Regular expressions

Summary: We explore regular expressions in Scheme which allows us to search for patterns of text in a structured, concise manner.

Introduction

As we’ve already discovered, Scheme provides a variety of tools for working with strings and lists of strings. We can extract substrings, split strings, count strings, and more.

But we’ve encountered one significant obstacle: sometimes instead of a particular string, we want to work with patterns of characters potentially contained within a string. For example:

We want to determine whether a string contains a book title which will be written using title case (first letter capitalized of each word).
We want to find all occurrences of “??” embedded within a string literal (i.e., in a block of text surrounded by quotes) and replace them with a string of our choice.
We want to break up a string that represents a date into chunks, noting that sometimes people use “-“, “/”, or even “.” as the delimiter between day, month, and year.

As you might expect, needing to express patterns in strings is a fairly common task in computing. Some years ago, the Mathematician Stephen Kleene invented a notation, which he called “Regular Expressions”. (Fun fact: Kleene is the great-grand-advisor of Professor Sam Rebelsky in CS.) Most modern programming languages now provide some version of regular expressions.

We can use regular expressions in Scheme in a wide variety of ways, including splitting strings, identifying parts of a string, and replacing parts of a string. In this reading, we cover the basics of regular expressions and their associated library support in Scamper. You can tell we are using the csc151 regular expressions when you see procedures whose name starts with rex (short for regular expression).

Scamper

Regular Expressions exist in Scamper 3.3.0 and higher!

Regular expression basics

Regular expressions constitute a domain-specific language (DSL). In contrast to a general-purpose programming language like Racket, a DSL is a smaller language built for a particular task. Here, the task is specifying patterns of characters that appear in strings.

Because of this, it is natural to express such patterns using strings themselves. We create a regular expression that matches the string str with (rex-string str). For example, the following string expresses the pattern of characters #\h, #\e, #\l, #\l, #\o in sequence:

(rex-string "hello")

If we wish to express the pattern of seeing this series of characters in sequence, we can just replicate the string:

"hellohello"

But what if we want to capture the pattern of zero or more repetitions of the word "hello"? We clearly can’t specify that with just our literal string syntax! Regular expressions enrich strings by introducing a syntax and semantics for expressing these patterns. For example, the following string captures the pattern of zero or more repetitions of "hello".

(rex-string "hellohello")

But what if we want to capture the pattern of one or more repetitions of the word "hello"? We can’t specify that with just our literal string syntax. Regular expressions enrich strings by introducing a syntax and semantics for expressing these patterns. For example, the following regular expression captures the pattern of one or more repetitions of "hello".

(rex-repeat (rex-string "hello"))

But how do we use these regular expressions? The simplest way to use them is to check whether a string matches a regular expression using (rex-matches? rex str).

> (define hellos
    (rex-repeat (rex-string "hello")))
> (rex-matches? hellos "hello")
#t
> (rex-matches? hellos "hellohello")
#t
> (rex-matches? hellos "hellohellohello")
#t
> (rex-matches? hellos "hellohello hello")
#f
> (rex-matches? hellos "")
#f

If we want to match zero or more copies of "hello", rather than one or more, we can use rex-repeat-0.

> (define hellos0
    (rex-repeat-0 (rex-string "hello")))
> (rex->string hellos0)
"(hello)*"
> (rex-matches? hellos0 "hello")
#t
> (rex-matches? hellos0 "")
#t
> (rex-matches? hellos0 "hellohellohello")
#t
> (rex-matches? hellos0 "hellohellohelloo")
#f

Why would we care about “zero or more repetitions” of a pattern? It depends on what we’re looking for. One common instance is permitting zero or more leading zeros in an integer. Here’s another example. What if we want to repeat only some of the characters in the string? For example, what if we want to match "hello" or "helloo" or "hellooo" or even "hellooooooooo"? We build such patterns by concatenating (joining together in sequence) two regular expressions.

We solve problems like this by decomposing the problem! To achieve the goal we just described, we need to match the string "hello" followed by zero or more o’s.

(rex-string "hello") matches the initial "hello".

(rex-repeat-0 (rex-string "o")) matches zero or more copies of o.

So, (rex-concat (rex-string "hello") (rex-repeat-0 (rex-string "o"))) matches "hello" with an arbitray number of additional o’s.

> (define hello-echo (rex-concat (rex-string "hello") (rex-repeat-0 (rex-string "o"))))
> (rex-matches? hello-echo "hello")
#t
> (rex-matches? hello-echo "helloooooooooooo")
#t
> (rex-matches? hello-echo "hellooo")
#t
> (rex-matches? hello-echo "goodbye")
#f

One important characteristic of regular expressions is that we gain power by combining them together in different ways using operations like rex-concat and rex-repeat. Let’s consider a slightly more complicated situation: We want to see if a string consists only of those echoing hellos (or hellooo’s or…).

Let’s decompose the problem: We want to start with one echoing hello. We want it to be followed by zero or more echoing hellos, each of which is preceded by a space.

We know how to match one echoing hello by using the hello-echo expression. A space is the pattern (rex-string " "). A space and an echoing hello is therefore (rex-concat (rex-string " ") hello-echo).

Finally, we can repeat that with rex-repeat-0. Putting it all together, we get the following.

> (define remaining-hellos
    (rex-repeat-0 (rex-concat (rex-string " ") hello-echo)))
> (define echoing-hellos
    (rex-concat hello-echo remaining-hellos))
> (rex-matches? echoing-hellos "hello")
#t
> (rex-matches? echoing-hellos "helloooo")
#t
> (rex-matches? echoing-hellos "hello hello hello")
#t
> (rex-matches? echoing-hellos "hellooo hello hello")
#t
> (rex-matches? echoing-hellos "hellohello") ; no space
#f
> (rex-matches? echoing-hellos "hello he")   ; incomplete second hello
#f

So far we’ve seen one basic kind of regular expression: strings that match themselves. We’ve seen two ways to extend regular expressions: we can repeat them and we can concatenate them. We’ve seen one way to use regular expressions: we can check if a string matches a regular expression. We’ve learned that by combining repetition and concateneation, we can create some fairly sophisticated patterns.

Now it’s time to expand our knowledge of basic regular expressions.

More basic regular expressions

While rex-string seems to be about as simple as you can make a regular expression, there are a few basic regular expressions that are nearly as simple. For example, (rex-any-char chr) matches any single character chr.

> (rex-matches? (rex-any-char) "a")
#t
> (rex-matches? (rex-any-char) "+")
#t
> (rex-matches? (rex-any-char) "a+")
#f

We often use rex-any-char along with rex-repeat to represent “anything”. For example the following matches any string that starts with s and ends with r.

> (define sr (rex-concat (rex-string "s")
                         (rex-repeat-0 (rex-any-char))
                         (rex-string "r")))
> (rex-matches? sr "super")
#t
> (rex-matches? sr "stranger")
#t
> (rex-matches? sr "sr")
#t
> (rex-matches? sr "samr")
#t
> (rex-matches? sr "computer")
#f
> (rex-matches? sr "science")
#f
> (rex-matches? sr "science computer")
#t

At times, we want to match one of a few characters, rather than any character. There are three additional basic patterns to match individual characters.

(rex-char-set str) is a regular expression that matches any one character in str.
(rex-char-antiset str) is a regular expression that matches any one character that does not appear in str.
(rex-char-range start finish) is a regular expression that appears anywhere between the characters start and finish (e.g. #\a and #\z for lowercase letters or #\0 and #\9 for digits.

Here is an example.

> (define lowercase (rex-char-range #\a #\z))
> (define vowel (rex-char-set "aeiou"))
> (define non-vowel (rex-char-antiset "aeiou"))
> (rex-matches? lowercase "c")
#t
> (rex-matches? lowercase "C")
#f
> (rex-matches? lowercase "cc")
#f
> (rex-matches? vowel "c")
#f
> (rex-matches? vowel "e")
#t
> (rex-matches? non-vowel "f")
#t
> (rex-matches? non-vowel "e")
#f

We might use these to determine if a word contains two vowels in a row. To do this, again decompose the problem. We’ll need a possibly-empty sequence of letters, then a vowel, then a vowel, then a possibly-empty sequence of letters.

> (define double-vowel
    (rex-concat (rex-repeat-0 lowercase) vowel vowel (rex-repeat-0 lowercase)))
> (rex-matches? double-vowel-word "hello")
#f
> (rex-matches? double-vowel-word "helloo")
#t
> (rex-matches? double-vowel-word "field")
#t
> (rex-matches? double-vowel-word "aardvark")
#t

There is one more basic regular expression: rex-empty matches only the empty string.

> (rex-matches? (rex-empty) "")
#t
> (rex-matches? (rex-empty) "hello")
#f

This procedure is helpful because it turns out that the empty string is useful in conjuction with some other ways of combining regular expressions.

Combining regular expressions

Speaking of combining regular expressions, there are three basic ways of combining or extending regular expressions: You can concatenate two or more regular expressions with (rex-concat rex1 rex2 ...) and you can repeat a regular expression with (rex-repeat rex) or (rex-repeat-0 rex). The third way is to use (rex-any-of rex 1 rex2 ...), which takes two or more regular expressions as parameters and returns a regular expression that matches any of them.

> (define greeting (rex-any-of (rex-string "hello")
                               (rex-string "hi")
                               (rex-string "hola")))
> (rex-matches? greeting "HI")
#f
> (rex-matches? greeting "hello")
#t

Using regular expressions

So far, we’ve seen only one way of using regular expressions: That is, we can determine whether a string matches a regular expression (or vice versa). However, one of the most common uses of regular expressions is to search a longer string for portions that match a regular expression. The procedure (rex-find-matches rex str) accomplishes that goal. For example, we might write the following to find all of the strings with two vowels in a row.

> (define lowercase (rex-char-range #\a #\z))
> (define vowel (rex-char-set "aeiou"))
> (define double-vowel-word
    (rex-concat (rex-repeat-0 lowercase) vowel vowel (rex-repeat-0 lowercase)))
> (rex-find-matches double-vowel-word "now is the time for all good people to come to the aid of their country")
("good" "people" "aid" "their" "country")

When dealing with longer strings, we might split into substrings based on a regular expression. For example, we might want to split at commas and semicolons, with a space and an optional "and " afterwards. Let’s decompose that pattern.

A comma or semicolon is a set: (rex-char-set ",;").

A space is just a space: (rex-string " ").

"and" is a string: (rex-string "and ").

We make it optional by matching it by using rex-optional: (rex-optional (rex-string “and “))

Putting it all together, we get the following.

> (define splitter
    (rex-concat (rex-char-set ",;")
                (rex-string " ")
                (rex-optional (rex-string "and "))))
> (rex-split-string splitter "me, you, and a dog named boo")
'("me" "you" "a dog named boo")
> (rex-split-string splitter "alpha, beta; gamma")
'("alpha" "beta" "gamma")

With more practice, you’ll find many clever ways to use regular expressions.

Displaying regular expressions

How do we store regular expressions? It turns out that the human-readable representation of the csc151 rex-style regular expressions looks remarkably like the expressions we would write to create them, just without formatting.

> vowel
(rex-char-set "aeiou")
> double-vowel-word
(rex-concat [(rex-repeat-0 (rex-char-range #\a #\z)) (rex-char-set "aeiou") (rex-char-set "aeiou") (rex-repeat-0 (rex-char-range #\a #\z))])
> splitter
(rex-concat [(rex-char-set ",;") (rex-string " ") (rex-optional (rex-string "and "))])

As we noted at the beginning, there’s a more standard syntax for regular expressions. You can start to explore that syntax using rex->regexp.

> (rex->regexp vowel)
"[aeiou]"
> (rex->regexp double-vowel-word)
"(?:[a-z])*[aeiou][aeiou](?:[a-z])*"
> (rex->regexp splitter)
"[,;] (?:and )?"

As you may have noted, this syntax is much more concise. It is, however, harder for most humans to read.

Self checks

Check 1: Alternate spellings (‡)

People with the name “Jared” often encounter various “alternate spellings” of their name, including “Jered”, “Jerod”, and “Jarod”. Write two regular expressions we could use to match any of those four spellings.

Check 2: Counting gendered words

One common analysis of a text is the ratio of male pronouns to female pronouns in a text.

Suppose we want to identify the number of times that the words “she” and “her” appear in a string and we do not want to count the number of times they appear in other words (e.g., “sheet” and “there”).

How could you use regular expressions to count those appearances?

Copyright © Eric Autry, Charlie Curtsinger, Sarah Dahlby Albright, Janet Davis, Nicole Eikmeier, Fahmida Hamid, Priscilla Jiménez, Barbara Johnson, Titus Klinge, Peter-Michael Osera, Leah Perlmutter, Samuel A. Rebelsky, William Rebelsky, John David Stone, Anya Vostinar, Henry Walker, and Jerod Weinman.

Unless specified otherwise elsewhere on this page, this work is licensed under a Creative Commons Attribution 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ or send a letter to Creative Commons, 543 Howard Street, 5th Floor, San Francisco, California, 94105, USA.

This website was built using Jekyll, Twitter Bootstrap, and the Bootswatch Cosmo Theme.