Study Guide: Regular Expressions

Instructions

This is a study guide with links to past lectures, assignments, and handouts, as well as additional practice problems to assist you in learning the concepts.

Assignments

Important: For solutions to these assignments once they have been released, find the links in the front page calendar.

Lectures

Guides

What are Regular Expressions?

Consider the following scenarios:

  1. You've just written a 500 page book on how to dominate the game of Hog. Your book assumes that all players use six-sided dice. However, in 2035, the newest version of Hog is updated to use 25-sided dice. You now need to find all instances where you mention six-sided dice in your article and update them to refer to 25-sided dice.
  2. You are in charge of a registry of phone numbers for the SchemeCorp company. You're planning a company social for all employees in the city of Berkeley. To make your guest list, you want to find the phone numbers of all employees living in the Berkeley area codes of 415 or 314.
  3. You're the communications expert on an interplanetary voyage, and you've received a message from another starship captain with the locations of a large number of potentially habitable planets, represented as strings. You must determine which of these planets lie in your star system.

What do all of these scenarios have in common? They all involve searching for patterns within a larger piece of text. These can include extracting strings that begin with a certain set of characters, contain a certain set of characters, or follow a certain format.

Regular expressions are a powerful tool for solving these kinds of problems. With regular expression operators, we can write expressions to describe a set of strings that match a specified pattern.

For example, the following code defines a function that matches all words that start with the letter "h" (capitalized or lowercase) and end with the lowercase letter "y".

import re
def hy_finder(text):
    """
    >>> hy_finder("Hey! Hurray, I hope you have a lovely day full of harmony.")
    ['Hey', 'Hurray', 'harmony']
    """

    return re.findall(r"\b[Hh][a-z]*y\b", text)

Let's examine the above regular expression piece by piece.

  1. First, we use r"", which denotes a raw string in Python. Raw strings handle the backslash character \ differently than regular string literals. For example, the \b in this regular expression is treated as a sequence of two characters. If we were to use a string literal without the additional r, \b would be treated as a single character representing an ASCII bell code.
  2. We then begin and end our regular expression with \b. This ensures that word boundaries exist before the "h" and after the "y" in the string we want to match.
  3. We use [Hh] to represent that we want our word to start with either a capital or lowercase "h" character.
  4. We want our word to contain 0 or more (denoted by the * character) lowercase letters between the "h" and "y". We use [a-z] to refer to this set.
  5. Finally, we use the character y to denote that our string should end with the lowercase letter "y".

Regular Expression Operators

Regular expressions are most often constructed using combinations of operators. The following special characters represent operators in regular expressions: \, (, ), [, ], {, }, +, *, ?, |, $, ^, and ..

We can still build regular expressions without using any of these special characters. However, these expressions would only be able to handle exact matches. For example, the expression potato would match all occurences of the characters p, o, t, a, t, and o, in that order, within a string.

Leveraging these operators enables us to build much more interesting expressions that can match a wide range of patterns. We'd recommend using interactive tools like regexr.com or regex101.com to practice using these.

Let's take a look at some common operators.

Pattern Description Example Example Matches Example Non-matches
[] Denotes a character class. Matches characters in a set (including ranges of characters like 0-9). Use [^] to match characters outside a set. [top] t, o, p s, march, 3
. Matches any character other than the newline character. 1. 1a, 1?, 11 1, 1\n
\d Matches any digit character. Equivalent to [0-9]. \D is the complement and refers to all non-digit characters. \d\d 12, 42, 60 4, 890
\w Matches any word character. Equivalent to [A-Za-z0-9_]. \W is the complement. \d\w 1a, 9_, 4Z 14, a5
\s Matches any whitespace character: spaces, tabs, or line breaks. \S is the complement. \d\s\w 1 s, 9 , 4 Z 1s, 1 s
* Matches 0 or more of the previous pattern. a* , a, aa, aaaaa schmorp, mlep
+ Matches 1 or more of the previous pattern. lo+l lol, lool, loool ll, lal
? Matches 0 or 1 of the previous pattern. lo?l lol, ll lool, lulz
| Usage: Char1 | Char2 . Matches either Char1 or Char2. a|b a, b c, d
() Creates a group. Matches occurences of all characters within a group. (<3)+ <3, <3<3, <3<3<3 <<, 33
{} Used like {Min, Max}. Matches a quantity between Min and Max of the previous pattern. a{2,4} aa, aaa, aaaa a, aaaaa
^ Matches the beginning of a string. ^aw+ aw, aww, awww wa, waaa
$ Matches the end of a string. \w+y$ hey, bay, stay yes, aye
\b Matches a word boundary, the beginning or end of a word. \w+e\b bridge, smoothie next, everlasting

Regular Expressions in Python

In Python, we use the re module (see the Python documentation for more information) to write regular expressions. The following are some useful function in the re module:

  • re.search(pattern, string) - returns a match object representing the first occurrence of pattern within string
  • re.sub(pattern, repl, string) - substitutes all matches of pattern within string with repl
  • re.fullmatch(pattern, string) - returns a match object, requiring that pattern matches the entirety of string
  • re.match(pattern, string) - returns a match object, requiring that string starts with a substring that matches pattern
  • re.findall(pattern, string) - returns a list of strings representing all matches of pattern within string, from left to right

Practice Problems

Easy

Medium

Q1: Party Planner

You are the CEO of SchemeCorp, a company you founded after learning Scheme in CS61A. You want to plan a social for everyone who works at your company, but only your colleagues who live in Berkeley can help plan the party. You want to add all employees located in Berkeley (based on their phone number) to a party-planning group chat. Given a string representing a list of employee phone numbers for SchemeCorp, write a regular expression that matches all valid phone numbers of employees in the 314 or 510 Berkeley area codes.

In addition, a few of your colleagues are visiting from Los Angeles (area code 310) and Montreal (area code 514) and would like to help. Your regular expression should match their phone numbers as well.

Valid phone numbers can be formatted in two ways. Some employees entered their phone numbers with parentheses around the area code (for example, (786)-375-6454), while some omitted the area code (for example, 786-375-6454). A few employees also entered their phone numbers incorrectly, with either greater than or fewer than 10 digits. These phone numbers are not valid and should not be included in the group chat.

import re
def party_planner(text):
    """
    Returns all strings representing valid phone numbers with 314, 510, 310, or 514 area codes.
    The area code may or may not be surrounded by parentheses. Valid phone numbers
    have 10 digits and follow this format: XXX-XXX-XXXX, where each X represents a digit.

    >>> party_planner("(408)-996-3325, (510)-658-7400, (314)-3333-22222")
    ['(510)-658-7400']
    >>> party_planner("314-826-0705, (510)-314-3143, 408-267-7765")
    ['314-826-0705', '(510)-314-3143']
    >>> party_planner("5103143143")
    []
    >>> party_planner("514-300-2002, 310-265-4242") # invite your friends in LA and Montreal
    ['514-300-2002', '310-265-4242']
    """
return re.findall(__________, text)
return re.findall(r"\(?[53]1[04]\)?-\d{3}-\d{4}", text)

After creating your group chat, you find out that your friends in Montreal and Los Angeles can no longer attend the party-planning meeting. How would you modify your regular expression to no longer match the 514 and 310 area codes?

One way to do this is to use the | operator to match either a phone number with a 510 area code or a phone number with a 314 area code. The final regular expression looks like this:

\(?510\)?-\d{3}-\d{4}|\(?314\)?-\d{3}-\d{4}