Study Guide: Regular Expressions

Instructions

This is a study guide with links to past lectures, assignments, and handouts, as well as additional practice problems to assist you in learning the concepts.

Assignments

Important: For solutions to these assignments once they have been released, find the links in the front page calendar.

Lectures

Guides

What are Regular Expressions?

Consider the following scenarios:

  1. You've just written a 500 page book on how to dominate the game of Hog. Your book assumes that all players use six-sided dice. However, in 2035, the newest version of Hog is updated to use 25-sided dice. You now need to find all instances where you mention six-sided dice in your article and update them to refer to 25-sided dice.
  2. You are in charge of a registry of phone numbers for the SchemeCorp company. You're planning a company social for all employees in the city of Berkeley. To make your guest list, you want to find the phone numbers of all employees living in the Berkeley area codes of 415 or 314.
  3. You're the communications expert on an interplanetary voyage, and you've received a message from another starship captain with the locations of a large number of potentially habitable planets, represented as strings. You must determine which of these planets lie in your star system.

What do all of these scenarios have in common? They all involve searching for patterns within a larger piece of text. These can include extracting strings that begin with a certain set of characters, contain a certain set of characters, or follow a certain format.

Regular expressions are a powerful tool for solving these kinds of problems. With regular expression operators, we can write expressions to describe a set of strings that match a specified pattern.

For example, the following code defines a function that matches all words that start with the letter "h" (capitalized or lowercase) and end with the lowercase letter "y".

import re
def hy_finder(text):
    """
    >>> hy_finder("Hey! Hurray, I hope you have a lovely day full of harmony.")
    ['Hey', 'Hurray', 'harmony']
    """

    return re.findall(r"\b[Hh][a-z]*y\b", text)

Let's examine the above regular expression piece by piece.

  1. First, we use r"", which denotes a raw string in Python. Raw strings handle the backslash character \ differently than regular string literals. For example, the \b in this regular expression is treated as a sequence of two characters. If we were to use a string literal without the additional r, \b would be treated as a single character representing an ASCII bell code.
  2. We then begin and end our regular expression with \b. This ensures that word boundaries exist before the "h" and after the "y" in the string we want to match.
  3. We use [Hh] to represent that we want our word to start with either a capital or lowercase "h" character.
  4. We want our word to contain 0 or more (denoted by the * character) lowercase letters between the "h" and "y". We use [a-z] to refer to this set.
  5. Finally, we use the character y to denote that our string should end with the lowercase letter "y".

Regular Expression Operators

Regular expressions are most often constructed using combinations of operators. The following special characters represent operators in regular expressions: \, (, ), [, ], {, }, +, *, ?, |, $, ^, and ..

We can still build regular expressions without using any of these special characters. However, these expressions would only be able to handle exact matches. For example, the expression potato would match all occurences of the characters p, o, t, a, t, and o, in that order, within a string.

Leveraging these operators enables us to build much more interesting expressions that can match a wide range of patterns. We'd recommend using interactive tools like regexr.com or regex101.com to practice using these.

Let's take a look at some common operators.

Pattern Description Example Example Matches Example Non-matches
[] Denotes a character class. Matches characters in a set (including ranges of characters like 0-9). Use [^] to match characters outside a set. [top] t, o, p s, march, 3
. Matches any character other than the newline character. 1. 1a, 1?, 11 1, 1\n
\d Matches any digit character. Equivalent to [0-9]. \D is the complement and refers to all non-digit characters. \d\d 12, 42, 60 4, 890
\w Matches any word character. Equivalent to [A-Za-z0-9_]. \W is the complement. \d\w 1a, 9_, 4Z 14, a5
\s Matches any whitespace character: spaces, tabs, or line breaks. \S is the complement. \d\s\w 1 s, 9 , 4 Z 1s, 1 s
* Matches 0 or more of the previous pattern. a* , a, aa, aaaaa schmorp, mlep
+ Matches 1 or more of the previous pattern. lo+l lol, lool, loool ll, lal
? Matches 0 or 1 of the previous pattern. lo?l lol, ll lool, lulz
| Usage: Char1 | Char2 . Matches either Char1 or Char2. a | b a, b c, d
() Creates a group. Matches occurences of all characters within a group. (<3)+ <3, <3<3, <3<3<3 <<, 33
{} Used like {Min, Max}. Matches a quantity between Min and Max of the previous pattern. a{2,4} aa, aaa, aaaa a, aaaaa
^ Matches the beginning of a string. ^aw+ aw, aww, awww wa, waaa
$ Matches the end of a string. \w+y$ hey, bay, stay yes, aye
\b Matches a word boundary, the beginning or end of a word. e\b bridge, smoothie next, everlasting

Regular Expressions in Python

In Python, we use the re module (see the Python documentation for more information) to write regular expressions. The following are some useful function in the re module:

  • re.search(pattern, string) - returns a match object representing the first occurrence of pattern within string
  • re.sub(pattern, repl, string) - substitutes all matches of pattern within string with repl
  • re.fullmatch(pattern, string) - returns a match object, requiring that pattern matches the entirety of string
  • re.match(pattern, string) - returns a match object, requiring that string starts with a substring that matches pattern
  • re.findall(pattern, string) - returns a list of strings representing all matches of pattern within string, from left to right

Practice Problems

Easy

Q1: Meme-ified

In this question, we will identify whether a given message has been meme-ified, or uses sticky caps. A meme-ified message contains at least one instance of the following sequence: a capitalized letter, followed by a lowercase letter, followed by another capitalized letter. For example, the following string is not meme-ified: "python doesn't support tail recursion." This string, however, is meme-ified: "PyThon dOeSn'T sUpPort TaIl reCuRsiOn." Write the function memeified below, which determines whether a string is meme-ified.

import re

def memeified(message):
    """
    Returns True for strings that have been meme-ified, or contain a capital letter followed
    by a lowercase letter followed by another capital letter. Returns False for non-meme-ified strings.

    >>> memeified("PyThon dOeSn'T sUpPort TaIl reCuRsiOn.")
    True
    >>> memeified("The above statement is false! Python doesn't support TCO")
    False
    >>> memeified("LoL this is fun")
    True
    >>> memeified("lOl this is fun")
    False
    >>> memeified("I WrIte My ScHeMe wItH StYlE - (CoNs '61 (cOnS 'A NiL))")
    True
    >>> memeified("I take my scheme very seriously and only use lowercase")
    False
    """
return bool(re.search(__________, message))
return bool(re.search(r"[A-Z][a-z][A-Z]", message))

Medium

Q2: Greetings

Let's say hello to our fellow bears! We've received messages from our new friends at Berkeley, and we want to determine whether or not these messages are greetings. In this problem, there are two types of greetings - salutations and valedictions. The first are messages that start with "hi", "hello", or "hey", where the first letter of these words can be either capitalized or lowercase. The second are messages that end with the word "bye" (capitalized or lowercase), followed by either an exclamation point, a period, or no punctuation. Write a regular expression that determines whether a given message is a greeting.

import re

def greetings(message):
    """
    Returns whether a string is a greeting. Greetings begin with either Hi, Hello, or
    Hey (either capitalized or lowercase), and/or end with Bye (either capitalized or lowercase) optionally followed by
    an exclamation point or period.

    >>> greetings("Hi! Let's talk about our favorite submissions to the Scheme Art Contest")
    True
    >>> greetings("Hey I just figured out that when I type the Konami Code into cs61a.org, something fun happens")
    True
    >>> greetings("I'm going to watch the sun set from the top of the Campenile! Bye!")
    True
    >>> greetings("Bye Bye Birdie is one of my favorite musicals.")
    False
    >>> greetings("High in the hills of Berkeley lived a legendary creature. His name was Oski")
    False
    >>> greetings('Hi!')
    True
    >>> greetings("bye")
    True
    """
return bool(re.search(__________, message))
return bool(re.search(r"(^([Hh](ey|i|ello)\b))|(\b[bB]ye[!\.]?$)", message))

Q3: Party Planner

You are the CEO of SchemeCorp, a company you founded after learning Scheme in CS61A. You want to plan a social for everyone who works at your company, but only your colleagues who live in Berkeley can help plan the party. You want to add all employees located in Berkeley (based on their phone number) to a party-planning group chat. Given a string representing a list of employee phone numbers for SchemeCorp, write a regular expression that matches all valid phone numbers of employees in the 314 or 510 Berkeley area codes.

In addition, a few of your colleagues are visiting from Los Angeles (area code 310) and Montreal (area code 514) and would like to help. Your regular expression should match their phone numbers as well.

Valid phone numbers can be formatted in two ways. Some employees entered their phone numbers with parentheses around the area code (for example, (786)-375-6454), while some omitted the area code (for example, 786-375-6454). A few employees also entered their phone numbers incorrectly, with either greater than or fewer than 10 digits. These phone numbers are not valid and should not be included in the group chat.

import re
def party_planner(text):
    """
    Returns all strings representing valid phone numbers with 314, 510, 310, or 514 area codes.
    The area code may or may not be surrounded by parentheses. Valid phone numbers
    have 10 digits and follow this format: XXX-XXX-XXXX, where each X represents a digit.

    >>> party_planner("(408)-996-3325, (510)-658-7400, (314)-3333-22222")
    ['(510)-658-7400']
    >>> party_planner("314-826-0705, (510)-314-3143, 408-267-7765")
    ['314-826-0705', '(510)-314-3143']
    >>> party_planner("5103143143")
    []
    >>> party_planner("514-300-2002, 310-265-4242") # invite your friends in LA and Montreal
    ['514-300-2002', '310-265-4242']
    """
return re.findall(__________, text)
return re.findall(r"\(?[53]1[04]\)?-\d{3}-\d{4}", text)

After creating your group chat, you find out that your friends in Montreal and Los Angeles can no longer attend the party-planning meeting. How would you modify your regular expression to no longer match the 514 and 310 area codes?

One way to do this is to use the | operator to match either a phone number with a 510 area code or a phone number with a 314 area code. The final regular expression looks like this:

\(?510\)?-\d{3}-\d{4}|\(?314\)?-\d{3}-\d{4}