Regular expressions

Tips for navigating the slides:
  • Press O or Escape for overview mode.
  • Visit this link for a nice printable version
  • Press the copy icon on the upper right of code blocks to copy the code

Class outline:

  • Declarative languages
  • Regular expression syntax
  • Regular expressions in Python

Declarative languages

Declarative programming

In imperative languages:

  • A "program" is a description of computational processes
  • The interpreter carries out execution/evaluation rules

In declarative languages:

  • A "program" is a description of the desired result
  • The interpreter figures out how to generate the result
  • Examples:
    • Regular expressions: Good (?:morning|evening)
    • Backus-Naur Form:
      ?calc_expr: NUMBER | calc_op
      calc_op: "(" OPERATOR calc_expr* ")"
      OPERATOR: "+" | "-" | "*" | "/"

Domain-specific languages

Many declarative languages are domain-specific: they are designed to tackle problems in a particular domain, instead of being general purpose multi-domain programming languages.

Language Domain
Regular expressions Pattern-matching strings
Backus-Naur Form Parsing strings into parse trees
SQL Querying and modifying database tables
HTML Describing the semantic structure of webpage content
CSS Styling webpages based on selectors
Prolog Describes and queries logical relations

Regular expressions

Pattern matching

Pattern matching in strings is a common problem in computer programming.

An imperative approach:


                    def is_email_address(str):
                        parts = str.split('@')
                        if len(parts) != 2:
                            return False
                        domain_parts = parts[1].split('.')
                        return len(domain_parts) >= 2 and len(domain_parts[-1]) == 3
                    

An equivalent regular expression:


                    (.+)@(.+)\.(.{3})
                    

With regular expressions, a programmer can just describe the pattern using a common syntax, and a regular expression engine figures out how to do the pattern matching for them.

Matching exact strings

The following are special characters in regular expressions: \ ( ) [ ] { } + * ? | $ ^ .

To match an exact string that has no special characters, just use the string:


                    Berkeley, CA 94720
                    

But if the matched string contains special characters, they must be escaped using a backslash.


                    \(1\+3\)
                    

The dot

The . character matches any single character that is not a new line.


                    .a.a.a
                    

It's typically better to match a more specific range of characters, however...

Character classes

Pattern Description Example Fully Matched By
[] Denotes a character class. Matches characters in a set (including ranges of characters like 0-9). Use [^] to match characters outside a set.
[top]
[h-p]
. Matches any character other than the newline character. 1.
\d Matches any digit character. Equivalent to [0-9]. \D matches the inverse (all non-digit characters). \d\d
\w Matches any word character. Equivalent to [A-Za-z0-9_]. \W matches the inverse. \d\w
\s Matches any whitespace character: spaces, tabs, or line breaks. \S matches the inverse. \d\s\w

Quantifiers

These indicate how many of a character/character class to match.

Pattern Description Example Fully Matched By
* Matches 0 or more of the previous pattern. a*
+ Matches 1 or more of the previous pattern. lo+l
? Matches 0 or 1 of the previous pattern. lo?l
{} Used like {Min, Max}. Matches a quantity between Min and Max of the previous pattern.
a{2}
a{2,}
a{2,4}

Combining patterns

Patterns P₁ and P₂ can be combined in various ways.

Combination Description Example Fully Matched By
P₁P₂ A match for P₁ followed immediately by one for P₂. ab[.,]
P₁|P₂ Matches anything that either P₁ or P₂ does. \d+|Inf
(P₁) Matches whatever P₁ does. Parentheses group, just as in arithmetic expressions. (<3)+

Anchors

These don't match an actual character, they indicate the position where the surrounding pattern should be found.

Pattern Description Example What parts match?
^ Matches the beginning of a string. ^aw+ aww aww
$ Matches the end of a string. \w+y$ stay stay
\b Matches a word boundary, the beginning or end of a word. \w+e\b broken bridge team

Regular expressions in Python

Support for regular expressions

Regular expressions are supported natively in many languages and tools.

Languages: Perl, ECMAScript, Java, Python, ..

Tools: Excel/Google Spreadsheets, SQL, BigQuery, VSCode, grep, ...

Raw strings

In normal Python strings, a backslash indicates an escape sequence, like \n for new line or \b for bell.


                    >>> print("I have\na newline in me.")
                    I have
                    a newline in me 
                    

But backslash has a special meaning in regular expressions. To make it easy to write regular expressions in Python strings, use raw strings by prefixing the string with an r:


                    pattern = r"\b[ab]+\b"
                    

The re module

The re module provides many helpful functions.

Function Description
re.search(pattern, string) returns a Match object representing the first occurrence of pattern within string
re.fullmatch(pattern, string) returns a Match object, requiring that pattern matches the entirety of string
re.match(pattern, string) returns a Match object, requiring that string starts with a substring that matches pattern
re.findall(pattern, string) returns a list of strings representing all matches of pattern within string, from left to right
re.sub(pattern, repl, string) substitutes all matches of pattern within string with repl

Match objects

The functions re.search, re.match, and re.fullmatch all take a string containing a regular expression and a string of text. They return either a Match object or, if there is no match, None.

re.search requires that the pattern exists somewhere in the string:


                    import re

                    re.search(r'-?\d+', '123 peeps')           # <re.Match object>
                    re.search(r'-?\d+', 'So many peeps')       # None
                    

Match objects are treated as true values, so you can use the result as a boolean:


                    bool(re.search(r'-?\d+', '123'))           # True
                    bool(re.search(r'-?\d+', 'So many peeps')) # False
                    

Inspecting a match

re.search returns a Match object representing the first occurrence of pattern within string.


                    title = "I Know Why the Caged Bird Sings"
                    re.search(r'Bird', title)   # 
                    

Match objects carry information about what has been matched. The Match.group() method allows you to retrieve it.


                    x = "This string contains 35 characters."
                    mat = re.search(r'\d+', x)
                    mat.group(0)  # 35
                    

Match groups

If there are parentheses in a patterns, each of the parenthesized groups will become groups in the match object.


                    x = "There were 12 pence in a shilling and 20 shillings in a pound."
                    mat = re.search(r'(\d+)[a-z\s]+(\d+)', x)
                    

                    mat.group(0)  # '12 pence in a shilling and 20'
                    mat.group(1)  # 12
                    mat.group(2)  # 20
                    mat.groups()  # (12, 20)
                    

It's also common to use parentheses in combination with quantifiers and other modifiers, however.

Finding multiple matches

re.findall() returns a list of strings representing all matches of pattern within string, from left to right.


                    locations = "CA 91105, NY 13078, CA 94702"
                    re.findall(r'\d\d\d\d\d', locations)
                    # ['91105', '13078', '94702']
                    
>

Resolving ambiguity

>

Ambiguous matches

Regular expressions can match a given string in more than one way. Especially when there are parenthesized groups, this can lead to ambiguity:


                    mat = re.match(r'wind|window', 'window')
                    mat.group()  # 'wind'
                    
                    mat = re.match(r'window|wind', 'window')
                    mat.group() # 'window'
                    
                    mat = re.match(r'(wind|window)(.*)shade', 'window shade')
                    mat.groups() # ('wind', 'ow ')
                    
                    mat = re.match(r'(window|wind)(.*)shade', 'window shade')
                    mat.groups() # ('window', ' ')
                    

Python resolves these particular ambiguities in favor of the first option.

>

Ambiguous quantifiers

Likewise, there is ambiguity with *, +, and ?.


                    mat = re.match(r'(x*)(.*)', 'xxx')
                    mat.groups()  # ('xxx', '')

                    mat = re.match(r'(x+)(.*)', 'xxx')
                    mat.groups()  # ('xxx', '')
                    
                    mat = re.match(r'(x?)(.*)', 'xxx')
                    mat.groups()  # ('x', 'xx')
                    
                    mat = re.match(r'(.*)/(.+)', '12/10/2020')
                    mat.groups()  # ('12/10', '2020')
                    

Python chooses to match greedily, matching the pattern left-to-right and, when given a choice, matching as much as possible while still allowing the rest of the pattern to match.

>

Lazy operators

Sometimes, you don’t want to match as much as possible.

The lazy operators *?, +?, and ?? match only as much as necessary for the whole pattern to match.


                    mat = re.match(r'(.*)(\d*)', 'I have 5 dollars')
                    mat.groups() # ('I have 5 dollars', '')
                    
                    mat = re.match(r'(.*?)(\d+)', 'I have 5 dollars')
                    mat.groups() # ('I have ', '5')
                    
                    mat = re.match(r'(.*?)(\d*)', 'I have 5 dollars')
                    mat.groups() # ('', '')
                    

The ambiguities introduced by *, +, ?, and | don’t matter if all you care about is whether there is a match!

Exercises

Name That Pattern! #1


                    [A-Za-z]{3}
                    

  • What's a valid input? AUS, aus
  • What's an invalid input? australia, au

Name That Pattern! #2


                    \d{4}-\d{2}-\d{2}
                    

  • What's a valid input? 2020-03-13
  • What's an invalid input? 2020/03/13, 03-13-2020

Name That Pattern! #3


                    [a-z0-9._%+-]+@[a-z0-9.-]+\.[a-z]{2,}$
                    

  • What's a valid input? someone@someplace.org
  • What's an invalid input? someone@mod%cloth.co

RegEx Makeover! #1

Let's make a regular expression to match 24-hour times of the format HH:MM.

First draft: [0-2]\d:\d\d

  • What invalid times would that match? 24:99
  • How do we fix minutes? [0-2]\d:[0-5]\d
  • How do we fix hours? ((2[0-3])|([0-1]\d)):[0-5]\d

Try in regexr.com!

Exercise: Stocks

Make a regular expression to match any tweet talking about GME stock.


                    import re

                    def match_gme(tweet):
                        """
                        >>> match_gme('GME')
                        True
                        >>> match_gme('yooo buy GME right now!')
                        True
                        >>> match_gme('#HUGME')
                        False
                        >>> match_gme('#HUGMEHARDER')
                        False
                        """
                        return bool(re.search(______, tweet))
                    

Tips

  • When learning, use sites like regexr.com
  • Get used to referencing a regular expressions cheat sheet

⚠️ A word of caution ⚠️

Regular expressions can be very useful. However: