Lab 13: Regular Expressions, BNF

Due by 11:59pm on Tuesday, April 20.

Starter Files

Download lab13.zip. Inside the archive, you will find starter files for the questions in this lab, along with a copy of the Ok autograder.

Topics

Consult this section if you need a refresher on the material for this lab. It's okay to skip directly to the questions and refer back here should you get stuck.

BNF

Backus-Naur Form (BNF) is a syntax for describing a context-free grammar. It was invented for describing the syntax of programming languages, and is still commonly used in documentation and language parsers. EBNF is a dialect of BNF which contains some convenient shorthands.

An EBNF grammar contains symbols and a set of recursive production rules. In 61A, we are using the Python Lark library to write EBNF grammars, which has a few specific rules for grammar writing.

There are two types of symbols: Non-terminal symbols can expand into non-terminals (including themselves) or terminals. In the Python Lark library, non-terminal symbols are always lowercase. Terminal symbols can be strings or regular expressions. In Lark, terminals are always uppercase.

Consider these two production rules:

numbers: INTEGER | numbers "," INTEGER
INTEGER: /-?\d+/

The symbol numbers is a non-terminal with a recursive production rule. It corresponds to either an INTEGER terminal or to the numbers symbol (itself) plus a comma plus an INTEGER terminal. The INTEGER terminal is defined using a regular expression which matches any number of digits with an optional - sign in front.

This grammar can describe strings like:

10
10,-11
10,-11,12

And so on, with any number of integers in front.

A grammar should also specify a start symbol, which corresponds to the whole expression being parsed (or the whole sentence, for a spoken language).

For the simple example of comma-separated numbers, the start symbol could just be the numbers terminal itself:

?start: numbers
numbers: numbers "," INTEGER | INTEGER
INTEGER: /-?\d+/

EBNF grammars can use these shorthand notations for specifying how many symbols to match:

EBNF Notation Meaning Pure BNF Equivalent
item* Zero or more items items: | items item
item+ One or more items items: item | items item
[item] item? Optional item optitem: | item

Lark also includes a few handy features:

  • You can specify tokens to complete ignore by using the ignore directive at the bottom of a grammar. For example, %ignore /\s+/ ignores all whitespace (tabs/spaces/new lines).
  • You can import pre-defined terminals for common types of data to match. For example, %import common.NUMBER imports a terminal that matches any integer or decimal number.

Using all of that, here's an EBNF grammar that corresponds to the Calculator language:

start: calc_expr?
calc_expr: NUMBER | calc_op
calc_op: "(" OPERATOR calc_expr* ")"
OPERATOR: "+" | "-" | "*" | "/"

%ignore /\s+/
%import common.NUMBER

You can paste that into code.cs61a.org and then input Calculator expressions in the interpreter to see their parse trees. Try it!


Regular Expressions

Regular expressions are a way to describe sets of strings that meet certain criteria, and are incredibly useful for pattern matching.

The simplest regular expression is one that matches a sequence of characters, like aardvark to match any "aardvark" substrings in a string.

However, you typically want to look for more interesting patterns. We recommend using an online tool like regexr.com for trying out patterns, since you'll get instant feedback on the match results.

Character classes

A character class makes it possible to search for any one of a set of characters. You can specify the set or use pre-defined sets.

Class Description
[abc] Matches a, b, or c
[a-z] Matches any character between a and z
[^A-Z] Matches any character that is not between A and Z.
\w Matches any "word" character. Equivalent to [A-Za-z0-9_]
\d Matches any digit. Equivalent to [0-9].
\s Matches any whitespace character (spaces, tabs, line breaks).
. Matches any character besides new line.

Character classes can be combined, like in [a-zA-Z0-9].

Combining patterns

There are multiple ways to combine patterns together in regular expressions.

Combination Description
AB A match for A followed immediately by one for B. Example: x[.,]y matches "x.y" or "x,y"
A⎮B Matches either A or B. Example: \d+⎮Inf matches either a decimal numeral or "Inf"

A pattern can be followed by one of these quantifiers to specify how many instances of the pattern can occur.

Quantifier Description
* 0 or more occurrences of the preceding pattern. Example: [a-z]* matches any sequence of lower-case letters or the empty string.
+ 1 or more occurrences of the preceding pattern. Example: \d+ matches any non-empty sequence of digits.
? 0 or 1 occurrences of the preceding pattern. Example: [-+]? matches an optional sign.
{1,3} Matches the specified quantity of the preceding pattern. {1,3} will match from 1 to 3 instances. {3} will match exactly 3 instances. {3,} will match 3 or more instances. Example: \d{5,6} matches either 5 or 6 digit numbers.

Groups

Parentheses are used similarly as in arithmetic expressions, to create groups. For example, (Mahna)+ matches strings with 1 or more "Mahna", like "MahnaMahna". Without the parentheses, Mahna+ would match strings with "Mahn" followed by 1 or more "a" characters, like "Mahnaaaa".

Anchors

Anchor Description
^ Matches the beginning of a string. Example: ^(I⎮You) matches I or You at the start of a string.
$ Normally matches the empty string at the end of a string or just before a newline at the end of a string. Example: `(.edu .org .com)$` matches .edu, .org, or .com at the end of a string.
\b Matches a "word boundary", the beginning or end of a word. Example: s\b matches s characters at the end of words.

Special characters

The following special characters are used above to denote types of patterns:

\ ( ) [ ] { } + * ? | $ ^ .

That means if you actually want to match one of those characters, you have to escape it using a backslash. For example, \(1\+3\) matches "(1 + 3)".

Using regular expressions in Python

Many programming languages have built-in functions for matching strings to regular expressions. We'll use the [Python re module] in 61A, but you can also use similar functionality in SQL, JavaScript, Excel, shell scripting, etc.

The search method searches for a pattern anywhere in a string:

re.search(r"(Mahna)+", "Mahna Mahna Ba Dee Bedebe")

That method returns back a match object, which is considered truth-y in Python and can be inspected to find the matching strings.

For more details, please consult the re module documentation or the re tutorial

BNF

Q1: EBNF Grammar

Consider this EBNF grammar for the Calculator language:

start: calc_expr

?calc_expr: NUMBER | calc_op

calc_op: "(" OPERATOR calc_expr* ")"

OPERATOR: "+" | "-" | "*" | "/"

%ignore /\s+/
%import common.NUMBER

Let's understand and modify the functionality of this BNF with a few questions.

Use Ok to test your knowledge by choosing the best answer for each of the following questions:

python3 ok -q ebnf-grammar-wwpd -u

Regular Expressions

Q2: Roman Numerals

Write a regular expression that finds any string of letters that resemble a Roman numeral and aren't part of another word. A Roman numeral is made up of the letters I, V, X, L, C, D, M and is at least one letter long.

import re

def roman_numerals(text):
    """
    Finds any string of letters that could be a Roman numeral
    (made up of the letters I, V, X, L, C, D, M).

    >>> roman_numerals("Sir Richard IIV, can you tell Richard VI that Richard IV is on the phone?")
    ['IIV', 'VI', 'IV']
    >>> roman_numerals("My TODOs: I. Groceries II. Learn how to count in Roman IV. Profit")
    ['I', 'II', 'IV']
    >>> roman_numerals("I. Act 1 II. Act 2 III. Act 3 IV. Act 4 V. Act 5")
    ['I', 'II', 'III', 'IV', 'V']
    >>> roman_numerals("Let's play Civ VII")
    ['VII']
    >>> roman_numerals("i love vi so much more than emacs.")
    []
    >>> roman_numerals("she loves ALL editors equally.")
    []
    """
    return re.findall(__________, text)

Use Ok to test your code:

python3 ok -q roman_numerals

Q3: Calculator Ops

Write a regular expression that parses strings written in the 61A Calculator language and returns any expressions which have two numeric operands, leaving out the parentheses around them.

import re

def calculator_ops(calc_str):
    """
    Finds expressions from the Calculator language that have two
    numeric operands and returns the expression without the parentheses.

    >>> calculator_ops("(* 2 4)")
    ['* 2 4']
    >>> calculator_ops("(+ (* 3 (+ (* 2 4) (+ 3 5))) (+ (- 10 7) 6))")
    ['* 2 4', '+ 3 5', '- 10 7']
    >>> calculator_ops("(* 2)")
    []
    """
    return re.findall(__________, calc_str)

Use Ok to test your code:

python3 ok -q calculator_ops

Q4: CS Classes

On reddit.com, there is an /r/berkeley subreddit for discussions about everything UC Berkeley. However, there is such a large amount of CS-related posts that those posts are auto-tagged so that readers can choose to ignore them or read only them.

Write a regular expression that finds strings that resemble a CS class- starting with "CS", followed by a number, and then optionally followed by "A", "B", or "C". Your search should be case insensitive, so both "CS61A" and "cs61a" would match.

import re

def cs_classes(post):
    """
    Returns strings that look like a Berkeley CS class,
    starting with "CS", followed by a number, optionally ending with A, B, or C.
    Case insensitive.

    >>> cs_classes("Is it unreasonable to take CS61A, CS61B, CS70, and EE16A in the summer?")
    True
    >>> cs_classes("how do I become a TA for cs61a? that job sounds so fun")
    True
    >>> cs_classes("Can I take ECON101 as a CS major?")
    False
    >>> cs_classes("Should I do the lab lites or regular labs in EE16A?")
    False
    """
    return bool(re.search(__________, post))

Use Ok to test your code:

python3 ok -q cs_classes