CS61A Lab 12: Regular Expressions and Interpretation

Week 12, 2012

Regular Expressions

Regular expressions are another type of language that are useful for matching text. While the full scope of regular expressions isn't in the scope of this course, it's good to get exposed to other types of languages, and regular expressions are incredibly useful in general programming.

Regular expressions are a formal way to pattern match strings. So far, we've been matching strings by just comparing strings to each other. For example, in calc...

	...    
	if operator == "-":
		...
	elif operator == "+":
		...
	elif operator == "/":
		...
	elif operator == "*":
		...

While this is fine for this basic case, it becomes harder to compare strings this way when we need to match strings against a more complicated pattern. The corresponding example using regular expressions would be

import re
	...
	...
	if re.match("-", operator):
		...
	elif re.match("\+", operator):
		...
	elif re.match("/", operator):
		...
	elif re.match("\*", operator):
		...

First, note that it doesn't look all that different from the original code. We have to call re.match, but other than that, it looks similar. The one weird part is the backslash in front of the multiplcation and addition operators. That's there because those operators are special, so we need to escape them for the regular expression engine to match them.

In general, regular expressions match normal strings just fine. So re.match("boomboom", s) and s == "boomboom" are effectively the same. (Note that re.match actually returns an object, and not just a boolean. This object can be used to extract what the actual match to the regular expression was using the group method. We don't need to do this for this lab, however. Also, re.match only matches the regular expression against the beginning of the string, and not the whole string)

As an example for learning regular expressions, lets look at identifiers.

Indentifiers in python are strings that can act as variable names. For example, "hi", "x", "i", and "expression" are all identifiers. Let's suppose that we're given the task of validating a given string to see if it is a valid Python identifier. A possible first attempt might be the following...

import string
def is_identifier(s):
    for char in s:
        if char not in string.ascii_letters:
            return False
    return True

The corresponding function, using regular expressions, is...

import re
def is_identifier(s):
    return True if re.match('[a-zA-Z]+', s) else False

The syntax might be confusing, so let me break it down for you. Our regular expression is "[a-zA-Z]+". Let's deal with the stuff inside the brackets first. The brackets represent what's called a character class. A character class matches anything inside in the target string. For example, the regex "li[cs]en[cs]e" matches the strings "license" or "lisence". In this case, our character class is "[a-zA-Z]". The dashes signify a range, so instead of typing "[abcdefghijklnmopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ]", we can just type "[a-zA-Z]". The plus at the end is a form of repetition. This is part of the big power of regular expressions. In this case, the plus sign means that the previous item (in this case the character class [a-zA-Z], but it could be something else) should be matched 1 or more times. The two other main repetition operators are the question mark (?) and the asterisk (*). The question mark means 0 or 1 matches, and the asterisk means 0 or more matches, in contrast with the plus sign, which matches 1 or more times.

Here are some examples to practice with. Try to predict what python will print! Don't worry if you don't understand these examples or are confused by them; regular expressions can extremely confusing.

>>> import re
>>> def matches(pattern, s):
        return True if re.match(pattern, s) else False
>>> matches("[a-zA-Z]+", "hello")

>>> matches("[A-Za-z_][A-Za-z_0-9]*", "__init__") #this is a proper identifier regex for python

>>> matches("[A-Za-z_][A-Za-z_0-9]*", "0imanidentifier")

>>> matches("aa+b", "ab")

>>> matches("aa+b", "aaaaab")

>>> matches("aa*b", "ab")

>>> matches("a[bcd]?bc", "abcbcd")

We aren't going to test you on the more complicated elements of regular expressions. This is just supposed to be an introduction to them. For more information (if you're interested in this), here are some interesting and/or useful links:

Calc: our first interpreter!

Solutions:
calc.py
scheme_reader.py

We're beginning our journey onto the world of interpreters! You may have seen an interpreter before. (Hint: that thing you've been running all of your Python in? Yeah, that's an interpreter). In fact, the Python interpreter we've been using all semester long is really nothing but a program, albeit a very complex one. You can think of it as a program that takes strings as input and produces their evaluated result as output.

There's nothing magical about interpreters, though! They're programs, just like the one's you've been writing throughout the entire semester. In fact, it's possible to write a Python interpreter in Python; PyPy is proof of that. However, because Python is a complex language, writing an interpreter for it would be a very daunting project. Instead, we'll play around with an interpreter for a calculator language.

In lecture, you were introduced to our first interpreter program, calc, which acts as a simple calculator. You can copy the code to your current directory by running the following in your terminal:

cp ~cs61a/public_html/fa12/labs/lab12/*.py .

You can also find the code in a zip file here. You can try running calc by running this command in the terminal:

python3 calc.py

To exit the program, type Ctrl-D or Ctrl-C.

Part 1)

Trace through the code in calc.py that would be evaluated when you type the following into calc.

> 2
> (+ 2 3)
> (+ 2 3 4)
> (+ 2 3)
> (+ 2)
> (+ 2 (* 4 5))

Part 2)

While prefix notation ( + 2 3 ) is easy to parse and interpret, it's not very natural for us to write code in. We'd much prefer infix notation (2 + 3). Let's implement this in our version of calc! To do this, you need to fill in the following functions, which are in the scheme_reader.py file.

  1. Write the body of read_infix. This function takes two arguments; first, and src. first is the first expression in the infix notation expression, and src is the buffer of tokens. The return should be an expression which is mathematically the same as the infix notation expression, but written using scheme style Pairs. See the doctests for simple examples. This should work in roughly the following manner:

  2. Write the body of next_is_op. This function should return whether the next token in the given buffer is an operator. Try using regular expressions! Hint: don't forget to check if there are any tokens in the buffer first.
  3. Modify the body of scheme_read to parse infix notation. This requires modifying two parts of the code:

And that's it! We have infix notation! The following inputs to calc.py should produce the given outputs.

> (+ 2 2 * 3)
8
> 2 + (- 5 2)
5
> (+ 1 3 * 4 (* 7 4))
41
> (2 * 3) + 6
12
> 6 + (2 * 3)
12

Part 3) Now, we're going to add the ability to define variables in calc. First, think about where we need to modify calc.py. Do we need to change calc_eval? calc_apply? Is define a special form?

Here are some guiding tips on how to implement definitions. First, some example usage:

> (define x 3)
x
> (+ x 2)
5
> (* x (+ 2 x))
15

And that's it! There you have basic variable declaration. Isn't that cool!??