Discussion 10: Regular Expressions

disc10.pdf

This is an online worksheet that you can work on during discussions. Your work is not graded and you do not need to submit anything.

Regular Expressions

Regular expressions are a way to describe sets of strings that meet certain criteria, and are incredibly useful for pattern matching.

The simplest regular expression is one that matches a sequence of characters, like aardvark to match any "aardvark" substrings in a string.

However, you typically want to look for more interesting patterns. We recommend using an online tool like regexr.com or regex101.com for trying out patterns, since you'll get instant feedback on the match results.

Character Classes

A character class makes it possible to search for any one of a set of characters. You can specify the set or use pre-defined sets.

Class	Description
`[abc]`	Matches a, b, or c
`[a-z]`	Matches any character between a and z
`[^A-Z]`	Matches any character that is not between A and Z.
`\w`	Matches any "word" character. Equivalent to `[A-Za-z0-9_]`.
`\d`	Matches any digit. Equivalent to `[0-9]`.
`[0-9]`	Matches a single digit in the range 0 - 9. Equivalent to `\d`.
`\s`	Matches any whitespace character (spaces, tabs, line breaks).
`.`	Matches any character besides new line.

Character classes can be combined, like in [a-zA-Z0-9].

Combining Patterns

There are multiple ways to combine patterns together in regular expressions.

Combo	Description
`AB`	A match for A followed immediately by one for B. Example: `x[.,]y` matches "x.y" or "x,y".
`A\|B`	Matches either A or B. Example: `\d+\|Inf` matches either a sequence containing 1 or more digits or "Inf".

A pattern can be followed by one of these quantifiers to specify how many instances of the pattern can occur.

Symbol	Description
`*`	0 or more occurrences of the preceding pattern. Example: `[a-z]*` matches any sequence of lower-case letters or the empty string.
`+`	1 or more occurrences of the preceding pattern. Example: `\d+` matches any non-empty sequence of digits.
`?`	0 or 1 occurrences of the preceding pattern. Example: `[-+]?` matches an optional sign.
`{1,3}`	Matches the specified quantity of the preceding pattern. `{1,3}` will match from 1 to 3 instances. `{3}` will match exactly 3 instances. `{3,}` will match 3 or more instances. Example: `\d{5,6}` matches either 5 or 6 digit numbers.

Groups

Parentheses are used similarly as in arithmetic expressions, to create groups. For example, (Mahna)+ matches strings with 1 or more "Mahna", like "MahnaMahna". Without the parentheses, Mahna+ would match strings with "Mahn" followed by 1 or more "a" characters, like "Mahnaaaa".

Anchors

^: Matches the beginning of a string. Example: ^(I|You) matches I or You at the start of a string.
$: Normally matches the empty string at the end of a string or just before a newline at the end of a string. Example: (\.edu|\.org|\.com)$ matches .edu, .org, or .com at the end of a string.
\b: Matches a "word boundary", the beginning or end of a word. Example: s\b matches s characters at the end of words.

Special Characters

The following special characters are used above to denote types of patterns:

\ / ( ) [ ] { } + * ? | $ ^ .

That means if you actually want to match one of those characters, you have to escape it using a backslash. For example, $1\+3$ matches "(1 + 3)".

Using Regular Expressions in Python

Many programming languages have built-in functions for matching strings to regular expressions. We'll use the Python re module in 61A, but you can also use similar functionality in SQL, JavaScript, Excel, shell scripting, etc.

The search method searches for a pattern anywhere in a string:

re.search(r"(Mahna)+", "Mahna Mahna Ba Dee Bedebe")

That method returns back a match object, which is considered truth-y in Python and can be inspected to find the matching strings. If no match is found, returns None.

For more details, please consult the re module documentation or the re tutorial.

Q1: CS Classes

On reddit.com, there is an r/berkeley subreddit for discussions about everything UC Berkeley. However, there is such a large amount of EE and CS-related posts that those posts are auto-tagged so that readers can choose to ignore them or read only them.

Write a regular expression that finds strings that resemble a CS or EE class- starting with "CS" or "EE", followed by a number, and then optionally followed by "A", "B", or "C". Your search should be case insensitive, so both "CS61A" and "cs61a" would match.

Q2: Greetings

Let's say hello to our fellow bears! We've received messages from our new friends at Berkeley, and we want to determine whether or not these messages are greetings. In this problem, there are two types of greetings - salutations and valedictions. The first are messages that start with "hi", "hello", or "hey", where the first letter of these words can be either capitalized or lowercase. The second are messages that end with the word "bye" (capitalized or lowercase), followed by either an exclamation point, a period, or no punctuation. Write a regular expression that determines whether a given message is a greeting.

Q3: Phone Number Validator

Create a regular expression that matches phone numbers that are 11, 10, or 7 numbers long.

Phone numbers 7 numbers long have a group of 3 numbers followed by a group of 4 numbers, either separated by a space, a dash, or nothing.

Examples: 123-4567, 1234567, 123 4567

Phone numbers 10 numbers long have a group of 3 numbers followed by a group of 3 numbers followed by a group of 4 numbers, either separated by a space, a dash, or nothing.

Examples: 123-456-7890, 1234567890, 123 456 7890

Phone numbers 11 numbers long have a group of 1 number followed by a group 3 numbers followed by a group of 3 numbers followed by a group of 4 numbers, either separated by a space, a dash, or nothing.

Examples: 1-123-456-7890, 11234567890, 1 123 456 7890

It is fine if spacing/dashes/no space mix! So 123 456-7890 is fine.

Note: The skeleton code is just a suggestion; feel free to use your own structure if you prefer.

Q4: Address First Line

Write a regular expression that parses strings and returns whether it contains the first line of a US mailing address.

US mailing addresses typically contain a block number, which is a sequence of 3-5 digits, following by a street name. The street name can consist of multiple words but will always end with a street type abbreviation, which itself is a sequence of 2-5 English letters. The street name can also optionally start with a cardinal direction ("N", "E", "W", "S"). Everything should be properly capitalized.

Proper capitalization means that the first letter of each name is capitalized. It is fine to have things like "WeirdCApitalization" match.

See the doctests for some examples.

Q5: Basic URL Validation

In this problem, we will write a regular expression which matches a URL. URLs look like the following:

URL

For example, in the link https://cs61a.org/resources/#regular-expressions, we would have:

Scheme: https://
Domain Name: cs61a.org
Path to the file: /resources
Anchor: /#regular-expressions

The port and parameters are not present in this example and you will not be required to match them for this problem.

You can reference this documentation from MDN if you're curious about the various parts of a URL.

For this problem, a valid domain name consists of two "words" separated by a single period. Recall that a "word" can consist of letters, numbers, and underscores. The second "word" should be exactly 3 characters long and represents the domain's extension. In the case of the above example, "cs61a" and "org" are the two "words" that are joined by a period.

For a URL to be "valid," it must contain a valid domain name and will optionally have a scheme, path, and anchor. (Note: In this problem, “scheme” does not refer to the programming language.)

A valid scheme will either be http:// or https://.

A valid path starts with a slash and then must be a valid path to a file or directory. A path to a directory should look something like path/to/directory, while a path to a file might look something like /composingprograms.html (note the period followed by the extension). Paths should not end with a slash or have more than one period -- /composing.programs.html/ is not a valid path. Any non-slash and non-period character in a path should be a letter, number, or underscore.

A valid anchor starts with /#. While they are more complicated, for this problem assume that valid anchors will then be followed by letters, numbers, hyphens, or underscores.

Hint: You can use \ to escape special characters in regex.

Q6: Email Domain Validator

Create a regular expression that makes sure a given string email is a valid email address and that its domain name is in the provided list of domains.

An email address is valid if it contains letters, number, or underscores, followed by an @ symbol, then a domain.

All domains will have a 3 letter extension following the period.

Hint: For this problem, you will have to make a regex pattern based on the elements in the domains parameter. A for loop can help with that.

Extra: There is a particularly elegant solution that utilizes join and replace instead of a for loop.

Note: The skeleton code is just a suggestion; feel free to use your own structure if you prefer.