Lab 11: Regular Expressions, SQL

Due by 11:59pm on Thursday, August 4.

Starter Files

Download Inside the archive, you will find starter files for the questions in this lab, along with a copy of the Ok autograder.


Consult this section if you need a refresher on the material for this lab. It's okay to skip directly to the questions and refer back here should you get stuck.

Regular expressions are a way to describe sets of strings that meet certain criteria, and are incredibly useful for pattern matching.

The simplest regular expression is one that matches a sequence of characters, like aardvark to match any "aardvark" substrings in a string.

However, you typically want to look for more interesting patterns. We recommend using an online tool like or for trying out patterns, since you'll get instant feedback on the match results.

Character classes

A character class makes it possible to search for any one of a set of characters. You can specify the set or use pre-defined sets.

Class Description
[abc] Matches a, b, or c
[a-z] Matches any character between a and z
[^A-Z] Matches any character that is not between A and Z.
\w Matches any "word" character. Equivalent to [A-Za-z0-9_].
\d Matches any digit. Equivalent to [0-9].
[0-9] Matches a single digit in the range 0 - 9. Equivalent to \d.
\s Matches any whitespace character (spaces, tabs, line breaks).
. Matches any character besides new line.

Character classes can be combined, like in [a-zA-Z0-9].

Combining patterns

There are multiple ways to combine patterns together in regular expressions.

Combo Description
AB A match for A followed immediately by one for B. Example: x[.,]y matches "x.y" or "x,y".
A|B Matches either A or B. Example: \d+|Inf matches either a sequence containing 1 or more digits or "Inf".

A pattern can be followed by one of these quantifiers to specify how many instances of the pattern can occur.

Symbol Description
* 0 or more occurrences of the preceding pattern. Example: [a-z]* matches any sequence of lower-case letters or the empty string.
+ 1 or more occurrences of the preceding pattern. Example: \d+ matches any non-empty sequence of digits.
? 0 or 1 occurrences of the preceding pattern. Example: [-+]? matches an optional sign.
{1,3} Matches the specified quantity of the preceding pattern. {1,3} will match from 1 to 3 instances. {3} will match exactly 3 instances. {3,} will match 3 or more instances. Example: \d{5,6} matches either 5 or 6 digit numbers.


Parentheses are used similarly as in arithmetic expressions, to create groups. For example, (Mahna)+ matches strings with 1 or more "Mahna", like "MahnaMahna". Without the parentheses, Mahna+ would match strings with "Mahn" followed by 1 or more "a" characters, like "Mahnaaaa".


  • ^: Matches the beginning of a string. Example: ^(I|You) matches I or You at the start of a string.
  • $: Normally matches the empty string at the end of a string or just before a newline at the end of a string. Example: (\.edu|\.org|\.com)$ matches .edu, .org, or .com at the end of a string.
  • \b: Matches a "word boundary", the beginning or end of a word. Example: s\b matches s characters at the end of words.

Special characters

The following special characters are used above to denote types of patterns:

\ / ( ) [ ] { } + * ? | $ ^ .

That means if you actually want to match one of those characters, you have to escape it using a backslash. For example, \(1\+3\) matches "(1 + 3)".

Using regular expressions in Python

Many programming languages have built-in functions for matching strings to regular expressions. We'll use the Python re module in 61A, but you can also use similar functionality in SQL, JavaScript, Excel, shell scripting, etc.

The search method searches for a pattern anywhere in a string:"(Mahna)+", "Mahna Mahna Ba Dee Bedebe")

That method returns back a match object, which is considered truth-y in Python and can be inspected to find the matching strings. If no match is found, returns None.

For more details, please consult the re module documentation or the re tutorial.

SQL Basics

Creating Tables

You can create SQL tables either from scratch or from existing tables.

The following statement creates a table by specifying column names and values without referencing another table. Each SELECT clause specifies the values for one row, and UNION is used to join rows together. The AS clauses give a name to each column; it need not be repeated in subsequent rows after the first.

CREATE TABLE [table_name] AS
  SELECT [val1] AS [column1], [val2] AS [column2], ... UNION
  SELECT [val3]             , [val4]             , ... UNION
  SELECT [val5]             , [val6]             , ...;

Let's say we want to make the following table called big_game which records the scores for the Big Game each year. This table has three columns: berkeley, stanford, and year.

We could do so with the following CREATE TABLE statement:

  SELECT 30 AS berkeley, 7 AS stanford, 2002 AS year UNION
  SELECT 28,             16,            2003         UNION
  SELECT 17,             38,            2014;

Selecting From Tables

More commonly, we will create new tables by selecting specific columns that we want from existing tables by using a SELECT statement as follows:

SELECT [columns] FROM [tables] WHERE [condition] ORDER BY [columns] LIMIT [limit];

Let's break down this statement:

  • SELECT [columns] tells SQL that we want to include the given columns in our output table; [columns] is a comma-separated list of column names, and * can be used to select all columns
  • FROM [table] tells SQL that the columns we want to select are from the given table; see the joins section to see how to select from multiple tables
  • WHERE [condition] filters the output table by only including rows whose values satisfy the given [condition], a boolean expression
  • ORDER BY [columns] orders the rows in the output table by the given comma-separated list of columns
  • LIMIT [limit] limits the number of rows in the output table by the integer [limit]

Note: We capitalize SQL keywords purely because of style convention. It makes queries much easier to read, though they will still work if you don't capitalize keywords.

Here are some examples:

Select all of Berkeley's scores from the big_game table, but only include scores from years past 2002:

sqlite> SELECT berkeley FROM big_game WHERE year > 2002;

Select the scores for both schools in years that Berkeley won:

sqlite> SELECT berkeley, stanford FROM big_game WHERE berkeley > stanford;

Select the years that Stanford scored more than 15 points:

sqlite> SELECT year FROM big_game WHERE stanford > 15;

SQL operators

Expressions in the SELECT, WHERE, and ORDER BY clauses can contain one or more of the following operators:

  • comparison operators: =, >, <, <=, >=, <> or != ("not equal")
  • boolean operators: AND, OR
  • arithmetic operators: +, -, *, /
  • concatenation operator: ||

Here are some examples:

Output the ratio of Berkeley's score to Stanford's score each year:

sqlite> select berkeley * 1.0 / stanford from big_game;

Output the sum of scores in years where both teams scored over 10 points:

sqlite> select berkeley + stanford from big_game where berkeley > 10 and stanford > 10;

Output a table with a single column and single row containing the value "hello world":

sqlite> SELECT "hello" || " " || "world";
hello world


Python already comes with a built-in SQLite database engine to process SQL. However, it doesn't come with a "shell" to let you interact with it from the terminal. Because of this, until now, you have been using a simplified SQLite shell written by us. However, you may find the shell is old, buggy, or lacking in features. In that case, you may want to download and use the official SQLite executable.

If running python3 didn't work, you can download a precompiled sqlite directly by following the following instructions and then use sqlite3 and ./sqlite3 instead of python3 based on which is specified for your platform.

Another way to start using SQLite is to download a precompiled binary from the SQLite website.

SQLite version 3.32.3 or higher should be sufficient.

However, before proceeding, please remove (or rename) any SQLite executables (sqlite3,, and the like) from the current folder, or they may conflict with the official one you download below. Similarly, if you wish to switch back later, please remove or rename the one you downloaded and restore the files you removed.


  1. Visit the download page linked above and navigate to the section Precompiled Binaries for Windows. Click on the link sqlite-tools-win32-x86-*.zip to download the binary.
  2. Unzip the file. There should be a sqlite3.exe file in the directory after extraction.
  3. Navigate to the folder containing the sqlite3.exe file and check that the version is at least 3.32.3:

    $ cd path/to/sqlite
    $ ./sqlite3 --version
    3.32.3 2020-06-18 14:16:19

macOS Big Sur (11.0.1) or newer

SQLite comes pre-installed. Check that you have a version that's greater than 3.32.3:

    $ sqlite3
    SQLite version 3.32.3

macOS (older versions)

SQLite comes pre-installed, but it may be the wrong version. You can take the following steps if the pre-installed version is less than 3.32.3.

  1. Visit the download page linked above and navigate to the section Precompiled Binaries for Mac OS X (x86). Click on the link sqlite-tools-osx-x86-*.zip to download the binary.
  2. Unzip the file. There should be a sqlite3 file in the directory after extraction.
  3. Navigate to the folder containing the sqlite3 file and check that the version is at least 3.32.3:

    $ cd path/to/sqlite
    $ ./sqlite3 --version
    3.32.3 2020-06-18 14:16:19


The easiest way to use SQLite on Ubuntu is to install it straight from the native repositories (the version will be slightly behind the most recent release). Check that the version is greater than 3.32.3:

$ sudo apt install sqlite3
$ sqlite3 --version
3.32.3 2020-06-18 14:16:19


To select data from multiple tables, we can use joins. There are many types of joins, but the only one we'll worry about is the inner join. To perform an inner join on two on more tables, simply list them all out in the FROM clause of a SELECT statement:

SELECT [columns] FROM [table1], [table2], ... WHERE [condition] ORDER BY [columns] LIMIT [limit];

We can select from multiple different tables or from the same table multiple times.

Let's say we have the following table that contains the names of head football coaches at Cal since 2002:

  SELECT "Jeff Tedford" AS name, 2002 as start, 2012 as end UNION
  SELECT "Sonny Dykes"         , 2013         , 2016        UNION
  SELECT "Justin Wilcox"       , 2017         , null;

When we join two or more tables, the default output is a cartesian product. For example, if we joined big_game with coaches, we'd get the following:

If we want to match up each game with the coach that season, we'd have to compare columns from the two tables in the WHERE clause:

sqlite> SELECT * FROM big_game, coaches WHERE year >= start AND year <= end;
17|38|2014|Sonny Dykes|2013|2016
28|16|2003|Jeff Tedford|2002|2012
30|7|2002|Jeff Tedford|2002|2012

The following query outputs the coach and year for each Big Game win recorded in big_game:

sqlite> SELECT name, year FROM big_game, coaches
...>        WHERE berkeley > stanford AND year >= start AND year <= end;
Jeff Tedford|2003
Jeff Tedford|2002

In the queries above, none of the column names are ambiguous. For example, it is clear that the name column comes from the coaches table because there isn't a column in the big_game table with that name. However, if a column name exists in more than one of the tables being joined, or if we join a table with itself, we must disambiguate the column names using aliases.

For examples, let's find out what the score difference is for each team between a game in big_game and any previous games. Since each row in this table represents one game, in order to compare two games we must join big_game with itself:

sqlite> SELECT b.Berkeley - a.Berkeley, b.Stanford - a.Stanford, a.Year, b.Year
...>        FROM big_game AS a, big_game AS b WHERE a.Year < b.Year;

In the query above, we give the alias a to the first big_game table and the alias b to the second big_game table. We can then reference columns from each table using dot notation with the aliases, e.g. a.Berkeley, a.Stanford, and a.Year to select from the first table.

SQL Aggregation

Previously, we have been dealing with queries that process one row at a time. When we join, we make pairwise combinations of all of the rows. When we use WHERE, we filter out certain rows based on the condition. Alternatively, applying an aggregate function such as MAX(column) combines the values in multiple rows.

By default, we combine the values of the entire table. For example, if we wanted to count the number of flights from our flights table, we could use:

sqlite> SELECT COUNT(*) from FLIGHTS;

What if we wanted to group together the values in similar rows and perform the aggregation operations within those groups? We use a GROUP BY clause.

Here's another example. For each unique departure, collect all the rows having the same departure airport into a group. Then, select the price column and apply the MIN aggregation to recover the price of the cheapest departure from that group. The end result is a table of departure airports and the cheapest departing flight.

sqlite> SELECT departure, MIN(price) FROM flights GROUP BY departure;

Just like how we can filter out rows with WHERE, we can also filter out groups with HAVING. Typically, a HAVING clause should use an aggregation function. Suppose we want to see all airports with at least two departures:

sqlite> SELECT departure FROM flights GROUP BY departure HAVING COUNT(*) >= 2;

Note that the COUNT(*) aggregate just counts the number of rows in each group. Say we want to count the number of distinct airports instead. Then, we could use the following query:

sqlite> SELECT COUNT(DISTINCT departure) FROM flights;

This enumerates all the different departure airports available in our flights table (in this case: SFO, LAX, AUH, SLC, SEA, and LAS).


First, check that a file named exists alongside the assignment files. If you don't see it, or if you encounter problems with it, scroll down to the Troubleshooting section to see how to download an official precompiled SQLite binary before proceeding.

You can start an interactive SQLite session in your Terminal or Git Bash with the following command:


While the interpreter is running, you can type .help to see some of the commands you can run.

To exit out of the SQLite interpreter, type .exit or .quit or press Ctrl-C. Remember that if you see ...> after pressing enter, you probably forgot a ;.

You can also run all the statements in a .sql file by doing the following: (Here we're using the lab11.sql file as an example.)

  1. Runs your code and then exits SQLite immediately afterwards.

    python3 < lab11.sql
  2. Runs your code and then opens an interactive SQLite session, which is similar to running Python code with the interactive -i flag.

    python3 --init lab11.sql


Regular Expressions

Q1: What Would RegEx Match?

For each of the following regular expressions, suggest a string that would be fully matched.

Use Ok to test your knowledge by choosing the best answer for each of the following questions:

python3 ok -q wwrm -u

A hexadecimal color code begins with # and is followed by exactly six hexadecimal numbers, which can be the digits 0-9 or letters a-f.

Q: #[a-f0-9]{6}
Choose the number of the correct choice:
0) A hexadecimal color code with 3 letters and 3 numbers
1) A hexadecimal color code that starts with letters and ends with numbers, like #gg1234
2) Any 6-digit hexadecimal color code, like #fdb515
3) Any hexadecimal color code with 0-6 digits

Q: (fizz(buzz|)|buzz)
Choose the number of the correct choice:
0) Only fizzbuzz or buzz
1) Only fizzbuzzbuzz
2) Only fizz
3) Only fizzbuzz, fizz, and buzz
4) Only fizzbuzz

Q: [-+]?\d*\.?\d+
Choose the number of the correct choice:
0) Only signed numbers like +1000, -1.5
1) Only signed or unsigned integers like +1000, -33
2) Signed or unsigned numbers like +1000, -1.5, .051
3) Only unsigned numbers like 0.051

Q: [1-9]+[05]+
Choose the number of the correct choice:
0) Any positive number
1) Numbers that are both greater than 5 and divisible by 5 like 10, 25, 800
2) Numbers that are divisible by 5 but do not have the digits 0 and 5 adjacent to each other as the last 2 digits
3) Numbers that are divisible by 5 like 5, 20, 6325

Q2: Scientific Name

Returns whether the input string name follows the correct format for a scientific name. A scientific name's format is as follows: starts with a capital letter, followed by a period (.) or a series of lowercase letters, followed by a space, followed by a series of lowercase letters. Refer to the doctests for examples of valid and invalid strings.

import re

def scientific_name(name):
    Returns True for strings that are in the correct notation for scientific names;
    i.e. contains a capital letter followed by a period or lowercase letters, 
    followed by a space, followed by more lowercase letters. Returns False for 
    invalid strings.

    >>> scientific_name("T. rex")
    >>> scientific_name("t. rex")
    >>> scientific_name("tyrannosurus rex")
    >>> scientific_name("t rex")
    >>> scientific_name("Falco peregrinus")
    >>> scientific_name("F peregrinus")
    >>> scientific_name("Annie the F. peregrinus")
    >>> scientific_name("I want a pet T. rex right now")
    return bool(, name))

Use Ok to test your code:

python3 ok -q scientific_name

Q3: Calculator Ops

Write a regular expression that parses strings written in the 61A Calculator language and returns True if any expression has exactly two numeric operands. Returns False otherwise.

Note: the allowed operators are +, -, *, and /. Check these lecture slides for a refresher on what the 61A calculator language is.

import re

def calculator_ops(calc_str):
    Returns True if an expression from the Calculator language that has two
    numeric operands exists in calc_str, False otherwise.

    >>> calculator_ops("(* 2 4)")
    >>> calculator_ops("(+ (* 3 (+ (* 2 4) (+ 3 5))) (+ (- 10 7) 6))")
    >>> calculator_ops("(* 2)")
    >>> calculator_ops("(/ 8 4 2)")
    >>> calculator_ops("(- 8 3)")
    >>> calculator_ops("+ 3 23")
    return bool(, calc_str))

Use Ok to test your code:

python3 ok -q calculator_ops

Survey Data

Survey Data

Last week, we asked you and your fellow students to complete a brief online survey through Google Forms, which involved relatively random but fun questions. In this lab, we will interact with the results of the survey by using SQL queries to see if we can find interesting things in the data.

First, take a look at data.sql and examine the table defined in it. Note its structure. You will be working with the two tables in this file.

The first is the table students, which is the main results of the survey. Each column represents a different question from the survey, except for the first column, which is the time of when the result was submitted. This time is a unique identifier for each of the rows in the table. The last several columns all correspond to the last question on the survey (more details below.)

Column Name Question
time The unique timestamp that identifies the submission
number What's your favorite number between 1 and 100?
color What is your favorite color?
seven Choose the number 7 below.
  • 7
  • Choose this option instead
  • seven
  • the number 7 below.
  • I find this question condescending
song If you could listen to only one of these songs for the rest of your life, which would it be?
  • "The Other Side Of Paradise" by Glass Animals
  • "Glimpse of Us" by Joji
  • "Leave the Door Open" by Anderson .Paak, Bruno Mars, and Silk Sonic
  • "Clair de Lune" by Claude Debussy
  • "Bohemian Rhapsody" by Queen
  • "Dancing Queen" by ABBA
  • "All that Jazz" from Chicago
  • "good 4 u" by Olivia Rodrigo
  • "Starman" by David Bowie
  • "Palette" by IU
  • "I Won't Say (I'm In Love)" from Hercules
date Pick a day of the year!
pet If you could have any animal in the world as a pet, what would it be?
instructor Choose your favorite photo of John DeNero
smallest Try to guess the smallest unique positive INTEGER that anyone will put!

The second table is numbers, which is the results from the survey in which students could select more than one option from the numbers listed, which ranged from 0 to 10 and included 2021, 2022, 9000, and 9001. Each row has a time (which is again a unique identifier) and has the value 'True' if the student selected the column or 'False' if the student did not. The column names in this table are the following strings, referring to each possible number: '0', '1', '2', '4', '5', '6', '7', '8', '9', '10', '2021', '2022', '9000', '9001'

Since the survey was anonymous, we used the timestamp that a survey was submitted as a unique identifier. A time in students matches up with a time in numbers. For example, a row in students whose time value is "11/17/2021 10:52:40" matches up with the row in numbers whose time value is "11/17/2021 10:52:40". These entries come from the same Google form submission and thus belong to the same student.

Note: If you are looking for your personal response within the data, you may have noticed that some of your answers are slightly different from what you had inputted. In order to make SQLite accept our data, and to optimize for as many matches as possible during our joins, we did the following things to clean up the data:

  • color and pet: We converted all the strings to be completely lowercase.
  • For some of the more "free-spirited" responses, we escaped the special characters so that they could be properly parsed.

You will write all of your solutions in the starter file lab11.sql provided. As with other labs, you can test your solutions with OK. In addition, you can use either of the following commands:

python3 < lab11.sql
python3 --init lab11.sql


Q4: What Would SQL print?

Note: there is no submission for this question

First, load the tables into sqlite3.

$ python3 --init lab11.sql

Before we start, inspect the schema of the tables that we've created for you:

sqlite> .schema

This tells you the name of each of our tables and their attributes.

Let's also take a look at some of the entries in our table. There are a lot of entries though, so let's just output the first 20:

sqlite> SELECT * FROM students LIMIT 20;

If you're curious about some of the answers students put into the Google form, open up data.sql in your favorite text editor and take a look!

For each of the SQL queries below, think about what the query is looking for, then try running the query yourself and see!

sqlite> SELECT * FROM students LIMIT 30; -- This is a comment. * is shorthand for all columns!
selects first 30 records from students;
sqlite> SELECT color FROM students WHERE number = 7;
selects the color from students who said their favorite number was 7;
sqlite> SELECT song, pet FROM students WHERE color = "blue" AND date = "12/25";
selects the song and pet from students who said their favorite color was blue and picked December 25th;

Remember to end each statement with a ;! To exit out of SQLite, type .exit or .quit or hit Ctrl-C.

Q5: Go Bears! (And Dogs?)

Now that we have learned how to select columns from a SQL table, let's filter the results to see some more interesting results!

It turns out that 61A students have a lot of school spirit: the most popular favorite color was 'blue'. You would think that this school spirit would carry over to the pet answer, and everyone would want a pet bear! Unfortunately, this was not the case, and the majority of students opted to have a pet 'dog' instead. That is the more sensible choice, I suppose...

Write a SQL query to create a table that contains both the column color and the column pet, using the keyword WHERE to restrict the answers to the most popular results of color being 'blue' and pet being 'dog'.

You should get the following output:

sqlite> SELECT * FROM bluedog;

This isn't a very exciting table, though. Each of these rows represents a different student, but all this table can really tell us is how many students both like the color blue and want a dog as a pet, because we didn't select for any other identifying characteristics. Let's create another table, bluedog_songs, that looks just like bluedog but also tells us how each student answered the song question.

You should get the following output:

sqlite> SELECT * FROM bluedog_songs;
blue|dog|Glimpse of Us
blue|dog|The Other Side of Paradise
blue|dog|Leave the Door Open
blue|dog|Bohemian Rhapsody
blue|dog|Dancing Queen
blue|dog|Bohemian Rhapsody
blue|dog|good 4 u
blue|dog|Clair de Lune
blue|dog|Dancing Queen
CREATE TABLE bluedog_songs AS

This distribution of songs actually largely represents the distribution of song choices that the total group of students made, so perhaps all we've learned here is that there isn't a correlation between a student's favorite color and desired pet, and what song they could spend the rest of their life listening to. Even demonstrating that there is no correlation still reveals facts about our data though!

Use Ok to test your code:

python3 ok -q bluedog

Q6: The Smallest Unique Positive Integer

Who successfully managed to guess the smallest unique positive integer value? Let's find out!

Write an SQL query to create a table with the columns time and smallest which contains the timestamp for each submission that made a unique guess for the smallest unique positive integer - that is, only one person put that number for their guess of the smallest unique integer. Also include their guess in the output.

Hint: Think about what attribute you need to GROUP BY. Which groups do we want to keep after this? We can filter this out using a HAVING clause. If you need a refresher on aggregation, see the topics section.

The submission with the timestamp corresponding to the minimum value of this table is the timestamp of the submission with the smallest unique positive integer!

CREATE TABLE smallest_int_having AS

Use Ok to test your code:

python3 ok -q smallest-int-having

Q7: Matchmaker, Matchmaker

Did you take 61A with the hope of finding a new group of friends? Well you're in luck! With all this data in hand, it's easy for us to find your perfect match. If two students want the same pet and have the same taste in music, they are clearly meant to be friends! In order to provide some more information for the potential pair to converse about, let's include the favorite colors of the two individuals as well!

In order to match up students, you will have to do a join on the students table with itself. When you do a join, SQLite will match every single row with every single other row, so make sure you do not match anyone with themselves, or match any given pair twice!

Important Note: When pairing the first and second person, make sure that the first person responded first (i.e. they have an earlier time). This is to ensure your output matches our tests.

Hint: When joining table names where column names are the same, use dot notation to distinguish which columns are from which table: [table_name].[column name]. This sometimes may get verbose, so it’s stylistically better to give tables an alias using the AS keyword. The syntax for this is as follows:

SELECT <[alias1].[column name1], [alias2].[columnname2]...>
    FROM <[table_name1] AS [alias1],[table_name2] AS [alias2]...> ...

The query in the football example from earlier uses this syntax.

Write a SQL query to create a table that has 4 columns:

  • The shared preferred pet of the pair
  • The shared favorite song of the pair
  • The favorite color of the first person
  • The favorite color of the second person
CREATE TABLE matchmaker AS

Use Ok to test your code:

python3 ok -q matchmaker


Make sure to submit this assignment by running:

python3 ok --submit