Study Guide: SQL
This is a study guide with links to past lectures, assignments, and handouts, as well as additional practice problems to assist you in learning the concepts.
Important: For solutions to these assignments once they have been released, see the main website
SQL is a declarative programming language. Unlike Python or Scheme where we write programs which provide the exact sequence of steps needed to solve a problem, SQL accepts instructions which express the desired result of the computation.
The challenge with writing SQL statements then is in determining how to compose the desired result! SQL has a strict syntax and a structured method of computation, so even though we write statements which express the desired result, we must still keep in mind the steps that SQL will follow to compute the result.
SQL operates on tables of data, which contains a number of fixed columns. Each row of a table represents an individual data point, with values for each column. SQL statements then operate on these tables by iterating over each row, determining if it should be included in the output relation (filtering), and then computing the resulting value which should appear in the table.
We can also describe SQL's implementation using the following code as an
example. Imagine the
ORDER BY clauses are
implemented as functions which act on rows. Here's a simplified view of how SQL
might work, if implemented in simple Python.
output_table =  for row in FROM(*input_tables): if WHERE(row): output_table += [SELECT(row)] if ORDER_BY: output_table = ORDER_BY(output_table) if LIMIT: output_table = output_table[:LIMIT]
Note that the
ORDER BY and
LIMIT clauses are applied only at the end after
all the rows in the output table have been determined.
One of the important things to remember about SQL is that we always return to this very simple model of computation: looping, filtering, applying a function, and then ordering and limiting the final output.
The simple Python example above helps expose a limitation of SQL: we currently can't create output tables with more rows than in the input! There are a few methods for creating novel combinations of existing data: joins and SQL recursion. Aggregation allows us to find patterns and consider multiple rows together as a single unit, or group.
Joins create novel combinations of data by combining data from more than one
source. Given multiple input tables, we can combine them in a join. Following
the Python metaphor, the join is like creating nested
def FROM(table_1, table_2): for row_1 in table1: for row_2 in table2: yield row_1 + row_2
Given each row in
table_1 and each row in
table_2, the join iterates over
each possible combination of rows and treats them as the input table. The same
idea extends to more than two tables as well.
The SQL lab also has a great visual demonstrating this exact result as well.
Joins are particularly useful when we want to combine data on a single column.
For example, say we have a table,
dogs, containing the
each dog, and a different table,
parents, containing the
of each dog. We might want to ask, "What's the difference in size between each
dog and their parent?" by joining together the tables in a SQL statement.
The first question we should ask ourselves is, "Which data tables do we need to
reference to assemble all the data we need?" We'll definitely need the table of
parents to determine the name of each dog and their parent. From their names,
we still need a way to get the size of each dog. That information is provided
SELECT d.name, d.size, p.parent FROM dogs as d, parents as p WHERE d.name = p.name;
But referencing the
dogs table only once will leave us in a tricky situation.
We can find either the size of the dog or their parent, but not both!
SELECT d1.name, d1.size, d2.name, d2.size FROM dogs as d1, dogs as d2, parents as p WHERE d1.name = p.name AND p.parent = d2.name;
dogs table twice provides the necessary information to solve the
We saw joins as a method for creating novel combinations of data, and recursion as an extension of joins. These methods combine data by extending the number of columns we have available to us and help us identify the patterns in data.
Aggregation functions allow us to operate on data in a different way by
combining results across multiple rows. Common aggregation functions to be
familiar with include
Applying an aggregation function to an input relation results in a single row containing the aggregate result.
> SELECT AVG(n) FROM n5; 3.0
But oftentimes, we'd like to condition the groups and compute aggregate results
for smaller portions of the input relation. We can use
GROUP BY and
to split the rows into groups and select only a subset of the groups.
output_table =  for input_group in GROUP_BY(FROM(*input_tables)): output_group =  for row in input_group: if WHERE(row): output_group += [row] if HAVING(output_group): output_table += [SELECT(output_group)] if ORDER_BY: output_table = ORDER_BY(output_table) if LIMIT: output_table = output_table[:LIMIT]
We take the results from the input tables, whether it's just a single table or
a join, and then apply the same row-by-row processing within a group. Before
adding the result of the group to the output table, we check to see if the
values of the group reflect the condition in the
HAVING clause which serves
as a filter on the groups, much like how
WHERE is a filter on the rows.
Suppose that we have a table of positive integers up to 100, as in lecture:
CREATE TABLE ints AS WITH i(n) AS ( SELECT 1 UNION SELECT n+1 FROM i LIMIT 100 ) SELECT n FROM i;
Define a table
divisors in which each row describes the number of unique
divisors for an integer up to 100. For example, the number 16 has 5 unique
divisors: 1, 2, 4, 8, and 16.
CREATE TABLE divisors ASSELECT "REPLACE THIS LINE WITH YOUR SOLUTION";SELECT a.n * b.n AS n, count(*) AS divisors FROM ints AS a, ints AS b WHERE a.n * b.n <= 100 GROUP BY a.n * b.n;
Here's an (incomplete) example of what the
divisors table should look like.
-- Example: SELECT * FROM divisors LIMIT 20; -- Expected output: -- 1|1 -- 2|2 -- 3|2 -- 4|3 -- 5|2 -- 6|4 -- 7|2 -- 8|4 -- 9|3 -- 10|4 -- 11|2 -- 12|6 -- 13|2 -- 14|4 -- 15|4 -- 16|5 -- 17|2 -- 18|6 -- 19|2 -- 20|6
Define a table
primes that has a single column containing all prime numbers up
CREATE TABLE primes ASSELECT "REPLACE THIS LINE WITH YOUR SOLUTION";SELECT n FROM divisors WHERE divisors = 2;
Here's what your output should look like.
-- Example: SELECT * FROM primes; -- Expected output: -- 2 -- 3 -- 5 -- 7 -- 11 -- 13 -- 17 -- 19 -- 23 -- 29 -- 31 -- 37 -- 41 -- 43 -- 47 -- 53 -- 59 -- 61 -- 67 -- 71 -- 73 -- 79 -- 83 -- 89 -- 97
Hint: You may want to use your
divisorstable to solve this problem.