Study Guide: SQL

Instructions

This is a study guide with links to past lectures, assignments, and handouts, as well as additional practice problems to assist you in learning the concepts.

Assignments

Important: For solutions to these assignments once they have been released, see the main website

Handouts

Lectures

Readings

Guides

SQL

SQL is a declarative programming language. Unlike Python or Scheme where we write programs which provide the exact sequence of steps needed to solve a problem, SQL accepts instructions which express the desired result of the computation.

The challenge with writing SQL statements then is in determining how to compose the desired result! SQL has a strict syntax and a structured method of computation, so even though we write statements which express the desired result, we must still keep in mind the steps that SQL will follow to compute the result.

SQL operates on tables of data, which contains a number of fixed columns. Each row of a table represents an individual data point, with values for each column. SQL statements then operate on these tables by iterating over each row, determining if it should be included in the output relation (filtering), and then computing the resulting value which should appear in the table.

We can also describe SQL's implementation using the following code as an example. Imagine the SELECT, FROM, WHERE, and ORDER BY clauses are implemented as functions which act on rows. Here's a simplified view of how SQL might work, if implemented in simple Python.

output_table = []
for row in FROM(*input_tables):
    if WHERE(row):
        output_table += [SELECT(row)]
if ORDER_BY:
    output_table = ORDER_BY(output_table)
if LIMIT:
    output_table = output_table[:LIMIT]

Note that the ORDER BY and LIMIT clauses are applied only at the end after all the rows in the output table have been determined.

One of the important things to remember about SQL is that we always return to this very simple model of computation: looping, filtering, applying a function, and then ordering and limiting the final output.

The simple Python example above helps expose a limitation of SQL: we currently can't create output tables with more rows than in the input! There are a few methods for creating novel combinations of existing data: joins and SQL recursion. Aggregation allows us to find patterns and consider multiple rows together as a single unit, or group.

Joins

Joins create novel combinations of data by combining data from more than one source. Given multiple input tables, we can combine them in a join. Following the Python metaphor, the join is like creating nested for loops.

def FROM(table_1, table_2):
    for row_1 in table1:
        for row_2 in table2:
            yield row_1 + row_2

Given each row in table_1 and each row in table_2, the join iterates over each possible combination of rows and treats them as the input table. The same idea extends to more than two tables as well.

The SQL lab also has a great visual demonstrating this exact result as well.

Joins are particularly useful when we want to combine data on a single column. For example, say we have a table, dogs, containing the name and size of each dog, and a different table, parents, containing the name and parent of each dog. We might want to ask, "What's the difference in size between each dog and their parent?" by joining together the tables in a SQL statement.

The first question we should ask ourselves is, "Which data tables do we need to reference to assemble all the data we need?" We'll definitely need the table of parents to determine the name of each dog and their parent. From their names, we still need a way to get the size of each dog. That information is provided by the dogs table.

SELECT d.name, d.size, p.parent FROM dogs as d, parents as p WHERE d.name = p.name;

But referencing the dogs table only once will leave us in a tricky situation. We can find either the size of the dog or their parent, but not both!

SELECT d1.name, d1.size, d2.name, d2.size
  FROM dogs as d1, dogs as d2, parents as p
 WHERE d1.name = p.name AND p.parent = d2.name;

Joining the dogs table twice provides the necessary information to solve the problem.

Aggregation

We saw joins as a method for creating novel combinations of data, and recursion as an extension of joins. These methods combine data by extending the number of columns we have available to us and help us identify the patterns in data.

Aggregation functions allow us to operate on data in a different way by combining results across multiple rows. Common aggregation functions to be familiar with include COUNT, MIN, MAX, SUM, and AVG.

Applying an aggregation function to an input relation results in a single row containing the aggregate result.

> SELECT AVG(n) FROM n5;
3.0

But oftentimes, we'd like to condition the groups and compute aggregate results for smaller portions of the input relation. We can use GROUP BY and HAVING to split the rows into groups and select only a subset of the groups.

output_table = []
for input_group in GROUP_BY(FROM(*input_tables)):
    output_group = []
    for row in input_group:
        if WHERE(row):
            output_group += [row]
    if HAVING(output_group):
        output_table += [SELECT(output_group)]
if ORDER_BY:
    output_table = ORDER_BY(output_table)
if LIMIT:
    output_table = output_table[:LIMIT]

We take the results from the input tables, whether it's just a single table or a join, and then apply the same row-by-row processing within a group. Before adding the result of the group to the output table, we check to see if the values of the group reflect the condition in the HAVING clause which serves as a filter on the groups, much like how WHERE is a filter on the rows.

Practice Problems

Medium

Suppose that we have a table of positive integers up to 100, as in lecture:

CREATE TABLE ints AS
  WITH i(n) AS (
      SELECT 1 UNION
      SELECT n+1 FROM i LIMIT 100
  )
  SELECT n FROM i;

Q1: Divisors

Define a table divisors in which each row describes the number of unique divisors for an integer up to 100. For example, the number 16 has 5 unique divisors: 1, 2, 4, 8, and 16.

CREATE TABLE divisors AS
SELECT "REPLACE THIS LINE WITH YOUR SOLUTION";
SELECT a.n * b.n AS n, count(*) AS divisors FROM ints AS a, ints AS b WHERE a.n * b.n <= 100 GROUP BY a.n * b.n;

Here's an (incomplete) example of what the divisors table should look like.

-- Example:
SELECT * FROM divisors LIMIT 20;
-- Expected output:
--   1|1
--   2|2
--   3|2
--   4|3
--   5|2
--   6|4
--   7|2
--   8|4
--   9|3
--   10|4
--   11|2
--   12|6
--   13|2
--   14|4
--   15|4
--   16|5
--   17|2
--   18|6
--   19|2
--   20|6

Q2: Primes

Define a table primes that has a single column containing all prime numbers up to 100.

CREATE TABLE primes AS
SELECT "REPLACE THIS LINE WITH YOUR SOLUTION";
SELECT n FROM divisors WHERE divisors = 2;

Here's what your output should look like.

-- Example:
SELECT * FROM primes;
-- Expected output:
--   2
--   3
--   5
--   7
--   11
--   13
--   17
--   19
--   23
--   29
--   31
--   37
--   41
--   43
--   47
--   53
--   59
--   61
--   67
--   71
--   73
--   79
--   83
--   89
--   97

Hint: You may want to use your divisors table to solve this problem.