# Study Guide: SQL

## Instructions

This is a study guide with links to past lectures, assignments, and handouts, as well as additional practice problems to assist you in learning the concepts.

**Assignments**

Important: For solutions to these assignments once they have been released, see the main website

**Handouts**

**Lectures**

**Readings**

# Guides

## SQL

**SQL** is a declarative programming language. Unlike Python or Scheme where
we write programs which provide the exact sequence of steps needed to solve a
problem, SQL accepts instructions which express the desired result of the
computation.

The challenge with writing SQL statements then is in determining how to compose the desired result! SQL has a strict syntax and a structured method of computation, so even though we write statements which express the desired result, we must still keep in mind the steps that SQL will follow to compute the result.

SQL operates on *tables* of data, which contains a number of fixed *columns*.
Each row of a table represents an individual data point, with values for each
column. SQL statements then operate on these tables by iterating over each row,
determining if it should be included in the output relation (filtering), and
then computing the resulting value which should appear in the table.

We can also describe SQL's implementation using the following code as an
example. Imagine the `SELECT`

, `FROM`

, `WHERE`

, and `ORDER BY`

clauses are
implemented as functions which act on rows. Here's a simplified view of how SQL
might work, if implemented in simple Python.

```
output_table = []
for row in FROM(*input_tables):
if WHERE(row):
output_table += [SELECT(row)]
if ORDER_BY:
output_table = ORDER_BY(output_table)
if LIMIT:
output_table = output_table[:LIMIT]
```

Note that the `ORDER BY`

and `LIMIT`

clauses are applied only at the end after
all the rows in the output table have been determined.

One of the important things to remember about SQL is that we always return to this very simple model of computation: looping, filtering, applying a function, and then ordering and limiting the final output.

The simple Python example above helps expose a limitation of SQL: we currently
can't create output tables with more rows than in the input! There are a few
methods for creating novel combinations of existing data: **joins** and SQL
**recursion**. **Aggregation** allows us to find patterns and consider multiple
rows together as a single unit, or group.

### Joins

Joins create novel combinations of data by combining data from more than one
source. Given multiple input tables, we can combine them in a join. Following
the Python metaphor, the join is like creating nested `for`

loops.

```
def FROM(table_1, table_2):
for row_1 in table1:
for row_2 in table2:
yield row_1 + row_2
```

Given each row in `table_1`

and each row in `table_2`

, the join iterates over
each possible combination of rows and treats them as the input table. The same
idea extends to more than two tables as well.

The SQL lab also has a great visual demonstrating this exact result as well.

Joins are particularly useful when we want to combine data on a single column.
For example, say we have a table, `dogs`

, containing the `name`

and `size`

of
each dog, and a different table, `parents`

, containing the `name`

and `parent`

of each dog. We might want to ask, "What's the difference in size between each
dog and their parent?" by joining together the tables in a SQL statement.

The first question we should ask ourselves is, "Which data tables do we need to
reference to assemble all the data we need?" We'll definitely need the table of
`parents`

to determine the name of each dog and their parent. From their names,
we still need a way to get the size of each dog. That information is provided
by the `dogs`

table.

`SELECT d.name, d.size, p.parent FROM dogs as d, parents as p WHERE d.name = p.name;`

But referencing the `dogs`

table only once will leave us in a tricky situation.
We can find either the size of the dog or their parent, but not both!

```
SELECT d1.name, d1.size, d2.name, d2.size
FROM dogs as d1, dogs as d2, parents as p
WHERE d1.name = p.name AND p.parent = d2.name;
```

Joining the `dogs`

table twice provides the necessary information to solve the
problem.

### Aggregation

We saw joins as a method for creating novel combinations of data, and recursion as an extension of joins. These methods combine data by extending the number of columns we have available to us and help us identify the patterns in data.

**Aggregation functions** allow us to operate on data in a different way by
combining results across *multiple rows*. Common aggregation functions to be
familiar with include `COUNT`

, `MIN`

, `MAX`

, `SUM`

, and `AVG`

.

Applying an aggregation function to an input relation results in a single row containing the aggregate result.

```
> SELECT AVG(n) FROM n5;
3.0
```

But oftentimes, we'd like to condition the groups and compute aggregate results
for smaller portions of the input relation. We can use `GROUP BY`

and `HAVING`

to split the rows into groups and select only a subset of the groups.

```
output_table = []
for input_group in GROUP_BY(FROM(*input_tables)):
output_group = []
for row in input_group:
if WHERE(row):
output_group += [row]
if HAVING(output_group):
output_table += [SELECT(output_group)]
if ORDER_BY:
output_table = ORDER_BY(output_table)
if LIMIT:
output_table = output_table[:LIMIT]
```

We take the results from the input tables, whether it's just a single table or
a join, and then apply the same row-by-row processing *within a group*. Before
adding the result of the group to the output table, we check to see if the
values of the group reflect the condition in the `HAVING`

clause which serves
as a filter on the groups, much like how `WHERE`

is a filter on the rows.

# Practice Problems

## Medium

Suppose that we have a table of positive integers up to 100, as in lecture:

```
CREATE TABLE ints AS
WITH i(n) AS (
SELECT 1 UNION
SELECT n+1 FROM i LIMIT 100
)
SELECT n FROM i;
```

### Q1: Divisors

Define a table `divisors`

in which each row describes the number of unique
divisors for an integer up to 100. For example, the number 16 has 5 unique
divisors: 1, 2, 4, 8, and 16.

```
CREATE TABLE divisors AS
SELECT "REPLACE THIS LINE WITH YOUR SOLUTION";
SELECT a.n * b.n AS n, count(*) AS divisors
FROM ints AS a, ints AS b
WHERE a.n * b.n <= 100
GROUP BY a.n * b.n;
```

Here's an (incomplete) example of what the `divisors`

table should look like.

```
-- Example:
SELECT * FROM divisors LIMIT 20;
-- Expected output:
-- 1|1
-- 2|2
-- 3|2
-- 4|3
-- 5|2
-- 6|4
-- 7|2
-- 8|4
-- 9|3
-- 10|4
-- 11|2
-- 12|6
-- 13|2
-- 14|4
-- 15|4
-- 16|5
-- 17|2
-- 18|6
-- 19|2
-- 20|6
```

### Q2: Primes

Define a table `primes`

that has a single column containing all prime numbers up
to 100.

```
CREATE TABLE primes AS
SELECT "REPLACE THIS LINE WITH YOUR SOLUTION";
SELECT n FROM divisors WHERE divisors = 2;
```

Here's what your output should look like.

```
-- Example:
SELECT * FROM primes;
-- Expected output:
-- 2
-- 3
-- 5
-- 7
-- 11
-- 13
-- 17
-- 19
-- 23
-- 29
-- 31
-- 37
-- 41
-- 43
-- 47
-- 53
-- 59
-- 61
-- 67
-- 71
-- 73
-- 79
-- 83
-- 89
-- 97
```

Hint: You may want to use your`divisors`

table to solve this problem.