Lab 13: SQL II: Aggregation | Spanish Grammar

Due at 11:59pm on 4/29/2015.

Starter Files

Download lab13.zip. Inside the archive, you will find starter files for the questions in this lab, along with a copy of the OK autograder.

Submission

By the end of this lab, you should have submitted the lab with python3 ok --submit. You may submit more than once before the deadline; only the final submission will be graded.

To receive credit for this lab, you must complete the questions in lab13.sql.

Table of Contents

Introduction

In this lab, you will use a subset of the IULA Spanish LSP Treebank developed by the Institut Universitari de Lingüística Aplicada at the Universitat Pompeu Fabra.

This treebank consists of several sentences from technical documents that have been annotated with the syntactic category of each word (e.g., noun or verb), along with all of its syntactic dependents. For instance, in the sentence "I have a computer," The words "I" and "computer" are dependents of the verb "have", and "a" is a dependent of the word "computer". Each dependent has a type, such as subj for the subject of a verb. The exact definition of a syntactic dependent is debated in linguistics, but a reasonable and consistent convention was used to annotate this dataset.

The lab13.py file generates a table from the data_spanish.conll text file, which describes a collection of annotated sentence. You need to run

python3 lab13.py

to generate a single table called deps that has five columns. Each row describes one dependent of a word in a sentence. The columns are:

To check that you have successfully created this table, run

python3 ok -q generate_table

To use the deps table, first open the spanish.db file from sqlite3. Keep the sqlite interpreter running to run our examples throughout the lab.

$ sqlite3
sqlite> .open spanish.db
sqlite> select word from deps where dependent_word = "caliente";
agua
agua
agua
plato
muestra

Exploring Spanish

Nouns in Spanish are masculine or feminine. Most nouns that end in "a" are feminine, but some (mostly with Greek origin) are masculine, such as "problema" (problem) or "sistema" (system).

The articles "el" and "la" both mean "the", but the first is used with masculine nouns while the second is used with feminine nouns.

Question 1

Create a table called masculine_a full of masculine nouns that end in "a". You can tell if a word ends in "a" using the expression substr(word, -1) = "a". You can tell if a word is a masculine noune if "el" is a dependent_word.

-- A table containing words that are masculine but end in "a"
CREATE TABLE masculine_a as
-- REPLACE THIS LINE select 'YOUR CODE HERE';
select word from deps where dependent_word = "el" and substr(word, -1) = "a";

Note: Finding all words ending in "a" that have "el" as a dependent will not find every masculine noun ending in "a" in Spanish. Also, some nouns that start with "a", such as "agua", are feminine but nonetheless take "el" as an article.

The following command in sqlite3 should yield the following output:

sqlite> .read lab13.sql
sqlite> select * from masculine_a limit 5;
energía
estrategia
energía
agricultura
clima

Check your work before proceeding.

python3 ok -q masculine

SQL Aggregation

Previously, we have been dealing with queries that process one row at a time. When we join, we make pairwise combinations of all of the rows. When we use WHERE, we filter out certain rows based on the condition. Alternatively, applying an aggregate function such as MAX(column) combines the values in multiple rows.

By default, all rows are combined together. What if we wanted to group together the values in similar rows and perform the aggregation operations within those groups? We use a GROUP BY clause.

Here's an example. For every unique value in category, collect all the rows having that value into a group. Then, provide that category and count the number of rows that have that category. The word "poder" is both a noun meaning "power" and a verb meaning "be able to" do something. We can see that it is usually a verb.

sqlite> SELECT category, count(*) FROM deps where word="poder" GROUP BY category;
NC|26
VMG|2
VMI|1362
VMN|7
VMP|16
VMS|6

Just like how we can filter out rows with WHERE, we can also filter out groups with HAVING. Important: A HAVING clause should use an aggregate function. Suppose we want to see all categories with at least 100 rows:

sqlite> SELECT category, COUNT(*) FROM deps GROUP BY category HAVING COUNT(*) > 100;
A|5310
CC|3955
CS|526
DA|138
F|166
NC|63980
NP|3076
PI|125
R|536
S|25693
VAI|512
VMG|183
VMI|10291
VMM|207
VMN|2274
VMP|2584
VMS|383
VSI|2978
VSN|178
Z|933

This query discards rare categories such as Demonstrative Determiner ("DD"), but keeps common ones such as Coordinating Conjunction ("CC").

Corpus Statistics

Please edit lab13.sql for the following two questions.

Question 2: Word Category Counts

Write a query that counts the number of appearances of a particular word-category pair in the dataset. The new table word_cat_count should have three columns: word, category, and count. How should you group the rows together?

Hint: To group based on information from multiple columns, not just one, simply add commas between the columns: GROUP BY school, major.

-- A table containing each word, category, and count of that group
CREATE TABLE word_cat_count AS
-- REPLACE THIS LINE select 'YOUR CODE HERE';
SELECT word, category, COUNT(*) as count FROM deps GROUP BY word, category;

Use OK to test your solution before moving on:

python3 ok -q word_cat_count

What are all the rows in word_cat_count involving the word "ser"?

$ sqlite3
sqlite> .read lab13.sql
sqlite> SELECT * FROM word_cat_count WHERE word = "ser";
ser|NC|15
ser|VSG|8
ser|VSI|2978
ser|VSN|178
ser|VSP|71
ser|VSS|78

So, "ser" is a verb meaning "to be", e.g. "Soy de California" means "I am from California." We expect to see a lot of verb (V__) tags in this result. But wait, there are also cases where "ser" is used as a noun (NC)! ¡Increíble! It turns out that "ser" can also mean "being", for instance ser humano means "human being".

Question 3: Noun Verbs

Write a query that finds all words that can appear as both nouns and verbs. Hint: Use the word_cat_count table you just wrote.

The resulting rows shoud look like the following:

deber|NC|17|deber|VMG|2

Hint: No aggregation is required here because we need to compare individual rows, not aggregate statistics/concatenations of word group. This question is a review for joins.

Hint: Use select * to select all columns from a joined table.

Hint: Nouns are always NC, but verbs can be many different categories, such as "VMI" (main indicative verb) or "VAS" (auxiliary subjunctive verb). To capture all the possinge strings, use the expression verb LIKE "V__" instead of verb = "VMI" AND verb = "VMP" AND .... The underscores match any character. LIKE is how SQL searches strings, instead of just looking for exact matches.

-- A table containing all info of words that can be both nouns and verbs
CREATE TABLE noun_verbs AS
-- REPLACE THIS LINE select 'YOUR CODE HERE';
SELECT * FROM word_cat_count as n, word_cat_count as v WHERE n.word = v.word AND n.category == "NC" AND v.category LIKE "V__";

The following command in sqlite3 should yield the following output:

sqlite> .read lab13.sql
sqlite> select * from noun_verbs limit 10;
deber|NC|17|deber|VMG|2
deber|NC|17|deber|VMI|407
deber|NC|17|deber|VMN|1
deber|NC|17|deber|VMP|18
deber|NC|17|deber|VMS|4
despertar|NC|3|despertar|VMN|2
haber|NC|2|haber|VAI|2
haber|NC|2|haber|VMI|149
haber|NC|2|haber|VMS|2
poder|NC|26|poder|VMG|2

Use OK to test your solution before moving on:

python3 ok -q noun_verbs

It turns out that haber ("to have (aux)"), like ha comido ("has eaten") can also appear as a noun!

Extra Questions

The following questions are for extra practice — they can be found in the lab13_extra.sql file. It is recommended that you complete these problems as well, but you do not need to turn them in for credit.

Dependency Frequencies

We are now ready to compute a big frequency table in lab13_extra.sql. The code for frequencies is provided for you. In particular, we group by the word, category, child dependency, and child category. Why did we consider the child's category in our group? A modifier can, for instance, be a preposition ("S") or an adjective ("A").

Let's run a few examples with the new frequencies table:

$ sqlite3
sqlite> .read lab13_extra.sql
sqlite> -- Example: See what arguments certain verbs take in
sqlite> SELECT * FROM frequencies WHERE word = "figurar";
figurar|VMI|MOD|S|1|0.0555555555555556
figurar|VMI|PP-LOC|S|8|0.444444444444444
figurar|VMI|SUBJ|NC|9|0.5
figurar|VMN|MOD|S|1|1.0

sqlite> -- Example: See what deps some words must have. "F" denotes punctuation
sqlite> SELECT word, category, dependency_type, dependent_category, count
   ...> FROM frequencies
   ...> WHERE frequency = 1 AND count > 5 AND dependent_category != "F";
autorizar|VMN|DO|NC|6
auxiliar|VMN|MOD|S|15
capaz|A|COMP|S|6
en_función_de|S|COMP|NC|6
este|DD|SPEC|DI|11
...

Question 4: The Most Likely Child

The next natural question to ask is: for a particular word, what kind of dependent should we expect to see? This question can actually help us identify certain Spanish constructions (if we're lucky) that are systematically different from English ones.

Write an appropriate select query for likeliest_child, whose columns are word, category, dependency_type, dependent_category, and the maximum frequency for each word-category pair under the column name max_frequency.

Conditions: We want to ignore deps that are just punctuation, so make sure dependent_category != "F" somewhere in the query. We also want the maximum frequency found in each word-category pair to be > 0.5 and the total number of the word-category pair occurrences to be > 10.

Hint: To reiterate, we are finding the maximum frequency for each word-category pair. How will we use GROUP BY to carry out this specific query?

Hint2: Which condition applies to each row? Which condition applies to each group? This then determines which condition(s) to put in WHERE and which to put in HAVING.

-- For each word-category pair, what is the most likely dependent category?
-- Dependent category cannot be punctuation (i.e. dependent_category != "F")
-- The observed frequency of the dependent should be greater than 0.5
-- The number of occurrences of the word-category pair should be greater than 10
CREATE TABLE likeliest_child AS
-- REPLACE THIS LINE select 'YOUR CODE HERE';
SELECT word, category, dependency_type, dependent_category, MAX(frequency) as max_frequency FROM frequencies WHERE dependent_category != "F" GROUP BY word, category HAVING MAX(frequency) > 0.5 AND SUM(count) > 10;

The following command in sqlite3 should yield the following output:

sqlite> .read lab13_extra.sql
sqlite> select word, category, dependency_type, dependent_category, ROUND(max_frequency, 5) from likeliest_child limit 20;
,|F|CONJ|NC|0.58108
ZD|Z|COMP|S|0.75
a|S|COMP|NC|0.78444
a_partir_de|S|COMP|NC|0.84
a_través_de|S|COMP|NC|0.75
abdomen|NC|SPEC|DA|0.64286
absceso|NC|MOD|A|0.58824
acceder|VMN|OBLC|S|0.75
actualidad|NC|SPEC|DA|0.73333
actualización|NC|COMP|S|0.58333
al|CS|COMP|VMN|0.95
alguno|PI|COMP|S|0.85714
ante|S|COMP|NC|0.65385
antebrazo|NC|SPEC|DA|0.7619
antes_de|S|COMP|NC|0.56
aorta|NC|MOD|A|0.53333
aplicable|A|COMP|S|0.83333
aplicar|VMN|DO|NC|0.52632
aprendiz|NC|MOD|S|0.72727
arteria|NC|MOD|A|0.55357

Use OK to test your solution:

python3 ok -q likeliest_child

We see a lot of common nouns ("NC") that have an article ("DA"), verbs ("V__") that take a direct object ("DO"), or prepositions ("S") that have noun complements. These examples are mostly similar to English. Are there others that surprise you?

In particular, I saw an interesting example: relacionar|VMP|OBLC|S|0.609756097560976

It turns out that relacionado ("related", "VMP" means participle) is used frequently with an oblique object "OBLC", namely an object denoted by con ("with").

Example: "La supervivencia de el hombre ha estado siempre estrechamente relacionada con el agua." Translation: "Human survival has always been closely related to water."

If you come up with other interesting queries, please feel free to post them on Piazza or discuss with your classmates/lab assistants/TAs!

Fin. (The end.)

Extensions

I had a lot of fun investigating Spanish, and I hope you did too. Here are some other extensions you can explore over your summer break:

Format Description

The attached narrative contains information about the CONLL file format and the convention used for dependency tags. The part of speech tagging convention can be found at this FreeLing page.