Lab 13: SQL II: Aggregation | Spanish Grammar

Due at 11:59pm on 4/29/2015.

Starter Files

Download lab13.zip. Inside the archive, you will find starter files for the questions in this lab, along with a copy of the OK autograder.

Submission

By the end of this lab, you should have submitted the lab with python3 ok --submit. You may submit more than once before the deadline; only the final submission will be graded.

To receive credit for this lab, you must complete the questions in lab13.sql.

Introduction
Exploring Spanish

Question 1

SQL Aggregation
Corpus Statistics

Question 2: Word Category Counts
Question 3: Noun Verbs

Extra Questions
Dependency Frequencies

Question 4: The Most Likely Child

Extensions
Format Description

Introduction

In this lab, you will use a subset of the IULA Spanish LSP Treebank developed by the Institut Universitari de Lingüística Aplicada at the Universitat Pompeu Fabra.

This treebank consists of several sentences from technical documents that have been annotated with the syntactic category of each word (e.g., noun or verb), along with all of its syntactic dependents. For instance, in the sentence "I have a computer," The words "I" and "computer" are dependents of the verb "have", and "a" is a dependent of the word "computer". Each dependent has a type, such as subj for the subject of a verb. The exact definition of a syntactic dependent is debated in linguistics, but a reasonable and consistent convention was used to annotate this dataset.

The lab13.py file generates a table from the data_spanish.conll text file, which describes a collection of annotated sentence. You need to run

python3 lab13.py

to generate a single table called deps that has five columns. Each row describes one dependent of a word in a sentence. The columns are:

word - the base form of a word
category - the syntactic category of a word
dependent_word - the base form of the dependent word
dependent_category - the syntactic category of the dependent word
dependency_type - the type of dependency between the word and its dependent

To check that you have successfully created this table, run

python3 ok -q generate_table

To use the deps table, first open the spanish.db file from sqlite3. Keep the sqlite interpreter running to run our examples throughout the lab.

$ sqlite3
sqlite> .open spanish.db
sqlite> select word from deps where dependent_word = "caliente";
agua
agua
agua
plato
muestra

Exploring Spanish

Nouns in Spanish are masculine or feminine. Most nouns that end in "a" are feminine, but some (mostly with Greek origin) are masculine, such as "problema" (problem) or "sistema" (system).

The articles "el" and "la" both mean "the", but the first is used with masculine nouns while the second is used with feminine nouns.

Question 1

Create a table called masculine_a full of masculine nouns that end in "a". You can tell if a word ends in "a" using the expression substr(word, -1) = "a". You can tell if a word is a masculine noune if "el" is a dependent_word.

-- A table containing words that are masculine but end in "a"
CREATE TABLE masculine_a as
    -- REPLACE THIS LINE
    select 'YOUR CODE HERE';
    select word from deps where dependent_word = "el" and substr(word, -1) = "a";

Note: Finding all words ending in "a" that have "el" as a dependent will not find every masculine noun ending in "a" in Spanish. Also, some nouns that start with "a", such as "agua", are feminine but nonetheless take "el" as an article.

The following command in sqlite3 should yield the following output:

sqlite> .read lab13.sql
sqlite> select * from masculine_a limit 5;
energía
estrategia
energía
agricultura
clima

Check your work before proceeding.

python3 ok -q masculine

SQL Aggregation

Previously, we have been dealing with queries that process one row at a time. When we join, we make pairwise combinations of all of the rows. When we use WHERE, we filter out certain rows based on the condition. Alternatively, applying an aggregate function such as MAX(column) combines the values in multiple rows.

By default, all rows are combined together. What if we wanted to group together the values in similar rows and perform the aggregation operations within those groups? We use a GROUP BY clause.

Here's an example. For every unique value in category, collect all the rows having that value into a group. Then, provide that category and count the number of rows that have that category. The word "poder" is both a noun meaning "power" and a verb meaning "be able to" do something. We can see that it is usually a verb.

sqlite> SELECT category, count(*) FROM deps where word="poder" GROUP BY category;
NC|26
VMG|2
VMI|1362
VMN|7
VMP|16
VMS|6

Just like how we can filter out rows with WHERE, we can also filter out groups with HAVING. Important: A HAVING clause should use an aggregate function. Suppose we want to see all categories with at least 100 rows:

sqlite> SELECT category, COUNT(*) FROM deps GROUP BY category HAVING COUNT(*) > 100;
A|5310
CC|3955
CS|526
DA|138
F|166
NC|63980
NP|3076
PI|125
R|536
S|25693
VAI|512
VMG|183
VMI|10291
VMM|207
VMN|2274
VMP|2584
VMS|383
VSI|2978
VSN|178
Z|933

This query discards rare categories such as Demonstrative Determiner ("DD"), but keeps common ones such as Coordinating Conjunction ("CC").

Corpus Statistics

Please edit lab13.sql for the following two questions.

Question 2: Word Category Counts

Write a query that counts the number of appearances of a particular word-category pair in the dataset. The new table word_cat_count should have three columns: word, category, and count. How should you group the rows together?

Hint: To group based on information from multiple columns, not just one, simply add commas between the columns: GROUP BY school, major.

-- A table containing each word, category, and count of that group
CREATE TABLE word_cat_count AS
    -- REPLACE THIS LINE
    select 'YOUR CODE HERE';
    SELECT word, category, COUNT(*) as count
    FROM deps
    GROUP BY word, category;

Use OK to test your solution before moving on:

python3 ok -q word_cat_count

What are all the rows in word_cat_count involving the word "ser"?

$ sqlite3
sqlite> .read lab13.sql
sqlite> SELECT * FROM word_cat_count WHERE word = "ser";
ser|NC|15
ser|VSG|8
ser|VSI|2978
ser|VSN|178
ser|VSP|71
ser|VSS|78

So, "ser" is a verb meaning "to be", e.g. "Soy de California" means "I am from California." We expect to see a lot of verb (V__) tags in this result. But wait, there are also cases where "ser" is used as a noun (NC)! ¡Increíble! It turns out that "ser" can also mean "being", for instance ser humano means "human being".

Question 3: Noun Verbs

Write a query that finds all words that can appear as both nouns and verbs. Hint: Use the word_cat_count table you just wrote.

The resulting rows shoud look like the following:

deber|NC|17|deber|VMG|2

Hint: No aggregation is required here because we need to compare individual rows, not aggregate statistics/concatenations of word group. This question is a review for joins.

Hint: Use select * to select all columns from a joined table.

Hint: Nouns are always NC, but verbs can be many different categories, such as "VMI" (main indicative verb) or "VAS" (auxiliary subjunctive verb). To capture all the possinge strings, use the expression verb LIKE "V__" instead of verb = "VMI" AND verb = "VMP" AND .... The underscores match any character. LIKE is how SQL searches strings, instead of just looking for exact matches.

-- A table containing all info of words that can be both nouns and verbs
CREATE TABLE noun_verbs AS
    -- REPLACE THIS LINE
    select 'YOUR CODE HERE';
    SELECT * FROM word_cat_count as n, word_cat_count as v
    WHERE n.word = v.word AND n.category == "NC" AND v.category LIKE "V__";

The following command in sqlite3 should yield the following output:

sqlite> .read lab13.sql
sqlite> select * from noun_verbs limit 10;
deber|NC|17|deber|VMG|2
deber|NC|17|deber|VMI|407
deber|NC|17|deber|VMN|1
deber|NC|17|deber|VMP|18
deber|NC|17|deber|VMS|4
despertar|NC|3|despertar|VMN|2
haber|NC|2|haber|VAI|2
haber|NC|2|haber|VMI|149
haber|NC|2|haber|VMS|2
poder|NC|26|poder|VMG|2

Use OK to test your solution before moving on:

python3 ok -q noun_verbs

It turns out that haber ("to have (aux)"), like ha comido ("has eaten") can also appear as a noun!

Extra Questions

The following questions are for extra practice — they can be found in the lab13_extra.sql file. It is recommended that you complete these problems as well, but you do not need to turn them in for credit.

Dependency Frequencies

We are now ready to compute a big frequency table in lab13_extra.sql. The code for frequencies is provided for you. In particular, we group by the word, category, child dependency, and child category. Why did we consider the child's category in our group? A modifier can, for instance, be a preposition ("S") or an adjective ("A").

Let's run a few examples with the new frequencies table:

$ sqlite3
sqlite> .read lab13_extra.sql
sqlite> -- Example: See what arguments certain verbs take in
sqlite> SELECT * FROM frequencies WHERE word = "figurar";
figurar|VMI|MOD|S|1|0.0555555555555556
figurar|VMI|PP-LOC|S|8|0.444444444444444
figurar|VMI|SUBJ|NC|9|0.5
figurar|VMN|MOD|S|1|1.0

sqlite> -- Example: See what deps some words must have. "F" denotes punctuation
sqlite> SELECT word, category, dependency_type, dependent_category, count
   ...> FROM frequencies
   ...> WHERE frequency = 1 AND count > 5 AND dependent_category != "F";
autorizar|VMN|DO|NC|6
auxiliar|VMN|MOD|S|15
capaz|A|COMP|S|6
en_función_de|S|COMP|NC|6
este|DD|SPEC|DI|11
...

Question 4: The Most Likely Child

The next natural question to ask is: for a particular word, what kind of dependent should we expect to see? This question can actually help us identify certain Spanish constructions (if we're lucky) that are systematically different from English ones.

Write an appropriate select query for likeliest_child, whose columns are word, category, dependency_type, dependent_category, and the maximum frequency for each word-category pair under the column name max_frequency.

Conditions: We want to ignore deps that are just punctuation, so make sure dependent_category != "F" somewhere in the query. We also want the maximum frequency found in each word-category pair to be > 0.5 and the total number of the word-category pair occurrences to be > 10.

Hint: To reiterate, we are finding the maximum frequency for each word-category pair. How will we use GROUP BY to carry out this specific query?

Hint2: Which condition applies to each row? Which condition applies to each group? This then determines which condition(s) to put in WHERE and which to put in HAVING.

-- For each word-category pair, what is the most likely dependent category?
-- Dependent category cannot be punctuation (i.e. dependent_category != "F")
-- The observed frequency of the dependent should be greater than 0.5
-- The number of occurrences of the word-category pair should be greater than 10
CREATE TABLE likeliest_child AS
    -- REPLACE THIS LINE
    select 'YOUR CODE HERE';
    SELECT word, category, dependency_type, dependent_category, MAX(frequency) as max_frequency
    FROM frequencies
    WHERE dependent_category != "F"
    GROUP BY word, category
    HAVING MAX(frequency) > 0.5 AND SUM(count) > 10;

The following command in sqlite3 should yield the following output:

sqlite> .read lab13_extra.sql
sqlite> select word, category, dependency_type, dependent_category, ROUND(max_frequency, 5) from likeliest_child limit 20;
,|F|CONJ|NC|0.58108
ZD|Z|COMP|S|0.75
a|S|COMP|NC|0.78444
a_partir_de|S|COMP|NC|0.84
a_través_de|S|COMP|NC|0.75
abdomen|NC|SPEC|DA|0.64286
absceso|NC|MOD|A|0.58824
acceder|VMN|OBLC|S|0.75
actualidad|NC|SPEC|DA|0.73333
actualización|NC|COMP|S|0.58333
al|CS|COMP|VMN|0.95
alguno|PI|COMP|S|0.85714
ante|S|COMP|NC|0.65385
antebrazo|NC|SPEC|DA|0.7619
antes_de|S|COMP|NC|0.56
aorta|NC|MOD|A|0.53333
aplicable|A|COMP|S|0.83333
aplicar|VMN|DO|NC|0.52632
aprendiz|NC|MOD|S|0.72727
arteria|NC|MOD|A|0.55357

Use OK to test your solution:

python3 ok -q likeliest_child

We see a lot of common nouns ("NC") that have an article ("DA"), verbs ("V__") that take a direct object ("DO"), or prepositions ("S") that have noun complements. These examples are mostly similar to English. Are there others that surprise you?

In particular, I saw an interesting example: relacionar|VMP|OBLC|S|0.609756097560976

It turns out that relacionado ("related", "VMP" means participle) is used frequently with an oblique object "OBLC", namely an object denoted by con ("with").

Example: "La supervivencia de el hombre ha estado siempre estrechamente relacionada con el agua." Translation: "Human survival has always been closely related to water."

If you come up with other interesting queries, please feel free to post them on Piazza or discuss with your classmates/lab assistants/TAs!

Fin. (The end.)

Extensions

I had a lot of fun investigating Spanish, and I hope you did too. Here are some other extensions you can explore over your summer break:

For more practice building SQL tables from Python, you can record all the bigrams, trigrams, ..., and n-grams that appear in the corpus. A bigram consists of a pair of words that occur next to each other, e.g. relacionado con, as you have seen above. We can then see if there are some bigrams that occur more frequently than expected.
If you look at the original CONLL file, you will notice that we have mostly ignored morphological features, such as gender/number/person. In English, subject-verb agreement is important: The pie is great vs. *The pie are great. Spanish has both subject-verb agreement and noun-adjective-article agreement. For example, "el" changes to "las" for a feminine-plural noun: "las vacaciones". You can actually get a computer to logically deduce the agreement through a process called unification. Given a problem (el) vacaciones (venir, future tense) pronto, the attributes of "vacationes" determine that the sentence should be las vacaciones vendrán pronto.
You can write a program that learns from this hand-annotated training data and then try to generate dependencies for a new sentence.
Another useful data source for Spanish is the EuroParl dataset, which aligns translated Spanish sentences with the English transcription of the European Parliament proceedings. A very important subproblem for translation is the ability to align English phrases to Spanish phrases. Example: There are many people everywhere. In Spanish: Hay mucha gente por todas partes. A program must be able to identify that "There are" matches with "Hay", "many people" matches with "mucha gente", and "everywhere" matches with "por todas partes". Given the English parsing of the NLP lecture and the Spanish dependency grammar data here, is there a general way to merge the two systems for alignment?
Of course, Spanish is not the other language out there with interesting grammatical properties. You can explore in what ways German word order is really precise. Or you can write a program to divide a stream of Chinese text into phrases. And more. The internet has a lot of rich data sets to process and learn from.

Format Description

The attached narrative contains information about the CONLL file format and the convention used for dependency tags. The part of speech tagging convention can be found at this FreeLing page.

Lab 13: SQL II: Aggregation | Spanish Grammar

Starter Files

Submission

Table of Contents

Introduction

Exploring Spanish

Question 1

SQL Aggregation

Corpus Statistics

Question 2: Word Category Counts

Question 3: Noun Verbs

Extra Questions

Dependency Frequencies

Question 4: The Most Likely Child

Extensions

Format Description