Due at 11:59pm on 4/29/2015.
Download lab13.zip. Inside the archive, you will find starter files for the questions in this lab, along with a copy of the OK autograder.
By the end of this lab, you should have submitted the lab with
python3 ok --submit
. You may submit more than once before the
deadline; only the final submission will be graded.
To receive credit for this lab, you must complete the questions in lab13.sql.
In this lab, you will use a subset of the IULA Spanish LSP Treebank developed by the Institut Universitari de Lingüística Aplicada at the Universitat Pompeu Fabra.
This treebank consists of several sentences from technical documents that have been annotated with the syntactic category of each word (e.g., noun or verb), along with all of its syntactic dependents. For instance, in the sentence "I have a computer," The words "I" and "computer" are dependents of the verb "have", and "a" is a dependent of the word "computer". Each dependent has a type, such as subj for the subject of a verb. The exact definition of a syntactic dependent is debated in linguistics, but a reasonable and consistent convention was used to annotate this dataset.
The lab13.py
file generates a table from the data_spanish.conll
text
file, which describes a collection of annotated sentence. You need to run
python3 lab13.py
to generate a single table called deps
that has five columns. Each row
describes one dependent of a word in a sentence. The columns are:
word
- the base form of a wordcategory
- the syntactic category of a worddependent_word
- the base form of the dependent worddependent_category
- the syntactic category of the dependent worddependency_type
- the type of dependency between the word and its dependentTo check that you have successfully created this table, run
python3 ok -q generate_table
To use the deps
table, first open the spanish.db
file from sqlite3. Keep the
sqlite interpreter running to run our examples throughout the lab.
$ sqlite3
sqlite> .open spanish.db
sqlite> select word from deps where dependent_word = "caliente";
agua
agua
agua
plato
muestra
Nouns in Spanish are masculine or feminine. Most nouns that end in "a" are feminine, but some (mostly with Greek origin) are masculine, such as "problema" (problem) or "sistema" (system).
The articles "el" and "la" both mean "the", but the first is used with masculine nouns while the second is used with feminine nouns.
Create a table called masculine_a
full of masculine nouns that end in "a".
You can tell if a word
ends in "a" using the expression substr(word, -1) =
"a"
. You can tell if a word
is a masculine noune if "el" is a
dependent_word
.
-- A table containing words that are masculine but end in "a"
CREATE TABLE masculine_a as
-- REPLACE THIS LINE
select 'YOUR CODE HERE';
select word from deps where dependent_word = "el" and substr(word, -1) = "a";
Note: Finding all words ending in "a" that have "el" as a dependent will not find every masculine noun ending in "a" in Spanish. Also, some nouns that start with "a", such as "agua", are feminine but nonetheless take "el" as an article.
The following command in sqlite3 should yield the following output:
sqlite> .read lab13.sql
sqlite> select * from masculine_a limit 5;
energía
estrategia
energía
agricultura
clima
Check your work before proceeding.
python3 ok -q masculine
Previously, we have been dealing with queries that process one row at a time.
When we join, we make pairwise combinations of all of the rows. When we use
WHERE
, we filter out certain rows based on the condition. Alternatively,
applying an aggregate function
such as MAX(column)
combines the values in multiple rows.
By default, all rows are combined together. What if we wanted to group together
the values in similar rows and perform the aggregation operations within those
groups? We use a GROUP BY
clause.
Here's an example. For every unique value in category, collect all the rows having that value into a group. Then, provide that category and count the number of rows that have that category. The word "poder" is both a noun meaning "power" and a verb meaning "be able to" do something. We can see that it is usually a verb.
sqlite> SELECT category, count(*) FROM deps where word="poder" GROUP BY category;
NC|26
VMG|2
VMI|1362
VMN|7
VMP|16
VMS|6
Just like how we can filter out rows with WHERE
, we can also filter out
groups with HAVING
. Important: A HAVING
clause should use an aggregate
function. Suppose we want to see all categories with at least 100 rows:
sqlite> SELECT category, COUNT(*) FROM deps GROUP BY category HAVING COUNT(*) > 100;
A|5310
CC|3955
CS|526
DA|138
F|166
NC|63980
NP|3076
PI|125
R|536
S|25693
VAI|512
VMG|183
VMI|10291
VMM|207
VMN|2274
VMP|2584
VMS|383
VSI|2978
VSN|178
Z|933
This query discards rare categories such as Demonstrative Determiner ("DD"), but keeps common ones such as Coordinating Conjunction ("CC").
Please edit lab13.sql for the following two questions.
Write a query that counts the number of appearances of a particular
word-category pair in the dataset. The new table word_cat_count
should have
three columns: word
, category
, and count
.
How should you group the rows together?
Hint: To group based on information from multiple columns, not just one,
simply add commas between the columns: GROUP BY school, major
.
-- A table containing each word, category, and count of that group
CREATE TABLE word_cat_count AS
-- REPLACE THIS LINE
select 'YOUR CODE HERE';
SELECT word, category, COUNT(*) as count
FROM deps
GROUP BY word, category;
Use OK to test your solution before moving on:
python3 ok -q word_cat_count
What are all the rows in word_cat_count
involving the word "ser"?
$ sqlite3
sqlite> .read lab13.sql
sqlite> SELECT * FROM word_cat_count WHERE word = "ser";
ser|NC|15
ser|VSG|8
ser|VSI|2978
ser|VSN|178
ser|VSP|71
ser|VSS|78
So, "ser" is a verb meaning "to be", e.g. "Soy de California" means "I am from
California." We expect to see a lot of verb (V__
) tags in this result.
But wait, there are also cases where "ser" is used as a noun (NC
)!
¡Increíble! It turns out that "ser" can also mean "being", for instance
ser humano
means "human being".
Write a query that finds all words that can appear as both nouns and verbs.
Hint: Use the word_cat_count
table you just wrote.
The resulting rows shoud look like the following:
deber|NC|17|deber|VMG|2
Hint: No aggregation is required here because we need to compare individual rows, not aggregate statistics/concatenations of word group. This question is a review for joins.
Hint: Use select *
to select all columns from a joined table.
Hint: Nouns are always NC
, but verbs can be many different categories,
such as "VMI" (main indicative verb) or "VAS" (auxiliary subjunctive verb). To
capture all the possinge strings, use the expression verb LIKE "V__"
instead
of verb = "VMI" AND verb = "VMP" AND ...
. The underscores match any
character. LIKE
is how SQL searches strings, instead of just looking for
exact matches.
-- A table containing all info of words that can be both nouns and verbs
CREATE TABLE noun_verbs AS
-- REPLACE THIS LINE
select 'YOUR CODE HERE';
SELECT * FROM word_cat_count as n, word_cat_count as v
WHERE n.word = v.word AND n.category == "NC" AND v.category LIKE "V__";
The following command in sqlite3 should yield the following output:
sqlite> .read lab13.sql
sqlite> select * from noun_verbs limit 10;
deber|NC|17|deber|VMG|2
deber|NC|17|deber|VMI|407
deber|NC|17|deber|VMN|1
deber|NC|17|deber|VMP|18
deber|NC|17|deber|VMS|4
despertar|NC|3|despertar|VMN|2
haber|NC|2|haber|VAI|2
haber|NC|2|haber|VMI|149
haber|NC|2|haber|VMS|2
poder|NC|26|poder|VMG|2
Use OK to test your solution before moving on:
python3 ok -q noun_verbs
It turns out that haber
("to have (aux)"), like ha comido
("has eaten")
can also appear as a noun!
The following questions are for extra practice — they can be found in the lab13_extra.sql file. It is recommended that you complete these problems as well, but you do not need to turn them in for credit.
We are now ready to compute a big frequency table in lab13_extra.sql. The code for frequencies
is provided for
you. In particular, we group by the word, category, child dependency, and child
category. Why did we consider the child's category in our group? A modifier
can, for instance, be a preposition ("S") or an adjective ("A").
Let's run a few examples with the new frequencies
table:
$ sqlite3
sqlite> .read lab13_extra.sql
sqlite> -- Example: See what arguments certain verbs take in
sqlite> SELECT * FROM frequencies WHERE word = "figurar";
figurar|VMI|MOD|S|1|0.0555555555555556
figurar|VMI|PP-LOC|S|8|0.444444444444444
figurar|VMI|SUBJ|NC|9|0.5
figurar|VMN|MOD|S|1|1.0
sqlite> -- Example: See what deps some words must have. "F" denotes punctuation
sqlite> SELECT word, category, dependency_type, dependent_category, count
...> FROM frequencies
...> WHERE frequency = 1 AND count > 5 AND dependent_category != "F";
autorizar|VMN|DO|NC|6
auxiliar|VMN|MOD|S|15
capaz|A|COMP|S|6
en_función_de|S|COMP|NC|6
este|DD|SPEC|DI|11
...
The next natural question to ask is: for a particular word, what kind of dependent should we expect to see? This question can actually help us identify certain Spanish constructions (if we're lucky) that are systematically different from English ones.
Write an appropriate select query for likeliest_child
, whose columns are
word
, category
, dependency_type
, dependent_category
, and the maximum frequency
for each word-category pair under the column name max_frequency
.
Conditions: We want to ignore deps that are just punctuation, so make sure
dependent_category != "F"
somewhere in the query.
We also want the maximum frequency found in each word-category pair to be > 0.5
and the total number of the word-category pair occurrences to be > 10.
Hint: To reiterate, we are finding the maximum frequency for each word-category
pair. How will we use GROUP BY
to carry out this specific query?
Hint2: Which condition applies to each row? Which condition applies to each
group? This then determines which condition(s) to put in WHERE
and which to
put in HAVING
.
-- For each word-category pair, what is the most likely dependent category?
-- Dependent category cannot be punctuation (i.e. dependent_category != "F")
-- The observed frequency of the dependent should be greater than 0.5
-- The number of occurrences of the word-category pair should be greater than 10
CREATE TABLE likeliest_child AS
-- REPLACE THIS LINE
select 'YOUR CODE HERE';
SELECT word, category, dependency_type, dependent_category, MAX(frequency) as max_frequency
FROM frequencies
WHERE dependent_category != "F"
GROUP BY word, category
HAVING MAX(frequency) > 0.5 AND SUM(count) > 10;
The following command in sqlite3 should yield the following output:
sqlite> .read lab13_extra.sql
sqlite> select word, category, dependency_type, dependent_category, ROUND(max_frequency, 5) from likeliest_child limit 20;
,|F|CONJ|NC|0.58108
ZD|Z|COMP|S|0.75
a|S|COMP|NC|0.78444
a_partir_de|S|COMP|NC|0.84
a_través_de|S|COMP|NC|0.75
abdomen|NC|SPEC|DA|0.64286
absceso|NC|MOD|A|0.58824
acceder|VMN|OBLC|S|0.75
actualidad|NC|SPEC|DA|0.73333
actualización|NC|COMP|S|0.58333
al|CS|COMP|VMN|0.95
alguno|PI|COMP|S|0.85714
ante|S|COMP|NC|0.65385
antebrazo|NC|SPEC|DA|0.7619
antes_de|S|COMP|NC|0.56
aorta|NC|MOD|A|0.53333
aplicable|A|COMP|S|0.83333
aplicar|VMN|DO|NC|0.52632
aprendiz|NC|MOD|S|0.72727
arteria|NC|MOD|A|0.55357
Use OK to test your solution:
python3 ok -q likeliest_child
We see a lot of common nouns ("NC") that have an article ("DA"), verbs ("V__") that take a direct object ("DO"), or prepositions ("S") that have noun complements. These examples are mostly similar to English. Are there others that surprise you?
In particular, I saw an interesting example:
relacionar|VMP|OBLC|S|0.609756097560976
It turns out that relacionado
("related", "VMP" means participle) is used
frequently with an oblique object "OBLC", namely an object denoted by con
("with").
Example: "La supervivencia de el hombre ha estado siempre estrechamente relacionada con el agua." Translation: "Human survival has always been closely related to water."
If you come up with other interesting queries, please feel free to post them on Piazza or discuss with your classmates/lab assistants/TAs!
Fin. (The end.)
I had a lot of fun investigating Spanish, and I hope you did too. Here are some other extensions you can explore over your summer break:
relacionado con
, as you have seen above. We can then see if there are some
bigrams that occur more frequently than expected.(el) vacaciones (venir, future tense) pronto
, the
attributes of "vacationes" determine that the sentence should be
las vacaciones vendrán pronto
.The attached narrative contains information about the CONLL file format and the convention used for dependency tags. The part of speech tagging convention can be found at this FreeLing page.