What do people tweet?
Draw their feelings on a map
to discover trends
In this project, you will develop a geographic visualization of Twitter data across the USA. You will need to use data abstraction, sequences, and dictionaries to create a modular program. This project uses ideas from Sections 2.1, 2.2, 2.3, and 2.4.3 of the Composing Programs online textbook.
The map displayed above depicts how the people in different states feel about California. This image is generated by:
The details of how to conduct each of these steps are contained within the project description. By the end of this project, you will be able to map the sentiment of any word or phrase. The trends.zip archive contains all the starter code and a small set of data.
The project uses several files, but all of your changes will be made to the first one.
trends.py
|
A starter implementation of the main project file |
geo.py
|
Geographic positions, 2-D projection equations, and geographic distance functions |
maps.py
|
Functions for drawing maps |
data.py
|
Functions for loading Twitter data from files |
graphics.py
|
A simple Python graphics library |
ucb.py
|
Utility functions for CS 61A |
ok
|
CS 61A autograder |
tests
|
A directory of tests used by ok
|
The data directory contains all the data files needed for the project, and it's necessary to run the project. The images directory contains the correct maps that your program should produce by the end of the project for the given terms. There is also a larger dataset that you can use once you are done with the project.
This is a one-week project. You may work with one partner only.
Start early! Feel free to ask for help early and often. The course staff is here to assist you, but we can't help everyone an hour before the deadline. Piazza and office hours await. You are not alone!
There are 15 possible points (12 for correctness and 3 for composition). You may earn one bonus point for submitting at least 24 hours before the deadline.
You only need to submit the file trends.py
. You do not need to modify any
other files for this project. To submit the project, change to the directory
where the trends.py
file is located and run submit proj2
.
Throughout this project, you should be testing the correctness of your code. It is good practice to test often, so that it is easy to isolate any problems.
We have provided an autograder called ok
to help you with
testing your code and tracking your progress. The first time you run
the autograder, you will be asked to log in with your @berkeley.edu
account using your web browser. Please do so. Each time you run
ok
, it will back up your work and progress on our servers.
The primary purpose of ok
is to test your implementations, but there
is a catch. At first, the test cases are locked. To unlock tests,
run the following command from your terminal:
python3 ok -u
When you see a ?
prompt, type what you expect the output to be based on the
description. If you are correct, then this test case will be available the next
time you run the autograder.
Once you have unlocked some tests and written some code, you can check the correctness of your program using the tests that you have unlocked:
python3 ok
To help with debugging, ok
can also be run in interactive mode. When an error
occurs in interactive mode, the autograder will start an interactive Python
session in the environment used for the test, so that you can explore the state
of the environment:
python3 ok -i
Most of the time, you will want to focus on a particular question.
Use the -q
option as directed in the problems below.
The tests
directory is used to store autograder tests, so make sure
not to modify it. You may lose all your unlocking progress if you
do. If you need to get a fresh copy, you can download the zip
archive and copy it over, but you will need to start
unlocking from scratch.
If you have any problems logging in or communicating with the server, use the
--local
flag to inhibit any server communication.
The project also includes a large number of doctests. You can check that they pass as well:
python3 -m doctest *.py
In this phase, you will create a data abstraction for Tweets, split the text of a tweet into words, and calculate the amount of positive or negative feeling in a tweet.
First, we will define a data abstraction for Tweets. To ensure that we do not violate abstraction barriers later in the project, we will create two different representations:
(A) The constructor make_tweet
returns a Python list with
the following items:
(B) The alternate constructor make_tweet_fn
returns a function that
takes a string argument that is one of the keys above and returns the
corresponding value.
Implement the missing selector and constructor functions for these two
representations: tweet_text
, tweet_time
, tweet_location
correspond to representation (A); make_tweet_fn
corresponds to
representation (B).
For tweet_location
you should return a position
. The constructors and
selectors for this data abstraction can be found in geo.py
.
The two representations created by make_tweet
and make_tweet_fn
do
not need to work together, but each constructor should work with its
corresponding selectors.
As with project 1, you will need to unlock the tests first before using them:
python3 ok -q 1 -u
python3 ok -q 1
Improve the extract_words
function as follows: Assume that a word is
any consecutive substring of text
that consists only of ASCII
letters. The string ascii_letters
in the string
module contains
all letters in the ASCII character set. The extract_words
function
should list all such words in order and nothing else.
Test your implementation before moving on:
python3 ok -q 2 -u
python3 ok -q 2
Some words are associated with positive or negative sentiment, but most are
not. The sentiment of some individual words, judged by a group of people, can
be found in the data/sentiments.csv
text file.
Implement the sentiment
data abstraction, which represents a
sentiment value that may or may not exist. The constructor
make_sentiment
takes either a numeric value within the interval -1 to
1, or None
to indicate that the value does not exist. Implement the
selectors has_sentiment
and sentiment_value
as well. You may use
any representation you choose, but the rest of your program should not
depend on this representation.
Test your implementation before moving on:
python3 ok -q 3 -u
python3 ok -q 3
You can experiment using the -p
flag, which calls the print_sentiment
function to print the sentiment values of all sentiment-carrying words in a
line of text.
python3 trends.py -p computer science is my favorite!
python3 trends.py -p life without lambda: awful or awesome?
Implement analyze_tweet_sentiment
, which takes a tweet
and returns a sentiment
. Read the docstrings for
get_word_sentiment
and analyze_tweet_sentiment
in trends.py
to
understand how the two functions interact. Your implementation should
not depend on the representation of a sentiment or a tweet!
The tweet_words
function should prove useful here: it combines the
tweet_text
selector and extract_words
function from the previous
questions to return a list of words in a tweet.
Test your implementation before moving on:
python3 ok -q 4 -u
python3 ok -q 4
In this phase, we will implement two functions that together determine the centers of U.S. states. The shape of a state is represented as a list of polygons. Some states (e.g. Hawaii) consist of multiple polygons, but most states (e.g. Colorado) consist of only one polygon (represented as a length-one list of polygons).
Implement find_centroid
, which takes a polygon and returns three
values: the coordinates of its centroid and its area. The input
polygon is represented as a list of position
values that are
consecutive vertices of its perimeter. The first vertex is always
identical to the last.
The centroid of a two-dimensional shape is its center of balance, defined as
the intersection of all straight lines that evenly divide the shape into
equal-area halves. The find_centroid
function returns the centroid
coordinates and area of an individual polygon.
The formula for computing the centroid of a polygon appears on Wikipedia. The formula relies on vertices being consecutive (either clockwise or counterclockwise; both give the same answer), a property that you may assume always holds for the input.
Hint: latitudes correspond to the x
values, and longitudes
correspond to the y
values.
The area of a polygon is never negative. Depending on how you compute
the area, you may need to use the built-in abs
function to return a
non-negative number.
Manipulate positions using their selectors (latitude
and longitude
)
rather than assuming a particular representation.
Test your implementation before moving on:
python3 ok -q 5 -u
python3 ok -q 5
Implement find_state_center
, which takes a state represented by a
list of polygons and returns a position
, its centroid.
The centroid of a collection of polygons can be computed by geometric decomposition. The centroid of a shape is the weighted average of the centroids of its component polygons, weighted by their area.
Test your implementation before moving on:
python3 ok -q 6 -u
python3 ok -q 6
Once you are finished, draw_centered_map
will draw the 10 states
closest to a given state (including that state). A red dot should
appear over the two-letter postal code of the specified state.
python3 trends.py -d CA
Your program should work identically, even if you use the functional
representation for tweets defined in problem 1, using the -f
flag.
python3 trends.py -f -d CA
In this phase, you will group tweets by their nearest state center and calculate the average positive or negative feeling in all the tweets associated with a state.
The name us_states
is bound to a dictionary containing the shape of
each U.S. state, keyed by its two-letter postal code.
Implement group_tweets_by_state
, which takes a sequence of tweets
and returns a dictionary. The keys of the returned dictionary are
state names (two-letter postal codes), and the values are lists of
tweets that appear closer to that state's center than any other.
If a state does not have any tweets, you should not include it in the returned dictionary.
Hint: You may find the group_by_key
and built-in min
functions
useful. You may also want to define additional functions to organize
your implementation into modular components.
Optional: The group_by_key
function is slow because it traverses the list
of pairs one time for each key. Can you improve it so that it considers each
pair only once?
Test your implementation before moving on:
python3 ok -q 7 -u
python3 ok -q 7
Implement average_sentiments
. This function takes the dictionary
returned by group_tweets_by_state
and also returns a dictionary. The
keys of the returned dictionary are the state names (two-letter postal
codes), and the values are average sentiment values for all the tweets
that have sentiment value in that state.
If a state has no tweets with sentiment values, leave it out of the returned dictionary entirely. Do not include a state with no sentiment using a zero sentiment value. Zero represents neutral sentiment, not unknown sentiment. States with unknown sentiment will appear gray, while states with neutral sentiment will appear white.
Test your implementation before moving on:
python3 ok -q 8 -u
python3 ok -q 8
You should now be able to draw maps that are colored by sentiment corresponding to tweets that contain a given term. The correct map for 'cali' appears at the top of this page.
python3 trends.py -m cali
python3 trends.py -m movie
python3 trends.py -m shopping
python3 trends.py -m "high school"
Your program should work identically, even if you use the functional representation for tweets defined in question 1, using the -f flag.
python3 trends.py -f -m cali
Finally, you can download a larger dataset once you are
done with your project. After extracting from the archive, you can move
tweets2011.txt
and tweets2014.txt
to the
`data` directory.
Warning: this dataset is 153 MB in zipped form. If you would rather not download the files, you can copy your Trends project onto your class account, and do the following on your class account:
cd trends/data
setup-tweets
If you run your project from your class account, make sure to use the
-X
flag with ssh
(on Macs or Linux) or enable XMing (on Windows)
so you can see the graphics!
Note: as stated in the accompanying README.txt
, the dataset is
intended solely for use with this project. Contents of tweets2014.txt
may not be redistributed or made public (e.g. on a version-control
repository). After setting up the new tweets in your data directory,
You can then use the -m
flag above to search for more phrases, and
the -t
to specify the data file, like the following:
python3 trends.py -m christmas -t tweets2011.txt
python3 trends.py -m christmas -t tweets2014.txt
Congratulations! One more 61A project completed.
These extensions are optional and ungraded. In this class, you are welcome to program just for fun. If you build something interesting, come to office hours and give us a demo. However, please do not change the behavior or signature of the functions you have already implemented.
group_tweets_by_state
draw_map_by_hour
that visualizes the tweets
that were posted during each hour of the day. For example, you'll
discover that "sandwich" tweets appear most positive at 10:00pm:
late night snack!:-)
and negative sentiment to sad ones.find_containing_state
that
finds the state that actually contains a tweet position.graphics.py
package supports animation. Use the slide_shape
method to have states and dots slide into place.draw_most_talkative_states
,
then use it as a foundation and modify as needed)Acknowledgements: Aditi Muralidharan developed this project with John DeNero. Hamilton Nguyen extended it. Kaylee Mann developed the autograder. Many others have contributed as well.