Extra Homework 2

Due by 11:59pm on Friday, 3/29

Instructions

Download extra02.zip. Inside the archive, you will find a file called extra02.py, along with a copy of the Ok autograder.

Submission: When you are done, submit with python3 ok --submit. You may submit more than once before the deadline; only the final submission will be scored.

Using Ok

The ok program helps you test your code and track your progress. The first time you run the autograder, you will be asked to log in with your @berkeley.edu account using your web browser. Please do so. Each time you run ok, it will back up your work and progress on our servers. You can run all the doctests with the following command:

python3 ok

To test a specific question, use the -q option with the name of the function:

python3 ok -q <function>

By default, only tests that fail will appear. If you want to see how you did on all tests, you can use the -v option:

python3 ok -v

If you do not want to send your progress to our server or you have any problems logging in, add the --local flag to block all communication:

python3 ok --local

When you are ready to submit, run ok with the --submit option:

python3 ok --submit

Readings: You might find the following references useful:

Section 2.4.9

Huffman Encoding Trees

Introduction

We consider the problem of representing text as a sequence of ones and zeros (bits). As a simpler example, consider encoding an alphabet consisting of only A, B, C, D, E, F, G, and H. We can choose a code with three bits per character. For example:

A 000   C 010   E 100   G 110
B 001   D 011   F 101   H 111

With this code, the 18-character message:

BACADAEAFABBAAAGAH

is encoded as the string of 54 bits:

001000010000011000100000101000001001000000000110000111

Codes such as the A-through-H code above (and ASCII)are known as fixed-length codes, because they represent each symbol in the message with the same number of bits. It is sometimes advantageous to use variable-length codes, in which different symbols may be represented by different numbers of bits. If our messages are such that some symbols appear very frequently and some very rarely, we can encode data more efficiently (i.e., using fewer bits per message) if we assign shorter codes to the frequent symbols. Consider the following alternative code for the letters A through H:

A 0     C 1010   E 1100   G 1110
B 100   D 1011   F 1101   H 1111

With this code, the same message as above is encoded as the string:

100010100101101100011010100100000111001111

This string contains 42 bits, so it saves more than 20% in space in comparison with the fixed-length code shown above.

One of the difficulties of using a variable-length code is knowing when you have reached the end of a symbol in reading a sequence of zeros and ones. One solution is to design the code in such a way that no complete code for any symbol is the beginning (or prefix) of the code for another symbol. Such a code is called a prefix code. In the example above, A is encoded by 0 and B is encoded by 100, so no other symbol can have a code that begins with 0 or with 100.

In general, we can attain significant savings if we use variable-length prefix codes that take advantage of the relative frequencies of the symbols in the messages to be encoded. One particular scheme for doing this is called Huffman encoding, after its discoverer, David Huffman. A Huffman code can be represented as a binary tree whose leaves are the symbols that are encoded.

Each symbol at a leaf is assigned a weight (its frequency), and each non-leaf node contains a weight that is the sum of all the weights of the leaves lying below it. The weights will be used to help construct the tree.

The figure below shows the Huffman encoding tree for the A-through-H prefix code given above. The weights at the leaves indicate that the tree was designed for messages in which A appears with frequency 8, B with frequency 3, and the other letters each with frequency 1.

Huffman encoding

Decoding

To decode a bit sequence using a Huffman tree, we begin at the root and use the successive zeros and ones of the bit sequence to determine whether to move down the left or the right branch. Each time we come to a leaf, we have generated a new symbol in the message, at which point we start over from the root of the tree to find the next symbol.

For example, suppose we are given the tree above and the sequence

10001111

Starting at the root, we move down the right branch, (since the first bit of the string is 1), then down the left branch (since the second bit is 0), then down the left branch (since the third bit is also 0). This brings us to the leaf B, so the first symbol of the decoded message is B.

Now we start again at the root, and we make a left move because the next bit in the string is 0. This brings us to the leaf A. Then we start again at the root with the rest of the string 1111, so we move right, right, right, right and reach H. Thus, the entire message is BAH.

Encoding

Given a Huffman tree, we can enumerate all encodings by traversing the tree. That is, we can write a function that takes the Huffman tree as input and returns a dictionary that maps letters (e.g. A) to codes (e.g. 0). Then, encoding a message involves concatenating the code for each letter in the message.

Generating Huffman Encoding Trees

Given an alphabet of symbols and their relative frequencies, how do we construct the tree that will encode messages with the fewest bits? Huffman gave an algorithm for doing this and showed that the resulting code is indeed the optimal variable-length code (for messages where the frequency of the symbols to be encoded matches the frequencies with which the Huffman tree was constructed). You can read more about the story of how Huffman encoding came to be here.

The algorithm for generating a Huffman tree is very simple. The idea is to arrange the tree so that the symbols with the lowest frequency appear farthest away from the root. Start with the set of leaf nodes, containing symbols and their frequencies in some initial data. Now, find two nodes with the smallest weights and merge them to produce a new node, with these two nodes as its left and right branches. The weight of the new node is the sum of the two weights. Remove the two leaves from the original set and replace them with this new node.

Repeat. At each step, merge two nodes with the smallest weights, removing them from the set and replacing them with a node that has these two as its left and right branches. The process stops when there is only one node left, which is the root of the entire Huffman tree.

Wikipedia has an excellent animation of this process.

Here is how the previous example Huffman tree is generated, where trees are described by the set of letters they contain, along with their weight:

Initial leaves:

{(A 8) (B 3) (C 1) (D 1) (E 1) (F 1) (G 1) (H 1)}

Trees after...

Step 1: {(A 8) (B 3) ({C D} 2) (E 1) (F 1) (G 1) (H 1)}

Step 2: {(A 8) (B 3) ({C D} 2) ({E F} 2) (G 1) (H 1)}

Step 3: {(A 8) (B 3) ({C D} 2) ({E F} 2) ({G H} 2)}

Step 4: {(A 8) (B 3) ({C D} 2) ({E F G H} 4)}

Step 5: {(A 8) ({B C D} 5) ({E F G H} 4)}

Step 6: {(A 8) ({B C D E F G H} 9)}

Step 7: {({A B C D E F G H} 17)}

The algorithm does not always specify a unique tree, because there may not be unique smallest-weight nodes at each step. Also, the choice of the order in which the two nodes are merged (i.e., which will be the right branch and which will be the left branch) is arbitrary.

Huffman trees are represented by the following data abstraction, which is defined in terms of the tree data abstraction from lecture:

# Huffman encoding trees

def huffman_leaf(letter, weight):
    """A leaf of a Huffman tree, which has a weight at the root."""
    return tree(weight, [tree(letter)])

def huffman_tree(left, right):
    """A Huffman encoding tree; left and right are also Huffman trees."""
    return tree(label(left) + label(right), [left, right])

def weight(tree):
    """The weight of a Huffman encoding tree."""
    return label(tree)

def is_huffman_leaf(tree):
    """Whether this Huffman tree is a Huffman leaf."""
    return not is_leaf(tree) and is_leaf(branches(tree)[0])

def letter(leaf):
    """The letter of a Huffman leaf."""
    return label(branches(leaf)[0])

The A-through-H code above can be implemented as:

CD = huffman_tree(huffman_leaf('c', 1), huffman_leaf('d', 1))
EF = huffman_tree(huffman_leaf('e', 1), huffman_leaf('f', 1))
GH = huffman_tree(huffman_leaf('g', 1), huffman_leaf('h', 1))
EFGH = huffman_tree(EF, GH)
BCD = huffman_tree(huffman_leaf('b', 3), CD)
BCDEFGH = huffman_tree(BCD, EFGH)
example_tree = huffman_tree(huffman_leaf('a', 8), BCDEFGH)

Q1: Letters

Implement letters, which returns the letters of a Huffman encoding tree in the numerical order of their encodings.

def letters(tree):
    """Return a list of all letters encoded in Huffman encoding TREE.

    >>> letters(example_tree)
    ['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h']
    """
    "*** YOUR CODE HERE ***"

Q2: Decoding

Implement decode_one, which takes as arguments a Huffman tree and a list of 0's and 1's (bits). It returns the first encoded letter in code, while removing its corresponding bits:

def decode(tree, code):
    """Decode CODE, a list of 0's and 1's using the Huffman encoding TREE.

    >>> decode(example_tree, [1, 0, 0, 0, 1, 1, 1, 1])
    'bah'
    """
    word = ''
    while code:
        word += decode_one(tree, code)
    return word

def decode_one(tree, code):
    """Decode and remove the first letter in CODE, using TREE.

    >>> code = [1, 0, 0, 0, 1, 1, 1, 1]
    >>> decode_one(example_tree, code)
    'b'
    >>> code # The initial 1, 0, and 0 are removed by decode_one
    [0, 1, 1, 1, 1]
    """
    "*** YOUR CODE HERE ***"

Q3: Encoding

Implement encodings, which returns the encodings for all letters in a Huffman tree as a dictionary:

def encodings(tree):
    """Return all encodings in a TREE as a dictionary that maps symbols to
    bit lists.

    >>> e = encodings(example_tree)
    >>> set(e.keys()) == set('abcdefgh')
    True
    >>> e['a']
    [0]
    >>> e['c']
    [1, 0, 1, 0]
    >>> e['h']
    [1, 1, 1, 1]
    """
    "*** YOUR CODE HERE ***"

Q4: Construction

Implement huffman, which takes a list of (symbol, frequency) pairs and outputs an optimal encoding tree. The given code sorts the pairs from lowest to highest frequency, then creates a Huffman leaf for each pair.

def huffman(frequencies):
    """Return a Huffman encoding for FREQUENCIES, a list of (symbol,
    frequency) pairs.

    >>> frequencies = [('a', 8), ('b', 3), ('c', 1), ('d', 1)]
    >>> h = huffman(frequencies)
    >>> for letter, code in sorted(encodings(h).items()):
    ...     print(letter + ':', code)
    a: [1]
    b: [0, 1]
    c: [0, 0, 0]
    d: [0, 0, 1]
    """
    frequencies.sort(key=lambda freq: freq[1]) # lowest frequencies first
    leaves = [huffman_leaf(letter, freq) for letter, freq in frequencies]
    "*** YOUR CODE HERE ***"

def huffman_wiki():
    """Return a Huffman encoding tree for the text of the Huffman coding page
    on Wikipedia. (Internet connection required!)

    Note: Sometimes encodings are slightly different than the ones in the test,
          depending on the content returned by Wikipedia.

    >>> e = encodings(huffman_wiki())
    >>> [[letter, e[letter]] for letter in ['a', 'b', 'c']]
    [['a', [0, 0, 1, 0]], ['b', [1, 0, 0, 0, 1, 0]], ['c', [0, 1, 0, 1, 1]]]
    """
    from urllib.request import urlopen
    from json import loads
    from collections import Counter
    huff = urlopen('http://goo.gl/w1Jdjj').read().decode()
    content = loads(huff)['query']['pages']['13883']['revisions'][0]['*']
    return huffman(list(Counter(content).items()))

Constraints

Q5: Squarer

The multiplier constraint from the readings is insufficient to model equations that include squared quantities because constraint networks must not include loops. Implement a new constraint squarer that represents the squaring relation:

def squarer(a, b):
    """The constraint that a*a=b.

    >>> x, y = connector('X'), connector('Y')
    >>> s = squarer(x, y)
    >>> x['set_val']('user', 10)
    X = 10
    Y = 100
    >>> x['forget']('user')
    X is forgotten
    Y is forgotten
    >>> y['set_val']('user', 16)
    Y = 16
    X = 4.0
    """
    "*** YOUR CODE HERE ***"

Q6: Pythagorean Theorem

Use your squarer constraint to build a constraint network for the Pythagorean theorem: a squared plus b squared equals c squared:

def pythagorean(a, b, c):
    """Connect a, b, and c into a network for the Pythagorean theorem:
    a*a + b*b = c*c

    >>> a, b, c = [connector(name) for name in ('A', 'B', 'C')]
    >>> pythagorean(a, b, c)
    >>> a['set_val']('user', 5)
    A = 5
    >>> c['set_val']('user', 13)
    C = 13
    B = 12.0
    """
    "*** YOUR CODE HERE ***"