Hashing and Sorting

Due Monday, November 3rd at 11:59 PM (development in progress) This homework is not fully battle tested. Send any errors to Josh directly: hug@cs

Introduction
Part 1: ECHashStringSet Implementation
Part 2: ECHashStringSet Timing
Part 3: Quicksort and Mergesort Mechanics
Part 4: Sorting as a Tool
Part 5: Optional Ungraded Problems

Introduction

Be warned, this homework is somewhat long. Be cautious when implementing your code. Don't be afraid to abandon initial attempts at problems. Also it has not been fully battle tested so please send an email to Josh if you spot anything amiss.

Part 1: ECHashStringSet (ECHashStringSet.java)

For this problem, you'll expand on your work from HW6 by creating a hashing based implementation of the StringSet interface.

As in HW6, to implement the interface, you'll be working completely from scratch, with no Makefiles or skeleton code. This one will be a fair bit more difficult than the BSTStringSet from HW6.

Create a class ECHashStringSet that implements the StringSet interface using an external chaining hash table as its core data structure. In order to ensure that put and contains operations are fast, you should resize the hash table as soon as the load factor exceeds 5. For memory efficiency, you should ensure that the load factor is never less than 0.2, except for empty lists, for which the load factor is allowed to be zero. You do not need to worry about downsizing the set since we don't have a remove operation.

There is one annoying issue you'll enounter: If the hashCode() is negative, we need to remove the top bit (since negative array indices are not allowed in Java). You can do this using the bit operations we discussed earlier in class, e.g. s.hashCode() & 0x7fffffff) % bucketCount.

There's no restriction on what you are allowed to use, but you should avoid using any class that has 'Hash' in the name, since this defeats the whole point of this problem.

Also don't be tempted by the fact that ArrayLists utilize resizing -- they won't do what you want, since they won't obey the load factor specifications described above. Yes this means you'll need a helper method that resizes hash table to increase the number of buckets when needed.

For an extra challenge, don't use anything that requires an import statement.

As references, you might find the following resources useful:

Chapter 7 of DSIJ.
Josh's in-class exercise on resizing from the hashing lecture.
The Resizing Array-Based Stack that is part of Princeton's standard Java library.
The wiki article on separate chaining (the more common term for what we call external chaining in class).
If you're using something that requires imports, see this Stack Overflow post on creating arrays of objects that expect a formal parameter. There is no good way to avoid the "unchecked or unsafe operations" warning.
If you get really stuck, there is rather a similar class you can look to for inspiriation at this link. You might find it particularly helpful for writing your resize method. Avoid looking at this class until you're really truly stuck. It's good to hit all the roadblocks yourself, and only appeal to outside help once you've really appreciated the difficulty you're trying to solve.

Your correctness tests from hw6 should work almost without modification (you'll just need to change the type of object that is instantiated). Make sure to test something that inserts a large number of strings. One testing approach is to generate random strings, insert them into both your ECHashStringSet and a TreeSet<String>, then iterate through the entire TreeSet and ensure that all of its members are also contains in your ECHashStringSet.

My solution (which does use an import for handling the list of items that go in each bucket) is 66 lines including comments and whitespace.

Part 2: ECHashStringSet Timing (hw7timing.txt)

Run the provided timing test InsertRandomSpeedTest. You'll need a copy of your BSTStringSet from HW6 (or if you didn't do HW6, a copy of the solution). Fill out hw7timing.txt as directed.

Repeat this for InsertInOrderSpeedTest.

If you're having timing issues, make sure your hash table is actually resizing.

Part 3: Quicksort and Mergesort Mechanics (sortingProblems.txt)

We mentioned in class that Quicksorting an array is equivalent to inserting all of its items into a BST. In this problem, we'll explore this further.

Before we can understand this, we need a specific implementation for partitioning.

Consider the array [5, 3, 2, 1, 7, 8, 4, 6], and suppose that we pick the leftmost item 5 as our pivot. One approach to partitioning is to perform a 'stable' partitioning where all items that are less than 5 appear in the same order as they did before partitioning, and likewise for the greater items. For the given array, we'd get [3, 2, 1, 4, 5, 7, 8, 6].

One inefficient but simple way to implement this stable partitioning algorithm is to perform the following steps:

Create three empty Lists for storing integers smaller, equal to, and larger than the pivot, respectively.
Go through each item of the list, comparing it to the pivot, and adding items to the respective list based on the comparison.
Concatenting the three Lists into a single concatenated List.
Copying the concatenated List back into the array.

So for example if we partition [5, 3, 2, 1, 7, 8, 4, 6] from index 0 to index 7, we'd get:

Smaller items: [3, 2, 1, 4]
Equal items  : [5]
Larger items : [7, 8, 6]

The concatenation of these lists is just [3, 2, 1, 4, 5, 7, 8, 6]. Along the way, we compared the following pairs of numbers: 5-3, 5-2, 5-1, 5-4, 5-7, 5-8, and 5-6.

If we were using partition to sort, we'd then repeat this process for the left half and right half sides as discussed in class.

In sortingProblems.txt, fill out the list of comparisons used by Quicksort. You might find running Quicksort.java (provided in the Sorting folder in hw7) to be useful.

BSTs (1b, 1c, 1d in sortingProblems.txt)

Now draw the BST that results when you insert [5, 3, 2, 1, 7, 8, 4, 6] (in that order) into an initially empty BST. Record the comparions you observe in sortingProblems.txt. Answer questions 1c and 1d in sortingProblems.txt.

Mergesort (1e in sortingProblems.txt)

Finally, for 1e, give an example of a comparison performed by mergesort that is not performed by Quicksort or BST. This isn't particularly interesting, but we're just asking to make sure you understand how Mergesort works.

Part 4: Sorting as a Tool: Stabbing Count Queries (Intervals.java)

Oops! This problem is broken, but I'm leaving this warning here for those of you who already made the key observation that there is a trivial solution that does not require sorting.

I started from a really neat problem involving dynamic stabbing count queries, where you want to be able to support 'insert' and 'stabbingCount' in $\Theta(log N)$ time. I wanted a sorting version of this, and decided to convert it into a static version of the same problem (where you look at a fixed array) -- but I never stepped back to actually think about what I was doing since of course you can just iterate through each interval and compare against the point. Maybe some future week I'll give you the real problem. Sorry... Just skip this one. I'll add an optional problem along the same lines for those of you who wanted to do something cool with sorting.

~~There are a large number of problems for which sorting provides a fast solution, even though the problem isn't really about sorting.~~

~~Define an interval to be a pair of numbers $[x_i, x_i']$ such that $x_i < x_i'$. An interval specifies a range of values in 1D space, where we can call $x_i$ the start point, and $x_i'$ the end point.~~

~~Given a list of such intervals and an integer x, we want to know how many intervals x belongs to, or alternately how many intervals x "stabs".~~

For example, if we have intervals = $[3, 10]$, $[4, 5]$, $[6, 12]$, $[8, 15]$, $[19, 30]$, and we have x = 9, then since x belongs to three intervals: [3, 10], [6, 12], and [8, 15] we say that the "stabbing count" is 3. By contrast, we'd say that the stabbingCount(intervals, 17) is 0, since 17 does not belong to any ranges.

~~Fill in Intervals.java so that the stabbingCount method returns the correct stabbing count in $\Theta(N log N)$ time.~~

There is a clever trick to this problem that you'll need to figure out to get $\Theta(N log N)$ time. You do not need to use any nested loops for this problem, and in fact if you find yourself using them, you probably haven't found the right approach.

Note: Sorting (see the Collections class) is the most straightforward way to achieve $\Theta(N log N)$, but there are other ways as well, e.g. since Quicksort and BST insertion have equivalent runtimes, you can achieve the same effect as randomized Quicksort by inserting randomly into a BST.

~~My solution to stabbingCount is 11 lines not including whitespace. Yes there is a simple way.~~

Part 5: Optional Ungraded Problems

For those of you who want to get a little more practice with algorithm design, consider the following problems.

Distribution Count for Large Numbers (SortInts.java)

[Goodrich & Tamassia] Given a sequence of n distinct integers, each one of which is in the range $[0, n^2 - 1]$, develop an $O(n)$ algorithm for sorting them. See the skeleton file Optional/SortInts.java. You can't use ordinary distribution sort for this, because that would require initializing and traversing arrays of size $n^2$, which would take too long

Inversion Counting (Inversions.java)

Find an algorithm that runs in $O(n \lg n)$ time for computing the number of inversions in a list of n items. See the skeleton file Optioanl/Inversions.java. To test that your code is actually $O(n \lg n)$, provide it a very large list (say hundreds of thousands).

Two Sum (Sum.java)

[Goodrich&Tamassia] Given two sequences of integers, $A$ and $B$, find an algorithm that runs in $O(n \lg n)$ time (where n is the total number of integers in A and B) that determines, for a given parameter $m$, whether there is an integer $a$ in $A$ and an integer $b$ in $B$ such that $m = a + b$. See the skeleton file Optional/Sum.java. To test that your algorithm is runs in $O(n \lg n)$ time, provide sequences of hundreds of thousands of integers. Feel free to use any of the methods in java.util.Arrays.