CS61C Lab 12

Caches

Background

Goals

In this lab you will experimentally determine the parameters of a computer's cache. Good luck...

Info

The program cache.c produces timing data for accesses of elements in an array. Here is its main loop:

for (index = 0; index < limit; index = index + stride) {
        x[index] = x[index] + 1;
}

The surrounding code coordinates the setting of the variables LIMIT and STRIDE and the timing of the loop.

Each loop/stride test is repeated many times and the results are averaged. The same loops are run without the memory access to measure the time taken with loop overhead. The program subtracts the overhead time from the loop time so the results should show the time spent on memory access only. Unfortunately, some of our compilers don t compile the real loop and the dummy loop in the same way, so your times may be exaggerated. You don't have to fix this.

The program prints three values for each test, labeled as SIZE, STRIDE, and READ+WRITE.

The SIZE value refers to the size, in bytes, of the array you are testing with, not the size of the cache. (You'll have to figure out the cache size for yourself.)
The STRIDE value says how many bytes are being skipped at each access. For example, if the stride is 4, bytes 0, 4, 8, 12, ... in the array are being accessed with bytes 1, 2, 3, 5, 6, 7, 9, 10, 11, ... being skipped.
The READ+WRITE value is the amount of time needed to make the access to memory, in nanoseconds. You'll have to figure out for yourself whether these times denote hits or misses. As explained above, the calculated times may be exaggerated if the compiler doesn t make equivalent machine code for the two loops that process the array.

Exercises

Setup

Copy the contents of ~cs61c/labs/12 to a suitable location in your home directory.

$ mkdir ~/lab
$ cp -R ~cs61c/labs/12/ ~/lab

Exercise 1

CAMERA is a simple cache simulator. You can run camera by executing:

$ java -classpath ~cs61c/bin/camera Camera

You can then select to simulate a 1 way (direct mapped), 2 way, 4 way, or fully associative cache. For each, generate a random sequence of addresses, and step through the simulator. Notice how the address is broken up into the offset, index, and tag. Additionally, notice what memory is copied into the cache in addition to the address requested.

Manually generate a series of memory accesses which, when looped, will cause all misses in a direct mapped cache but will have all hits in a fully associative cache.

Then, manually generate a series of memory accesses which, when looped, will cause all misses for a fully associative cache but will have a minimal miss rate in a direct mapped cache.

NOTE: In CAMERA, the addresses are interpreted as word addresses, but in our model, we talk about byte addresses. Thus, when CAMERA mentions a word, it may be easier for you to think of it as mentioning a byte.

Checkoff

Show the above 2 memory access sequences to your TA.

Exercise 2

The Sutardja Dai Macs are very modern machines and have a complicated memory heirarchy that will be difficult to evaluate with our tests. We will be using some of the older lab machines instead.

Start by gathering some data. Log in to one of the following machines of your choice using ssh -X (Please, choose randomly. We don't want to overload any one of them). Note the machine name and any information you can find about its processor. "uname -a" or "uname -X" may work on some of the instructional machines.

List of computers that you can login to. All of them are located at 275 Soda:

forrestal franklin hancock hornet

independence intrepid kearsarge kittyhawk

langley lexington leyte midway

monterey nimitz oriskany princeton

randolph ranger saipan sanjacinto

saratoga tarawa ticonderoga

Compile cache.c without any optimizations:

$ gcc -O0 -o cache cache.c

In case you have a hard to read font, the first switch is "minus Oh Zero". Now run the program, which tests reading and writing with arrays from 4K to 1M at strides varying from 4 to 4K bytes. You will want to capture the output into a file, so use a command like "./cache > out" (which puts the result in a file named "out").

Now graph the data by running:

$ gnuplot

If it doesn't work on the machine you connected to, run it from your local lab machine. Once you are in gnuplot, give gnuplot this command:

gnuplot> load "cache.gnuplot"

This command load a sequence of command that is stored in the file cache.gnuplot. You should see the results of each array size experiment as separate graph segments. Each segment shows various strides, increasing from 4 bytes at its leftmost point to 4K bytes at its rightmost.

If you are interested, look at the file cache.gnuplot to understand how gnuplot plots your data. The top part of the file sets the plot title and x-axis to have a log scale among other things. The last line is the actual plot command. The gnuplot plot command is very powerful. See the help page from within gnuplot if you want to understand how it works. To learn more about gnuplot, try its excellent help system: "help plot", "help plot with", "help set", "help", etc.

Before you start number-crunching, try to understand the qualitative features of the graph. You should be able to explain what each of the jumps in access time means, and why some are much bigger than others. Do all your graph segments start at the same access time?

Try modifying the line containing set yrange [1:200] to a different value if the plots seems too clobbered.

You can exit gnuplot by the command:

gnuplot> quit

Exercise 3

From what you see in the graph and the numbers in your table, you should be able to determine most of the parameters of the test computer's cache. Here are the questions you should answer:

Approximately how fast is a cache hit?
Approximately how fast is a cache miss?
What is the size in bytes of the first-level cache? Why?
What is the block size in bytes of the first-level cache? Why?

Checkoff

Explain your answer to the above questions to your TA.

Exercise 4 (For Experts)

This question is not required for the checkoff. If you don't have time to answer in lab, try figuring this out at home.

Answer one more question:

What is the set associativity on the cache you're analyzing?

You may not be able to figure this out from the data given by the original cache.c. Changing the STRIDE_MIN, STRIDE_MAX, CACHE_MIN, and CACHE_MAX settings at the top of the file will help. When you're calculating the set associativity, you ll want a large stride and large array sizes to eliminate the effects of the block size and cache size. With a sufficiently large stride length along with arrays big enough to contain 1, 2, 4, ... elements to access, you should be able to observe the effects of putting several elements into the same cache set (row). Normally, retrieving elements into the same set would cause repeated misses and no benefit from the cache. But if the cache is associative at all, you ll see benefits even when multiple blocks are loaded into the same set. cache.c's testing style of increasing the array size should give you results that show how many blocks can be loaded into one set, and that's the cache's associativity.

A piece of information not required for this lab but worth exploring is the following:

Does the machine have more than one level of cache? Why or why not?

You will probably not be able to conclude very much about multiple levels of cache because the smaller caches tend to hide the effects of the larger ones. Do not expect to be able to calculate all the parameters of non-primary caches.

The Sutardja Dai Macs

If you are curious, you can try running these benchmarks locally on the Mac machines. You will have to increase the test parameters in cache.c to see any effects, and the scale of the y-axis in cache.gnuplot since the access times will be faster. Keep in mind that the cache heirarchy here is several levels deep, so any results will be difficult to derive. Lastly, gnuplot hasn't been installed on the Macs, so you will need to log into one of the other instructional servers to actually plot the data.

What you should have learned

As a result of this lab, you ought to be able to specify what data and features of the cache program let you determine a given cache parameter, and to answer questions like the following (drawn from an old exam):

Running the cache program on the Wazcog Mark IV computer, you get the following results:

size	stride
	4	8	16	32	64	128	256
64	109	126	122	121	101
128	120	118	116	121	117	105
256	119	116	102	117	113	113	110
512	466	966	949	949	983	308	311
1024	458	949	983	966	949	983	307
2048	449	992	954	966	979	979	979

(Stride and size are in bytes, times in nanoseconds. Yes, this is an artificially small example, to make the numbers fit easily.)

What is the cache size in bytes?
What is the block size in bytes?
What is the allocation policy? (Direct mapped, fully associative, or set associative? If N-way set associative, what is N?)

forrestal	franklin	hancock	hornet
independence	intrepid	kearsarge	kittyhawk
langley	lexington	leyte	midway
monterey	nimitz	oriskany	princeton
randolph	ranger	saipan	sanjacinto
saratoga	tarawa	ticonderoga