CS61cl Lab 22 - Caches

Quiz:

When we started learning about digital design we learned how to convert any truth table to combinational logic gates. Then we learned how to build MUXes and decoders. An then we used truth tables to specify the control logic that makes our processor do the register transfers we want it to do. So we are tempted to bang out a bunch of combinational logic for every control signal. But wait, we can use MUXes and Decoders - and it is often a lot simpler. The following truth table has two inputs and three outputs.

A B | X  Y  Z
0 0 | 1  0  0 
0 1 | 1  1  0
1 0 | 0  1  1
1 1 | 0  0  0
1. Show how to implement this with a single MUX that has multibit inputs.

2. Show how to implement this with a single decoder plus one OR gate for each output

The memory hierarchy

The web site for your favorite computer vendor is selling a hot new item, "This machine will run billions of instructions per second!". It has a 3 GHz CPU clock speed (that's 0.33 nanoseconds per cycle) and 40-60 nanoseconds memory access time. Having taken CS61Cl you are not settling for merketing hype. So you do a little calculation. Every instruction is fetched from memory and many of them (about 30%) load from or store to memory. Let's call it 1.3 memory accesses per instruction at 50 ns per memory access. That 80 ns just on memory, or 12.5 million instructions per second! Is this false advertising? How can so fast an execution speed be claimed with such slow memory access rates? The answer is that there's something between the CPU and main memory called a cache, which we'll describe in a minute.

First, we'll describe the memory hierarchy.

This hierarchy is illustrated below. In general, the lower the level, the higher the latency (access time), and the lower the cost per bit.

The cache fits in just below the CPU. It's on the chip with the CPU, and so is quickly accessible; access time ranges from 0.5 to 5 nanoseconds. There are levels of cache—level 1, level 2, and so on—that differ in size and in the speed of the bus connecting them to the CPU. Modern personal computers have 64-256 Kb level 1 caches for instructions and data, a 4 Mb level 2 cache, with perhaps a 256 Mb level 3 cache.

How caches are used

Here's an analogy for how the memory hierarchy is used. You're writing a term paper—the "processor"—at a table in your dorm. Doe Library is equivalent to a disk; it has essentially limitless capacity, but it takes a long time to retrieve a book. You have several good books on your bookself. The bookshelf is like main memory. Its capacity is smaller, which means you have to return a book when the shelf fills up. However, it's easier and faster to find a book there than to go to the library. The open books on the table are like the cache. Capacity is mush less—an open book takes up more space than a closed one on the self, and you may have to shelve a book when too many others are open—but it's much faster to retrieve data. This organization creates the illusion of the whole library open on the tabletop: we keep as many recently used books as possible open on the table, since we're likely to use them again; we keep as many books as DOE will allow in our shelf, since it's faster to access them than to go to the library stacks.

Memory, then, contains copies of data on disk that have been recently used. Similarly, the cache contains copies of data in memory that have been recently used. In general, each successive level contains the most used data from the next lower level. Each load instruction first checks the cache to see if it contains the addressed data; only if it doesn't will main memory be accessed. (We'll discuss shortly how stores are handled.)

The actions we take at various levels of the memory hierarchy are similar. When we make an access and find it there (we call it a "hit") and return the data immediately. When what we're looking for is not there (we call it a "miss") we go to the next lower level and bring it up. (The frequency at which we have to dig down and do that is called the "miss rate".) However, the means for taking these actions are very different at different levels. For fast memory in front of slow memory, we call the fast one a cache and everything is handled in hardware. For example, it may take 1 cycle to process a hit and 10s or 100s to process a miss. For memory in front of disk, we call the whole illusion "virtual memory" and hits are handled in hardware while misses are handled in software by the operating sustem. The miss takes millions of instructions.

Why do caches work?

Caches "cheat the laws of physics" by taking advantage of statistics. Programs don't really access RAM randomly. Programs have temporal and spatial locality and caches exploit this behavior to make them run faster. If we access a particular word, we are likely to access it again soon; this is temporal locality. Spatial locality says that if we use a piece of memory, we're likely to want to use the neighboring pieces sometime soon.

BrainStorm

(P&H exercise 7.3 and 7.4)

  1. Describe the general characteristics of a program that would exhibit very high amounts of temporal locality but very little spatial locality with regard to data accesses, and give an illustrative example.

  2. Describe the general characteristics of a program that would exhibit very little temporal locality but very high amounts of spatial locality with regard to data accesses, and give an illustrative example.

Fully associative caches

A fully associative cache is managed just like a hardware version of an assoc table (which you would have seen in CS 61A) or a map (which you would have seen in CS 61B). It stores in essence a collection of a constant number of address/contents pairs. In reality, it stores what P&H refer to as a tag, which contains only as much of the address as is necessary to find it in the table. P&H also refer to the contents as the data, and to a pair plus bookkeeping information (e.g. a "valid?" bit that says if the cache entry contains meaningful data) as a block.

When a load is encountered, the cache is searched for a block with the desired address. If a block with the desired address is in the cache, that's a hit and the corresponding data is returned. If the cache does not contain a block with the desired address, that's a miss; the requested information is retrieved from main memory and also stored in the cache. But, unlike your software lookup that searches sequentially through all the tags in the table, in hardware we do them all in parallel.

When the collection fills up, something currently in the collection must be replaced by the incoming address/contents pair. With a small cache, the least recently used block might be replaced; that's the block that has been unreferenced for the longest time. (The circuitry to keep track of when a block was used is complicated.) With a larger cache, the block to replace might be chosen at random, since random choice is much easier to implement in hardware.

Using sequential search to find an entry in a fully associative cache would be too slow, so the search is done in parallel. This requires a comparator for each cache entry to check an address against all the addresses currently in the cache, plus a selector to access the correct contents. If the address is found, the associated contents is returned. A diagram appears below.

We assume that the cache contains only words. This allows us to economize a bit (actually 2 bits) in storing the address. Just as in jump and branch instructions, where the last two bits of an address are 0 and therefore aren't stored, we can also get away with keeping just 30 bits of the address as the tag.

BrainStorm

Consider two programs, program A with high spatial locality and low temporal locality, and program B with just the opposite. Which of the following is more likely? Briefly explain your answer.

  1. Using a fully associative cache will give program A more of a performance boost than program B.
  2. Using a fully associative cache will give program B more of a performance boost than program A.
  3. The benefits of using a fully associative cache will be roughly the same for the two programs.

A short example

Here's an example of the use of a four-word fully associative cache with a least-recently-used replacement policy. The tag fields in the cache are omitted in the diagram to save space; the contents field in each cache entry contains M(addr), which means the word stored in memory at address addr, or "--", which means an invalid entry. The sequence of byte addresses accessed—the "reference string", in P&H terms—is 410 (001002), 810 (010002), 1610 (100002), 2410 (110002), 810 (010002), 1210 (011002), 1610 (100002), 1210 (011002). The least recently used block appears in italics.

address accessed   hit or miss?   cache contents
4 (00100) miss
M(4)
--
--
--
8 (01000) miss
M(4)
M(8)
--
--
16 (10000) miss
M(4)
M(8)
M(16)
--
24 (11000) miss
M(4)
M(8)
M(16)
M(24)
8 (01000) hit
M(4)
M(8)
M(16)
M(24)
12 (01100) miss
M(12)
M(8)
M(16)
M(24)
16 (10000) hit
M(12)
M(8)
M(16)
M(24)
12 (01100) hit
M(12)
M(8)
M(16)
M(24)

BrainStorm - A Longer Example

Here is a series of address references (byte addresses):

	8, 12, 44, 64, 84, 52, 256, 192, 
	76, 44, 12, 88, 16, 108, 24, 44

Assume that a fully associative cache with 8 one-word blocks is initially empty, and uses a least-recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache.

Seeing caches in action with MARS

Copy the MIPS assembly language program ~cs61cl/code/cache.s to your local directory and open it in MARS. This is a very simple program that walks sequentially through an array modifying each word. Take a look at the program an convince yourself that you understand what it does. It has some flags that allow to go over the array repeatedly or to skip elements when stepping through the array.

Brainstorm: Temporal Locality

If you were to change the rep count to 2 so that you went over the same array twice, what would the hit rate be? Why?

Go back to the source and change the rep count to 4. Assemble you program. Reset the cache simulator and the visualizer. Set the run speed to about 10 inst/sec so that you can see what is going on. Let it fly. Notice what happens in the cache on the first and second repeats. You can check your answer as rep two completes and see where the hit rates goes from there.

Brainstorm: Replacement

Reduce the number of blocks in the cache to 16 and run cache.s again with the rep count set to 4 as above. Observe the hit rate. Why is it so low?

Multi-word blocks

An elaboration on fully associative caching uses multi-word blocks. When a word is added to the cache, the words around it are also added. This tactic takes advantage of spatial locality.

Consider now a cache with 2-word blocks, as in the diagram below.

How is it decided, when an address/contents pair is to be added to the cache, whether the word before it or the word after it is to be added as well? Suppose addr is the byte address we want to add. The approach taken is to add the two words for which addr/8 is the same. For example, if addr is 24, we add the words at addresses 24 and 28. If addr is 36, we add the two words at addresses 32 and 36. This lets us save a bit in the tag—now we only need 29 bits to identify the address of a given cache entry.

Let's see how this works with the address references

    0, 12, 4, 8

and a four-word cache with a block size of 2. We put the partner word in parentheses to show that it wasn't directly accessed.

accessed address  hit or miss?cache contents
0miss
    0, (4)
    --
12miss
    0, (4)
    (8), 12
4hit
    0, 4
    8, 12
8hit
    0, 4
    8, 12

BrainStorm

Here is a series of address references (byte addresses):

	8, 12, 44, 64, 84, 52, 256, 192, 
	76, 44, 12, 88, 16, 108, 24, 44

Assume that a fully associative cache with 4 two-word blocks is initially empty, and uses a least-recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache.

Bigger blocks

This multi-word block scheme can be extended to 4-word blocks, as in the diagram below, and beyond. Only 28 bits are needed for the tag for a 4-word block. The words in each block are those for which addr/16 is the same.

Brainstorm: Spatial locality in MARS

Change the configuration of your simulated cache to have four 4-word blocks, rather than 16 1-word blocks. It still has a total size of 64 bytes. Your cache.s again and observe the hit rate. Why is it so much higher? Does it matter if you set the rep count back to 1?

BrainStorm

Here is a series of address references (byte addresses):

	8, 12, 44, 64, 84, 52, 256, 192, 
	76, 44, 12, 88, 16, 108, 24, 44

Assume that a fully associative cache with 2 four-word blocks is initially empty, and uses a least-recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache.

An alternative: direct-mapped caches

In a fully associative cache, a particular tag/data pair can be anywhere in the cache. We've already noted that the circuitry needed to find something in the cache is complicated. In a direct-mapped cache, each memory address is associated with exactly one location within the cache. This is inflexible but easy to implement in hardware.

The block location in the cache is determined by bits in the address. With an 8-word direct-mapped cache, the last two bits in the address specify a position within the word. The next three bits in the address (counting from the right) is an index to one of the 8 cache locations, as shown in P&H Figure 7.5 below.

The remainder is the tag with which a cache entry is identified. The figure below shows how the bits are classified in an 8-word direct-mapped cache: "t" is part of the tag, "i" is part of the index, and "p" is part of the position within the word.

    tttttttttttttttttttttttttttiiipp

To see if a given address is in the cache, we isolate bits 4-2 of the address—call them index—and then compare the tag in the index'th entry with the tag from the address we're looking for.

Here is a short example that uses a 4-word cache. Bits 3 and 2 are used to index the cache.

address accessed   hit or miss?   cache contents
4 (00100) miss
--
M(4)
--
--
8 (01000) miss
--
M(4)
M(8)
--
16 (10000) miss
M(16)
M(4)
M(8)
--
24 (11000) miss
M(16)
M(4)
M(24)
--
8 (01000) miss
M(16)
M(4)
M(8)
--
12 (01100) miss
M(16)
M(4)
M(8)
M(12)
16 (10000) hit
M(16)
M(4)
M(8)
M(12)
12 (01100) hit
M(16)
M(4)
M(8)
M(12)

Multi-word blocks (again)

Like a fully associative cache, a direct-mapped cache can be used with multi-word blocks. The rightmost bits in an address then specify a position within the block. If there are two words per block, there are three position bits; four words per block require four position bits; and so on. The next bits specify an index within the cache. The remainder are the tag as before.

Here's a picture of an eight-word direct-mapped cache with a two-word block size, followed by the corresponding use of the bits within an address.

    tttttttttttttttttttttttttttiippp

BrainStorm

Here is a series of address references (byte addresses):

	8, 12, 44, 64, 84, 52, 256, 192, 
	76, 44, 12, 88, 16, 108, 24, 44

Assume that a direct-mapped cache with eight words and two words per block is initially empty. Label each reference in the list as a hit or a miss, and show the final contents of the cache.

Direct mapped cache under MARS

Change the placement policy in the MARS data cache performance tool to Direct Mapping. Run your cache.s and compare the hit rates with a direct mapped and fully associative cache. Can you explain what you observe?

Now let's modify the program slightly to change the address stream that it generates. Add a variable to the beginning of the data segment.

	.data
counter:	0
array:
	.space	2048
	
Add an instruction outside the loop to set up a pointer to this variable.
	la  $t2, counter
And add an instruction in the inner loop to store to this variable.
	addu	$s4,$s4,$s1
	addu	$s5,$s5,$s1
	sw      $s3, 0($t2)
	blt	$s4,$s3,innerloop
Set a break point or slow down the simulation rate so that you can see the effect of mixing stores to counter with stepping through array.

Compare the hit rates with associative and direct mapped caches on this program.

Set-associative caches

A hybrid organization that combines features of direct-mapped and fully associative caches is a set-associative cache. An N-way set-associative cache is like a direct-mapped cache, each of whose elements is a fully associative cache of N elements. Thus the cache consists of a number of sets, each of which consists of N blocks. Each block in the memory maps to a unique set in the cache given by the index field, and a block can be placed in any element of that set.

The figures below portray a two-way set-associative cache and a four-way set-associative cache, both with a total of eight words.

2-way set-associative cache4-way set-associative cache

Here's a short example of the use of a 2-way set-associative cache with four words. Bit 2 is used to choose the set.

address accessed   hit or miss?   cache contents
4 (00100) miss
--, --
M(4), --
8 (01000) miss
M(8), --
M(4), --
16 (10000) miss
M(8), M(16)
M(4), --
24 (11000) miss
M(24), M(16)
M(4), --
8 (01000) miss
M(24), M(8)
M(4), --
12 (01100) miss
M(24), M(8)
M(4), M(12)
16 (10000) miss
M(16), M(8)
M(4), M(12)
12 (01100) hit
M(16), M(8)
M(4), M(12)

BrainStorm

BrainStorm

Here is a series of address references (byte addresses):

	8, 12, 44, 64, 84, 52, 256, 192, 
	76, 44, 12, 88, 16, 108, 24, 44

Assume that a 2-way set associative cache with eight words total is initially empty, and uses a least-recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache.

2-way Set associative in MARS

Modify the set size to be 2 blocks and run your modified version of cache.s with the counter. How does the hit rate of the 2-way set associative cache compare to fully associative and direct mapped. What do you think is the best cost-performance tradeoff?

Summary

Today's activities introduced caches and ways to organize them. The two types of organization are fully associative, where cache entries are not in predetermined locations, and direct-mapped, where they are. An n-way set associative cache splits the entries into sets of n elements, which are then searched as in fully associative caches. Another variable is the block size; bigger blocks improve retrieval in programs with high spatial locality.

Coming up next lab are activities involving inference of a cache's characteristics from program timings. We will also consider ways to be cache-conscious programmers.