When we started learning about digital design we learned how to convert any truth table to combinational logic gates. Then we learned how to build MUXes and decoders. An then we used truth tables to specify the control logic that makes our processor do the register transfers we want it to do. So we are tempted to bang out a bunch of combinational logic for every control signal. But wait, we can use MUXes and Decoders - and it is often a lot simpler. The following truth table has two inputs and three outputs.
A B | X Y Z 0 0 | 1 0 0 0 1 | 1 1 0 1 0 | 0 1 1 1 1 | 0 0 01. Show how to implement this with a single MUX that has multibit inputs.
2. Show how to implement this with a single decoder plus one OR gate for each output
The web site for your favorite computer vendor is selling a hot new item, "This machine will run billions of instructions per second!". It has a 3 GHz CPU clock speed (that's 0.33 nanoseconds per cycle) and 40-60 nanoseconds memory access time. Having taken CS61Cl you are not settling for merketing hype. So you do a little calculation. Every instruction is fetched from memory and many of them (about 30%) load from or store to memory. Let's call it 1.3 memory accesses per instruction at 50 ns per memory access. That 80 ns just on memory, or 12.5 million instructions per second! Is this false advertising? How can so fast an execution speed be claimed with such slow memory access rates? The answer is that there's something between the CPU and main memory called a cache, which we'll describe in a minute.
First, we'll describe the memory hierarchy.
At the top is the CPU. It holds a small amount of code and data in registers that can be accessed in a fraction of a nanosecond - a single cyle.
Further down is main memory. It has more capacity than the registers, but is still limited. Probably most personal computers used today have a gigabyte of main memory.
Lower yet are various kinds of secondary storage, such as hard disks, USB sticks, DVDs, and tapes. A disk drive, for example, has huge capacity—half a trillion bytes for around $600 these days—but is also quite slow, with access in milliseconds.
This hierarchy is illustrated below. In general, the lower the level, the higher the latency (access time), and the lower the cost per bit.
The cache fits in just below the CPU. It's on the chip with the CPU, and so is quickly accessible; access time ranges from 0.5 to 5 nanoseconds. There are levels of cache—level 1, level 2, and so on—that differ in size and in the speed of the bus connecting them to the CPU. Modern personal computers have 64-256 Kb level 1 caches for instructions and data, a 4 Mb level 2 cache, with perhaps a 256 Mb level 3 cache.
Here's an analogy for how the memory hierarchy is used. You're writing a term paper—the "processor"—at a table in your dorm. Doe Library is equivalent to a disk; it has essentially limitless capacity, but it takes a long time to retrieve a book. You have several good books on your bookself. The bookshelf is like main memory. Its capacity is smaller, which means you have to return a book when the shelf fills up. However, it's easier and faster to find a book there than to go to the library. The open books on the table are like the cache. Capacity is mush less—an open book takes up more space than a closed one on the self, and you may have to shelve a book when too many others are open—but it's much faster to retrieve data. This organization creates the illusion of the whole library open on the tabletop: we keep as many recently used books as possible open on the table, since we're likely to use them again; we keep as many books as DOE will allow in our shelf, since it's faster to access them than to go to the library stacks.
Memory, then, contains copies of data on disk that have been recently used. Similarly, the cache contains copies of data in memory that have been recently used. In general, each successive level contains the most used data from the next lower level. Each load instruction first checks the cache to see if it contains the addressed data; only if it doesn't will main memory be accessed. (We'll discuss shortly how stores are handled.)
The actions we take at various levels of the memory hierarchy are similar. When we make an access and find it there (we call it a "hit") and return the data immediately. When what we're looking for is not there (we call it a "miss") we go to the next lower level and bring it up. (The frequency at which we have to dig down and do that is called the "miss rate".) However, the means for taking these actions are very different at different levels. For fast memory in front of slow memory, we call the fast one a cache and everything is handled in hardware. For example, it may take 1 cycle to process a hit and 10s or 100s to process a miss. For memory in front of disk, we call the whole illusion "virtual memory" and hits are handled in hardware while misses are handled in software by the operating sustem. The miss takes millions of instructions.(P&H exercise 7.3 and 7.4)
Describe the general characteristics of a program that would exhibit very high amounts of temporal locality but very little spatial locality with regard to data accesses, and give an illustrative example.
Describe the general characteristics of a program that would exhibit very little temporal locality but very high amounts of spatial locality with regard to data accesses, and give an illustrative example.
A fully associative cache is managed just like a hardware version of an assoc table (which you would have seen in CS 61A) or a map (which you would have seen in CS 61B). It stores in essence a collection of a constant number of address/contents pairs. In reality, it stores what P&H refer to as a tag, which contains only as much of the address as is necessary to find it in the table. P&H also refer to the contents as the data, and to a pair plus bookkeeping information (e.g. a "valid?" bit that says if the cache entry contains meaningful data) as a block.
When a load is encountered, the cache is searched for a block with the desired address. If a block with the desired address is in the cache, that's a hit and the corresponding data is returned. If the cache does not contain a block with the desired address, that's a miss; the requested information is retrieved from main memory and also stored in the cache. But, unlike your software lookup that searches sequentially through all the tags in the table, in hardware we do them all in parallel.
When the collection fills up, something currently in the collection must be replaced by the incoming address/contents pair. With a small cache, the least recently used block might be replaced; that's the block that has been unreferenced for the longest time. (The circuitry to keep track of when a block was used is complicated.) With a larger cache, the block to replace might be chosen at random, since random choice is much easier to implement in hardware.
Using sequential search to find an entry in a fully associative cache would be too slow, so the search is done in parallel. This requires a comparator for each cache entry to check an address against all the addresses currently in the cache, plus a selector to access the correct contents. If the address is found, the associated contents is returned. A diagram appears below.
We assume that the cache contains only words. This allows us to economize a bit (actually 2 bits) in storing the address. Just as in jump and branch instructions, where the last two bits of an address are 0 and therefore aren't stored, we can also get away with keeping just 30 bits of the address as the tag.
Consider two programs, program A with high spatial locality and low temporal locality, and program B with just the opposite. Which of the following is more likely? Briefly explain your answer.
Here's an example of the use of a four-word fully associative cache with a least-recently-used replacement policy. The tag fields in the cache are omitted in the diagram to save space; the contents field in each cache entry contains M(addr), which means the word stored in memory at address addr, or "--", which means an invalid entry. The sequence of byte addresses accessed—the "reference string", in P&H terms—is 410 (001002), 810 (010002), 1610 (100002), 2410 (110002), 810 (010002), 1210 (011002), 1610 (100002), 1210 (011002). The least recently used block appears in italics.
address accessed | hit or miss? | cache contents |
4 (00100) | miss |
M(4) -- -- -- |
8 (01000) | miss |
M(4) M(8) -- -- |
16 (10000) | miss |
M(4) M(8) M(16) -- |
24 (11000) | miss |
M(4) M(8) M(16) M(24) |
8 (01000) | hit |
M(4) M(8) M(16) M(24) |
12 (01100) | miss |
M(12) M(8) M(16) M(24) |
16 (10000) | hit |
M(12) M(8) M(16) M(24) |
12 (01100) | hit | M(12) M(8) M(16) M(24) |
Here is a series of address references (byte addresses):
8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44
Assume that a fully associative cache with 8 one-word blocks is initially empty, and uses a least-recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache.
Go back to the source and change the rep count to 4. Assemble you program. Reset the cache simulator and the visualizer. Set the run speed to about 10 inst/sec so that you can see what is going on. Let it fly. Notice what happens in the cache on the first and second repeats. You can check your answer as rep two completes and see where the hit rates goes from there.
An elaboration on fully associative caching uses multi-word blocks. When a word is added to the cache, the words around it are also added. This tactic takes advantage of spatial locality.
Consider now a cache with 2-word blocks, as in the diagram below.
How is it decided, when an address/contents pair is to be added to the cache, whether the word before it or the word after it is to be added as well? Suppose addr is the byte address we want to add. The approach taken is to add the two words for which addr/8 is the same. For example, if addr is 24, we add the words at addresses 24 and 28. If addr is 36, we add the two words at addresses 32 and 36. This lets us save a bit in the tag—now we only need 29 bits to identify the address of a given cache entry.
Let's see how this works with the address references
0, 12, 4, 8
and a four-word cache with a block size of 2. We put the partner word in parentheses to show that it wasn't directly accessed.
accessed address | hit or miss? | cache contents |
0 | miss |
0, (4) -- |
12 | miss |
0, (4) (8), 12 |
4 | hit |
0, 4 8, 12 |
8 | hit |
0, 4 8, 12 |
Here is a series of address references (byte addresses):
8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44
Assume that a fully associative cache with 4 two-word blocks is initially empty, and uses a least-recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache.
This multi-word block scheme can be extended to 4-word blocks, as in the diagram below, and beyond. Only 28 bits are needed for the tag for a 4-word block. The words in each block are those for which addr/16 is the same.
Here is a series of address references (byte addresses):
8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44
Assume that a fully associative cache with 2 four-word blocks is initially empty, and uses a least-recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache.
In a fully associative cache, a particular tag/data pair can be anywhere in the cache. We've already noted that the circuitry needed to find something in the cache is complicated. In a direct-mapped cache, each memory address is associated with exactly one location within the cache. This is inflexible but easy to implement in hardware.
The block location in the cache is determined by bits in the address. With an 8-word direct-mapped cache, the last two bits in the address specify a position within the word. The next three bits in the address (counting from the right) is an index to one of the 8 cache locations, as shown in P&H Figure 7.5 below.
The remainder is the tag with which a cache entry is identified. The figure below shows how the bits are classified in an 8-word direct-mapped cache: "t" is part of the tag, "i" is part of the index, and "p" is part of the position within the word.
tttttttttttttttttttttttttttiiipp
To see if a given address is in the cache, we isolate bits 4-2 of the address—call them index—and then compare the tag in the index'th entry with the tag from the address we're looking for.
Here is a short example that uses a 4-word cache. Bits 3 and 2 are used to index the cache.
address accessed | hit or miss? | cache contents |
4 (00100) | miss |
-- M(4) -- -- |
8 (01000) | miss |
-- M(4) M(8) -- |
16 (10000) | miss |
M(16) M(4) M(8) -- |
24 (11000) | miss |
M(16) M(4) M(24) -- |
8 (01000) | miss |
M(16) M(4) M(8) -- |
12 (01100) | miss |
M(16) M(4) M(8) M(12) |
16 (10000) | hit |
M(16) M(4) M(8) M(12) |
12 (01100) | hit | M(16) M(4) M(8) M(12) |
Like a fully associative cache, a direct-mapped cache can be used with multi-word blocks. The rightmost bits in an address then specify a position within the block. If there are two words per block, there are three position bits; four words per block require four position bits; and so on. The next bits specify an index within the cache. The remainder are the tag as before.
Here's a picture of an eight-word direct-mapped cache with a two-word block size, followed by the corresponding use of the bits within an address.
tttttttttttttttttttttttttttiippp
Here is a series of address references (byte addresses):
8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44
Assume that a direct-mapped cache with eight words and two words per block is initially empty. Label each reference in the list as a hit or a miss, and show the final contents of the cache.
Now let's modify the program slightly to change the address stream that it generates. Add a variable to the beginning of the data segment.
.data counter: 0 array: .space 2048Add an instruction outside the loop to set up a pointer to this variable.
la $t2, counterAnd add an instruction in the inner loop to store to this variable.
addu $s4,$s4,$s1 addu $s5,$s5,$s1 sw $s3, 0($t2) blt $s4,$s3,innerloopSet a break point or slow down the simulation rate so that you can see the effect of mixing stores to counter with stepping through array.
Compare the hit rates with associative and direct mapped caches on this program.
A hybrid organization that combines features of direct-mapped and fully associative caches is a set-associative cache. An N-way set-associative cache is like a direct-mapped cache, each of whose elements is a fully associative cache of N elements. Thus the cache consists of a number of sets, each of which consists of N blocks. Each block in the memory maps to a unique set in the cache given by the index field, and a block can be placed in any element of that set.
The figures below portray a two-way set-associative cache and a four-way set-associative cache, both with a total of eight words.
2-way set-associative cache | 4-way set-associative cache |
Here's a short example of the use of a 2-way set-associative cache with four words. Bit 2 is used to choose the set.
address accessed | hit or miss? | cache contents |
4 (00100) | miss |
--, -- M(4), -- |
8 (01000) | miss |
M(8), -- M(4), -- |
16 (10000) | miss |
M(8), M(16) M(4), -- |
24 (11000) | miss |
M(24), M(16) M(4), -- |
8 (01000) | miss |
M(24), M(8) M(4), -- |
12 (01100) | miss |
M(24), M(8) M(4), M(12) |
16 (10000) | miss |
M(16), M(8) M(4), M(12) |
12 (01100) | hit | M(16), M(8) M(4), M(12) |
Here is a series of address references (byte addresses):
8, 12, 44, 64, 84, 52, 256, 192, 76, 44, 12, 88, 16, 108, 24, 44
Assume that a 2-way set associative cache with eight words total is initially empty, and uses a least-recently-used replacement policy. Label each reference in the list as a hit or a miss, and show the final contents of the cache.
Today's activities introduced caches and ways to organize them. The two types of organization are fully associative, where cache entries are not in predetermined locations, and direct-mapped, where they are. An n-way set associative cache splits the entries into sets of n elements, which are then searched as in fully associative caches. Another variable is the block size; bigger blocks improve retrieval in programs with high spatial locality.
Coming up next lab are activities involving inference of a cache's characteristics from program timings. We will also consider ways to be cache-conscious programmers.