#### **CS162 Operating Systems and** Systems Programming Lecture 10

#### **Caches and TLBs**

October 7, 2013 Anthony D. Joseph and John Canny http://inst.eecs.berkeley.edu/~cs162

#### **Review: Address Segmentation** Virtual memory view 1011 0000 + Physical memory view 1111 1111 11 0000 stack 1111 0000 (0xF0) Segment Map 1110 0000 stack 1110 0000 (0xE0) base limit 1100 0000 1011 0000 (0xC0) 1 0000 10 0111 0000 1 1000 01 0101 0000 10 0000 heap 1000 0000 00 0001 0000 10 0000 heap (0x80)lo111 0000 (0x70)0101 0000 data (0x50)0100 0000 (0x40)code 0001 0000 code 0000 0000 (0x10) 0000 0000 eg # offset 10/7/13 Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013 Lec 10.3

# **Goals for Today's Lecture**

- · Paging- and Segmentation-based Translation Recap
- · Multi-level Translation
- Caching
  - Misses
  - Organization
- · Translation Look aside Buffers (TLBs)
- How Caching and TLBs fit into the Virtual Memory Architecture

Note: Some slides and/or pictures in the following are adapted from slides ©2005 Silberschatz, Galvin, and Gagne. Slides courtesy of Anthony D. Joseph, John Kubiatowicz, AJ Shankar, George Necula, Alex Aiken, Eric Brewer, Ras Bodik, Ion Stoica, Doug Tygar, and David Wagner.

Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013





















|                                  | Advantages                                                         | Disadvantages                              |  |  |
|----------------------------------|--------------------------------------------------------------------|--------------------------------------------|--|--|
| Segmentation                     | Fast context<br>switching: Segment<br>mapping<br>maintained by CPU | External fragmentation                     |  |  |
| Paging<br>(single-level<br>page) | No external fragmentation, fast easy allocation                    | Large table size ~ virtual memory          |  |  |
| Paged segmentation               | Table size ~ # of pages in virtual                                 | Multiple memory references per page access |  |  |
| Two-level pages                  | memory, fast easy allocation                                       |                                            |  |  |

# **Multi-level Translation Analysis**

- · Pros:
  - Only need to allocate as many page table entries as we need for application – size is proportional to usage
    - » In other words, sparse address spaces are easy
  - Easy memory allocation
  - Easy Sharing
    - » Share at segment or page level (need additional reference counting)
- · Cons:
  - One pointer per page (typically 4K 16K pages today)
  - Page tables need to be contiguous
    - » However, previous example keeps tables to exactly one page in size
  - Three (or more, if >2 levels) memory lookups per reference
    - » Seems very expensive!

10/7/13 Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013 Lec 10.14

# **Caching Concept**



- Cache: a repository for copies that can be accessed more quickly than the original
  - Make frequent case fast and infrequent case less dominant
- Caching at different levels
  - Can cache: memory locations, address translations, pages, file blocks, file names, network routes, etc...
- Only good if:
  - Frequent case frequent enough and
  - Infrequent case not too expensive
- Important measure: Average Access time =
   (Hit Rate x Hit Time) + (Miss Rate x Miss Time)

10/7/13 Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013 Lec 10.16







# **Sources of Cache Misses**

- Compulsory (cold start): first reference to a block
  - "Cold" fact of life: not a whole lot you can do about it
  - Note: When running "billions" of instruction, Compulsory Misses are insignificant
- Capacity:
  - Cache cannot contain all blocks access by the program
  - Solution: increase cache size
- Conflict (collision):
  - Multiple memory locations mapped to same cache location
  - Solutions: increase cache size, or increase associativity
- Two others:
  - Coherence (Invalidation): other process (e.g., I/O) updates memory

Lec 10.20

- Policy: Due to non-optimal replacement policy

10/7/13 Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013















### **Administrivia**

- Project #1:
  - Code due Tuesday Oct 8 by 11:59pm
  - Design doc (submit proj1-final-design) and group evals (Google Docs form) due Wed 10/9 at 11:59PM
    - » Group evals are anonymous to your group
- Midterm #1 is Monday Oct 21 5:30-7pm in 145 Dwinelle (A-L) and 2060 Valley LSB (M-Z)
  - Closed book, double-sided handwritten page of notes, no calculators, smartphones, Google glass etc.
  - Covers lectures #1-13 (Disks/SSDs, Filesystems), readings, handouts, and projects 1 and 2
  - Review session 390 Hearst Mining, Fri October 18, 5-7 PM
- · Class feedback is always welcome!

1

Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013

Fall 2013 Lec 10.28

10/7/13











# Which Block Should be Replaced on a Miss?

- · Easy for Direct Mapped: Only one possibility
- Set Associative or Fully Associative:
  - Random
  - LRU (Least Recently Used)

#### Example TLB miss rates:

|        | 2-way |        | 4-way |        | 8-way |        |
|--------|-------|--------|-------|--------|-------|--------|
| Size   | LRU   | Random | LRU   | Random | LRU F | Random |
| 16 KB  | 5.2%  | 5.7%   | 4.7%  | 5.3%   | 4.4%  | 5.0%   |
| 64 KB  | 1.9%  | 2.0%   | 1.5%  | 1.7%   | 1.4%  | 1.5%   |
| 256 KB | 1.15% | 1.17%  | 1.13% | 1.13%  | 1.12% | 1.12%  |

10/7/13 Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013 Lec 10.34

# What Happens on a Write?

- Write through: The information is written both to the block in the cache and to the block in the lower-level memory
- Write back: The information is written only to the block in the cache.
  - Modified cache block is written to main memory only when it is replaced
  - Question is block clean or dirty?
- Pros and Cons of each?
  - WT:
    - » PRO: read misses cannot result in writes
    - » CON: processor held up on writes unless writes buffered
  - - » PRO: repeated writes not sent to DRAM processor not held up on writes
    - » CON: More complex

Read miss may require writeback of dirty data

10/7/13

Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013

Lec 10.35

# **Caching Applied to Address Translation**



- Question is one of page locality: does it exist?
  - Instruction accesses spend a lot of time on the same page (since accesses sequential)
  - Stack accesses have definite locality of reference
  - Data accesses have less page locality, but still some...
- Can we have a TLB hierarchy?
- Sure: multiple levels at different sizes/speeds



# · Software traversed Page tables

Hardware traversed page tables:

TLB (may walk multiple levels)

decides what to do afterwards

- On TLB miss, processor receives TLB fault
- Kernel traverses page table to find PTE
  - » If PTE valid, fills TLB and returns from fault
  - » If PTE marked as invalid, internally calls Page Fault handler

**What Actually Happens on a TLB Miss?** 

- On TLB miss, hardware in MMU looks at current page table to fill

» If PTE marked as invalid, causes Page Fault, after which kernel

» If PTE valid, hardware fills TLB and processor never knows

- Most chip sets provide hardware traversal
  - Modern operating systems tend to have more TLB faults since they use translation for many things

10/7/13 Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013 Lec 10.38

# What happens on a Context Switch?

- Need to do something, since TLBs map virtual addresses to physical addresses
  - Address Space just changed, so TLB entries no longer valid!
- Options?
  - Invalidate TLB: simple but might be expensive
    - » What if switching frequently between processes?
  - Include ProcessID in TLB
    - » This is an architectural solution: needs hardware
- What if translation tables change?
  - For example, to move page from memory to disk or vice versa...
  - Must invalidate TLB entry!
    - » Otherwise, might think that page is still in memory!

10/7/13

Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013

Lec 10.39

# What TLB organization makes sense?



- Needs to be really fast
  - Critical path of memory access
  - Seems to argue for Direct Mapped or Low Associativity
- However, needs to have very few conflicts!
  - With TLB, the Miss Time extremely high!
  - This argues that cost of Conflict (Miss Time) is much higher than slightly increased cost of access (Hit Time)
- Thrashing: continuous conflicts between accesses
  - What if use low order bits of page as index into TLB?
    - » First page of code, data, stack may map to same entry
    - » Need 3-way associativity at least?
  - What if use high order bits as index?
    - » TLB mostly unused for small programs

10/7/13

Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013

### TLB organization: include protection

- How big does TLB actually have to be?
  - -Usually small: 128-512 entries
  - -Not very big, can support higher associativity
- TLB usually organized as fully-associative cache
  - Lookup is by Virtual Address
  - -Returns Physical Address + other info
- What happens when fully-associative is too slow?
  - -Put a small (4-16 entry) direct-mapped cache in front
  - -Called a "TLB Slice"
- When does TLB lookup occur relative to memory cache access?
  - -Before memory cache lookup?
  - In parallel with memory cache lookup?

10/7/13

Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013

Lec 10.41

# **Reducing translation time further**

As described, TLB lookup is in serial with cache lookup:



#### **Physical Address**

- Machines with TLBs go one step further: they overlap TLB lookup with cache access.
  - Works because offset available early

10/7/13

Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013

Lec 10.42

# Overlapping TLB & Cache Access (1/2)

- · Main idea:
  - Offset in virtual address exactly covers the "cache index" and "byte select"
  - Thus can select the cached byte(s) in parallel to perform address translation



10/7/13

Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013

Lec 10.43

# **Overlapping TLB & Cache Access (1/2)**

• Here is how this might work with a 4K cache:



- · What if cache size is increased to 8KB?
  - Overlap not complete
  - Need to do something else. See CS152/252

10/7/13

Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013









# **Summary (2/2)**

- · Cache Organizations:
  - Direct Mapped: single block per set
  - Set associative: more than one block per set
  - Fully associative: all entries equivalent
- TLB is cache on address translations
  Fully associative to reduce conflicts
  Can be overlapped with cache access

10/7/13 Anthony D. Joseph and John Canny CS162 ©UCB Fall 2013