# CS162 Operating Systems and Systems Programming Lecture 13

# Address Translation (con't) Caches and TLBs

October 16, 2006 Prof. John Kubiatowicz http://inst.eecs.berkeley.edu/~cs162

### **Review: Exceptions: Traps and Interrupts**

· A system call instruction causes a synchronous exception (or "trap") - In fact, often called a software "trap" instruction • Other sources of synchronous exceptions: - Divide by zero, Illegal instruction, Bus error (bad address, e.g. unaligned access) - Segmentation Fault (address out of range) - Page Fault (for illusion of infinite-sized memory) Interrupts are Asynchronous Exceptions - Examples: timer, disk ready, network, etc.... - Interrupts can be disabled, traps cannot! • On system call, exception, or interrupt: - Hardware enters kernel mode with interrupts disabled - Saves PC, then jumps to appropriate handler in kernel - For some processors (x86), processor also saves registers, changes stack, etc. • Actual handler typically saves registers, other CPU 10/1 state, and switches to kernel stack Lec 13.2

## **Review: Multi-level Translation**

- What about a tree of tables?
  - Lowest level page table⇒memory still allocated with bitmap
     Higher levels often segmented
- Could have any number of levels. Example (top segment):





## Goals for Today

- Finish discussion of Address Translation
- Caching and TLBs

Note: Some slides and/or pictures in the following are adapted from slides ©2005 Silberschatz, Galvin, and Gagne

| 10/16/06 | Kubiatowicz CS162 ©UCB Fall 2006 | Lec 13.5 | 10/16/06 | Kubiatowicz CS162 ©UCB Fall 2006 | Lec 13.6 |
|----------|----------------------------------|----------|----------|----------------------------------|----------|
|          |                                  |          |          |                                  |          |

## What is in a PTE?

- What is in a Page Table Entry (or PTE)?
  - Pointer to next-level page table or to actual page
  - Permission bits: valid, read-only, read-write, write-only
- Example: Intel x86 architecture PTE:
  - Address same format previous slide (10, 10, 12-bit offset)
  - Intermediate page tables called "Directories"



- P: Present (same as "valid" bit in other architectures)
- W: Writeable
- U: User accessible
- PWT: Page write transparent: external cache write-through
- PCD: Page cache disabled (page cannot be cached)
  - A: Accessed: page has been accessed recently
  - D: Dirty (PTE only): page has been modified recently
  - L: L=1⇒4MB page (directory only). Bottom 22 bits of virtual address serve as offset Kubiatowicz CS162 ©UCB Fall 2006 Lec 13.7

- · Pros:
  - Only need to allocate as many page table entries as we need for application » In other wards, sparse address spaces are easy
  - Easy memory allocation
  - Easy Sharing
    - » Share at segment or page level (need additional reference counting)
- · Cons:
  - One pointer per page (typically 4K 16K pages today)
  - Page tables need to be contiguous
    - » However, previous example keeps tables to exactly one page in size
  - Two (or more, if >2 levels) lookups per reference » Seems very expensive!
- Really starts to be a problem for 64-bit address space:

| - How big is | virtual memor  | y space vs       | physical | memory?  |
|--------------|----------------|------------------|----------|----------|
| 0/16/06      | Kubiatowicz CS | 162 ©UCB Fall 20 | 06       | Lec 13.6 |

## Examples of how to use a PTE

- How do we use the PTE?
  - Invalid PTE can imply different things:
    - » Region of address space is actually invalid or
    - » Page/directory is just somewhere else than memory
  - Validity checked first
    - » OS can use other (say) 31 bits for location info
- Usage Example: Demand Paging

  - Keep only active pages in memory Place others on disk and mark their PTEs invalid
- Usage Example: Copy on Write
  - UNIX fork gives copy of parent address space to child » Address spaces disconnected after child created
  - How to do this cheaply?
    - » Make copy of parent's page tables (point at same memory)
    - » Mark entries in both sets of page tables as read-only
  - » Page fault on write creates two copies
- Usage Example: Zero Fill On Demand
  - New data pages must carry no information (say be zeroed)
  - Mark PTEs as invalid; page fault on use gets zeroed page - Often, OS creates zeroed pages in background



- hardware can also do in software (possibly expensive)
- Protection via Strong Typing
  - Restrict programming language so that you can't express program that would trash another program
  - Loader needs to make sure that program produced by valid compiler or all bets are off
  - Example languages: LISP, Ada, Modula-3 and Java
- Protection via software fault isolation:
  - Language independent approach: have compiler generate object code that provably can't step out of bounds
    - » Compiler puts in checks for every "dangerous" operation (loads, stores, etc). Again, need special loader.
    - » Alternative, compiler generates "proof" that code cannot do certain things (Proof Carrying Code)

#### - Or: use virtual machine to guarantee safe behavior (loads and stores recompiled on fly to check bounds) Kubiatowicz C5162 ©UCB Fall 2006 10/16/06 Lec 13,11

- Also will get solutions up very soon!
- Project 2 is started!
  - We moved the design document due date to Wednesday (10/18) at 11:59pm
  - Always keep up with the project schedule by looking on the "Lectures" page
- Make sure to come to sections!
  - There will be a lot of information about the projects that I cannot cover in class
  - Also supplemental information and detail that we don't have time for in class

## Caching Concept



- Cache: a repository for copies that can be accessed more quickly than the original
  - Make frequent case fast and infrequent case less dominant
- Caching underlies many of the techniques that are used today to make computers fast
  - Can cache: memory locations, address translations, pages, file blocks, file names, network routes, etc...
- Only good if:
  - Frequent case frequent enough and
  - Infrequent case not too expensive
- Important measure: Average Access time = (Hit Rate × Hit Time) + (Miss Rate × Miss Time)
- 10/16/06

Kubiatowicz CS162 ©UCB Fall 2006

Lec 13,13







· Cannot afford to translate on every access

- At least three DRAM accesses per actual DRAM access

- Or: perhaps I/O if page table partially on disk!
- Even worse: What if we are using caching to make memory access faster than DRAM access???
- Solution? Cache translations!

```
- Translation Cache: TLB ("Translation Lookaside Buffer")
10/16/06 Kubiatowicz CS162 ©UCB Fall 2006 Lec 13.15
```



## Memory Hierarchy of a Modern Computer System

- Take advantage of the principle of locality to:
  - Present as much memory as in the cheapest technology
  - Provide access at speed offered by the fastest technology





## A Summary on Sources of Cache Misses

- · Compulsory (cold start or process migration, first reference): first access to a block
  - "Cold" fact of life: not a whole lot you can do about it
  - Note: If you are going to run "billions" of instruction, Compulsory Misses are insignificant
- Capacity:
  - Cache cannot contain all blocks access by the program
  - Solution: increase cache size
- Conflict (collision):
  - Multiple memory locations mapped to the same cache location
  - Solution 1: increase cache size
  - Solution 2: increase associativity
- Coherence (Invalidation): other process (e.g., I/O) updates memory 10/16/06 Kubiatowicz C5162 ©UCB Fall 2006 Lec 13,19



# How is a Block found in a Cache?



Data Select

- Index Used to Lookup Candidates in Cache
  - Index identifies the set
- Tag used to identify actual copy
  - If no candidates match, then declare cache miss
- Block is minimum quantum of caching
  - Data select field used to select data within block
  - Many caching applications don't have data select field

### **Review: Direct Mapped Cache**

#### • Direct Mapped 2<sup>N</sup> byte cache:

- The uppermost (32 N) bits are always the Cache Tag
- The lowest M bits are the Byte Select (Block Size =  $2^{M}$ )
- Example: 1 KB Direct Mapped Cache with 32 B Blocks
  - Index chooses potential block
  - Tag checked to verify block
  - Byte select chooses byte within block



## **Review: Fully Associative Cache**

- Fully Associative: Every block can hold any line
  - Address does not include a cache index
  - Compare Cache Tags of all Cache Entries in Parallel
- Example: Block Size=32B blocks
  - We need N 27-bit comparators
  - Still have byte select to choose from within block



# Review: Set Associative Cache



### Review: Which block should be replaced on a miss?

- Easy for Direct Mapped: Only one possibility
- Set Associative or Fully Associative:
  - Random
  - LRU (Least Recently Used)

|        | 2-    | 2-way  |       | 4-way  |       | 8-way<br>LRU Random |  |
|--------|-------|--------|-------|--------|-------|---------------------|--|
| Size   | LRU   | Random | LRU   | Random | LRU   | Random              |  |
| 16 KB  | 5.2%  | 5.7%   | 4.7%  | 5.3%   | 4.4%  | 5.0%                |  |
| 64 KB  | 1.9%  | 2.0%   | 1.5%  | 1.7%   | 1.4%  | 1.5%                |  |
| 256 KB | 1.15% | 1.17%  | 1.13% | 1.13%  | 1.12% | 1.12%               |  |

Lec 13.23

#### Review: What happens on a write?

- Write through: The information is written to both the block in the cache and to the block in the lower-level memory
- Write back: The information is written only to the block in the cache.
  - Modified cache block is written to main memory only when it is replaced
  - Question is block clean or dirty?
- Pros and Cons of each?
  - WT:
    - » PRO: read misses cannot result in writes
    - » CON: Processor held up on writes unless writes buffered
  - WB:
    - » PRO: repeated writes not sent to DRAM processor not held up on writes
    - » CON: More complex Read miss may require writeback of dirty data

```
10/16/06
```

```
Kubiatowicz CS162 ©UCB Fall 2006
```

#### Lec 13.25

### Caching Applied to Address Translation



- Stack accesses have definite locality of reference
- Data accesses have less page locality, but still some...
- Can we have a TLB hierarchy?
- Sure: multiple levels at different sizes/speeds

Lec 13.26

## What Actually Happens on a TLB Miss?

#### • Hardware traversed page tables:

- On TLB miss, hardware in MMU looks at current page table to fill TLB (may walk multiple levels)
  - » If PTE valid, hardware fills TLB and processor never knows
  - » If PTE marked as invalid, causes Page Fault, after which kernel decides what to do afterwards
- Software traversed Page tables (like MIPS)
  - On TLB miss, processor receives TLB fault
  - Kernel traverses page table to find PTE
    - » If PTE valid, fills TLB and returns from fault
    - » If PTE marked as invalid, internally calls Page Fault handler
- Most chip sets provide hardware traversal
  - Modern operating systems tend to have more TLB faults since they use translation for many things
  - Examples:
    - » shared segments
    - » user-level portions of an operating system

### What happens on a Context Switch?

- Need to do something, since TLBs map virtual addresses to physical addresses
  - Address Space just changed, so TLB entries no longer valid!
- Options?
  - Invalidate TLB: simple but might be expensive
    - » What if switching frequently between processes?
  - Include ProcessID in TLB
    - » This is an architectural solution: needs hardware
- What if translation tables change?
  - For example, to move page from memory to disk or vice versa...
  - Must invalidate TLB entry!
    - » Otherwise, might think that page is still in memory!

Lec 13.27

10/16/06



### TLB organization: include protection

- How big does TLB actually have to be?
  - -Usually small: 128-512 entries
  - -Not very big, can support higher associativity
- TLB usually organized as fully-associative cache
  - Lookup is by Virtual Address
  - Returns Physical Address + other info
- What happens when fully-associative is too slow?
  - Put a small (4-16 entry) direct-mapped cache in front
  - Called a "TLB Slice"
- Example for MIPS R3000:

| Virtual Address | Physical Address | Dirty | Ref | Valid | Access | ASIC |
|-----------------|------------------|-------|-----|-------|--------|------|
| 0xFA00          | 0x0003           | Y     | Ν   | Y     | R/W    | 34   |
| 0x0040          | 0x0010           | Ν     | Υ   | Y     | R      | 0    |
| 0x0041          | 0x0011           | Ν     | Υ   | Y     | R      | 0    |

```
10/16/06
```

Kubiatowicz CS162 ©UCB Fall 2006

Lec 13.30

## Example: R3000 pipeline includes TLB "stages"

#### MIPS R3000 Pipeline

| Inst Fetch |  | Dcd/ | Reg | ALU  | / E.A | Memory  | Write Reg |
|------------|--|------|-----|------|-------|---------|-----------|
| TLB I-Cac  |  | he   | RF  | Oper | ation |         | WB        |
|            |  |      |     | E.A. | TLB   | D-Cache |           |

#### TLB

64 entry, on-chip, fully associative, software TLB fault handler

#### Virtual Address Space



## Reducing translation time further

• As described, TLB lookup is in serial with cache lookup:



#### **Physical Address**

- Machines with TLBs go one step further: they overlap TLB lookup with cache access.
  - Works because offset available early

```
10/16/06
```

Kubiatowicz CS162 ©UCB Fall 2006

### **Overlapping TLB & Cache Access**



Lec 13.34

## Summary #2/2: Translation Caching (TLB)

|                                                              | —                                |             |
|--------------------------------------------------------------|----------------------------------|-------------|
| • PTE: Page Table<br>- Includes physic<br>- Control info (vo |                                  | , etc)      |
| <ul> <li>A cache of trans<br/>Buffer" (TLB)</li> </ul>       | slations called a "Translation   | n Lookaside |
| - Relatively smal                                            | l number of entries (< 512)      |             |
| - Fully Associativ                                           | ve (Since conflict misses expen  | sive)       |
| - TLB entries co                                             | ntain PTE and optional process   | ID          |
| ・On TLB miss, po                                             | ge table must be traversed       |             |
| - If located PTE                                             | is invalid, cause Page Fault     |             |
| $\cdot$ On context swite                                     | ch/change in page table          |             |
| - TLB entries mu                                             | ist be invalidated somehow       |             |
| $\cdot$ TLB is logically i                                   | in front of cache                |             |
| - Thus, needs to                                             | be overlapped with cache acco    | ess to be   |
| 10/16/06 really fast                                         | Kubiatowicz CS162 ©UCB Fall 2006 | Lec 13.35   |