# **CS162 Operating Systems and** Systems Programming Lecture 14

# Caching and **Demand** Paging

October 18, 2006 Prof. John Kubiatowicz http://inst.eecs.berkeley.edu/~cs162

### **Review: A Summary on Sources of Cache Misses**

- · Compulsory (cold start): first reference to a block
  - "Cold" fact of life: not a whole lot you can do about it
  - Note: When running "billions" of instruction, Compulsory Misses are insignificant
- · Capacity:
  - Cache cannot contain all blocks access by the program
  - Solution: increase cache size
- Conflict (collision):
  - Multiple memory locations mapped to same cache location
  - Solutions: increase cache size, or increase associativity
- Two others:
  - Coherence (Invalidation): other process (e.g., I/O) updates memory
  - Policy: Due to non-optimal replacement policy

### 10/18/06

Kubiatowicz CS162 ©UCB Fall 2006

# **Review: Memory Hierarchy of a Modern Computer System**

- Take advantage of the principle of locality to:
  - Present as much memory as in the cheapest technology
  - Provide access at speed offered by the fastest technology



### Review: Where does a Block Get Placed in a Cache?



no.

Lec 14.3

Lec 14.4

### **Review: Other Caching Questions**

- What line gets replaced on cache miss?
  - Easy for Direct Mapped: Only one possibility
  - Set Associative or Fully Associative:
    - » Random
    - » LRU (Least Recently Used)

### • What happens on a write?

- Write through: The information is written to both the cache and to the block in the lower-level memory
- Write back: The information is written only to the block in the cache
  - » Modified cache block is written to main memory only when it is replaced
  - » Question is block clean or dirty?

### • Can we have a TLB hierarchy? - Sure: multiple levels at different sizes/speeds Kubiatowicz CS162 ©UCB Fall 2006 10/18/06 10/18/06 Kubiatowicz CS162 ©UCB Fall 2006 Lec 14.5 Lec 14.6 Goals for Today What Actually Happens on a TLB Miss? • Hardware traversed page tables: Finish discussion of TLBs - On TLB miss, hardware in MMU looks at current page Concept of Paging to Disk table to fill TLB (may walk multiple levels) » If PTE valid, hardware fills TLB and processor never knows • Page Faults and TLB Faults » If PTE marked as invalid, causes Page Fault, after which • Precise Interrupts kernel decides what to do afterwards Software traversed Page tables (like MIPS) Page Replacement Policies - On TLB miss, processor receives TLB fault - Kernel traverses page table to find PTE » If PTE valid, fills TLB and returns from fault » If PTE marked as invalid, internally calls Page Fault handler Most chip sets provide hardware traversal - Modern operating systems tend to have more TLB faults since they use translation for many things - Examples: Note: Some slides and/or pictures in the following are » shared segments adapted from slides ©2005 Silberschatz, Galvin, and Gagne. » user-level portions of an operating system Many slides generated from my lecture notes by Kubiatowicz.

Lec 14.7

10/18/06

**Review: Caching Applied to Address Translation** 

Result

Physical

Address

Physical

Memory

ri R

Cached?

Yes

Translate

(MMU)

Data Read or Write

(untranslated)

page (since accesses sequential)

• Question is one of page locality: does it exist?

- Instruction accesses spend a lot of time on the same

Stack accesses have definite locality of reference
Data accesses have less page locality, but still some...

No

Virtual

Address

CPU

### What happens on a Context Switch? What TLB organization makes sense? • Need to do something, since TLBs map virtual TLB CPU Cache Memory addresses to physical addresses - Address Space just changed, so TLB entries no Needs to be really fast longer valid! - Critical path of memory access Options? » In simplest view: before the cache » Thus, this adds to access time (reducing cache speed) - Invalidate TLB: simple but might be expensive - Seems to argue for Direct Mapped or Low Associativity » What if switching frequently between processes? However, needs to have very few conflicts! - Include ProcessID in TLB - With TLB, the Miss Time extremely high! » This is an architectural solution: needs hardware - This argues that cost of Conflict (Miss Time) is much higher than slightly increased cost of access (Hit Time) • What if translation tables change? Thrashing: continuous conflicts between accesses - For example, to move page from memory to disk or - What if use low order bits of page as index into TLB? vice versa... » First page of code, data, stack may map to same entry - Must invalidate TLB entry! » Need 3-way associativity at least? » Otherwise, might think that page is still in memory! - What if use high order bits as index? » TLB mostly unused for small programs Kubiatowicz CS162 ©UCB Fall 2006 10/18/06 Kubiatowicz CS162 ©UCB Fall 2006 Lec 14.9 10/18/06 Lec 14,10

### TLB organization: include protection

- How big does TLB actually have to be? - Usually small: 128-512 entries
  - Not very big, can support higher associativity
- TLB usually organized as fully-associative cache
  - Lookup is by Virtual Address
  - Returns Physical Address + other info
- What happens when fully-associative is too slow? - Put a small (4-16 entry) direct-mapped cache in front
  - Called a "TLB Slice"
- Example for MIPS R3000:

| Virtual Address | Physical Address | Dirty | Ref | Valid | Access | ASID |
|-----------------|------------------|-------|-----|-------|--------|------|
| 0xFA00          | 0x0003           | Y     | Ν   | Y     | R/W    | 34   |
| 0x0040          | 0x0010           | Ν     | Y   | Y     | R      | 0    |
| 0x0041          | 0x0011           | Ν     | Y   | Y     | R      | 0    |
|                 |                  |       |     | -     |        |      |

# Example: R3000 pipeline includes TLB "stages"

MIPS R3000 Pipeline

| Inst Fetch  |  | Dcd/ Reg |    | ALU / E.A |     | Memory  | Write Reg |  |
|-------------|--|----------|----|-----------|-----|---------|-----------|--|
| TLB I-Cache |  | he       | RF | Operation |     |         | WB        |  |
|             |  |          |    | E.A.      | TLB | D-Cache |           |  |

TLB

64 entry, on-chip, fully associative, software TLB fault handler

### Virtual Address Space



### Reducing translation time further

• As described, TLB lookup is in serial with cache lookup:



### TLB lookup with cache access.

- Works because offset available early

```
10/18/06
```

Kubiatowicz CS162 ©UCB Fall 2006

Lec 14,13

### Administrative

- Project 2 is started!
  - We moved the design document due date to today (Wednesday 10/18) at 11:59pm
  - Always keep up with the project schedule by looking on the "Lectures" page
- Make sure to come to sections!
  - There will be a lot of information about the projects that I cannot cover in class
  - Also supplemental information and detail that we don't have time for in class
- Exams are graded
  - Will be handed back during section tomorrow

### **Demand Paging**

**Overlapping TLB & Cache Access** 

index

00

10 2

disp

(=)

• Here is how this might work with a 4K cache:

20

- Need to do something else. See CS152/252

page #

• What if cache size is increased to 8KB?

- Tags in cache are virtual addresses

- Translation only happens on cache misses <sup>8/06</sup>
Kubiatowicz C5162 ©UCB Fall 2006

assoc

lookup

TLB

FN

- Overlap not complete

Another option: Virtual Caches

32

Hit/

Miss

10/18/06

- Modern programs require a lot of physical memory
   Memory per system growing faster than 25%-30%/year
- But they don't use all their memory all of the time
  - 90-10 rule: programs spend 90% of their time in 10% of their code
  - Wasteful to require all of user's code to be in memory
- Solution: use main memory as cache for disk



10/18/06

1 K

Hit/

Miss

Lec 14,14

4K Cache

4 bytes

FN Data



### Demand Paging is Caching

- Since Demand Paging is Caching, must ask:
  What is block size?

  1 page

  What is organization of this cache (i.e. direct-mapped, set-associative, fully-associative)?

  Fully associative: arbitrary virtual → physical mapping
  How do we find a page in the cache when look for it?
  First check TLB, then page-table traversal
  - What is page replacement policy? (i.e. LRU, Random...) » This requires more explanation... (kinda LRU)
  - What happens on a miss?
    - » Go to lower level to fill miss (i.e. disk)
  - What happens on a write? (write-through, write back) » Definitely write-back. Need dirty bit!

```
10/18/06
```

Kubiatowicz CS162 ©UCB Fall 2006

Lec 14.18

# Review: What is in a PTE?

- What is in a Page Table Entry (or PTE)?
  - Pointer to next-level page table or to actual page
  - Permission bits: valid, read-only, read-write, write-only
- Example: Intel x86 architecture PTE:
  - Address same format previous slide (10, 10, 12-bit offset)
  - Intermediate page tables called "Directories"



- P: Present (same as "valid" bit in other architectures)
- W: Writeable
- U: User accessible
- PWT: Page write transparent: external cache write-through
- PCD: Page cache disabled (page cannot be cached)
  - A: Accessed: page has been accessed recently
  - D: Dirty (PTE only): page has been modified recently
  - L: L=1⇒4MB page (directory only). Bottom 22 bits of virtual address serve as offset Kubiatowicz C5162 ©UCB Fall 2006

# Demand Paging Mechanisms

- PTE helps us implement demand paging
  - Valid  $\Rightarrow$  Page in memory, PTE points at physical page
  - Not Valid  $\Rightarrow$  Page not in memory; use info in PTE to find it on disk when necessary
- $\cdot$  Suppose user references page with invalid PTE?
  - Memory Management Unit (MMU) traps to OS » Resulting trap is a "Page Fault"
  - What does OS do on a Page Fault?:
    - » Choose an old page to replace
    - » If old page modified ("D=1"), write contents back to disk
    - » Change its PTE and any cached TLB to be invalid
    - » Load new page into memory from disk
    - » Update page table entry, invalidate TLB for new entry
    - » Continue thread from original faulting location
  - TLB for new page will be loaded when thread continued!
  - While pulling pages off disk for one process, OS runs another process from ready queue
  - » Suspended process sits on wait queue

10/18/06

### Software-Loaded TLB

- MIPS/Snake/Nachos TLB is loaded by software
  - High TLB hit rate⇒ok to trap to software to fill the TLB, even if slower
  - Simpler hardware and added flexibility: software can maintain translation tables in whatever convenient format
- How can a process run without access to page table?
  - Fast path (TLB hit with valid=1):
    - » Translation to physical page done by hardware
  - Slow path (TLB hit with valid=0 or TLB miss) » Hardware receives a "TLB Fault"
  - What does OS do on a TLB Fault?
    - » Traverse page table to find appropriate PTE
    - » If valid=1, load page table entry into TLB, continue thread
    - » If valid=0, perform "Page Fault" detailed previously

» Continue thread

- Everything is transparent to the user process:
  - It doesn't know about paging to/from disk
- It doesn't even know about software TLB handling Kubiatowicz C5162 ©UCB Fall 2006 Lec 14.21 10/18/06

# **Transparent Exceptions**



- How to transparently restart faulting instructions? - Could we just skip it?
  - » No: need to perform load or store after reconnecting physical page
- Hardware must help out by saving:
  - Faulting instruction and partial state
    - » Need to know which instruction caused fault
    - » Is single PC sufficient to identify faulting position????
  - Processor State: sufficient to restart user thread
    - » Save/restore registers, stack, etc

• What if an instruction has side-effects? 10/18/06 Kubiatowicz CS162 ©UCB Fall 2006

```
Lec 14,22
```

# Consider weird things that can happen

What if an instruction has side effects?

- Options:

- » Unwind side-effects (easy to restart)
- » Finish off side-effects (messy!)
- Example 1: mov (sp)+,10
  - » What if page fault occurs when write to stack pointer?
  - » Did sp get incremented before or after the page fault?
- Example 2: strcpy (r1), (r2)
  - » Source and destination overlap: can't unwind in principle!
  - » IBM S/370 and VAX solution: execute twice once read-only
- What about "RISC" processors?
  - For instance delayed branches?
    - » Example: bne somewhere ld r1,(sp)
    - » Precise exception state consists of two PCs: PC and nPC
  - Delaved exceptions:

```
» Example:
             div r1, r2, r3
             ld r1, (sp)
```

- » What if takes many cycles to discover divide by zero. but load has already caused page fault?
  - Lec 14,23

### **Precise Exceptions**

- Precise  $\Rightarrow$  state of the machine is preserved as if program executed up to the offending instruction
  - All previous instructions completed
  - Offending instruction and all following instructions act as if they have not even started
  - Same system code will work on different implementations
  - Difficult in the presence of pipelining, out-of-order execution, ...
  - MIPS takes this position
- Imprecise  $\Rightarrow$  system software has to figure out what is where and put it all back together
- Performance goals often lead designers to forsake precise interrupts
  - system software developers, user, markets etc. usually wish they had not done this
- Modern techniques for out-of-order execution and branch prediction help implement precise interrupts

10/18/06

### Page Replacement Policies

• What about LRU? • Why do we care about Replacement Policy? - Replacement is an issue with any cache - Replace page that hasn't been used for the longest time - Particularly important with pages - Programs have locality, so if something not used for a » The cost of being wrong is high: must go to disk while, unlikely to be used in the near future. » Must keep important pages in memory, not toss them out - Seems like LRU should be a good approximation to MIN. What about MIN? • How to implement LRU? Use a list! - Replace page that won't be used for the longest time - Great, but can't really know future... Head — Page 6 Page 7 Page 1 Page 2 - Makes good comparison case, however What about RANDOM? Tail (LRU) -- Pick random page for every replacement - On each use, remove page from list and place at head - Typical solution for TLB's. Simple hardware - Pretty unpredictable - makes it hard to make real-time - LRU page is at tail auarantees Problems with this scheme for paging? What about FIFO? - Need to know immediately when each page used so that - Throw out oldest page. Be fair - let every page live in can change position in list... memory for same amount of time. - Many instructions for each hardware access - Bad, because throws out heavily used pages instead of • In practice, people approximate LRU (more later) infrequently used pages

10/18/06

10/18/06

Kubiatowicz CS162 ©UCB Fall 2006

Lec 14,25

Lec 14,27

### Kubiatowicz CS162 ©UCB Fall 2006

**Replacement Policies (Con't)** 

Lec 14.26

### Summary

- TLB is cache on translations
  - Fully associative to reduce conflicts
  - Can be overlapped with cache access
- Demand Paging:
  - Treat memory as cache on disk
  - Cache miss  $\Rightarrow$  get page from disk
- Transparent Level of Indirection
  - User program is unaware of activities of OS behind scenes
- Data can be moved without affecting application correctness
   Software-loaded TLB
  - Fast Path: handled in hardware (TLB hit with valid=1)
  - Slow Path: Trap to software to scan page table
- Precise Exception specifies a single instruction for which:
  - All previous instructions have completed (committed state)
  - No following instructions nor actual instruction have started
- Replacement policies
  - FIFO: Place pages on queue, replace page at end
  - MIN: replace page that will be used farthest in future
  - LRU: Replace page that hasn't be used for the longest time