



#### Goals for Today

- Finish discussion of both Address Translation and Protection
- Caching and TLBs

Note: Some slides and/or pictures in the following are adapted from slides ©2005 Silberschatz, Galvin, and Gagne

#### What is in a PTF?

- What is in a Page Table Entry (or PTE)?
  - Pointer to next-level page table or to actual page
  - Permission bits: valid, read-only, read-write, write-only
- Example: Intel x86 architecture PTE:
  - Address same format previous slide (10, 10, 12-bit offset)
  - Intermediate page tables called "Directories"



- P: Present (same as "valid" bit in other architectures)
- W: Writeable
- U: User accessible
- PWT: Page write transparent: external cache write-through
- PCD: Page cache disabled (page cannot be cached)
  - A: Accessed: page has been accessed recently
  - D: Dirty (PTE only): page has been modified recently

Lec 13.5

- L: L=1⇒4MB page (directory only).
- Bottom 22 bits of virtual address serve as offset Kubiatowicz CS162 ©UCB Fall 2007

10/15/07

How is the translation accomplished?



- What, exactly happens inside MMU?
- One possibility: Hardware Tree Traversal
  - For each virtual address, takes page table base pointer and traverses the page table in hardware
  - Generates a "Page Fault" if it encounters invalid PTE » Fault handler will decide what to do
    - » More on this next lecture
  - Pros: Relatively fast (but still many memory accesses!)
  - Cons: Inflexible, Complex hardware
- Another possibility: Software
  - Each traversal done in software
  - Pros: Very flexible
  - Cons: Every translation must invoke Fault!
- In fact, need way to cache translations for either case! 10/15/07 Kubiatowicz C5162 ©UCB Fall 2007 Lec 13.7

## Examples of how to use a PTE

- How do we use the PTE? - Invalid PTE can imply different things: » Region of address space is actually invalid or » Page/directory is just somewhere else than memory - Validity checked first » OS can use other (say) 31 bits for location info • Usage Example: Demand Paging - Keep only active pages in memory - Place others on disk and mark their PTEs invalid • Usage Example: Copy on Write - UNIX fork gives *copy* of parent address space to child » Address spaces disconnected after child created - How to do this cheaply? » Make copy of parent's page tables (point at same memory) » Mark entries in both sets of page tables as read-only » Page fault on write creates two copies • Usage Example: Zero Fill On Demand - New data pages must carry no information (say be zeroed) - Mark PTEs as invalid; page fault on use gets zeroed page - Often, OS creates zeroed pages in background 10/15/07 Kubiatowicz CS162 ©UCB Fall 2007 Lec 13.6
  - **Dual-Mode Operation**
  - Can Application Modify its own translation tables?
    - If it could, could get access to all of physical memory
    - Has to be restricted somehow
  - To Assist with Protection, Hardware provides at least two modes (Dual-Mode Operation):
    - "Kernel" mode (or "supervisor" or "protected")
    - "User" mode (Normal program mode)
    - Mode set with bits in special control register only accessible in kernel-mode
  - Intel processor actually has four "rings" of protection:
    - PL (Priviledge Level) from 0 3
      - » PLO has full access, PL3 has least
    - Privilege Level set in code segment descriptor (CS)
    - Mirrored "IOPL" bits in condition register gives permission to programs to use the 170 instructions
    - Typical OS kernels on Intel processors only use PLO ("user") and PL3 ("kernel") Kubiatowicz CS162 ©UCB Fall 2007

#### For Protection, Lock User-Programs in Asylum

- Idea: Lock user programs in padded cell with no exit or sharp objects
  - Cannot change mode to kernel mode
  - User cannot modify page table mapping
  - Limited access to memory: cannot adversely effect other processes
     » Side-effect: Limited access to



Lec 13,11

- memory-mapped I/O operations (I/O that occurs by reading/writing memory locations)
- Limited access to interrupt controller
- What else needs to be protected?
- A couple of issues
  - How to share CPU between kernel and user programs?
    - » Kinda like both the inmates and the warden in asylum are the same person. How do you manage this???
  - How do programs interact?
  - How does one switch between kernel and user modes?
    » OS → user (kernel → user mode): getting into cell
    - $OS \rightarrow user (kernel \rightarrow user mode); getting into cell$

10/15/07

» User→ OS (user → kernel mode): getting out of cell Kubiatowicz CS162 ©UCB Fall 2007 Lec 13.9

### How to get from Kernel $\rightarrow$ User

- What does the kernel do to create a new user process?
  - Allocate and initialize address-space control block
  - Read program off disk and store in memory
  - Allocate and initialize translation table
    - » Point at code in memory so program can execute
    - » Possibly point at statically initialized data
  - Run Program:
    - » Set machine registers
    - » Set hardware pointer to translation table
    - » Set processor status word for user mode
    - » Jump to start of program
- How does kernel switch between processes?
  - Same saving/restoring of registers as before

- Save/restore PSL (hardware pointer to translation table) 10/15/07 Kubiatowicz C5162 ©UCB Fall 2007 Lec 13.10

## User→Kernel (System Call)

- Can't let inmate (user) get out of padded cell on own
  - Would defeat purpose of protection!
  - So, how does the user program get back into kernel?



- System call: Voluntary procedure call into kernel
  - Hardware for controlled User-Kernel transition
  - Can any kernel routine be called? » No! Only specific ones.
  - System call ID encoded into system call instruction » Index forces well-defined interface with kernel

#### System Call Continued

- What are some system calls?
  - I/O: open, close, read, write, lseek
  - Files: delete, mkdir, rmdir, truncate, chown, chgrp, ...
  - Process: fork, exit, wait (like join)
  - Network: socket create, set options
- Are system calls constant across operating systems?
  - Not entirely, but there are lots of commonalities
  - Also some standardization attempts (POSIX)
- What happens at beginning of system call?
  » On entry to kernel, sets system to kernel mode
  » Handler address fetched from table/Handler started
- System Call argument passing:
  - In registers (not very much can be passed)
  - Write into user memory, kernel copies into kernel mem
    - » User addresses must be translatedlw
    - » Kernel has different view of memory than user
  - Every Argument must be explicitly checked!



#### Closing thought: Protection without Hardware

- Does protection require hardware support for translation and dual-mode behavior?
  - No: Normally use hardware, but anything you can do in hardware can also do in software (possibly expensive)
- Protection via Strong Typing
  - Restrict programming language so that you can't express program that would trash another program
  - Loader needs to make sure that program produced by valid compiler or all bets are off
  - Example languages: LISP, Ada, Modula-3 and Java
- Protection via software fault isolation:
  - Language independent approach: have compiler generate object code that provably can't step out of bounds
    - » Compiler puts in checks for every "dangerous" operation (loads, stores, etc). Again, need special loader.
    - » Alternative, compiler generates "proof" that code cannot do certain things (Proof Carrying Code)

#### - Or: use virtual machine to guarantee safe behavior (loads and stores recompiled on fly to check bounds) 10/15/07 Kubiatowicz C5162 ©UCB Fall 2007 Lec 13.15

#### **Administrivia**

- Exam was too long
  - Sorry about that...
  - If it is any consolation, everyone in same boat
- Still grading exam
  - Will announce results as soon as possible
  - Also will get solutions up very soon!
- Project 2 is started!
  - Design document due date is Wednesday (10/17) at 11:59pm
  - Always keep up with the project schedule by looking on the "Lectures" page
- Make sure to come to sections!
  - There will be a lot of information about the projects that I cannot cover in class
  - Also supplemental information and detail that we don't have time for in class

#### **Review: Monitor Summary**







**Caching Concept** 

- Cannot afford to translate on every access
  - At least three DRAM accesses per actual DRAM access
  - Or: perhaps I/O if page table partially on disk!
- Even worse: What if we are using caching to make memory access faster than DRAM access???
- Solution? Cache translations!

- Translation Cache: TLB ("Translation Lookaside Buffer") 10/15/07 Kubiatowicz CS162 ©UCB Fall 2007 Lec 13.20



## A Summary on Sources of Cache Misses

- · Compulsory (cold start or process migration, first reference): first access to a block
  - "Cold" fact of life: not a whole lot you can do about it
  - Note: If you are going to run "billions" of instruction, Compulsory Misses are insignificant
- Capacity:
  - Cache cannot contain all blocks access by the program
  - Solution: increase cache size
- Conflict (collision):
  - Multiple memory locations mapped to the same cache location
  - Solution 1: increase cache size
  - Solution 2: increase associativity
- Coherence (Invalidation): other process (e.g., I/O) updates memory 10/15/07 Kubiatowicz CS162 ©UCB Fall 2007 Lec 13,23

# Memory Hierarchy of a Modern Computer System

- Take advantage of the principle of locality to:
  - Present as much memory as in the cheapest technology
  - Provide access at speed offered by the fastest technology



# How is a Block found in a Cache?



Data Select

- Index Used to Lookup Candidates in Cache
  - Index identifies the set
- Tag used to identify actual copy
  - If no candidates match, then declare cache miss
- Block is minimum guantum of caching
  - Data select field used to select data within block
  - Many caching applications don't have data select field

#### **Review: Direct Mapped Cache**

#### • Direct Mapped 2<sup>N</sup> byte cache:

- The uppermost (32 N) bits are always the Cache Tag
- The lowest M bits are the Byte Select (Block Size =  $2^{M}$ )
- Example: 1 KB Direct Mapped Cache with 32 B Blocks
  - Index chooses potential block
  - Tag checked to verify block
  - Byte select chooses byte within block



#### **Review:** Set Associative Cache • N-way set associative: N entries per Cache Index - N direct mapped caches operates in parallel • Example: Two-way set associative cache - Cache Index selects a "set" from the cache - Two tags in the set are compared to input in parallel - Data is selected based on the tag result 31 Cache Tag **Cache Index** Byte Select Valid Cache Tag Cache Data Cache Data Cache Tag Valid Cache Block 0 Cache Block 0 Compare Mux <sup>0</sup> Sel0 ompare Sel1 OR 10/15/07 Lec 13.26 Hit Cache Block

#### **Review: Fully Associative Cache**

- Fully Associative: Every block can hold any line
  - Address does not include a cache index
  - Compare Cache Tags of all Cache Entries in Parallel
- Example: Block Size=32B blocks
  - We need N 27-bit comparators
  - Still have byte select to choose from within block



## Where does a Block Get Placed in a Cache?



10/15/07

Lec 13.27

10/15/07

Kubiatowicz CS162 ©UCB Fall 2007



#### What happens on a Context Switch? What TLB organization makes sense? • Need to do something, since TLBs map virtual TLB CPU Cache Memory addresses to physical addresses - Address Space just changed, so TLB entries no Needs to be really fast longer valid! - Critical path of memory access Options? » In simplest view: before the cache » Thus, this adds to access time (reducing cache speed) - Invalidate TLB: simple but might be expensive - Seems to argue for Direct Mapped or Low Associativity » What if switching frequently between processes? However, needs to have very few conflicts! - Include ProcessID in TLB - With TLB, the Miss Time extremely high! » This is an architectural solution: needs hardware - This argues that cost of Conflict (Miss Time) is much higher than slightly increased cost of access (Hit Time) • What if translation tables change? Thrashing: continuous conflicts between accesses - For example, to move page from memory to disk or - What if use low order bits of page as index into TLB? vice versa... » First page of code, data, stack may map to same entry - Must invalidate TLB entry! » Need 3-way associativity at least? » Otherwise, might think that page is still in memory! - What if use high order bits as index? » TLB mostly unused for small programs Kubiatowicz CS162 ©UCB Fall 2007 10/15/07 Lec 13.34

10/15/07

Kubiatowicz CS162 ©UCB Fall 2007

Lec 13.33

### TLB organization: include protection

- How big does TLB actually have to be? -Usually small: 128-512 entries
  - -Not very big, can support higher associativity
- TLB usually organized as fully-associative cache
  - Lookup is by Virtual Address
  - Returns Physical Address + other info
- What happens when fully-associative is too slow? -Put a small (4-16 entry) direct-mapped cache in front
  - Called a "TLB Slice"
- Example for MIPS R3000:

| Virtual Address | Physical Address | Dirty | Ref | Valid | Access | ASID |
|-----------------|------------------|-------|-----|-------|--------|------|
| 0xFA00          | 0x0003           | Y     | Ν   | Y     | R/W    | 34   |
| 0x0040          | 0x0010           | Ν     | Y   | Y     | R      | 0    |
| 0x0041          | 0x0011           | Ν     | Y   | Y     | R      | 0    |
|                 |                  |       | -   | _     |        | _    |

# Example: R3000 pipeline includes TLB "stages"

MIPS R3000 Pipeline

| Inst Fe | etch  | Dcd/ | Reg | ALU / | / E.A | Memory  | Write Reg |
|---------|-------|------|-----|-------|-------|---------|-----------|
| TLB     | I-Cac | he   | RF  | Oper  | ation |         | WB        |
|         |       |      |     | E.A.  | TLB   | D-Cache |           |

TLB

10/15/07

64 entry, on-chip, fully associative, software TLB fault handler

#### Virtual Address Space



Kubiatowicz CS162 ©UCB Fall 2007

#### Reducing translation time further

• As described, TLB lookup is in serial with cache lookup:



#### **Physical Address**

- Machines with TLBs go one step further: they overlap TLB lookup with cache access.
  - Works because offset available early

```
10/15/07
```

Kubiatowicz CS162 ©UCB Fall 2007

Lec 13.37



#### Summary #1/2

- The Principle of Locality:
  - Program likely to access a relatively small portion of the address space at any instant of time.
    - » Temporal Locality: Locality in Time
    - » Spatial Locality: Locality in Space
- Three (+1) Major Categories of Cache Misses:
  - Compulsory Misses: sad facts of life. Example: cold start misses.
  - Conflict Misses: increase cache size and/or associativity
  - Capacity Misses: increase cache size
  - Coherence Misses: Caused by external processors or I/O devices
- Cache Organizations:
  - Direct Mapped: single block per set
  - Set associative: more than one block per set
  - Fully associative: all entries equivalent

## Summary #2/2: Translation Caching (TLB)

- PTE: Page Table Entries
  - Includes physical page number
  - Control info (valid bit, writeable, dirty, user, etc)
- A cache of translations called a "Translation Lookaside Buffer" (TLB)
  - Relatively small number of entries (< 512)
  - Fully Associative (Since conflict misses expensive)
  - TLB entries contain PTE and optional process ID
- On TLB miss, page table must be traversed
  - If located PTE is invalid, cause Page Fault
- On context switch/change in page table
  - TLB entries must be invalidated somehow
- TLB is logically in front of cache
- Thus, needs to be overlapped with cache access to be really fast

Kubiatowicz CS162 ©UCB Fall 2007