# CS162 Operating Systems and Systems Programming Lecture 14 Caching (Finished), Demand Paging October 15<sup>th</sup>, 2017 Ion Stoica http://cs162.eecs.Berkeley.edu ## Recall: In Machine Structures (eg. 61C) ... HitRate = $$90\% = Avg. Access Time = (0.9 \times 10) + (0.1 \times 100) = 19ns$$ HitRate = $$99\%$$ => Avg. Access Time= $(0.99 \times 10) + (0.01 \times 100) = 10.9$ ns #### Review: Direct Mapped Cache - Direct Mapped 2<sup>N</sup> byte cache: - The uppermost (32 N) bits are always the Cache Tag - The lowest M bits are the Byte Select (Block Size = $2^{M}$ ) - Example: I KB Direct Mapped Cache with 32 B Blocks - Index chooses potential block - Tag checked to verify block - $\frac{1}{31}$ Byte select chooses byte within block **Cache Tag Cache Index Byte Select** Ex: 0x01 Ex: 0x00 Ex: 0x50 Valid Bit **Cache Tag Cache Data** Byte 1 Byte 0 Byte 31 Byte 33 Byte 32 0x50Byte 63 3 **Byte 1023 Byte 992** 31 # Fully Associative Cache - Fully Associative: Every block can hold any line - Address does not include a cache index - Compare Cache Tags of all Cache Entries in Parallel - Example: Block Size=32B blocks - We need N 27-bit comparators - Still have byte select to choose from within block #### Set Associative Cache - N-way set associative: N entries per Cache Index - N direct mapped caches operates in parallel - Example: Two-way set associative cache - Cache Index selects a "set" from the cache - Two tags in the set are compared to input in parallel #### Where does a Block Get Placed in a Cache? • Example: Block 12 placed in 8 block cache #### **32-Block Address Space:** Block 1111111111222222222233 no. 0123456789012345678901 #### **Direct mapped:** block 12 can go only into block 4 (12 mod 8) Block 0 1 2 3 4 5 6 7 no. #### **Set associative:** block 12 can go anywhere in set 0 (12 mod 4) Block 0 1 2 3 0 1 2 3 no. Set 0 Set 1 #### **Fully associative:** block 12 can go anywhere Block 0 1 2 3 4 5 6 7 no. # Which block should be replaced on a miss? - Easy for Direct Mapped: Only one possibility - Set Associative or Fully Associative: - Random - LRU (Least Recently Used) - Miss rates for a workload: | | 2-way | | 4-way | | 8-way | | |--------|------------|-------|------------|-------|------------|-------| | Size | LRU Random | | LRU Random | | LRU Random | | | I6 KB | 5.2% | 5.7% | 4.7% | 5.3% | 4.4% | 5.0% | | 64 KB | 1.9% | 2.0% | 1.5% | 1.7% | 1.4% | 1.5% | | 256 KB | 1.15% | 1.17% | 1.13% | 1.13% | 1.12% | 1.12% | ## What happens on a write? - Write through (WT): The information is written to both the block in the cache and to the block in the lower-level memory - Write back (WB): The information is written only to the block in the cache - Modified cache block is written to main memory only when it is replaced - Question is block clean or dirty? - Pros and Cons of each? - -WT: - » PRO: read misses cannot result in writes - » CON: Processor held up on writes unless writes buffered - -WB: - » PRO: repeated writes not sent to DRAM processor not held up on writes - » CON: More complex Read miss may require writeback of dirty data Caching Applied to Address Translation - Question is one of page locality: does it exist? - Instruction accesses spend a lot of time on the same page (since accesses sequential) - Stack accesses have definite locality of reference - Data accesses have less page locality, but still some... - Can we have a TLB hierarchy? - Sure: multiple levels at different sizes/speeds # Recall: What Actually Happens on a TLB Miss? - Software traversed Page tables - On TLB miss, processor receives TLB fault - Kernel traverses page table to find PTE - » If PTE valid, fills TLB and returns from fault - » If PTE marked as invalid, internally calls Page Fault handler - Hardware traversed page tables: - On TLB miss, hardware in MMU looks at current page table to fill TLB (may walk multiple levels) - » If PTE valid, hardware fills TLB and processor never knows - » If PTE marked as invalid, causes Page Fault, after which kernel decides what to do afterwards - Most chip sets provide hardware traversal - Modern operating systems tend to have more TLB faults since they use translation for many things - Examples: - » shared segments - » user-level portions of an operating system # Transparent Exceptions: TLB/Page fault (1/2) - How to transparently restart faulting instructions? - (Consider load or store that gets TLB or Page fault) - Could we just skip faulting instruction? - » No: need to perform load or store after reconnecting physical page # Transparent Exceptions: TLB/Page fault (2/2) - Hardware must help out by saving: - Faulting instruction and partial state - » Need to know which instruction caused fault - » Is single PC sufficient to identify faulting position???? - Processor State: sufficient to restart user thread - » Save/restore registers, stack, etc - What if an instruction has side-effects? # Consider weird things that can happen - What if an instruction has side effects? - Options: - » Unwind side-effects (easy to restart) - » Finish off side-effects (messy!) - Example I: mov (sp)+,10 - » What if page fault occurs when write to stack pointer? - » Did sp get incremented before or after the page fault? - Example 2: strcpy (r1), (r2) - » Source and destination overlap: can't unwind in principle! - » IBM S/370 and VAX solution: execute twice once read-only - What about "RISC" processors? - For instance delayed branches? - » Example: bne somewhere ld r1,(sp) - » Precise exception state consists of two PCs: PC and nPC (next PC) - Delayed exceptions: - » Example: div r1, r2, r3 ld r1, (sp) - » What if takes many cycles to discover divide by zero, but load has already caused page fault? #### **Precise Exceptions** - Precise ⇒ state of the machine is preserved as if program executed up to the offending instruction - All previous instructions completed - Offending instruction and all following instructions act as if they have not even started - Same system code will work on different implementations - Difficult in the presence of pipelining, out-of-order execution, ... - MIPS takes this position - Imprecise ⇒ system software has to figure out what is where and put it all back together - Performance goals often lead to forsaking precise interrupts - System software developers, user, markets etc. usually wish they had not done this - Modern techniques for out-of-order execution and branch prediction help implement precise interrupts 11/15/18 CS162 ©UCB Fall 18 Lec 14.14 - Needs to be really fast - Critical path of memory access - » In simplest view: before the cache - » Thus, this adds to access time (reducing cache speed) - Seems to argue for Direct Mapped or Low Associativity - However, needs to have very few conflicts! - With TLB, the Miss Time extremely high! - This argues that cost of Conflict (Miss Time) is much higher than slightly increased cost of access (Hit Time) - Thrashing: continuous conflicts between accesses - What if use low order bits of page as index into TLB? - » First page of code, data, stack may map to same entry - » Need 3-way associativity at least? - What if use high order bits as index? - » TLB mostly unused for small programs #### Reducing translation time further As described, TLB lookup is in serial with cache lookup: - Machines with TLBs go one step further: they overlap TLB lookup with cache access. - Works because offset available early # Overlapping TLB & Cache Access (1/2) #### • Main idea: - Offset in virtual address exactly covers the "cache index" and "byte select" - Thus can select the cached byte(s) in parallel to perform address translation virtual address Virtual Page # Offset physical address tag / page # index byte #### Overlapping TLB & Cache Access Here is how this might work with a 4K cache: - What if cache size is increased to 8KB? - Overlap not complete - Need to do something else. See CS152/252 - Another option: Virtual Caches - Tags in cache are virtual addresses - Translation only happens on cache misses 11/15/18 CS162 ©UCB Fall 18 Lec 14.18 # Putting Everything Together: Address Translation ## Putting Everything Together: TLB ## Putting Everything Together: Cache # **BREAK** ## Next Up: What happens when ... # Where are all places that caching arises in OSes? - Direct use of caching techniques - TLB (cache of PTEs) - Paged virtual memory (memory as cache for disk) - File systems (cache disk blocks in memory) - DNS (cache hostname => IP address translations) - Web proxies (cache recently accessed pages) - Which pages to keep in memory? - All-important "Policy" aspect of virtual memory - Will spend a bit more time on this in a moment ## Impact of caches on Operating Systems (1/2) - Indirect dealing with cache effects (e.g., sync state across levels) - Maintaining the correctness of various caches - E.g., TLB consistency: - » With PT across context switches? - » Across updates to the PT? - Process scheduling - Which and how many processes are active? Priorities? - Large memory footprints versus small ones? - Shared pages mapped into VAS of multiple processes ? ## Impact of caches on Operating Systems (2/2) - Impact of thread scheduling on cache performance - Rapid interleaving of threads (small quantum) may degrade cache performance - » Increase average memory access time (AMAT) !!! - Designing operating system data structures for cache performance #### Working Set Model • As a program executes it transitions through a sequence of "working sets" consisting of varying sized subsets of the address space #### Cache Behavior under WS model - Amortized by fraction of time the Working Set is active - Transitions from one WS to the next - Capacity, Conflict, Compulsory misses - Applicable to memory caches and pages. Others? #### Another model of Locality: Zipf - Likelihood of accessing item of rank r is $\alpha$ 1/r<sup>a</sup> - Although rare to access items below the top few, there are so many that it yields a "heavy tailed" distribution - Substantial value from even a tiny cache - Substantial misses from even a very large cache # **Demand Paging** - Modern programs require a lot of physical memory - Memory per system growing faster than 25%-30%/year - But they don't use all their memory all of the time - 90-10 rule: programs spend 90% of their time in 10% of their code - Wasteful to require all of user's code to be in memory • Solution: use main memory as cache for disk ## Illusion of Infinite Memory (1/2) - Disk is larger than physical memory ⇒ - In-use virtual memory can be bigger than physical memory - Combined memory of running processes much larger than physical memory - » More programs fit into memory, allowing more concurrency ## Illusion of Infinite Memory (2/2) - Principle: Transparent Level of Indirection (page table) - Supports flexible placement of physical data - » Data could be on disk or somewhere across network - Variable location of data transparent to user program - » Performance issue, not correctness issue # Since Demand Paging is Caching, Must Ask... - What is block size? - I page - What is organization of this cache (i.e. direct-mapped, set-associative, fully-associative)? - Fully associative: arbitrary virtual → physical mapping - How do we find a page in the cache when look for it? - First check TLB, then page-table traversal - What is page replacement policy? (i.e. LRU, Random...) - This requires more explanation... (kinda LRU) - What happens on a miss? - Go to lower level to fill miss (i.e. disk) - What happens on a write? (write-through, write back) - Definitely write-back need dirty bit! # Summary: Steps in Handling a Page Fault 11/15/18 ## Summary - A cache of translations called a "Translation Lookaside Buffer" (TLB) - Relatively small number of PTEs and optional process IDs (< 512)</li> - Fully Associative (Since conflict misses expensive) - On TLB miss, page table must be traversed and if located PTE is invalid, cause Page Fault - On change in page table, TLB entries must be invalidated - TLB is logically in front of cache (need to overlap with cache access) - Precise Exception specifies a single instruction for which: - All previous instructions have completed (committed state) - No following instructions nor actual instruction have started - Can manage caches in hardware or software or both - Goal is highest hit rate, even if it means more complex cache management