+ Problem: how does the operating system get information from user memory? E.g. I/O buffers, parameter blocks. Note that the user passes the OS a virtual address. + 1. Use real addresses - In some cases the OS just runs unmapped. Then all it has to do is read the tables and translate user addresses in software. + Note: addresses that are contiguous in the virtual address space may not be contiguous physically. Thus I/O operations may have to be split up into multiple - .21 - blocks. Draw an example. + 2. Can specify (somehow) that the data addresses are to use the User Page Tables. (would need special hardware) + Note that we therefore need two active PTBRs - user PTBR and System PTBR. + 3. Have OS page tables point to user pages. + 4. A few machines, most notably the VAX, make both system information and user information visible at once (but can't touch system stuff unless running with special ker- nel protection bit set). This makes life easy for the kernel, although it doesn't solve the I/O problem. + I.e. OS is in everyone's address space. + VAX Addressing + Another example: VAX. + Address is 32 bits, top two select segment. Four base- bound pairs define page tables (system, P0, P1, unused). + Pages are 512 bytes long. + Read-write protection information is contained in the page table entries, not in the segment table. + One segment contains operating system stuff, two contain stuff of current user process. + Potential problem: page tables can get big. Don't want to have to allocate them contiguously, especially for large user processes. Solution is to use the system page table to map the user page tables so the user page tables - .22 - can be scattered: + System base-bounds pairs are physical addresses, sys- tem tables must be contiguous. + User base-bounds pairs are virtual addresses in the system space. This allows the user page tables to be scattered in non-contiguous pages of physical memory. + The result is a two-level scheme. + This is alternative to normal two level scheme. If normal two level scheme were used, and if page tables were paged, would actually be four level scheme. + Inverted Page Table + Idea is that page table is organized as hash table. Hash from virtual address into table with number of entries larger than physical memory size. (Page table shared by all processes.) + Problem with segmentation and paging: extra memory refer- ences to access translation tables can slow programs down by a factor of two or three. There are obviously too many translations required to keep them all in special processor registers. + But for small machines (e.g. PDP-11), can have one regis- ter for every page in memory, since can only address 64Kbytes. - .23 - + Solution: Translation Lookaside Buffer (TLB), also called + Translation Buffer (TB) (DEC), or + Directory Lookaside Table (DLAT) (IBM), or + Address Translation Cache (ATC) (Motorola). + A TLB is used to store a few of the translation table en- tries. It's very fast, but only remembers a small number of entries. On each memory reference: (draw picture, ex- plain name) + First ask TLB if it knows about the page. If so, the reference proceeds fast. + If TLB has no info for page, translator must go through page and segment tables to get info. Refer- ence takes a long time, but give the info for this page to TLB so it will know it for next reference (TLB must forget one of its current entries in order to record new one). + TLB Organization: Picture of black box. Virtual page number goes in, physical page location comes out. Similar to a cache. + So what the TLB does is: + Accept virtual address + See if virtual address matches entry in TLB + If so, return real address + If not, ask translator to provide real address. + Translator loads new translation into TLB, replacing old - .24 - one. (Usually one not used recently.) + (Must replace entry in same set.) + Will the TLB work well if it holds only a few entries, and the program is very big? + Yes - due to Principle of Locality. (Peter Denning) + Principle of Locality + 1. Temporal Locality - Information that has been used re- cently is likely to be continued to be used. + Alternate formulation - information in use now con- sists mostly of the same information as was used re- cently. + 2. Spatial Locality - info near the current locus of reference is also likely to be used in the near future. + Example - top of desk is cache for file cabinet. If desk is messy, stuff on top is likely to be what you need. + Explanation- code is either sequential or loops. Data used together is often clustered together (array ele- ments, stack, etc.) + In practice, TLBs work quite well. Typically find 96% to 99.9% of the translations in the TLB. + TLB is just a memory with some comparators. Typical size of memory: 16-512 entries. Each entry holds a virtual page number and the corresponding physical page number. How can memory be organized to find an entry quickly? + One possibility: search whole table associatively on - .25 - every reference. Hard to do for more than 32 or 64 en- tries. + A better possibility: restrict the info for any given virtual page to fall in into a subset of entries in the TLB. Then only need to search that Set. Called set as- sociative. E.g. use the low-order bits of the virtual page number as the index to select the set. Real TLBs are either fully associative or set associative. If the size of the set is one, called direct mapped. + Diagram of set associative TLB. + Replacement must be in same set. + Translator is a piece of hardware that knows how to translate virtual to real addresses. It uses the PTBR to find the page table(s). Reads the page table to find the page. + TLBs are a lot like hash tables except simpler (must be to be implemented in hardware). Some hash functions are better than others. + Is it better to use low page number bits than high ones to select the set? + Low ones are best: if a large contiguous chunk of memory is being used, all pages will fall in dif- ferent sets. + Must be careful to flush TB during each context swap. Why? + Otherwise, when we switch processes, we'll still be using - .26 - the old translations from virtual to real, and will be addressing the wrong part of memory. + Alternative - can make process identifier (PID) part of virtual address. Have a Process Identifier Register (PIDR) which supplies that part of the address. + When we modify the page table, we must either flush TLB or flush the entry that was modified. - .27 - Topic: Demand Paging, Thrashing, Working Sets + So far we have disentangled the programmer's view of memory from the system's view using a mapping mechanism. Each sees a different organization. This makes it easier for the OS to shuffle users around and simplifies memory sharing between users. + However, until now a user process had to be completely loaded into memory before it could run. (sort of- we mentioned page faults and segment faults, but...) This is wasteful since a process may only need a small amount of its total memory at any one time (locality). Virtual memory permits a process to run with only some of its virtual address space loaded into physical memory. + Virtual address space, translated to either a) physical memory (small, fast) or b) disk (backing store), which is large but slow. + Backing storage is typically disk. + The idea is to produce the illusion that the entire virtual address space is in main memory, when in fact, it isn't. + More generally, we have a multi-level (2 level in this case) memory hierarchy. We want to have the cost of the slower and larger level, and the performance of the smaller and faster level. - .28 - + Diagram of a memory hierarchy, showing access times. + The reason that this works is that most programs spend most of their time in only a small piece of the code. + Principle of Locality - there are two parts. . + Temporal Locality - the same information is likely to be reused. + Spatial Locality - nearby information is also likely to be used in the near future. + (Idea invented (?) by Peter Denning.) + If not all of process is loaded when it is running, what hap- pens when it references a byte that is only in the backing store? Hardware and software cooperate to make things work anyway. + First, extend the page tables with an extra bit ``present, or valid''. If present isn't set then a reference to the page results in a trap. This trap is given a special name, page fault. + Page fault - an attempt to reference a page which is not in memory. + Diagram of Page Table Entry. (show real address, protec- tion bits, valid/present bit, dirty bit, reference bit). + Any page not in main memory right now has the ``present/valid'' bit cleared in its page table entry. - .29 - + When page fault occurs: + Trap to OS (why?) + Verify that reference is to valid page; if not, abend. + Find page frame to put page. + Find a page to replace, if no empty frame. + If dirty, find a place to put replaced page on secon- dary storage (Can reuse previous location.) + Remove page (either copy back or overwrite) + Update page table. + Update map of secondary storage if necessary (to show where we put page) + Update memory (core) map + Flush TLB entry for page that has been removed. + Operating system brings page into memory + Find page on secondary storage. + Transfer it. + Update page table (set valid bit, and real address) + Update map of file system/disk to show that page is now in memory. (e.g. update cache of inodes) + Update Core Map (memory map). + The process resumes execution. (i.e. it goes on ready list. maybe it resumes) + Note that all of these take time. We may switch to another process while the IO is taking place. + Multiprogramming is supposed to overlap the fetch of a page - .30 - (or I/O) for one process with the execution of another. + If no process is available to run (all doing I/O or page fault), called multiprogramming idle or page fetch idle. + Page out - to remove a page. + Page out a process - remove it from memory. + Page in a process - load its pages into memory. + Continuing (resuming) the process is very tricky, since page fault may have occurred in the middle of an instruction. Don't want user process to be aware that the page fault even happened. + Can the instruction just be skipped? + Suppose the instruction is restarted from the beginning? + How is the ``beginning'' located? + Even if the beginning is found, what about instruc- tions with side effects, like MOV (SP)+, 10? + Without additional information from the hardware, it may be impossible to restart a process after a page fault. Machines that permit restarting must have hardware sup- port to keep track of all the side effects so that they can be undone before restarting. + Early Apollo approach for 68000 + (two processors, one just for handling page faults) + IBM 370 solution (execute long instructions twice) + If you think about this when designing the instruction set, it isn't too hard to make a machine support virtual - .31 - memory. It's much harder to do after the fact. + How many page faults can occur in one instruction?? + E.g. instruction spans page boundaries, and each of two operands spans two pages. Could have 2 level page table, with one page of page table needed to point to each instruction & data page. + Once the hardware has provided basic capabilities for virtual memory, the OS must implement 3 algorithms: + Page fetch algorithm: when to bring pages into memory. + Page replacement algorithm: which page(s) should be thrown out, and when. + Page placement algorithm: where to put the page in memory. + Note that the page placement algorithm for main memory is ir- relevant - memory is uniform. (But CRAY has non-uniform memory access time. Also not irrelevant for other parts of memory hierarchy.) + Page Fetch Algorithms: + Demand paging: start up process with no pages loaded, load a page when a page fault for it occurs, i.e. wait until it absolutely MUST be in memory. Almost all paging systems are like this. + Request paging: let user say which pages are needed. What's wrong with this? - .32 - + Users don't always know best, and aren't always im- partial. They will overestimate needs. Maybe men- tion overlays here, although overlays are even more draconian than request paging. + Still need demand paging, in case user doesn't remember to bring in the right page. + Prefetching, or Prepaging: bring a page into memory be- fore it is referenced (e.g. when one page is referenced, bring in the next one, just in case). + Reason for prepaging is + (a) bring in several pages at once - cut per page overhead + (b) eliminate real time delay in waiting for page - overlap computation and fetch. + Idea is to guess at which page will be needed. Hard to do effectively without a prophet, may spend a lot of time doing wasted work. If used at all, typically One block lookahead - i.e. the next one. + Seldom works. + Can also do "swapping", ("working set restoration") whereby when you start a process, you swap in most or all of its pages, or at least all of the pages it was using the last time it was running. When it stops, you swap out its pages in a bunch on contiguous tracks on disk. + Also called working set restoration - .33 - + Overlays - a technique by which the user divides his pro- gram into segments. The user issues commands to load and unload the segments from memory; these commands specify the location in memory where the segments are placed. Used when there is no virtual memory, and the user is given a partition of real memory to work with. + Page Replacement Algorithms: + Random (RAND): pick any page at random. + FIFO: throw out the page that has been in memory the longest. The ideas are: (a) its simple, and (b) the first page that was fetched is believed to be no longer needed. + LRU (least recently used): use the past to predict the future. Throw out the page that hasn't been used in the longest time. If there is locality, then this is presum- ably the best you can do. + MIN (or OPT): as always, the best algorithm arises if we can predict the future. + Throw out the page that won't be used for the longest time into the future. This requires a prophet, so it isn't practical, but it is good for comparison. + Real and Virtual Time + Virtual Time is time as measured by a running process - doesn't include time that process is blocked (e.g. for page fault or other reason). Often in units of memory - .34 - references. + Real Time - time as measured by wall clock. Includes time that process is blocked (including page faults). + How to evaluate paging algorithms: + What are the costs of page faults? + CPU overhead for page fault- handler, dispatcher, I/O routines. (e.g. 3000 instructions). + Possible CPU (multiprogramming) idle while page ar- rives + I/O busy while page is transferred + Main memory (or cache) interference while page is transferred. + Real time delay to handle page fault. + Two approaches (Metrics) for Eval Paging Algorithms: + Curve of page faults vs. amount of space used - preferable. (Called "parachor curve") + Space time product vs. amount of space. Want to minimize STP. (show curve) + Space time product (STP)- integral of amount of space used by program over the time it runs. In- cludes time for page faults. This is the real space time product. + Exact formula is integral(0,E(space)) [m(t) dt], where E is ending time of program, and m(t) is memory used by program at time t (real time). - .35 - + In discrete time, {sum(0,R,i) [m(i)(1+f(i)*PFT)]}, where R is ending time of program in discrete time (i.e. number of memory references), i is i'th memory reference, m(i) is number of pages in memory at i'th reference, f(i) is indicator function = 0 if no page fault, =1) if page fault. PFT = page fault time. + First product is virtual space-time product. Second term adds in time for page faults. + Space time product can be computed approximately from page fault vs. space curve. (approximately) STP = (virtual running time for program(F) + pft * number of page faults) * (mean space occupied by program (n bar)). pft is time for page fault to be handled. + Space time product depends on PFT, so is technol- ogy dependent. Also doesn't take into account fact that machine may not be idle when page is being fetched. + Example: Try the reference string 4, 3, 2, 1, 4, 3, 5, 4, 3, 2, 1, 5. assume there are three or four page frames of phy- sical memory. Show the memory allocation state after each memory reference. + Do for MIN, LRU, FIFO + see figures. + Note the anomally for FIFO - we would like the miss ratio - .36 - to decline with increasing memory size. + Stack Algorithm - An algorithm which obeys the inclusion pro- perty - the set of pages in a memory of size N at time t is always a subset of the set of pages in a memory of size N+1 at time t. Obviously cannot have miss ratio increasing with memory size. + Stack is list of pages in order of size of memory which includes them. + Implementing LRU: need some form of hardware support in ord- er to keep track of which pages have been used recently. + Perfect LRU? Keep a register for each page, and store the system clock into that register on each memory refer- ence. To replace a page, scan through all of them to find the one with the oldest clock. This is expensive if there are a lot of memory pages. + Or, could use linked list to maintain "LRU stack". Note that we can see (by inspection) that with LRU, miss ratio with never increase with increasing number of pages in memory. + In practice, almost nobody implements perfect LRU. (CDC-Star did). Instead, we settle for an approximation that is efficient. Just find an old page, not necessari- ly the oldest. + LRU is just an approximation anyway so why not ap- - .37 - proximate a little more? + use bit (reference bit) - a bit in the page table entry (usu- ally cached in the TLB), that is set when the page is refer- enced. It is turned off under OS control. + Clock algorithm: keep ``use'' bit for each page frame, hardware sets the bit for the referenced page on every memory reference. Have a pointer pointing to the k'th page frame. When a fault occurs, look at the use bit of the page being pointed to. If it is on, turn it off, increment the pointer, and repeat. If it is off, replace the page in that page frame, set use(k)=1. (Clock diagram.) + Also called FINUFO - first in, not used, first out. + In effect, the use bit, when used with the clock algorithm breaks the pages into two groups: those "in use" and those "not in use". We want to replace one of the latter. + What does it mean if the clock hand is sweeping very slowly? + What does it mean if the clock hand is sweeping very fast? + Some systems also use a ``dirty'' bit to give some extra preference to dirty pages. This is because it is more expen- sive to throw out dirty pages: clean ones need not be writ- ten to disk. - .38 - + What are tradeoffs here? + Cost of page fault declines - lower probability of writing out dirty block. + Probability of fault increases - i.e. if clock was a good algorithm, and we mess with it, it should make it worse. + How would Least Frequently Used replacement work? + It would be a disaster, since locality changes. + A per process replacement algorithm or local page replacement algorithm, or per job replacement algorithm allocates page frames to individual processes: a page fault in one process can only replace one of that process' frames. This relieves interference from other processes. + If all pages from all processes are lumped together by the replacement algorithm, then it is said to be a global re- placement algorithm. Under this scheme, each process com- petes with all of the other processes for page frames. + If you are using a local replacement algorithm, you have partitioned memory among the jobs or processes. + Local algorithm: + Protects jobs from others which are badly behaved. + Hard to decide how much space to allocate to each process. + Allocation may be unreasonable. + Global algorithm: - .39 - + Permits memory allocation for process to shift over time. + Permits memory allocation to adapt to process needs + Permits badly behaved process to grab too much memory. + Thrashing: A situation when the page fault rate is so high that the system spends most of its time either processing a page fault or waiting for a page to arrive. + Thrashing means that there is too much page fetch idle - time when the processor is idle waiting for a page to ar- rive. + Suppose there are many users, and that between them their processes are making frequent references to 50 pages, but memory has 40 pages. + Each time one page is brought in, another page, whose contents will soon be referenced, is thrown out. + Compute average memory access time. + The system will spend all of its time reading and writing pages. It will be working very hard but not getting any- thing done. + The progress of the programs will make it look like the access time of memory is as slow as disk, rather than disks being as fast as memory. + Plot of CPU utilization vs. level of multiprogramming. + Thrashing was a severe problem in early demand paging systems. - .40 - + Thrashing occurs because the system doesn't know when it has taken on more work than it can handle. LRU mechanisms order pages in terms of last access, but don't give absolute numbers indicating pages that mustn't be thrown out. + What do humans do when thrashing? If flunking all courses at midterm time, drop one. + Solutions to Thrashing: + If a single process is too large for memory, there is nothing the OS can do. That process will simply thrash. (Buy more memory) + If the problem arises because of the sum of several processes: + Figure out how much memory each process needs. Change scheduling priorities to run processes in groups whose memory needs can be satisfied. + Shed load. + Change paging algorithm + Working Sets are a solution proposed by Peter Denning. An informal definition is + Working set = ``the set of pages that a process is work- ing with, and which must thus be resident if the process is to avoid thrashing.'' + The idea is to use the recent needs of a process to predict its future needs. + Formally, ``Exactly that set of pages used in the - .41 - preceeding T virtual time units'' (T usually given in un- its of memory references.) + Choose T, the working set parameter. At any given time, all pages referenced by a process in its last T seconds of execution are considered to comprise its working set. + Working Set Paging Algorithm keeps in memory exactly those pages used in the preceding T time units. + Minimum values for T are about 10,000 to 100,000 memory references. + A process will never be executed unless its working set is resident in main memory. Pages outside the working set may be discarded at any time. + Note that this requires a reservoir of unassigned page frames. + Working set paging requires that the sum of the sizes of the working sets of the jobs eligible to run (which we will call the balance set) be less than or equal to the amount of space available. We previously referred to the balance set as the jobs in the in-memory queue. + Some algorithm must be provided for moving processes into and out of the balance set. What happens if the balance set changes too frequently? + Still get thrashing + As working sets change, corresponding changes will have to be made in the balance set. - .42 - + Working set also has the advantage over LRU that it adjusts the amount of space in use according to what the process needs. LRU works with a fixed amount of space, even though a process' needs change. + How do we implement working set? Can it be done exactly? + One of the initial plans was to store some sort of a capacitor with each memory page. The capacitor would be charged on each reference, then would discharge slowly if the page wasn't referenced. Tau would be determined by the size of the capacitor. This wasn't actually imple- mented. One problem is that we want separate working sets for each process, so the capacitor should only be allowed to discharge when a particular process executes. + What if a page is shared? + Actual solution: take advantage of use bits. + OS maintains idle time value for each page: amount of CPU time received by process since last access to page. + Every once in a while, scan all pages of a process. For each use bit on, clear page's idle time. For use bit off, add process' CPU time (since last scan) to idle time. Turn all use bits off during scan. + Scans happen on order of every few seconds (in Unix, I is on the order of a minute or more). + What is overhead of sampling reference bits regularly? + Assume samples every 10,000 memory references. - .43 - 40Mbyte memory, with 4K pages. 5 instructions to sample one bit, with 10 memory refs. Then 100,000 memory refs required just to record use bits. + Other questions about working sets and memory management in general: + What should T be? + What if it's too large? + What if it's too small? + plot STP vs. T, Page Fault Rate vs. I + What algorithms should be used to determine which processes are in the balance set? + How much memory is needed in order to keep the CPU busy? Note than under working set methods the CPU may occasion- ally sit idle even though there are runnable processes. + (How do we compute working sets if pages are shared?) + Working Set Restoration + Idea is that when we remove a process from the in-memory queue, we know what its working set is. + When we run the process again (i.e. promote it to the in-memory queue), we can restore the working set to memory all at once.l + Advantages: + minimize CPU overhead + don't have to wait for each page fault -> all transfers at once. - .44 - + Can optimize layout when writing out, and can fetch from consecutive locations + Or can just sort the fetches, so that average la- tency is much smaller. + A problem with working set is that even the approximate im- plementation above has a lot of overhead. Instead, Opderbeck and Chu created an algorithm called + Page Fault Frequency - Let X be the virtual time since the last page fault for this process. + At the time of a page fault, [If X>T, remove all pages (of the process) with the use bit off.] Then get a page frame for the new page, and turn off all reference bits for the process. + Idea was to make this a quick and easy way to implement working set. Idea is that as long as process is faulting too often (