Kye Hyun Kim Roland Carlos 3/2 2005 Lecture 12 ========================================= Topic: How to evaluate paging algorithms. If we have page faults, what are the costs we incur because of them? - The CPU overhead for the page fault. We have to deal with the handler, dispatcher, and I/O routines. - The CPU may have to remain idle while the page arrives. - I/O may have to busy wait while the page is being transferred. - There is main memory (or cache) interference while the page is being transferred. - And of course, the real time delay to handle the page fault. There are two approaches (metrics) to evaluate paging algorithms. 1. Plot page faults vs. amount of space used. This is known as a "parachor curve". See http://webdisk.berkeley.edu/~rollins/lecture1.jpg for graph of faults vs. space. This is a more commonly used method of measure. This is because it is easier to see the tradeoffs between faults vs. space. 2. Plot space time product vs. amount of space. Want to minimize STP. See http://webdisk.berkeley.edu/~rollins/lecture2.jpg for graph of space vs. time. See http://webdisk.berkeley.edu/~rollins/lecture3.jpg for graph of STP vs. space. What is space time product? - It is the integral of amount of space used by program over the time it runs. This also includes time for page faults. - Exact formula is integral(O,E(space)) [m(t) dt] (if we use real time). - E is the ending time of program. - m(t) is the memory used by the program at time t (t being measured in real time). - However, if we measure in discrete time, the formula is different. It now is {sum(i = 0, R) [m(i)(1+f(i)*PFT)]} - R is ending time of program in discrete time (i.e. number of memory references) - i is i'th memory reference - m(i) is number of pages in memory at i'th reference - f(i) is indicator function (=0 if no page fault, =1 if page fault) - PFT is page fault time - First product is the virtual space-time product, the second term adds in time for page faults. - Space time product can be computed approximately from page fault vs. space curve. So the approximate calucation for STP = (virtual running time for program(F) + PFT * number of page faults) * (mean space occupied by program (n bar)). - Space time product will depend on PFT, so it is technology dependent (i.e. we may not get the same results from different machines even if everything else is the same.) SPT also does not take into account the fact that machine may not be idle when page is being fetched. ================================================= Topic: Example runs of certain paging algorithms. We use the given reference string for all the tests: 4, 3, 2, 1, 4, 3, 5, 4, 3, 2, 1, 5 Least Recently Used (LRU) -Method: When we have to put in a new page into a full page table, we kick out the page that was least recently used (i.e. least recently referenced) 4 pages test (*indicates that a page fault happened, i.e. we had to go to disk to bring this page into the page table): 4 3 2 1 4 3 5 4 3 2 1 5 reference string ---------------------------------- 4* 3* 2* 1* 4 3 5* 4 3 2* 1* 5* most recently used page 4 3 2 1 4 3 5 4 3 2 1 2nd most recently used page 4 3 2 1 4 3 5 4 3 2 3rd most recently used page 4 3 2 1 1 1 5 4 3 least recently used page Result: 8 page faults 3 pages test: 4 3 2 1 4 3 5 4 3 2 1 5 reference string ---------------------------------- 4* 3* 2* 1* 4* 3* 5* 4 3 2* 1* 5* most recently used page 4 3 2 1 4 3 5 4 3 2 1 2nd most recently used page 4 3 2 1 4 3 5 4 3 2 least recently used page Result: 10 page faults First In First Out (FIFO) - Method: When we have to put in a new page into a full page table, we kick out the page that is the oldest in the page table (i.e. the first one on). This happens, regardless of if we reference that oldest page more recently than another page. 4 pages test (* means same as above. # indicates that the page is the oldest page in the page table and will be kicked out if a page fault happens and there is no more space in the page table): 4 3 2 1 4 3 5 4 3 2 1 5 reference string ---------------------------------- 4* 4# 4# 4# 4# 4# 5* 5 5 5# 1* 1 3* 3 3 3 3 3# 4* 4 4 4# 5* 2* 2 2 2 2 2# 3* 3 3 3# 1* 1 1 1 1 1# 2* 2 2 Result: 10 page faults 3 pages test: 4 3 2 1 4 3 5 4 3 2 1 5 reference string ---------------------------------- 4* 4# 4# 1* 1 1# 5* 5 5 5 5# 5# 3* 3 3# 4* 4 4# 4# 4# 2* 2 2 2* 2 2# 3* 3 3 3 3# 1* 1 Result: 9 page faults - Note the anomaly here (known as Belady's Anomaly). When we increase the number of pages in the page table, we actually increase the number of page faults as well. This is not always the case (and actually should not be the case, we want miss ration to decline with increasing memory), which why this case is known as an anomaly. Optimum (OPT) - Method: The algorithm is simple, we replace the page that will be referenced later than all the other pages in the table. This helps to insure that we limit page faults until they are absolutely necessary. - This is also known as a Minimum (MIN) algorithm. 4 pages test (* means same as above): 4 3 2 1 4 3 5 4 3 2 1 5 reference string ---------------------------------- 4* 3* 2* 1* 4 3 5* 4 3 2 1* 5 4 4 4 1 4 4 5 5 5 5 1 3 3 3 1 3 3 4 3 2 2 2 2 2 2 2 2 4 3 3 Result: 6 page faults 3 pages test: 4 3 2 1 4 3 5 4 3 2 1 5 reference string ---------------------------------- 4* 3* 2* 1* 4 3 5* 4 3 2* 1* 5 4 4 4 1 4 4 5 5 5 5 1 3 3 3 1 3 3 4 3 2 2 Result: 7 page faults - Note: A page will only move up in the order if it is referenced, otherwise the page will move down or stay in place. - Note: The OPT 3 page test is just the 4 page test without the fourth line. In the 3 page test, we still replace the page that will not be used longest into the future. If you want to get the 2 page test for OPT, just remove the third line. - The main problem with OPT is that we require knowledge of the future (i.e. what pages will be referenced when, which is almost always impossible). ====================== Topic: Stack Algorithm - A stack algorithm is an algorithm which obeys the inclusion property. - The inclusion property implies that the set of pages in a memory of size N at time t is always a subset of the set of pages in a memory of size N+1 at time t. Because of this, it is obvious that we cannot have a miss ratio that increases with memory size. (We avoid Belady's Anomaly). - Stack is a list of pages in order of size of memory which includes them. ======================= Topic: Implementing LRU - It sounds simple, but it is hard to implement. We need some form of hardware support in order to keep track of which pages have been used recently. - Perfect LRU: Keep a register for each page and store the system clock into that register on each memory reference. To replace a page, we simply scan through all of them to find the one with the oldest clock (those the LRU). However, this is not quite perfect, it is expensive if there are a lot of memory pages. - LRU stack: Whenever a page is referenced, it is added to/removed from the stack and placed on the top. That way, the top is always the most recently used page, while the bottom is always the LRU. This makes it easy to push out a page since it will always be the bottom page. - Note that we can see (by inspection) that with LRU, miss ratio will never increase with increasing number of pages in memory. - Perfect LRU hard to implement in practice. We settle for an approximation that is efficient. Just find an old page, but not necessarily the oldest one. ====================== Topic: Clock Algorithm - Before we talk about the clock algorithm, we must mention use bits. - A use bit (also known as a reference bit) is a bit in the page table entry (usually cached in the TLB), that is set when the page is referenced. It is only turned off by the OS. - The clock algorithm works as follows. We keep a "use" bit for each page frame. The hardware sets the bit for the referenced page on every memory reference. Have a pointer pointing to the k'th page frame. When a fault occurs, look at the use bit of the page frame you are pointing to. If it is on, turn it off, increment the pointer, and repeat the process. If it is off, then we replace that page in that page frame and set the new page's use bit to 1. See http://webdisk.berkeley.edu/~rollins/lecture4.jpg for examples of clock algorithm. - This algorithm is also known as FINUFO (first in, not used, first out). - The use bit, when used with the clock algorithm breaks the pages into two groups: those "in use" and those "not in use". We want to replace one of the "not in use". - What does it mean if the clock hand is sweeping very slowly? Answer: Plenty of memory, not many page faults. - What does it mean if the clock hand is sweeping very slowly? Answer: Not enough memory! - Some systems also use a "dirty" bit to give some extra preference to dirty pages. This is because it is more expensive to throw out dirty pages: clean ones don't need to be written to the disk. - Tradeoffs: - The cost of page fault declines, since there is lower probability of writing out the dirty block, we don't have to worry about writing to disk when we kick out a dirty block as often. - Probability of fault increases however. Since clock was already a good algorithm, by messing it, the chances of making it worse are actually higher than the chances of making it better. ======================================== Topic: Details on Replacement Algorithms - If we wanted to implement a Least Frequently Used replacement algorithm, how would it work? - It would be a disaster, since locality changes. - We care more about whether a page is referenced at all over whether how often a page is referenced. - Ex: What happens when we use a page heavily during the initial phase of a process but then never use it again? Since it was used heavily, it has a large count and remains in memory even though it is no longer needed. - One solution is to slowly decrement the used count over time to ensure an old never used page will eventually get paged out. - A per process replacement algorithm or local page replacement algorithm, or per job replacement algorithm allocate page frames to individual processes. - This means that a page fault in one process can only replace one of that process' frames. - An important fact because of this: Other processes cannot interfere with one another. - If all pages from all processes are lumped together by the replacement algorithm, then it is said to be a global replacement algorithm. - Under this scheme, each process competes with all of the other processes for page frames. - Local algorithms: - Protects jobs from others which are badly behaved. - Hard to decide how much space to allocate to each process. - Allocation may be unreasonable. - Global algorithms: - Permits memory allocation for process to shift over time. - Permits memory allocation to adapt to process needs. - Permits badly behaved process to grab too much memory. ======================================== Topic: Thrashing - Thrashing: A situation when the page fault rate is so high that the system spends most of its time either processing a page fault or waiting for a page to arrive. - Thrashing means that there is too much page fetch idle time when the processor is idle waiting for a page to arrive. - Suppose there are many users, and that between them their processes are making frequent references to 50 pages, but memory has 40 pages. - Each time one page is brought in, another page, whose contents will soon be referenced, is thrown out. - Compute average memory access time. - The system will spend all of its time reading and writing pages. It will be working very hard but not getting anything done. - The progress of the programs will make it look like the access time of memory is as slow as disk, rather than disks being as fast as memory. - Plot of CPU utilization vs. level of multiprogramming. see http://webdisk.berkeley.edu/dav/public_html/Thrashing_Figure.JPG?JServSessionIdzone=065015g5t1 As the degree of multiprogramming increases, CPU utilization drops sharply. At this point, to increase CPU utilization and stop thrashing, we must decrease the degree of multiprogramming. - Thrashing was a severe problem in early demand paging systems. - Thrashing occurs because the system doesn't know when it has taken on more work than it can handle. LRU mechanisms order pages in terms of last access, but don't give absolute numbers indicating pages that mustn't be thrown out. - What do humans do when thrashing? If flunking all courses at midterm time, drop one. - Imagine a person who bravely signed up for 25 units. This person might be THRASHING. 4 problem sets are due tomorrow, but have time for only 2 of them. -> work one of them for a while and then switch to other one and so on. -> unfortunately nothing done. - Solutions to Thrashing: - If a single process is too large for memory, there is nothing the OS can do. That process will simply thrash.(Buy more memory or bigger machine) - If the problem arises because of the sum of several processes: - Figure out how much memory each process needs. Change scheduling priorities to run processes in groups whose memory needs can be satisfied. - Shed load. - Change paging algorithm - Working Sets (IBM) are a solution proposed by Peter Denning. An informal definition is - Working set = "the set of pages that a process is working with, and which must thus be resident if the process is to avoid thrashing." - The idea is to use the recent needs of a process to predict its future needs. - Formally, "Exactly that set of pages used in the preceeding T virtual time units" (T usually given in units of memory references.) - Choose T, the working set parameter. At any given time, all pages referenced by a process in its last T seconds of execution are considered to comprise its working set. - Working Set Paging Algorithm keeps in memory exactly those pages used in the preceding T time units. - Minimum values for T are about 10,000 to 100,000 memory references. - A process will never be executed unless its working set is resident in main memory. Pages outside the working set may be discarded at any time. - Note that this requires a reservoir of unassigned page frames. - Imagine a carpenter. Working set is tools hanging from belt. Every once in a while he needs a weird instrument then climb down the ladder and get the tool. if he doesn't have tools on your belt, he has to climb down and up all the time. - working set is dynamic because working set varies all the time Most processes don't need a staic number of page frames. (i.e. compiler has different phases) - Working set paging requires that the sum of the sizes of the working sets of the jobs eligible to run (which we will call the balance set) be less than or equal to the amount of space available. We previously referred to the balance set as the jobs in the in-memory queue. - Some algorithm must be provided for moving processes into and out of the balance set. What happens if the balance set changes too frequently? - Still get thrashing - As working sets change, corresponding changes will have to be made in the balance set. - Working set also has the advantage over LRU that it adjusts the amount of space in use according to what the process needs. LRU works with a fixed amount of space, even though a process' needs change. - How do we implement working set? Can it be done exactly? - One of the initial plans was to store some sort of a capacitor with each memory page. The capacitor would be charged on each reference, then would discharge slowly if the page wasn't referenced. Tau would be determined by the size of the capacitor. This wasn't actually implemented. One problem is that we want separate working sets for each process, so the capacitor should only be allowed to discharge when a particular process executes. - What if a page is shared? - Actual solution: take advantage of use bits. - OS maintains idle time value for each page: amount of CPU time received by process since last access to page. - Every once in a while, scan all pages of a process. For each use bit on, clear page's idle time. For use bit off, add process' CPU time (since last scan) to idle time. Turn all use bits off during scan. - Scans happen on order of every few seconds (in Unix, I is on the order of a minute or more). - What is overhead of sampling reference bits regularly? - Assume samples every 10,000 memory references. 40Mbyte memory, with 4K pages. 5 instructions to sample one bit, with 10 memory refs. Then 100,000 memory refs required just to record use bits. -> overhead is unreasonable - Other questions about working sets and memory management in general: - What should T be? - What if it's too large? - it may overlap several localities. - What if it's too small? - iit will not encompass the entire locality - plot STP vs. T, Page Fault Rate vs. I - What algorithms should be used to determine which processes are in the balance set? - How much memory is needed in order to keep the CPU busy? Note than under working set methods the CPU may occasionally sit idle even though there are runnable processes. - (How do we compute working sets if pages are shared?) - Working Set Restoration (swapping) - Idea is that when we remove a process from the in-memory queue, we know what its working set is. - When we run the process again (i.e. promote it to the in-memory queue), we can restore the working set to memory all at once.l - Advantages: - minimize CPU overhead - don't have to wait for each page fault -> all transfers at once. - Can optimize layout when writing out, and can fetch from consecutive locations - Or can just sort the fetches, so that average latency is much smaller. - A problem with working set is that even the approximate implementation above has a lot of overhead. Instead, Opderbeck and Chu (UCLA) created an algorithm called - Page Fault Frequency - Let X be the virtual time since the last page fault for this process. - At the time of a page fault, [If X>T, remove all pages (of the process) with the use bit off.] Then get a page frame for the new page, and turn off all reference bits for the process. - Idea was to make this a quick and easy way to implement working set. Idea is that as long as process is faulting too often (