Kye Hyun Kim
Roland Carlos
3/2 2005 Lecture 12
=========================================
Topic: How to evaluate paging algorithms.

If we have page faults, what are the costs we incur because of them?
- The CPU overhead for the page fault. We have to deal with the handler, 
dispatcher, and I/O routines.
- The CPU may have to remain idle while the page arrives.
- I/O may have to busy wait while the page is being transferred.
- There is main memory (or cache) interference while the page is being 
transferred.
- And of course, the real time delay to handle the page fault.

There are two approaches (metrics) to evaluate paging algorithms.
1. Plot page faults vs. amount of space used. This is known as a "parachor
curve".

See http://webdisk.berkeley.edu/~rollins/lecture1.jpg for graph of faults vs. space.

This is a more commonly used method of measure. This is because it is easier 
to see the tradeoffs between faults vs. space. 

2. Plot space time product vs. amount of space. Want to minimize STP.

See http://webdisk.berkeley.edu/~rollins/lecture2.jpg for graph of space vs. time.
See http://webdisk.berkeley.edu/~rollins/lecture3.jpg for graph of STP vs. space.

What is space time product? 
- It is the integral of amount of space used by program over the time it runs.
This also includes time for page faults.
- Exact formula is integral(O,E(space)) [m(t) dt] (if we use real time).
  - E is the ending time of program. 
  - m(t) is the memory used by the program at time t (t being measured in real 
time).
- However, if we measure in discrete time, the formula is different. It now is 
{sum(i = 0, R) [m(i)(1+f(i)*PFT)]} 
   - R is ending time of program in discrete time (i.e. number of memory 
references)
   - i is i'th memory reference
   - m(i) is number of pages in memory at i'th reference
   - f(i) is indicator function (=0 if no page fault, =1 if page fault)
   - PFT is page fault time
     - First product is the virtual space-time product, the second term adds in
 time for page faults.
   - Space time product can be computed approximately from page fault vs. space
 curve. So the approximate calucation for STP = (virtual running time for 
program(F) + PFT * number of page 
    faults) * (mean space occupied by program (n bar)).
   - Space time product will depend on PFT, so it is technology dependent (i.e.
   we may not get the 
   same results from different machines even if everything else is the same.) 
SPT also does not take into account the fact that machine may not be idle when 
page is being fetched.

=================================================
Topic: Example runs of certain paging algorithms.

We use the given reference string for all the tests: 4, 3, 2, 1, 4, 3, 5, 4, 3,
 2, 1, 5

Least Recently Used (LRU)
-Method: When we have to put in a new page into a full page table, we kick out 
the page that was least recently used (i.e. least recently referenced)

4 pages test (*indicates that a page fault happened, i.e. we had to go to disk
to bring this page into the page table):

4  3  2  1  4  3  5  4  3  2  1  5    reference string
----------------------------------
4* 3* 2* 1* 4  3  5* 4  3  2* 1* 5*   most recently used page
   4  3  2  1  4  3  5  4  3  2  1    2nd most recently used page
      4  3  2  1  4  3  5  4  3  2    3rd most recently used page
         4  3  2  1  1  1  5  4  3    least recently used page

Result: 8 page faults

3 pages test:

4  3  2  1  4  3  5  4  3  2  1  5    reference string
----------------------------------
4* 3* 2* 1* 4* 3* 5* 4  3  2* 1* 5*   most recently used page
   4  3  2  1  4  3  5  4  3  2  1    2nd most recently used page
      4  3  2  1  4  3  5  4  3  2    least recently used page

Result: 10 page faults

First In First Out (FIFO)
- Method: When we have to put in a new page into a full page table, we kick out
the page that is the oldest in the page table (i.e. the first one on). This
happens, regardless of if we reference that oldest page more recently than 
another page.

4 pages test (* means same as above. # indicates that the page is the oldest
page in the page table and will be kicked out if a page fault happens and
there is no more space in the page table):

4  3  2  1  4  3  5  4  3  2  1  5    reference string
----------------------------------
4* 4# 4# 4# 4# 4# 5* 5  5  5# 1* 1
   3* 3  3  3  3  3# 4* 4  4  4# 5*
      2* 2  2  2  2  2# 3* 3  3  3#
         1* 1  1  1  1  1# 2* 2  2

Result: 10 page faults

3 pages test:

4  3  2  1  4  3  5  4  3  2  1  5    reference string
----------------------------------
4* 4# 4# 1* 1  1# 5* 5  5  5  5# 5#
   3* 3  3# 4* 4  4# 4# 4# 2* 2  2
      2* 2  2# 3* 3  3  3  3# 1* 1

Result: 9 page faults

- Note the anomaly here (known as Belady's Anomaly). When we increase the
number of pages in the page table, we actually increase the number of page
faults as well. This is not always the case (and actually should not be the
case, we want miss ration to decline with increasing memory), which why this 
case is known as an anomaly.

Optimum (OPT)
- Method: The algorithm is simple, we replace the page that will be referenced
later than all the other pages in the table. This helps to insure that we 
limit page faults until they are absolutely necessary.
- This is also known as a Minimum (MIN) algorithm.

4 pages test (* means same as above):

4  3  2  1  4  3  5  4  3  2  1  5    reference string
----------------------------------
4* 3* 2* 1* 4  3  5* 4  3  2  1* 5
   4  4  4  1  4  4  5  5  5  5  1
      3  3  3  1  3  3  4  3  2  2
         2  2  2  2  2  2  4  3  3

Result: 6 page faults

3 pages test:

4  3  2  1  4  3  5  4  3  2  1  5    reference string
----------------------------------
4* 3* 2* 1* 4  3  5* 4  3  2* 1* 5
   4  4  4  1  4  4  5  5  5  5  1
      3  3  3  1  3  3  4  3  2  2

Result: 7 page faults

- Note: A page will only move up in the order if it is referenced, otherwise
the page will move down or stay in place.
- Note: The OPT 3 page test is just the 4 page test without the fourth line.
In the 3 page test, we still replace the page that will not be used longest
into the future. If you want to get the 2 page test for OPT, just remove the
third line.
- The main problem with OPT is that we require knowledge of the future (i.e.
what pages will be referenced when, which is almost always impossible).

======================
Topic: Stack Algorithm

- A stack algorithm is an algorithm which obeys the inclusion property.
  - The inclusion property implies that the set of pages in a memory of size N
  at time t is always a subset of the set of pages in a memory of size N+1 at
  time t. Because of this, it is obvious that we cannot have a miss ratio that
  increases with memory size. (We avoid Belady's Anomaly).
  - Stack is a list of pages in order of size of memory which includes them.

=======================    
Topic: Implementing LRU

- It sounds simple, but it is hard to implement. We need some form of hardware
support in order to keep track of which pages have been used recently.
  - Perfect LRU: Keep a register for each page and store the system clock into
  that register on each memory reference. To replace a page, we simply scan
  through all of them to find the one with the oldest clock (those the LRU).
  However, this is not quite perfect, it is expensive if there are a lot of
  memory pages.
  - LRU stack: Whenever a page is referenced, it is added to/removed from the
  stack and placed on the top. That way, the top is always the most recently
  used page, while the bottom is always the LRU. This makes it easy to push
  out a page since it will always be the bottom page.
    - Note that we can see (by inspection) that with LRU, miss ratio will
    never increase with increasing number of pages in memory.
  - Perfect LRU hard to implement in practice. We settle for an approximation 
  that is efficient. Just find an old page, but not necessarily the oldest 
  one. 

======================
Topic: Clock Algorithm

- Before we talk about the clock algorithm, we must mention use bits.
  - A use bit (also known as a reference bit) is a bit in the page table entry
  (usually cached in the TLB), that is set when the page is referenced. It is
  only turned off by the OS.
- The clock algorithm works as follows. We keep a "use" bit for each page 
frame. The hardware sets the bit for the referenced page on every memory
reference. Have a pointer pointing to the k'th page frame. When a fault occurs,
look at the use bit of the page frame you are pointing to. If it is on, turn
it off, increment the pointer, and repeat the process. If it is off, then we
replace that page in that page frame and set the new page's use bit to 1.

See http://webdisk.berkeley.edu/~rollins/lecture4.jpg for examples of clock algorithm.

- This algorithm is also known as FINUFO (first in, not used, first out).
- The use bit, when used with the clock algorithm breaks the pages into
two groups: those "in use" and those "not in use". We want to replace one of
the "not in use".
- What does it mean if the clock hand is sweeping very slowly?
  Answer: Plenty of memory, not many page faults.
- What does it mean if the clock hand is sweeping very slowly?
  Answer: Not enough memory!

- Some systems also use a "dirty" bit to give some extra preference to dirty
pages. This is because it is more expensive to throw out dirty pages: clean 
ones don't need to be written to the disk.
  - Tradeoffs: 
  - The cost of page fault declines, since there is lower probability of 
  writing out the dirty block, we don't have to worry about writing to disk
  when we kick out a dirty block as often.
  - Probability of fault increases however. Since clock was already a good
  algorithm, by messing it, the chances of making it worse are actually higher
  than the chances of making it better.

========================================
Topic: Details on Replacement Algorithms

- If we wanted to implement a Least Frequently Used replacement algorithm, how would it work?
  - It would be a disaster, since locality changes.
  - We care more about whether a page is referenced at all over whether how often a page is
  referenced.
  - Ex: What happens when we use a page heavily during the initial phase of a process but then never
  use it again? Since it was used heavily, it has a large count and remains in memory even though it
  is no longer needed.
  - One solution is to slowly decrement the used count over time to ensure an old never used page will
  eventually get paged out.

- A per process replacement algorithm or local page replacement algorithm, or per job replacement
algorithm allocate page frames to individual processes.
  - This means that a page fault in one process can only replace one of that process' frames.
  - An important fact because of this: Other processes cannot interfere with one another.

- If all pages from all processes are lumped together by the replacement algorithm, then it is said
to be a global replacement algorithm.
  - Under this scheme, each process competes with all of the other processes for page frames.

- Local algorithms:
  - Protects jobs from others which are badly behaved.
  - Hard to decide how much space to allocate to each process.
  - Allocation may be unreasonable.

- Global algorithms:
  - Permits memory allocation for process to shift over time.
  - Permits memory allocation to adapt to process needs.
  - Permits badly behaved process to grab too much memory.

========================================
Topic: Thrashing

- Thrashing:
  A situation when the page fault rate is  so  high that  the  system spends most 
  of its time either processing a page fault or waiting for a page to arrive.
  - Thrashing means that there is too much page fetch idle time when the processor
    is idle waiting for a page to arrive.
  - Suppose there are many users, and that between them their processes are making
    frequent references to 50 pages, but memory has 40 pages.
  - Each time one page is brought  in,  another  page,  whose contents will soon
    be referenced, is thrown out.
  - Compute average memory access time.
  - The system will spend all of its time reading and writing pages.
    It will be working very hard but not getting anything done.
  - The progress of the programs will make it  look  like the  access time of memory
    is as slow as disk, rather than disks being as fast as memory.
  - Plot of CPU utilization vs. level of multiprogramming.
    see http://webdisk.berkeley.edu/dav/public_html/Thrashing_Figure.JPG?JServSessionIdzone=065015g5t1
    As the degree of multiprogramming increases, CPU utilization drops sharply.
    At this point, to increase CPU utilization and stop thrashing, we must decrease the degree of
    multiprogramming.
  - Thrashing was a severe problem in early demand paging systems.

- Thrashing occurs because the system doesn't know when it  has taken  on more work
  than it can handle.  LRU mechanisms order pages in terms  of  last  access,
  but  don't  give  absolute numbers indicating pages that mustn't be thrown out.
  -  What do  humans  do  when  thrashing?   If  flunking  all courses at midterm time, drop one.
  - Imagine a person who bravely signed up for 25 units.
    This person might be THRASHING. 4 problem sets are due tomorrow, but have time for only 2 of them.
    -> work one of them for a while and then switch to other one and so on.
    -> unfortunately nothing done.

- Solutions to Thrashing:
  - If a single process is too large for memory, there is nothing  the OS can do.
    That process will simply thrash.(Buy more memory or bigger machine)
  - If the problem arises  because  of  the  sum  of  several processes:
    - Figure  out  how  much  memory  each  process  needs.
      Change  scheduling  priorities  to  run  processes in groups whose memory needs
      can be satisfied.
    - Shed load.
    - Change paging algorithm

- Working Sets (IBM) are a solution proposed by  Peter  Denning.  An informal definition is
  - Working set = "the set of pages that a process is  working  with, and which must
    thus be resident if the process is to avoid thrashing."
    - The idea is to use the recent needs of a  process  to predict its future needs.
  - Formally,  "Exactly  that  set  of  pages  used  in  the preceeding T virtual
    time units" (T usually given in units of memory references.)
  - Choose T, the working set parameter.  At any given  time, all  pages  referenced
    by a process in its last T seconds of execution are considered to comprise its working set.
  - Working Set Paging  Algorithm  keeps  in  memory  exactly those pages used in the preceding
    T time units.
  - Minimum values for T are  about  10,000  to  100,000 memory references.
  - A process will never be executed unless its  working  set is  resident  in  main memory.
    Pages outside the working set may be discarded at any time.
    - Note that this requires  a  reservoir  of  unassigned page frames.
  - Imagine a carpenter. Working set is tools hanging from belt. Every once in a while
    he needs a weird instrument then climb down the ladder and get the tool. if he doesn't have
    tools on your belt, he has to climb down and up all the time.
  - working set is dynamic because working set varies all the time
    Most processes don't need a staic number of page frames.
    (i.e. compiler has different phases)

- Working set paging requires that the sum of the sizes of  the working  sets of the jobs
  eligible to run (which we will call the balance set) be less than or equal to the amount
  of space available. We previously referred to the balance set as the jobs in the in-memory queue.
  - Some algorithm must be provided for moving processes into and  out of the balance set.
    What happens if the balance set changes too frequently?
    - Still get thrashing

- As working sets change, corresponding changes will have to be made in the balance set.

- Working set also has the advantage over LRU that  it  adjusts the  amount  of  space in  use according
  to what the process needs.  LRU works with a fixed amount of space, even though a process' needs change.

- How do we implement working set?  Can it be done exactly?
  - One of the initial plans was to  store  some  sort  of  a capacitor  with each memory page.
    The capacitor would be charged on each reference, then would discharge slowly if the  page  wasn't referenced.
    Tau would be determined by the size of the capacitor.  This wasn't  actually  implemented.
    One  problem  is  that we want separate working sets for each process, so the capacitor  should  only  be
    allowed to discharge when a particular process executes.
    - What if a page is shared?

  - Actual solution:  take advantage of use bits.
    - OS maintains idle time value for each  page:   amount of  CPU time received by process since last access to
      page.
    - Every once in a while, scan all pages of  a  process. For each use bit on, clear page's idle time.  For use
      bit off, add process' CPU time (since last  scan)  to idle time.  Turn all use bits off during scan.
    - Scans happen on order of every few seconds (in  Unix, I is on the order of a minute or more).

  - What is overhead of sampling reference bits regularly?
    - Assume  samples  every  10,000   memory   references.
      40Mbyte  memory,  with  4K  pages.  5 instructions to sample one bit, with 10 memory  refs.
      Then  100,000 memory refs required just to record use bits. -> overhead is unreasonable

- Other questions about working sets and memory  management  in general:
  - What should T be?
    - What if it's too large?
      - it may overlap several localities.
    - What if it's too small?
      - iit will not encompass the entire locality
      - plot STP vs. T, Page Fault Rate vs. I
    - What  algorithms  should  be  used  to  determine   which processes are in the balance set?
    - How much memory is needed in order to keep the CPU  busy?
      Note than under working set methods the CPU may occasionally sit idle even though there are runnable processes.
    - (How do we compute working sets if pages are shared?)

- Working Set Restoration (swapping)
  - Idea is that when we remove a process from the  in-memory queue, we know what its working set is.
  - When we run the process again (i.e.  promote  it  to  the in-memory  queue),  we  can  restore  the  working set to
    memory all at once.l
    - Advantages:
      - minimize CPU overhead
      - don't have to wait for each  page  fault -> all transfers at once.
      - Can optimize layout when  writing  out,  and  can fetch from consecutive locations
      - Or can just sort the fetches, so that average latency is much smaller.

- A problem with working set is that even the  approximate implementation above has a lot of overhead.  Instead,
  Opderbeck and Chu (UCLA) created an algorithm called
  - Page Fault Frequency - Let X be the  virtual  time  since the last page fault for this process.
    - At the time of a page  fault,  [If  X>T,  remove  all pages  (of  the process) with the use bit off.]  Then
      get a page frame for the new page, and turn  off  all reference bits for the process.
  - Idea was to make this a quick and easy way  to  implement working set. Idea is that as long as process is faulting
    too often (<T), process will get  more  pages  and  cease faulting so frequently.
  - Problem is that process can fault  frequently  and  still not  need  more  pages  - may be going through large data
    area.  Doesn't work as well as WS. But it's much more implementable.