************************************
		cs162 Lecture Notes for Feb 28, 2005
		    by Jesse Davidson, David Lee
		    ****************************

** QUICK REFERENCE **

Translation Lookaside Buffer
Principle of Locality
Set Associativity
Demand Paging, Thrashing, Working Sets

*********************


----------------------------
TRANSLATION LOOKASIDE BUFFER
----------------------------

PROBLEM with segmentation and paging:   extra  memory  refer-
ences  to access translation tables can slow programs down by
a factor of two or three.    There  are  obviously  too  many
translations  required  to keep them all in special processor
registers.

	But for small machines (e.g. PDP-11), can have one regis-
	ter  for  every  page  in  memory, since can only address
	64Kbytes.

SOLUTION:  Translation Lookaside Buffer (TLB), also called:

    Translation Buffer (TB) (DEC), or 
    Directory Lookaside Table (DLAT) (IBM), or
    Address Translation Cache (ATC) (Motorola).
[ As Prof Smith pointed out, ATC is really the correct name. ]

		-----------
		|	  |
	VA --->	|   TLB	  | ----->  RA
		|	  |
		-----------
		  |	^
		  |	|
		  |	|
		  |	|
		  V	|
		------------
		|TRANSLATOR|
		------------

       	  TLB
  	-------
	|VA|RA|
	-------
	|  |  |
	-------
	|  |  |
	-------
	|  |  |
	-------
		  
A TLB is used to store a few of the translation table en-
tries.  It's very fast, but only remembers a small number
of entries.  On each memory reference:

   First ask TLB if it knows about the page.  If so, the
   reference proceeds fast.

      If TLB has no  info  for  page,  translator  must  go
      through  page and segment tables to get info.  Refer-
      ence takes a long time, but give the  info  for  this
      page  to  TLB  so  it will know it for next reference
      (TLB must forget one of its current entries in  order
      to record new one).


   So what the TLB does is:
      Accept virtual address
      See if virtual address matches entry in TLB
      If so, return real address
      If not, ask translator to provide real address.
      Translator loads new translation into TLB, replacing old one.  
		   (Usually one not used recently.)
		   (Must replace entry in same set.)

********
NOTE that the TLB is basically a cache.  It simply holds associates
a VA to a RA, and if it is missing an entry, passes the request along
to whatever paging/segmentation scheme we happen to be using.
********

Will the TLB work well if it holds only a  few  entries,  and
the program is very big?
YES - due to Principle of Locality. (Peter Denning)


---------------------
PRINCIPLE OF LOCALITY
---------------------

TEMPORAL locality    (Recently Used Information) 
Information that has been used recently is likely to be continued to be used.

[Alternate formulation]  Information in use now consists mostly of the same 
			 information as was used recently.


SPATIAL locality     (Information in same vicinity) 
Information near the current locus of reference is also likely to be used 
in the near future.

   Example - top of desk is cache for file cabinet.  If desk
             is messy, stuff on top is likely to be what you need.

   Explanation - code is either sequential or loops.  Data
                 used  together  is  often  clustered together (array 
		 elements, stack, etc.)

   In practice, TLBs work quite well.  Typically find 96% to
   99.9% of the translations in the TLB.


-----------------
SET ASSOCIATIVITY
-----------------

TLB is just a memory with some comparators.  Typical size  of
memory:   16-512  entries.   Each  entry holds a virtual page
number and the corresponding physical page number.   How  can
memory be organized to find an entry quickly?

    One possibility:  search  whole  table  associatively  on
    every  reference.   Hard to do for more than 32 or 64 en-
    tries.

    A better possibility:  restrict the info  for  any  given
    virtual  page  to fall in into a subset of entries in the
    TLB.  Then only need to search that Set.  Called set  as-
    sociative.   E.g.  use  the low-order bits of the virtual
    page number as the index to select the  set.   Real  TLBs
    are  either fully associative or set associative.  If the
    size of the set is one, called direct mapped.


		<--- 4  sets --->
	 --------------------------------
	 |	|	|	|	|
	 --------------------------------   4 elements / set
	 |	|	|	|	|
	 --------------------------------
	 |	|	|	|	|
	 --------------------------------
	 |	|	|	|	|
	 --------------------------------

		virtual address
		-----------------------
		| page# | set#| byte# |
		-----------------------
		NOTE that the set # is actually the lower order bits
		of the page #.


    Use lower order bit to search since it is easier to find. Higher
    probability of finding Virtual Address. 
    Replacement must be in same set.

    Translator is a piece of hardware that knows how to translate
    virtual to real addresses.  It uses the PTBR to find the page
    table(s).  Reads the page table to find the page.

	TLBs are a lot like hash tables except simpler (must be to be
	implemented  in  hardware).   Some  hash functions are better
	than others.

	     Is it better to use low page number bits than  high  ones
	     to select the set?

                 Low ones are best:  if a large  contiguous  chunk  of
                 memory  is  being  used,  all pages will fall in dif-
                 ferent sets.


    Must be careful to flush TLB during each context swap.  Why?

	 Otherwise, when we switch processes, we'll still be using
         the  old  translations  from virtual to real, and will be
         addressing the wrong part of memory.

         [Alternative] - can make process identifier (PID)  part  of
         virtual  address.   Have  a  Process  Identifier Register
         (PIDR) which supplies that part of the address.

	 When we modify the page table, we must either  flush  TLB  or
         flush the entry that was modified.


---------------------------------------------
Topic: DEMAND PAGING, THRASHING, WORKING SETS
---------------------------------------------

Paging gives you the ability to run multiple processes in memory.     
 So far we have disentangled the programmer's view  of  memory
  from  the system's view using a mapping mechanism.  Each sees
   a different organization.  This makes it easier for the OS to
    shuffle  users  around  and simplifies memory sharing between
     users.

However, until now a user process had to be completely loaded
 into memory before it could run.  (sort of- we mentioned page
  faults and segment faults, but...) This is wasteful  since  a
   process  may  only need a small amount of its total memory at
    any one time (locality).  Virtual memory permits a process to
     run  with  only some of its virtual address space loaded into
      physical memory.


         Virtual address  space,  translated  to  either  a)  physical
         memory  (small,  fast)  or  b) disk (backing store), which is
         large but slow.
             [ Backing storage is typically disk. ]
	     
	 *************************************************************
         The idea is to produce the illusion that the  entire  virtual
         address space is in main memory, when in fact, it isn't.
	
         More generally, we have a multi-level (2 level in this  case)
         memory hierarchy.  We want to have the cost of the slower and
         larger level, and the performance of the smaller  and  faster
         level.
	 *************************************************************


	 ---------
	 |  CPU  |
	 ---------
	 | CACHE |
	 ---------
	     ^
	     |
	     V
	 ---------
	 |MEMORY |
	 ---------
	     ^
	     |
	     V
	 ---------
	 | DISK	 |
	 ---------
	     ^
	     |
	     V
	 ---------
	 | other | [ TAPE -- Significantly slower, as those unable to
	 ---------   access inst accounts March 5-7 found out ]


         The reason that this works is that most programs  spend  most
         of their time in only a small piece of the code.

         Principle of Locality
             Temporal Locality - the same information is likely to  be
             reused.
             Spatial Locality - nearby information is also  likely  to
             be used in the near future.


         If not all of process is loaded when it is running, what hap-
         pens  when  it  references a byte that is only in the backing
         store?  Hardware and software cooperate to make  things  work
         anyway.

             First,  extend  the  page  tables  with  an   extra   bit
             ``present,  or  valid''.   If  present  isn't  set then a
             reference to the page results in a trap.   This  trap  is
             given a special name, page fault.

                 Page fault - an attempt to reference a page which  is
                 not in memory.


	Page Table Entry
	----------------------------------------------------------------
	| RA | Protection bits | valid bit | dirty bit | reference bit |
	----------------------------------------------------------------

             Any  page  not  in  main  memory  right   now   has   the
             ``present/valid'' bit cleared in its page table entry.


_______________________
When page fault occurs:

             Trap to OS - Don't trust user/ abnormal ends (abend) 
             
	     Verify that reference is to valid page; if not, abend.
             
	     Find page frame to put page.
             
	         Find a page to replace, if no empty frame.

                 If dirty, find a place to put replaced page on secon-
                 dary storage (Can reuse previous location.)

                 Remove page (either copy back or overwrite)

                 Update page table.

                 Update map of secondary storage if necessary (to show
                 where we put page)

                 Update memory (core) map

                 Flush TLB entry for page that has been removed.

             Operating system brings page into memory

                 Find page on secondary storage.

                 Transfer it.

                 Update page table (set valid bit, and real address)

                 Update map of file system/disk to show that  page  is
                 now in memory.  (e.g. update cache of inodes)

                 Update Core Map (memory map).

             The process resumes execution.  (i.e. it  goes  on  ready
             list.  maybe it resumes)

         Note that all of these take time.  We may switch  to  another
         process while the IO is taking place.

         Multiprogramming is supposed to overlap the fetch of  a  page
         (or I/O) for one process with the execution of another.

             If no process is available to run (all doing I/O or  page
             fault), called multiprogramming idle or page fetch idle.

         Page out - to remove a page.

         Page out a process - remove it from memory.

         Page in a process - load its pages into memory.

__________________
RESUMING A PROCESS

         Continuing (resuming) the process is very tricky, since  page
         fault  may  have  occurred  in  the middle of an instruction.
         Don't want user process to be aware that the page fault  even
         happened.

             Can the instruction just be skipped?

             Suppose the instruction is restarted from the beginning?

                 How is the ``beginning'' located?

                 Even if the beginning is found, what  about  instruc-
                 tions with side effects

             Without additional information from the hardware, it  may
             be  impossible  to  restart a process after a page fault.
             Machines that permit restarting must have  hardware  sup-
             port  to  keep track of all the side effects so that they
             can be undone before restarting.

             Early Apollo approach for 68000

                 (two processors, one just for handling page faults)

                 IBM 370 solution (execute long instructions twice)
		     [ 'practice' execution to detect page fault ]

             If you think about this when  designing  the  instruction
             set,  it isn't too hard to make a machine support virtual
             memory.  It's much harder to do after the fact.
      	       [ Note that RISC instruction sets solve this nicely ]


             How many page faults can occur in one instruction??

                 E.g. instruction spans page boundaries, and  each  of
                 two  operands  spans  two  pages.  Could have 2 level
                 page table, with one page of  page  table  needed  to
                 point to each instruction & data page.   [ = SIX ]


         Once the hardware has provided basic capabilities for virtual
         memory, the OS must implement 3 algorithms:

             Page fetch algorithm:  when to bring pages into memory.

             Page replacement  algorithm:   which  page(s)  should  be
             thrown out, and when.

             Page placement  algorithm:  where  to  put  the  page  in
             memory.

         Note that the page placement algorithm for main memory is ir-
         relevant  -  memory  is  uniform.   (But CRAY has non-uniform
         memory access time.  Also not irrelevant for other  parts  of
         memory hierarchy.)

______________________
Page Fetch Algorithms:

             Demand paging:  start up process with  no  pages  loaded,
             load  a  page  when a page fault for it occurs, i.e. wait
             until it absolutely MUST be in memory.  Almost all paging
             systems are like this.

             Request paging:  let user say  which  pages  are  needed.
             What's wrong with this?

                 Users don't always know best, and aren't  always  im-
                 partial.   They  will overestimate needs.

                 Still  need  demand  paging,  in  case  user  doesn't
                 remember to bring in the right page.


             Prefetching, or Prepaging: bring a page into  memory  be-
             fore  it is referenced (e.g. when one page is referenced,
             bring in the next one, just in case).

                 Reason for prepaging is

                     (a) bring in several pages at once - cut per page
                     overhead

                     (b) eliminate real time delay in waiting for page
                     - overlap computation and fetch.

                 Idea is to guess at which page will be needed.   Hard
                 to  do effectively without a prophet, may spend a lot
                 of time doing wasted work.  If used at all, typically
                 One block lookahead - i.e.  the next one.

                 Seldom works.

                 Can also do "swapping", ("working  set  restoration")
                 whereby when you start a process, you swap in most or
                 all of its pages, or at least all of the pages it was
                 using  the  last time it was running.  When it stops,
                 you swap out its  pages  in  a  bunch  on  contiguous
                 tracks on disk.
                   [ Also called working set restoration. ]


             Overlays - a technique by which the user divides his pro-
             gram into segments.  The user issues commands to load and
             unload the segments from memory; these  commands  specify
             the  location  in  memory  where the segments are placed.
             Used when there is no virtual memory,  and  the  user  is
             given a partition of real memory to work with.
	       [ PAINFUL ]

____________________________
Page Replacement Algorithms:

             Random (RAND):  pick any page at random.

             FIFO:  throw out the page that has  been  in  memory  the
             longest.   The  ideas  are:  (a)  its simple, and (b) the
             first page that was fetched is believed to be  no  longer
             needed.

             LRU (least recently used):  use the past to  predict  the
             future.   Throw out the page that hasn't been used in the
             longest time.  If there is locality, then this is presum-
             ably the best you can do.

	       **********************************************************
	       NOTE that ( almost ) no one actually implements true LRU.
	       Too much overhead.  Instead, a LRU bit is added to each
	       entry, cleared every ***, and when an entry is used, the
	       bit is set.  Searching for an entry is done modulus style,
	       turning bits off as search proceeds, so that in worst case
	       we run through all entries before finding a cleared bit.
	       **********************************************************
	     
             MIN (or OPT):  as always, the best algorithm arises if we
             can predict the future.

                 Throw out the page that won't be used for the longest
                 time into the future.  This requires a prophet, so it
                 isn't practical, but it is good for comparison.

_____________________
Real and Virtual Time

             Virtual Time is time as measured by a running  process  -
             doesn't  include  time  that process is blocked (e.g. for
             page fault or other reason).  Often in  units  of  memory
             references.


             Real Time - time as measured  by  wall  clock.   Includes
             time that process is blocked (including page faults).


----------------------------------------------------------------------
EOF