Phase 3: Caching and Virtual Memory

The third phase of Nachos is to investigate the interaction between the TLB, the virtual memory system, and the file system. We don't provide any new virtual memory code for this assignment. You will continue to use the stub file system. For this phase, you should run gmake and nachos in the proj3 directory.

To help you to organize your code better, we provide you with a new package, nachos.vm, with two new classes, VMKernel and VMProcess. VMKernel extends UserKernel, and VMProcess extends UserProcess. VMKernel and VMProcess are the only classes you should have to modify for this project phase.

This phase and the next phase of the project involve open-ended design problems. We will expect you to come up with a design that would make sense in a real system, and to defend your choices in the design review. For example, you will have some freedom to choose how to do software address translation on TLB misses, how to represent the swap partition, how to implement paging, etc. In each case, we will expect you to come to the design review armed with a defensible justification as to why your choices are reasonable. You should evaluate your design on all the available criteria: speed of handling a TLB miss, space overhead in memory, minimizing the number of page faults, simplicity, etc. There is no single "right" answer.

The first design aspect to consider is the software-managed translation lookaside buffer (TLB). Page tables were used in phase 2 to simplify memory allocation and to isolate failures from one address space from affecting other programs. For this phase, the processor knows nothing about page tables. Instead, the processor only deals with a software-managed cache of page table entries, called the TLB. Given a memory address (an instruction to fetch, or data to load or store), the processor first looks in the TLB to determine if the mapping of the virtual page to a physical page is already known. If the translation is in the TLB, the processor uses it directly. If the translation is not in the TLB (a "TLB miss"), the processor causes a trap to the OS kernel. Then it is the kernel's responsibility to load the mapping into the TLB, using page tables, segments, inverted page tables, or whatever other mechanism might be appropriate. In other words, the Nachos MIPS simulator does not have direct access to your page tables; it only knows about the TLB. It is your job to write the code that manages a TLB miss.

The second design aspect of this project is paging, which allows physical memory pages to be transferred to and from disk to provide the illusion of an (almost) unlimited physical memory. A TLB miss may require a page to be brought in from disk to satisfy the translation. That is, when a TLB miss fault occurs, the kernel should check its own page table. If the page is not in memory, it should read the page in from disk, set the page table entry to point to the new page, install the page table entry, and resume the execution of the user program. Of course, the kernel must first find space in memory for the incoming page, potentially writing some other page back to disk, if it has been modified.

Performance of this mechanism depends greatly on the policy used to decide which pages are kept in memory and which are stored on disk. On a page fault, the kernel must decide which page to replace; ideally, it will throw out a page that will not be referenced for a long time, keeping in memory those pages may be referenced soon. Another consideration is that if the replaced page has been modified, the page must first be saved to disk before the needed page can be brought in. (Of course, if the page has not been modified, it is not necessary to write it back to disk.)

To help you implement virtual memory, each TLB entry contains three status bits: valid, used, and dirty. If the valid bit is set, the virtual page is in memory and the translation can be used directly by the processor. If the valid bit is clear, or if the page translation is not found in the TLB, then the processor traps to the OS to perform the translation. The processor sets the used bit in the TLB entry whenever a page is referenced and sets the the dirty bit whenever the page is modified.

  1. (30%) Implement software-management of the TLB, with software translation via an inverted page table.

  2. (40%) Implement demand paging of virtual memory. For this, you will need routines to move a page from disk to memory and from memory to disk. You should use the Nachos stub file system as backing store; this will make Part 3 (see below) a lot easier.

  3. (30%) Implement lazy loading of the code and data pages from user programs. That is, rather than reading all of the code and data from a user program at startup time, set up page table entries so that when page faults occur, the pages from the executable will be read into memory on demand.

Be aware that in project 3 you may run into live-lock. This is a case where your user programs will appear to have deadlocked - they will not be able to make any forward progress. There is a possibility that this will happen if you are running more than one user program with a small number of physical pages. You are not required to fix this problem. The autograder will not run any sequence that could possibly result in live-lock.

If you run more than 1 process, you risk live-lock for the case when two processes running can both get _double_ TLB misses (one for the instruction and one for data). This can result in live-lock depending on the luck of context switching and everything.

Here is an example (with two processes running, #1 & #2):
#1 TLB miss on instruction & data
#1 Acquire lock to service data
CONTEXT SWITCH
#2 TLB miss on instruction & data
#2 Go to sleep waiting for lock to service data
CONTEXT SWITCH
#1 Finish service on data
#1 Release lock, giving it to #2 since it is on the queue
#1 Try to acquire the lock for instruction
#1 Go to sleep because #2 has the lock
CONTEXT SWITCH
#2 Invalidate TLB because of context switch, removing the service #1 has done on data
#2 Service it's own data
#2 Release lock, giving it to #1 since it is on the queue
#2 Try to acquire lock for instruction
#2 Go to sleep because #1 has the lock
CONTEXT SWITCH
#1 Invalidate TLB because of context switch, removing the service #2 has done on data
#1 Service instruction miss
#1 Release lock, giving it to #2 since it is on the queue
#1 Try to acquire lock because we need to re-service data
#1 Go to sleep because #2 has the lock
CONTEXT SWITCH
...

In order to fix this, you would have to guarantee that at least one process is guaranteed to have 2 TLB translations as well as those 2 corresponding pages in memory before being context switched. This way, at least one process would be able to make progress on a double-miss instruction. You could lessen the effects of live-lock by saving and restoring the state of the TLB on context switches, but the same problem can occur with very few pages of physical memory. In this case, 2 processes could get stuck in a similar way by continually throwing out each others pages before both the instruction and data page could be loaded.

Also be careful that you don't write off other bugs in your project as live-lock. Live-lock is fairy rare and non-deterministic to get. It will only occur when running multiple user programs. I recommend lots of black/white box and stub file testing before you do full integration testing by running user programs.