Guy Boo cs162-am 3/9/2005 lecture 15 ****************************** ***** MIDTERM STATISTICS ***** ****************************** Average: 77.11 Median: 78 Std Dev: 11.76 Max: 96 Min: 42 Distribution: 41-45 1 * 46-50 2 ** 51-55 2 ** 56-60 2 ** 61-65 3 *** 66-70 7 ******* 71-75 11 *********** 76-80 14 ************** 81-85 11 *********** 86-90 11 *********** 91-95 7 ******* 96-100 1 * Grading: Adrian Mettler: 1, 2 Karl Chen: 3, 4 Prof. Smith: 5, 6 ************************** ***** TOPIC : PAGING ***** ************************** Usually, you don't have a choice over what size your pages are because the hardware is constructed to support a specific page size. In some systems the page size is left up to the OS, but these are rare. However, if you feel your hard-wired page size is too small you can simulate a larger page size by moving and handling groups of pages at once and you'll get performance that approximates the larger page size. If you have a page size of 4KB and you want a page size of 8KB for example, hard-code your system to always swap in and out 2 real, consecutive 4KB pages (or 1 conceptual 8KB page) whenever it gets a page fault and its performance will resemble a system with 8KB pages. The correspondence will not be 1 to 1, however; for example, you'll still have two entries in the TLB and the page table. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> Advantages of Different Page Sizes <<< ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ large * Less overhead when swapping. * Page fault processing happens less often because the larger page is more likely to already contain the next needed byte. * TLB is more effective because it is able to reference more data. * Page table gets smaller because you require less pages to reference the same address space. * Less overhead to run the page replacement algorithm because you have fewer pages to consider. small * Less internal fragmentation. * Less time for transferring data, so less time for a page fault. * The working set of a given process fits in less physical memory. * Use more of the page per page fault. When you swap in a page, you're typically using only pieces of it, so with a smaller page size less unnecessary data was swapped in. In other words, just because your page size scales up doesn't mean that you're using an equally large fraction of the page. For example, the same amount of data that could fit on 64KB of 4KB pages may require as many as 96KB worth of 8KB pages due to internal fragmentation. Student >> If you have larger pages do you need more address bits? Smith >> You need more address bits to reference the byte within a page, but you need less to reference the page itself. remember that the number of address bits you require is dependent only on how many bytes there are in your address space. As machines get faster and memories get larger it makes sense to move to the larger page sizes because transfer times become faster, latency times shrink, and you can afford to waste a lot more memory in internal fragmentation. The range of page sizes you'll typically run into is anywhere from 512B on an antique machine to 64KB. Some high-end machines, though, have extremely large page sizes that can be as big as 4 Megabytes. Student >> Overall, how is the total time for delay affected by a larger page size? Smith >> Swap delay is composed of seek, latency, and transfer times. Specifically what these terms mean is for the following lectures, but basically seek and latency time will not be affected by page size and transfer time will increase proportionally with how much data you have to transfer. It is important to remember, however, that when you bring in a larger page you bring in more data, so your page fault rate may drop significantly. Student >> If you have larger pages, do you have the same number of entries in your page table? Smith >> No. You've still got a 32 bit address space, so if you have a 4KB page, you need 20 bits for the page number and if you've got an 8K page you need 19 bits for the page number. So: as your pages get bigger, the number of pages gets smaller. Student >> Is the same amount of memory referenced if you have bigger pages? There are fewer entries, but each one is bigger, so... Smith >> Each entry maps more stuff. So if your page table points to 8KB pages, then each page maps 8KB, not 4KB. In particular, if you double your page size then your page table gets half as big, and your TLB becomes much more effective because with the same number of entries it can map twice as much stuff. Student >> If your page table is half the size it was, can you still map the same amount of memory as you could before? Smith >> Yes. The total size of your address space never changed. Student >> So doesn't that imply that page faults should occur at the same rate? Smith >> No. Let's take a simple example: if you have 100KB worth of stuff and you're just scanning through it, then with 4KB pages you'd get 25 page faults and with 8KB pages you'd get 13 page faults. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> Graphs of System Performance v. Page Size <<< ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ # | * | * p | * a | * g | * e | * | * f | * a | * u | ** *** l | ** **** t | ***** **** s | ****** +---------------------------------- page size p | e | r | ********* f | *** ***** o | ** **** r | ** m | * a | * n | * c | * e | * +---------------------------------- page size As you can see, as you make the page size bigger you're going to have fewer faults because it covers more data. But when the page size gets *too* big, internal fragmentation begins to play a role and your system ends up wasting a good deal of time transferring a lot of unnecessary data every page fault. Student >> How do you measure performance in this graph? Smith >> We could use jobs per second, throughput, percentage of time spent in useful computation rather than OS functions... For this particular application we don't need a precise definition. Student >> Is performance then the inverse of faults? Smith >> Assuming that you're not considering any other parameter, yes. Student >> Why is the minimum of the page faults v. page size not on the right? Smith >> As pages get bigger, the delay for pages is much larger, so your performance will start to drop *even though* your page fault rate continues to drop. You're having fewer page faults, but it'll take you longer to process each one as the pages get very big. Student >> So if your page size is really big, will you have no page faults? Smith >> If you can swap in the entire process address space as one page, you shouldn't have any page faults. On the other hand, you may have one Gigabyte process address space and a quarter Gigabyte of memory, in which case you can't do that. ~~~~~~~~~~~~~~~~~~~~~ >>> Paging the OS <<< ~~~~~~~~~~~~~~~~~~~~~ Almost anything in memory can be paged out, including the operating system, but some pages in memory can't. You cannot page out: - parts of the operating system that are needed for handling page faults, and anything this code relies on. For example, you can't page out the page fault handler or the I/O code that manages the actual loading and writing during a page swap. This memory is referred to as 'wired down'. - memory for programs that have a real-time response criterion. If you're coding the assembly line at GM and the welder takes a page fault right when the car trundles by, then a door might not get welded. - sections that are already undergoing a paging or other read/write operation. Typically, the I/O system deals only with real addresses, so if your program does an I/O to read the OS constructs a command to the I/O system which has a disk block member and the memory address of the receiving buffer. The I/O system then will write directly to that area of memory, so the OS must ensure that whatever's there doesn't get paged out, or else whatever gets paged in will have a section of its data erroneously rewritten. Typically, the OS performs this task with a lock bit on the page so that, if it is set, it knows that the page cannot be paged out at the moment. An additional concern introduced by paging when you're performing an I/O transfer is how you stay within page boundaries. In general, you're transferring to fragmented regions in memory, so either the I/O system must be really smart and know where the page boundaries are, or you have to perform the transfer as a separate contiguous transfer for each page written to. Student >> Why is I/O so fundamentally disconnected from the VM system? Why does it not perform the translation? Smith >> The problem with the I/O is that it's connected only to the memory bus, and the memory bus is only connected to physical memory, so there's no virtual page table for it to look in. Besides, if you were able to pass a page table to the device, then not only does that hardware become *much* more complicated, it also becomes OS specific because it needs to be able to understand the page table format. So instead of going to the computer store and buying a disk off the shelf, you'd need to buy the "mac OSX" version or the "windows NT" version, and if the page table format was ever changed - like in an update - you wouldn't be able to use your disk anymore. Student >> Does a disk driver play some role in that? Smith >> Well a driver is a piece of the OS that understands the disk, not a program on the disk that understands the OS, so no. Not in that way. Student >> What prevents a disk from writing anywhere in memory? Smith >> Nothing, other than the idea that the OS alone gives it instructions for where to write. The disk controllers really have the power to write anywhere in memory, so yes you've got to make sure you don't code them maliciously. Student >> So if the disk is writing directly to memory instead of merely providing data for the CPU to copy to memory, how does the CPU know when the I/O device is done? Smith >> That's the point of the I/O interrupt. When the I/O device completes its task, it will send an interrupt to the CPU to let it know. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> Studying Paging Algorithms <<< ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ Physically studying and testing the algorithms we've been learning about so far for paging turns out to be really difficult. Frequently in computer science mathematical models serve such a purpose well, but with regards to paging none have ever been found that adequately model program behavior. Alternatively, you could run a simulation based on random numbers, but that kind of approach suffers from the same kind of problem as the mathematical model - it fails to adequately simulate program behavior. Primarily, this is because the memory addresses touched by actual programs are *not* random. You could elect to perform experiments on an actual system, (and of course that sounds like it would be the ideal solution,) but several problems emerge. Firstly, you'd need to be able to modify the OS, and usually you won't have access to what is most probably proprietary code. Secondly - as in mainframes or distributed systems, which are otherwise the ideal places for such experimentation - someone will probably want to *use* the system. Even after all these problems have been dealt with, you'd still face the reality that your results would be extremely difficult to reproduce. Even if you control so much as the order of jobs in the queue and the system clock and the very orientations of the disks, user and program behavior remains unpredictable, and unless the effect of one paging algorithm is radically different from another, the differences will generally be swamped. The usual approach for generating a model of program behavior to examine is something called trace driven simulation, where you track a program's virtual memory references and then simulate a system with various paging algorithms. *** Methods of Acquiring an Address Trace *** instrumenting the code You can modify your code so that every branch, load, and store is replaced by a call that records the involved address and then executes the instruction. Once you have this trace, you can post-process it to determine the sequence of instruction fetches - because instruction fetches are sequential between branches - leaving you with the complete address trace of the program. use a hardware monitor Attaching a hardware monitor to the address bus is a tricky, but clean way to see exactly what a processor is doing while it's doing it. This approach worked when all the instruction fetches and loads and stores went directly to the main memory. With the advent of instruction and L1 and L2 caches, however, this approach no longer works because all of the needed wires are microscopic in size and internal to the chip. trap to the OS on every instruction Most machines have a facility called trace trap, which allows the experimenter to set certain parameters for when a trap should be encountered. (for example: within some memory range or on every load or store) Essentially, you set it to trap on every instruction and modify the trap handler so that it records all addresses associated with this instruction. simulate the program You could write a machine simulator that would literally walk through the executable file, pick up the next line of code, and make a record of the addresses it's touched. turn off every valid bit after every memory reference As a way to force a machine that doesn't have a trace trap facility to trap on every instruction, make every instruction trigger a page fault and have the page fault handler record the addresses. Finally, if you have a microcoded computer, you could have the microcode store the addresses for you. ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ >>> Paging Algorithms and Minimizing Page Faults <<< ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ | + * : # p | + * : # a | + * : # g | + * : # e | + * :: # | + * : ## f | + * :: ### a | ++ ** ::: ###### u | + * ::::: ########## l | ++ **** :::::::::: ##### FIFO / random t | ++++ ************* ::::::: LRU / clock s | +++++++++++++ ********** working set | ++++++++++++ optimal (requires future knowledge) +------------------------------------- allocated memory The above graph shows how many page faults you can expect from a typical process v. the amount of allocated memory. Working set is below least recently used because WS adjusts the amount of memory based on how much the process actually needs. Generally speaking, FIFO and random are somewhat worse than LRU, and clock has approximately the same performance as LRU. One way to minimize page faults is to rewrite algorithms in client code. Generally, rewriting your code so that execution tends to remain within page frames is not terribly efficient, but rewriting data accesses is possible and often done. For example, say you want to transpose a 1 million by 1 million matrix stored row-wise. You could read off a single row very easily and efficiently, but accessing every element of a column would generate a page fault, so you can't perform the transposition directly. One way to make it more efficient would be to break it up into blocks like so +---+---+---+---+---+ | | | A | | | +---+---+---+---+---+ | | | | | | +---+---+---+---+---+ | B | | | D | | +---+---+---+---+---+ | | | C | | | +---+---+---+---+---+ | | | | | | +---+---+---+---+---+ so that all the pages in each cell will fit in memory. If you transpose the insides of the cells, then transpose cells as a unit (ex: copy A to B's spot and C to D's spot and vice versa) then you end up with the final matrix. This kind of optimization is usually a big win and is done a lot, but it is not always easy, possible, or even necessary to perform. For example, if you're trying to code a parser, there's no obvious way to partition your data so that it minimizes page faults, but then you don't really care because all of you data is very likely to fit within 100KB of space, which will more than likely all fit in memory at the same time. ******************************* ***** TOPIC : I/O DEVICES ***** ******************************* This part of the course is essentially 'show and tell'. ~~~~~~~~~~~~~~~~ >>> Terminal <<< ~~~~~~~~~~~~~~~~ +------------+ | | | screen | | | <- full duplex terminal +------------+ +---------+ ^-----------| | | CPU | v-----------| | /------------\ +---------+ / keyboard \ /________________\ Antique device. Consists of a screen and a keyboard. Notice that there's no wire from the screen to the keyboard - they both talk directly to the CPU. If you have a 'full duplex' connection, this is how it looks; with two wires the CPU can output to the screen at the same time as the keyboard sends input to the CPU. Older terminals had what was called 'half duplex' connections, and the screen and keyboards were connected directly, via the same wire they used to speak to the CPU. +------------+ | | | screen | | | <- half duplex terminal +------------+ +---------+ ^ | | |-----------| CPU | v | | /------------\ +---------+ / keyboard \ /________________\ Since they are connected by the same wire, the keyboard and the CPU cannot use the wire to communicate at the same time. So if the CPU is outputting something to the screen, you are not able to type. However, such machines typically featured an interrupt button, so you could force the CPU to stop outputting in order to listen to input from the keyboard. Terminals at one time weren't even buffered. The earliest incarnations of terminals were teletype machines, which had a mounted barrel of characters that spun, tilted, and hit a ribbon to produce a character on a page, and were able to process 10 characters a second. Student >> You're talking about typewriters? Smith >> No, I'm talking about teletypes. You're thinking of electric typewriters. Those were the *next* step. As soon as terminals switched to screens they were able to process many more characters at a time - approximately 1900 to 2000 baud at about 10 bits per character transmitted. Before mainframes came about, terminal keyboards would send one interrupt per character pressed. Mainframes had to support multiple users, so constant interrupts were necessarily a bad idea, and hence I/O controllers were introduced. The controllers would buffer input from the keyboard, and would instead only send an interrupt to the main processor every time carriage return was pressed. Since then, with respect to PC's, people have figured that processor time is cheap so the 'one interrupt per character' functionality is essentially what we're back to. ~~~~~~~~~~~~~~~~~~~~ >>> Line Printer <<< ~~~~~~~~~~~~~~~~~~~~ A line printer is basically an overgrown electronic typewriter. It feeds in a continuous stream of paper - separated into sheets by perforations - via sprockets that catch on holes punched in the edges. IBM mechanical line printers could do 2000 lines per minute. It had a steel belt like a motorcycle chain, and mounted on this belt were the little characters. The belt would whip around, when the right character came around a little hammer would hit the belt and there would be a ribbon and your character would appear on the paper. There wouldn't be just one of each character on the belt, there would be several copies of the common characters so that you wouldn't have to whip the whole belt around to get your character. The rarer characters would appear only once, so it would take a while to get things like an ampersand. Student >> are you going to test on this type of information? Smith >> I'm not going to test you on antique devices, but look in the reader to see what kind of information i'll test for on an exam. one of the things i will ask about is parameter values. I won't ask specifically about how long it takes a disk to do this - whether it's five milliseconds or four, but when we get to current devices, you should know within a factor of three or five or whatever what the parameter values are - you should know how fast things are. you should know basically where the decimal point goes. Line printers had an interesting problem - they could print 132 characters per line and the 133rd character right at the beginning was carriage control, so depending on what you put there you could move it down a line, print a new page, etc. One common mistake was that when you wanted to print a list or a table you'd format the first 99 lines fine, but on line 100 a '1' character ends up in the carriage control column which meant page feed - skip to the next page. so every line after that ends up on its own page. Then IBM came out with the laser printer. It resembled a line printer, especially in the fact that it was approximately as large as a Volkswagen bug, but it literally used a laser to print on the paper and it went about four or five times faster than a line printer. The first line printer was about $300,000.