Chris Downie (-ai), Carly Rector (-ak) Lecture notes on 3/30/05 I/O Optimization ================ Block Size Optimization Small blocks * small io buffers * quickly transferred * require lots more transfer for a fixed amount of data * high overhead on disk - wasted bytes for every disk block (interrecord gaps, header bytes, ERC bytes) * more entries in file descriptor to point to blocks (inode) * less internal fragmentation * if random allocation, more seeks Optimal block sizes tend to range from 2K to 8K bytes. (This is an old analysis. Now approx. 4-64K) Optimum increasing with improvements in technology * Berkeley Unix uses 4K blocks (now 8K?) Basic (hardware) * block size in VAX is 512 bytes * Berkeley Unix also uses fragments that are one fourth of the size of the logical block size Clever solution: In addition to allocating integral blocks, the last block in the file can have up to 4 fragments, the quarter of the block size. Minimum block size is really quarter of a block. +----------------------+ | In-class Quote Time! | +----------------------+ "When stupidity is the explanation for something, it's probably the right one" ------------------------ Disk Arm Scheduling ------------------- In timesharing systems, it may sometimes be the case that there are several disk IO's requested at the same times We will use the following demo sequence to illustrate the differences in disk arm scheduling. Demo Sequence: 38, 150, 12, 18, 302, 804, 3 FCFS - First come first served (FIFO, FCFS) may result in a lot of unneccsary disk arm motion under heavy loads Ex: 38, 150, 12, 18, 302, 804, 3 SSTF - Shortest Seek Time First - handle nearest request first. This can reduce armmovement and result in greater overall disk efficiency, but some requestrs may have to wait a long times - Problem: starvation! Imagine that the disk is heavily loaded with 3 open files. Two of the files loacated near the center of disk, other file near edge. Disk can be fully busy servicing first two files, and ignoring the last files Ex: 38, 18, 12, 3, 150, 302, 804 Scan - Like an elevator. Move arm in one direciton, servicing requestist in that direciton. Then reverse and continue. - Advantage: Doesn't get hung up in any one place for very long, works well under heavy load, but it may not get the shortest seeks - Also, tends to neglect files at periphery of disk. Ex: 38, 150, 302, 804, 18, 12, 3 Cscan - circular scan like a one-way elevator. Moves only in one direction, when it finds no further requests in the scan direction, it returns immediately to the furthest request in the other direction, and it resumes the scan. - Treats all files (and tracks) equally but somewhat higher mean access time than Scan Ex: 38, 150, 302, 804, 3, 12, 18 Overall ------- SSTF has best mean access time. Scan or Cscan can be used if there is a danger of starvation Theory vs. Reality. ------------------- "Let's assume the all the blocks we're asking for are uniformly distrubuted over the disk's surface" -- but in reality many blocks are written sequentially, in files. * Not many files are open at a given times * Not very many processes with open files are running at a given moment. * Thus, no random requests and a lot of sequential IO's * Most of the time there aren't very many disk requests in the queue * Also, if continguous allocaiton is used (as with OS360) then seeks are seldom required Thus, the disk scheduling algorithm is rarely an important decision. Aside: Google example --------------------- How many disks do you need to handle a large number of requests for data? If there are N requests per second and K IO's per request There would be N*K I/Os / second About 5ms per I/O implys a maximum of 200 I/Os / second N*K/200 disks needed to handle all that. This is why Google can easily give out GBs of free storage. They have many disks in order handle all the requests they get in a timely mannner, but most of them are around 90% empty. Rotational Scheduling --------------------- It is rare to have more than one request outstanding for a given cylinder (this was more relevant when drums were used) * SRLTF (shortest rotational latency time first) works well but rotational scheduling can be useful for writing data, if we don't hae to write back to the same location (log structured file system - see end of notes) * Rotational scheduling is hard using logical block address (LBA) since you don' tknow the rotational position or the number of blocks per track. * Rotational and seek scheduling can be usefully combined (into shortest time to next block) if done in the onboard disk controller, which should know the angular and radial position. "Skip-Sector" or "Interleaved" disk allocation * Imagine that you are reading the blocks of a file sequen- tially and quickly, and the file is allocated sequentially * Usually, will find that you try to read a block just after the start of the block has been passed. * Solution is to allocate file blocks to alternate disk blocks or sectors. Then haven't passed block when we want to read it. * Note that if all bits read are immediately placed into a semiconductor buffer, this is unnecessary Example: order = 1, 6, 2, 7, 3, 8, 4, 9, 5. Normal Disk Sector Numbering Interleaved Numbering ============================ ===================== 9 1 5 1 _________ _________ 8 /\ | /\ 2 9 /\ | /\ 6 / \ | / \ / \ | / \ /\_ \ | / _/\ /\_ \ | / _/\ | \_ \|/ _/ | | \_ \|/ _/ | 7 | _={ )=_ | 3 4 | _={ )=_ | 2 | _/ / \ \_ | | _/ / \ \_ | \/ / \ \/ \/ / \ \/ 6 \ / \ / 4 8 \ / \ / 7 \/_______\/ \/_______\/ 5 3 Track offset for head and cylinder switching * It takes time to switch between heads on different tracks or cylinders. Thus we may want to skip several blocks when moving sequentially between tracks, to allow the head to be selected. * This is the same concept as above, except with reordering tracks rather than reordering sectors. Aside: Microcode ---------------- Computer archetects - user instruction set is not the machine structure set. User instruction is call to "microcode". Much lower level instruciton set the user doesn't see. As an example, the VAX had a user instruction to calculate polynomials. This is clearly composed of lower level code - the "microcode." IO guys - code embedded in the controller ("firmware"). This is the sense we usually mean it in. File Placement -------------- Seek distances wil be minimized if commonly used files are located near center of disks. * Even better results if reference patterns are analyzed and files that are frequently referenced together are placed near each other. * Frequency of seeks, and queuing for disks will be reduced if commonly used files (or files used at the same time) are located on different disks. * E.g. Spread the paging data sets and operating systems data over several disks. Disk Caching ------------ Keep a cache of recently used disk blocks in main memory. * Recently read blocks are retained in cache until replaced * Writes go to disk cache, and are later written back. * Typically would include index blocks for an open file. Also use the cache for read ahead and write behind - Read ahead: Extra blocks are read beyond what was requested - Write behind: Writes are not immediately written back; the disk does write backs when it is free. * can load entire disk tracks into cache at once * Typically works quite well - hit ratios of 70-90% * A power backup can be useful when using such a cache, since a failure means that some things you think have been written to disk have in fact not been. Can also do caching in disk controller - most controllers these days have 64K - 4MB of cache/buffer in the controller. Mostly useful as buffer, not cache, since the main memory cache is so much larger. +----------------------+ | In-class Quote Time! | +----------------------+ LRU says: "I don't know nuttin". The question of alternative replacement algorithms came up in class, with particular reference to S2Q and F2Q (various hybrids of FIFO and LRU taught in CS186). While Prof. Smith was unable to comment on the efficiency of these algorithms relative to LRU, he stated that despite how simple the LRU replacement scheme is, he believes it will out-perform any more sophisticated scheme. ------------------------ Prefetching and Data Reorganization ----------------------------------- Since disk blocks are often read (and written) sequentially, it can be helpful to prefetch ahead of the current read point. It is also therefore useful to make sure that the physical layout of the data reflects the ogical organization of the data - ie logically sequentiall blocks are also physically sequential. Thus it is useful tperiodically reorganize the data on disk. Data Replication * Frequential used data can be replaced at multiple locations on the disk (seek to nearest copy) * This means that on writes, extra copies must either be updated or invalidated ALIS - Automatic Locality Improving Storage Best results obtained when techniques are combined: reorganize to make sequential, cluster, and replicate. +----------------------+ | In-class Quote Time! | +----------------------+ Note: Ants don't like living underwater. ------------------------ RAID - redundant array of inexpensive disks ------------------------------------------- Observations: * Small disks cheaper than large ones (due to economies of scale) * Failure rate is constant, independent of disk size Therefore, if we replace a few large disks with lots of small disks, failure rate increases. Solution: * Interleave the blocks of the file across a set of smaller disks, and add a parity disk. * Note that since we presume (a) only one disk failure, and (b) we know which disk has failed, we can reconstruct the failed disk. * Can do parity in two directions for extra reliability. +----------------------+ | In-class Quote Time! | +----------------------+ Prof. Smith's list of things which can cause a disk to fail: Hitting it with a sledgehammer, spilling Coke on it, zapping it with static electrictity Student Contribution: Spraying it with RAID spray. Prof. Smith: "But then it would be bug free!" Groans and forced chuckles from the class at large. ------------------------ Advantage: Improves read bandwidth Problem: This means that we have to write the parity disk on every write. Becomes bottleneck. * One solution - interleave on a different basis than the number of disks. That means that the parity disk varies, and the bottleneck is spread around. Types of Raid ------------- RAID 0 - ordinary disks RAID 1 - replication (mirroring) RAID 4 - Parity disk in fixed location RAID 5 - parity disk in varying locations Log Structured File System -------------------------- Used in a file system dominated by writes. Blocks are written all in one place (sequentially) , with a pointer placed in the location they would normally be.