;;*********************************************** cs162 Lecture Notes for 3/30 taken by Tabassum Khan ;;********************************************** Announcements: Midterm2 is on next Wednesday covers everything from Day1 of cs162 upto lecture on Monday 04/03 TOPIC: I/O OPTIMIZATION Continue discussion of I/O system-----today I/O optimization Why do we care about I/O optimization? 1. Disk is the major I/O device--98% of the I/Os some I/O might be to tapes, to terminals, keyboard but mostly your I/Os goes into disk 2. Disk is mechanical--which means its thousands and millions of times slower than other electronic devices in the system. CPU : 50% faster every year Disk : 7-9% every year CPU is spending all the time waiting for disk I/O. This would bring the system to a complete halt. This is the reason why people are doing better and better in optimizing I/O system. ** Block Size Optimization ** In a disk, typically each sector is 512 bytes but we can optimize the amount of I/O we get in each grab we make---512 bytes upto any multiple of that. * Small blocks * Smaller blocks, smaller I/O buffers This was a problem earlier when I/O buffers interfered with using the memory. But not anymore. * Smaller blocks, faster transfer If you do some arithmetic you will notice that say if you ask for a K size block --transfer time is not a significant part of access time but for a mega byte disk block the transfer time will be dominating and I/O would take longer. * Require lots more transfers for a fixed amount of data. * High overhead on disk - wasted bytes for every disk block. (Inter record gaps, header bytes, ERC bytes). if you have physically small blocks on disk, which by and large you dont have a choice because disk are hard sectored 512 bytes these days. But if you can adjust that, any one sensible will make block size larger---because you are wasting a lot of space for interrecord gaps, marks that tell you about tracks, etc. At the end of last lecture someone asked: Q: Why is the formatted disk smaller than unformatted disk? A: Inter-record gaps error correction codes Marks on the disk telling you where each track and sector goes * More entries in file descriptor to point to blocks (Inode). ---although that can usually can be simulated in the system the disk may be hard sectored at 512 bytes but if you ask for a K size block then say you want 8 of them then your file descriptor needs to point only to 4K blocks. * Less internal fragmentation. ---was a big issue when disks were small, now the disks are 200Gig and so we dont care about it. * If random allocation, more seeks. * Optimal block sizes tend to range from 2K to 8K bytes. Optimum increasing with improvements in technology. (NOTE: the above stats in the lecture notes are based on old analysis) * Berkeley Unix uses 4K blocks. (Now may have increased to 8K ) Basic (underlying hardware) block size in VAX is 512 bytes. * Berkeley Unix also uses fragments that are 1/4 the size of the logical block size. In addition to allocating integral block, the last block can have upto 3-4 fragments and each fragment is 1/4 of a block size, so the minimum size file u can allocate is 1/4 of a block. and this can be used to avoid disk wastage for lots of small files. ** Disk Arm Scheduling: In timesharing systems, it may sometimes be the case that there are several disk I/Os requested at the same time. This means that it needs to have some sort of queue where I/O request would wait for their turn to be served. Anything that has a queue, can be scheduled. We did CPU scheduling earlier in the semester, now we will do disk arm scheduling. * First come first served (FIFO, FCFS): deal with the requests in the order they were recieved. This may result in a lot of unnecessary disk arm motion under heavy loads. * Shortest seek time first (SSTF): handle nearest request first. This can reduce arm movement and result in greater overall disk efficiency, but some requests may have to wait a long time. * Potential starvation. Imagine that disk is heavily loaded, with 3 open files. Two of the files located near center of disk. Another file near edge. Disk can be fully busy servicing first two files, and ignoring last file. * SCAN: -like an elevator. Move arm in one direction, servicing requests, until there are no additional requests in that direction. Then, turn arround and starts moving in the reverse direction and continue. * This algorithm doesn't get hung up in any one place for very long. It works well under heavy load. But it may not get the shortest seek. * files at periphery of disk are somewhat discriminated because the center of the disk is visited twice as often as the edge. _,.-----.._ ,-' -. ,- '- / `. / \ / /--\ scan \ This is a pretty picture of | | \---------> | a disk. The arrows show the direciton | \ /<-------- | in which the I/O request are being served ' \../0......1000 | following a scan algorithm. \ ,' `. / -. ,- -.._ _,-' '-----'' * CSCAN - (circular scan) like a one-way elevator. Moves only in one direction. When it finds no further requests in the direction of the scan, it returns immediately to the furthest request in the other direction, and resumes the scan. * This treats all files (and tracks) equally, but has somewhat higher mean access time than SCAN. This is because, it takes some time to drag the arm from the edge to the center to start scan again. If you think about the seek time, basically there is a huge start up and stop time and you move very fast in the middle, for an edge to edge seek may only be 5 times longer than a track to track seek. In track seek you start the arm, stop the arm, look to make sure you are on the right track. ********But its not as bad as it sounds. _,.-----.._ ,-' -. ,- '- / `. This is a pretty picture of a disk. / \ The arrows show the direction in which / /--\ Cscan \ the I/O requests are begin served | | \---------> | following a cscan algorithm. | \ /---------> | ' \../0......1000 | \ ,' `. / -. ,- -.._ _,-' '-----'' SERVICING EXAMPLE: say our I/O requests were recieved in this order: queue = 38 150 12 18 302 804 3 FIFO---->38 150 12 18 302 804 3 SSTF---->38 18 12 3 150 302 804 SCAN---->38 150 302 804 18 12 3 CSCAN--->38 150 302 804 3 12 18 In SSTF, starvation might occur when the request is in the following order : 38 39 38 39 38 39 38 39 *12 38 39 38 39 *poor 12 will never get served :( Summary: * SSTF has best mean access time. Scan or CScan can be used if there is a danger of starvation. FIFO is worst. This is not really very credible because this is under an assumption that Prof. made at the beginning of the discussion which is that blocks being asked are uniformly distributed over the surface and the request is in random. trying to allocate the block sequentially, on the same track and cylinder--block are in files--at any given time ---how many files are open--not many--how many processes are running doing the I/O--not very many --so how many files are open on disk who have I/O going on --0 may be 2 -3 --those files are laid down sequentially--so we dont have random request---Because if u have 3 files open---that is where u are reading from so the scheduling problem is simplified---dont have to do any seek because the chances are that only 1 file is open and u are reading from that. * Most of the time there aren't very many disk requests in the queue, so this isn't a terribly important decision. * Also, if contiguous allocation is used (as with OS/360), then seeks are seldom required. lots of research on random order request before anyone pointed out that this is silly Researchers study the on queue = 500 in reality avg queue length = .1 or .2 or .05 ******not really an scheduling problem********** there are some system that are very disk-limited like Google 10K-20K machine most of the disk are 10% occupied because they are disk ltd. they need all these disk because they want to get enough higher aggregate I/O through. that is why the are giving this Gb of gmail, since they have 90% of free disk why not give to somebody for storage. each disk I/O 10ms 100 I/Os per disk want to do 10000 I/O per sec need 100 disk, even if we may not have enough data to for 100 disk but u cannot do 10000 i/os without 100 disk. Spread the data out on those 100 disk. ** Rotational Scheduling If you can schedule the arm, you can also schedule things rotationally. * It is rare to have more than one request outstanding for a given cylinder. (This was more relevant when drums were used.) * SRLTF (shortest rotational latency first) works well. ---equivalent to SSTF * But rotational scheduling can be useful for writing data, if we don't have to write back to same location. (log structured file system.) * Rotational scheduling is hard using logical block address (LBA) - since you don't know the rotational position or the number of blocks per track. ** Rotational and seek scheduling can be usefully combined (into shortest time to next block) if done in the onboard disk controller, which should know the angular and radial position. ** Skip-Sector or Interleaved disk allocation. This was a big problem when we didnt have buffers to read every block if u read a file sequentially, you read a block, send an I/O interrupt, the CPU says "Thank You, now get me another block", and by that time you have passed the entire inter-record gap and you have to wait for an entire rotation to get t the next block. SOLUTION: Skip-sector disk allocation (alternate the blocks) 12 1 _,.--|---.._ 7 6 ,-' -. ,- '- / `/. / \ 2 11 / /--\ \ | | \ _| | \ / | ' \../ | 8 \ ,' 5 `. \/ -. ,- -.._ _,-' 3 10 '-----'' 9 4 * Imagine that you are reading the blocks of a file sequen- tially and quickly, and file is allocated sequentially. * Usually, will find that you try to read a block just after the start of the block has been passed. * Solution is to allocate file blocks to alternate disk blocks or sectors. Then haven't passed block when we want to read it. * Note that if all bits read are immediately placed into a semiconductor buffer, this is unnecessary. ** Track offset for head and cylinder switching * It takes time to switch between heads on different tracks or cylinders. Thus we may want to skip several blocks when moving sequentially between tracks, to allow the head to be selected. ** File Placement Obviously, it takes longer to move an arm between longer distances than between shorter distances. If you put your frequently used files in the center of the disk right next to each other, you will have shorter seek time than if you put them out on the periphery or random. If you have 2 disks than you split your files over those 2 disk, so that you would be able to read 2 files at once, that is make two I/Os Therefore, 2 I/Os will take the same time as 1 I/O. * Seek distances will be minimized if commonly used files are located near center of disk. * Even better results if reference patterns are analyzed and files that are frequently referenced together are placed near each other. Freqency of seeks, and queueing for disks will be reduced if commonly used files (or files used at the same time)are located on different disks. * E.g. spread the paging data sets and operating systems data sets over several disks. ** Disk Caching Caching works for TLBs, caching works for memory, caching works for CPU, may be it works for Disks. Amazingly enough, it does. * Keep a cache of recently used disk blocks in main memory. * Recently read blocks are retained in cache until re placed. * Writes go to disk cache, and are later written back. * Typically would include index blocks for an open file. * Also use the cache for read ahead and write behind. * Can load entire disk tracks into cache at once. * Typically works quite well - hit ratios of 70-90%. * Can also do caching in disk controller - most controllers these days have 64K-4MB of cache/buffer in the controller. Mostly useful as buffer, not cache, since the main memory cache is so much larger. ** Prefetching and Data Reorganization * Since disk blocks are often read (and written) sequen- tially, it can be very helpful to prefetch ahead of the current read point. * It is also therefore useful to make sure that the physi- cal layout of the data reflects the logical organization of the data - i.e. logically sequential blocks are also physically sequential. Thus it is useful to periodically reorganize the data on the disk. ** Data Replication * Frequential used data (such as file descriptors, inode table, C compiler)can be replicated at multiple locations on the disk. Then seek the nearest copy. * (Catch)This means that on writes, extra copies must either be updated or invalidated. * ALIS - automatic locality improving storage (an student research) * Best results obtained when techniques are combined: reorganize to make sequential, cluster, and replicate. --saves fair amount of I/O time * RAID - Redundant Array of Inexpensive Disks ---invented by a research group at UCBerkeley. IBM did some research on this idea internally and didnt patent it. * Observations: * Small disks cheaper than large ones (due to economies of scale) * Failure rate is constant, independent of disk size * Therefore, if we replace a few large disks with lots of small disks, failure rate increases * Solution: * Interleave the blocks of the file across a set of smaller disks, and add a parity disk. * Note that since we presume (a) only one disk failure, and (b) we know which disk failed, we can reconstruct the failed disk. * Can do parity in two directions for extra reliability. ---XORing the bits is the key idea * Advantage: * Improves read bandwidth because you could get the data from either disk since we have duplicates. * Problem: * This means that we have to write the parity disk on every write. It becomes a bottleneck. * A solution - interleave on a different basis than the number of disks. That means that the parity disk varies, and the bottleneck is spread around. * Types of RAID: * RAID 0 - ordinary disks no redundancy The simplest RAID level, RAID 0 should really be called "AID", since it involves no redundancy. Files are broken into stripes of a size dictated by the user-defined stripe size of the array, and stripes are sent to each disk in the array. Giving up redundancy allows this RAID level the best overall performance characteristics of the single RAID levels, especially for its cost. * *D* * * RAID 1 - replication or mirroring RAID 1 is usually implemented as mirroring; a drive has its data duplicated on two different drives using either a hardware RAID controller or software. If either drive fails, the other continues to function as a single drive until the failed drive is replaced * * *D* *D* * * * RAID 4 - parity disk in fixed location RAID 4 improves performance by striping data across many disks in blocks, and provides fault tolerance through a dedicated parity disk. * * * * * *D* *D* *D* *D* *P* * * * * * * RAID 5 - parity disk in varying location RAID 5 stripes both data and parity information across three or more drives. It is similar to RAID 4 except that it exchanges the dedicated parity drive for a distributed parity algorithm, writing data and parity blocks across all the drives in the array. This removes the "bottleneck" that the dedicated parity drive represents, improving write performance slightly and allowing somewhat better parallelism in a multiple-transaction environment **I am sorry I cant draw this picture in ASCII**I give up*** picture in the text book--pg 559 Figure 14.9 Question: Whats the catch? Answer: An unreasonable assumption is that only one disk will fail at a time. In reality, the failure might not be necessarily independent. It might so happen that all the disks fail at the same time, reasons could be many: power supply failure lightning strike hit them earthquake hit the disks a maniac comes with AK47 and shoots all the disk at the same time. RAID is nice but you want to be really sure that your disk failure is independent then the most you can do is make sure that you get your disks from different companies, different manufacture batches, different micro-code, and then GOOD LUCK. ***************************END OF LECTURE***********************************************