Topic: File Structure, I/O Optimization File: a named collection of bits (usually stored on disk). From the OS' standpoint, the file consists of a bunch of blocks Status: R stored on the device. Programmer may actually see a different interface (bytes or records), but this doesn't matter to the file system (just pack bytes into blocks, unpack them again on reading). File may have attributes and properties- e.g. name(s), protection, type (numeric, alphabetic, binary, c program, fortran progrm, data, etc.), time of creation, time of last use, owner, length, link count, layout (format). How do we (or can we) use a file? Sequential: information is processed in order, one piece after the other. This is by far the most common mode: e.g. editor writes out new file, compiler compiles it, etc. Random Access: can address any block in the file directly without passing through its predecessors. E.g. the data set for demand paging, libraries, databases. Need to know what block we want (e.g. some sort of index or address needed.) Keyed: search for blocks with particular values, e.g. hash table, associative database, dictionary. Usually not provided by operating system (but is provided in some IBM systems). Keyed access can be considered a form of random access. Modern file and I/O systems must address four general problems: 1. Disk Management: Efficient use of disk space Fast access to files File structures Device use optimization. User has hardware independent view of disk. (Mostly, so does OS). 2. Naming: how do users refer to files? This concerns directories, links, etc. 3. Protection: all users are not equal. Want to protect users from each other. Want to have files from various users on same disk. Want to permit controlled sharing. 4. Reliability Information must last safely for long periods of time. Disk Management: How should the blocks of the file be placed on the disk? How should the map to find and access the blocks look? File Descriptor- A data structrure that gives file attributes and contains the map which tells you where the blocks of your file are. File descriptors are stored on disk along with the files (when the files are not open). Some system, user and file characteristics: Most files are small. In Unix, most files are very small. Lots of files with a few commands in them, etc. Much of the disk is allocated to large files. Many of the I/O operations are made to large files. Most (between 60% and 85%) of the I/Os are reads. Most I/Os are sequential. .LP Thus, per-file cost must be low but large files must have good performance. File Block Layout and Access Contiguous Linked Indexed or tree structured Note - this is just standard data structures stuff, but on disk. Contiguous allocation: Allocate file in a contiguous set of blocks or tracks. Keep a free list of unused areas of the disk. When creating a file, make the user specify its length, allocate all the space at once. Descriptor contains location and size. Advantages: Easy access, both sequential and random, Low overhead Simple. Few seeks. Very good performance for sequential access. Drawbacks: Horrible fragmentation will make large files impossible. Hard to predict needs at file creation time. May over allocate. Hard to enlarge files. Can improve this scheme by permitting files to be allocated in extents. I.e. ask for contiguous block; if it isn't enough, get another contiguous block. Example: IBM OS/360 permits up to 16 extents. Extra space in last extent can be released after file is written. Linked files: Link the blocks of the file together as a linked list. In file descriptor, just keep pointer to first block. In each block of file keep pointer to next block. Advantages? (Files can be extended, no external fragmentation problems. Sequential access is easy: just chase links.) Drawbacks? (andom access requires sequential access through list. Lots of seeking, even in sequential access. Some overhead in block for link.) (Simple) Indexed files: Simplest approach is to just keep an array of block pointers for each file. File maximum length must be declared when it is created. Allocate an array to hold pointers to all the blocks, but don't allocate the blocks. Then fill in the pointers dynamically using a free list. Advantages? Not as much space wasted by overpredicting, both sequential and random access are easy. Only waste space in the index. Drawbacks? May still have to set maximum file size (Can have an overflow scheme if file is larger than predicted maximum.) Blocks are probably allocated randomly over disk surface, and there will be lots of seeks. Index array may be large, and may require large file descriptor. Multi-level indexed files: the VAX Unix solution (version 4.3). In general, any sort of multi-level tree structure. More specifically, we describe what Berkeley 4.3BSD Unix does: File descriptors: 15 block pointers. The first 12 point to data blocks, the next three to indirect, doubly-indirect, and triply-indirect blocks (256 pointers in each indirect block). Maximum file length is fixed, but large. Descriptor space isn't allocated until needed. Advantages: simple, easy to implement, incremental expansion, easy access to small files. Good random access to blocks. Easy to insert block in middle of file. Easy to append to file. Small file map. Drawbacks: Indirect mechanism doesn't provide very efficient access to large files: 3 descriptor ops for each real operation. (When we "open" the file, we can keep the first level or two of the file descriptor around, so we don't have to read it each time.) File isn't generally allocated contiguously, so we have to seek between blocks. Block Allocation: If all blocks are same size, can use bit map solution. One bit per disk block. Cache parts of bit map in memory. Select block at random (or not randomly) from bitmap. If blocks are variable size, can use free list. This requires free storage area management. Fragmentation and compaction. In Unix, blocks are grouped in groups for efficiency: each block on the free list contains pointers to many free blocks, plus a pointer to the next list block. Thus there aren't many references involved in allocation or deallocation. Block-by-block organization of free list means that file data gets spread around the disk. A more efficient solution (used in Demos system built at Los Alamos.): Allocate groups of sequential blocks. Use multi-level index scheme described above, but each pointer isn't to one block - it is to a sequence of blocks. When we need another block for a file, we attempt to allocate the next physical block on the track (or cylinder). If we can't do it sequentially, we try to do it nearby. If we have detected a pattern of sequential writing, then we grab a bunch of blocks at a time (release them if unused). (The size of the bunch will depend on how many sequential writes have occurred so far.) Keep part of the disk unallocated always (as Unix does now) - then probability we can find sequential block to allocate is high. -------------------------- I/O Optimization Block Size Optimization Small blocks Small I/O buffers Discuss I/O buffers- used for reads and writes. Are quickly transferred Require lots more transfers for a fixed amount of data. High overhead on disk - wasted bytes for every disk block. (Inter record gaps, header bytes, ERC bytes). More entries in file descriptor to point to blocks (Inode). Less internal fragmentation. If random allocation, more seeks. Optimal block sizes tend to range from 2K to 8K bytes. Optimum increasing with improvements in technology. Berkeley Unix uses 4K blocks. (now 8K?) Basic (hardware) block size in VAX is 512 bytes. Berkeley Unix also uses fragments that are 1/4 the size of the logical block size. Disk Arm Scheduling: in timesharing systems, it may sometimes be the case that there are several disk I/O's requested at the same time. First come first served (FIFO, FCFS): may result in a lot of unnecessary disk arm motion under heavy loads. Shortest seek time first (SSTF): handle nearest request first. This can reduce arm movement and result in greater overall disk efficiency, but some requests may have to wait a long time. Problem is starvation. Imagine that disk is heavily loaded, with 3 open files. Two of the files located near center of disk. Other file near edge. Disk can be fully busy servicing first two files, and ignoring last file. Scan: -like an elevator. Move arm in one direction, servicing requests, until there are no additional requests in that direction. Then, reverse direction and continue. This algorithm doesn't get hung up in any one place for very long. It works well under heavy load. But - it may not get the shortest seek. Also, tends to neglect files at periphery of disk. CScan - (circular scan) like a one-way elevator. Moves only in one direction. When it finds no further requests in the scan direction, it returns immediately to the furthest request in the other direction, and it resumes the scan. This treats all files (and tracks) equally, but somewhat higher mean access time than Scan. SSTF has best mean access time. Scan or CScan can be used if there is a danger of starvation. Most of the time there aren't very many disk requests in the queue, so this isn't a terribly important decision. Also, if contiguous allocation is used (as with OS/360), then seeks are seldom required. Rotational Scheduling It is rare to have more than one request outstanding for a given cylinder. (This was more relevant when drums were used.) SRLTF (shortest rotational latency first) works well. But rotational scheduling can be useful for writing data, if we don't have to write back to same location. (log structured file system.) Rotational scheduling is hard using logical block address (LBA) - since you don't know the rotational position or the number of blocks per track. Skip-Sector or Interleaved disk allocation. Imagine that you are reading the blocks of a file sequentially and quickly, and file is allocated sequentially. Usually, will find that you try to read a block just after the start of the block has been passed. Solution is to allocate file blocks to alternate disk blocks or sectors. Then haven't passed block when we want to read it. Track offset for head switching It takes time to switch between heads on different tracks. Thus we may want to skip several blocks when moving sequentially between tracks, to allow the head to be selected. File Placement Seek distances will be minimized if commonly used files are located near center of disk. Freqency of seeks, and queueing for disks will be reduced if commonly used files (or files used at the same time) are located on different disks. E.g. spread the paging data sets and operating systems data sets over several disks. Disk Caching Keep a cache of recently used disk blocks in main memory. Recently read blocks are retained in cache until replaced. Writes go to disk cache, and are later written back. Typically would include index blocks for an open file. Also use the cache for read ahead and write behind. Can load entire disk tracks into cache at once. Typically works quite well - hit ratios of 70-90%. Can also do caching in disk controller - most controllers these days have 64K-4MB of cache/buffer in the controller. Mostly useful as buffer, not cache, since the main memory cache is so much larger. RAID Observations: Small disks cheaper than large ones (due to economies of scale) Failure rate is constant, independent of disk size Therefore, if we replace a few large disks with lots of small disks, failure rate increases Solution: Interleave the blocks of the file across a set of smaller disks, and add a parity disk. Note that since we presume (a) only one disk failure, and (b) we know which disk failed, we can reconstruct the failed disk. Can do parity in two directions for extra reliability. Advantage: Improves read bandwidth. Problem: This means that we have to write the parity disk on every write. It becomes a bottleneck. A solution - interleave on a different basis than the number of disks. That means that the parity disk varies, and the bottleneck is spread around. Topic: Directories and Other File System Topics Naming: How do users refer to their files? How does OS refer to the file itself? How does OS find file, given name? File Descriptor is a data strucure or record that describes the file. The file descriptor information has to be stored on disk, so it will stay around even when the OS doesn't. (Note that we are assuming that disk contents are permanent.) In Unix, all the descriptors are stored in a fixed size array on disk. The descriptors also contain protection and accounting information. A special area of disk is used for this (disk contains two parts: the fixed-size descriptor array, and the remainder, which is allocated for data and indirect blocks). The size of the descriptor array is determined when the disk is initialized, and can't be changed. In Unix, the descriptor is called an inode, (index node and its index in the array is called its i-number). Internally, the OS uses the i-number to refer to the file. IBM calls the equivalent structure the volume table of contents (VTOC). The Inode is the focus of all file activity in UNIX. There is a unique inode allocated for each file, including directories. An inode is 'named' by its dev/inumber pair. (iget/iget.c) Inode fields: reference count (number of times open) number of links to file owner's user id, owner's group id number of bytes in file time last accessed, time last modified, last time inode changed disk block addresses, indirect blocks (discussed previously) flags: (inode is locked, file has been modified, some process waiting on lock) file mode: (type of file: character special, directory, block special, regular, symbolic link, socket), A socket is an endpoint of a communication, referred to by a descriptor, just like a file or a pipe. Two processes can each create a socket and then connect those two endpoints to produce a reliable byte stream. (a pipe requires a common parent process. a socket does not, and the processes may be on different machines) (items below not in text on 4.3BSD) protection info: (set user id on execution, set group id on execution, read, write, execute permissions, sticky bit? (check)) count of shared locks on inode count of exclusive locks on inode unique identifier file sys associated with this inode quota structure controlling this file When a file is open, its descriptor is kept in main memory. When the file is closed, the descriptor is stored back to disk. There is usually a per process table of open files. In Unix, there is a process open file table, with one entry for each file open. The integer entry into that table is the handle for that file open. Multiple opens for the file will get multiple entries. (note that if a process forks, a given entry can be shared by several processes.) (standard-in is #0 and standard-out is #1, stderr is #2, must be per process.) Unix also has a system open file table, which points to the inode for the file (in the inode table). This table is system wide. There is also the inode table, which is a system-wide table holding active and recently used inodes. Descriptor is kept in OS space which is paged. So may be necessary to have page fault to get to descriptor info. Users need a way of referencing files that they leave around on disk. One approach is just to have users remember descriptor indexes. I.e. the user would have to remember something like the number of the descriptor, or some such. Unfortunately, not very user friendly. Of course, users want to use text names to refer to files. Special disk structures called directories are used to tell what descriptor indices correspond to what names. Approach #1: have a single directory for the whole disk. Use a special area of disk to hold the directory. Directory contains pairs. Problems: If one user uses a name, no-one else can. If you can't remember the name of a file, you may have to look through a very long list. Security problem - people can see your file names (which can be dangerous.) Many personal computers work this way. Approach #2: have a separate directory for each user (TOPS-10 approach). This is still clumsy: names from a user's different projects get confused. Still can't remember names of files. IBM's VM is similar to this. Files have 3 part name: , where location is A, B, C, etc. (i.e. which disk). Very painful. (Also, file names limited to 8 characters.) #3 - Unix approach: generalize the directory structure to a tree. Directories are stored on disk just like regular files (i.e. file descriptor with 13 pointers, etc.). User programs can manipulate directories almost like any other file. Only special system programs may write directories. Each directory contains pairs. The file pointed to by the index may be another directory. Hence, get hierarchical tree structure. Names have slashes separating the levels of the tree. There is one special directory, called the root. This directory has no name, and is the file pointed to by descriptor 2 (descriptors 0 and 1 have other special purposes). Note that we need ROOT. Otherwise, we would have no way to reach any files. From root, we can get anywhere in the file system. Full file name is the path name, i.e. full name from root. A directory consists of some number of blocks of DIRBLKSIZ bytes, where DIRBLKSIZ is chosen such that it can be transferred to disk in a single atomic operation (e.g. 512 bytes on most machines). Each directory block contains some number of directory entry structures, which are of variable length. Each directory entry has info at the front of it, containing its inode number, the length of the entry, and the length of the name contained in the entry. These are followed by the name padded to a 4 byte boundary with null bytes. All names are guaranteed null terminated. Note that in Unix, a file name is not the name of a file. It is only a name by which the kernel can search for the file. The inode is really the "name" of the file. Each pointer from a directory to a file is called a hard link. In some systems, there is a distinction between a "branch" and a "link", where the link is a secondary access path, and the branch is the primary one (goes with ownership). You "erase" a file by removing a link to it. In reality, a count is kept of the number of links to a file. It is only really erased when the last link is removed. To really erase a file, we put the blocks of the file on the free list. Symbolic Links There are two ways to "link" to another directory or file. One is a direct pointer. In Unix, such links are limited to not cross "file systems" - i.e. not to another disk. We can use symbolic links, by which instead of pointing to the file or directory, we have a symbolic name for that file or directory. We need to be careful not to create cycles in the directory system - otherwise recursive operations on the file system will loop. (E.g. cp -r). In Unix, this is solved by not permitting hard links to existing directories (except by the superuser). Pros and Cons of tree structured directory scheme Can organize files in logical manner. Easy to find the file you're looking for, even if you don't exactly remember its name. "Name" of the file is in fact a concatenation of the path from the root. Thus name is actually quite long- provides semantic info. Can have duplicate names, if path to the file is different. Can (assuming protection scheme permits) give away access to a subdirectory and the files under it, without giving access to all files. (Note: Unix does not permit multiple hard links to a directory, unless done by superuser.) Access to a file requires only reading the relevant directories, not the entire list of files. (My list of files prints out to a 1/2" listing. 10000 files) Structure is more complex to move around and maintain A file access may require that many directories be read, not just one. It is very nice that directories and file descriptors are separate, and that directories are implemented just like files. This simplifies the implementation and management of the structure (can write ``normal'' programs to manipulate them as files). I.e. the file descriptors are things the user shouldn't have to touch. Directories can be treated as normal files. Working directory: it is cumbersome constantly to have to specify the full path name for all files. In Unix, there is one directory per process, called the working directory, which the system remembers. This is not the same as the home directory, which is where you are at log-in time, and which is in effect the root of your personal file system. Every user has a search path, which is a list of directories in which to look to resolve a file name. The first element is almost always the working directory. ``/'' is an escape to allow full path names. I.e. most names are relative file names. Ones with "/" are full (complete) path names. Note that in Unix, the search path is maintained by the shell. If any other program wants to do the same, it has to rebuild the facilities from scratch. Should be in the OS. ("set path" in .cshrc or .login.) My path is: (. ~/bin /usr/new /usr/ucb /bin /usr/bin /usr/local /usr/hosts ~/com) Basically, want to look in working directory, then system library directories. We probably don't want a search strategy that actually searches more widely. If it did, it might find a file that wasn't really the target. This is yet another example of locality. Simple means to change working directory - "cd". Can also refer to directories of other users by prefacing their logins by "~". Operations on Files Open - put a file descriptor into your table of open files. Those are the files that you can use. May require that locks be set, and a user count be incremented. (If any locking is involved, may have to check for deadlock.) Close - inverse of open. Create a file - sometimes done automatically by open. Remove (rm) or erase - drop the link to the file. Put the blocks back on the free list if this is the last link. Read - read a record from the file. (This usually means that there is an "access method" - i.e. I/O code - which deals with the user in terms of records, and the device in terms of physical blocks). Write - like read, but may also require disk space allocation. Rename ("mv" or "move") - rename the file. Unix combines two different operations here. Rename would strictly involve changing the file name within the same directory. "move" moves the file from one directory to another. Unix does both with one command. Note that mv also destroys old file if there is one with new name. Seek - move to a given location in the file. Synch - write blocks of file from disk cache back to disk Change properties (e.g. protection info, owner) Link - add a link to a file Lock & Unlock - lock/unlock the file. Partial Erase (truncate) Note that commands such as "copy", "cat", etc., are built out of the simpler commands listed above. Pseudo Files We have commands such as "read", and "write" for files. We want to do similar things to devices (e.g. terminal, printer, etc.). There is no reason not to treat I/O devices as files, and we can do so. Called "pseudo files". File Backup and Recovery The problem - want to avoid losing files due to : 1. System Crashes (hardware or software) a. Physical Hard Failure - usually Head Crash b. Software Failure c. General system failure while the file is open. (this is the most common problem and the one we are usually concerned with.) (Usually Power failure) 2. User Errors Want to be able to get files back after we have destroyed them (overwritten or erased). (Unix doesn't provide this.) 3. Sabotage and malicious users. Approaches: Periodic full dump - periodically dump all of the files (and directories) to backup storage, such as tape. System can be reloaded from dump tape. Sometimes called checkpoint dump. Note: system has to be shut down during dumping. Slow. Recovery is only back to last dump - not up to date. Large amount of data - slow to dump, and large number of tapes. Incremental (periodic) dump - dump all modified files periodically - e.g. when the user logs out, or after the file is closed. Thus we can lose a file only when it is open. Disadvantages: large quantities of data, long and involved recovery problem. One recovery problem is that when a crash is software or hardware, some tables may be left in inconsistent condition (e.g. free list may be wrong, etc.). It is also necessary to fix all the tables. Very system dependent. There are several approaches to the problem of a crash while modifying a file. Work on a copy of the file and swap it for the original when you close. This is usually what an editor (e.g. vi) does. What if open by more than one person at same time? How do we make "swap" atomic? (see below) Write a log of all changes to the file, so we can backup, if necessary (audit trail). Write a list of changes to the file prior to modifying the file, so we can restart the list at the point at which a crash occurred (intentions list, or log-tape write ahead protocol.). Keep multiple copies of the file. Update one, and then copy the update to the 2'nd (careful replacement). Make a new copy of any part of the file as it is modified. Replace old parts with new parts when we close the file (differential file). I.e. Duplicate the file descriptor. Update the new copy of the file descriptor as new copies of blocks are made. Swap the new file descriptor for the old when the file is closed.