These notes include Smith’s lecture notes with notes taken in class interweaved. Smith’s notes are denoted with a “+” and notes in take in class are denoted by “-“. + File Placement + Seek distances will be minimized if commonly used files are located near center of disk. + Even better results if reference patterns are analyzed and files that are frequently referenced to- gether are placed near each other. + Freqency of seeks, and queueing for disks will be reduced if commonly used files (or files used at the same time) are located on different disks. + E.g. spread the paging data sets and operating sys- terms data sets over several disks. + Disk Caching + Keep a cache of recently used disk blocks in main memory. -Can do this in main memory and controller. + Recently read blocks are retained in cache until re- placed. + Writes go to disk cache, and are later written back. + Typically would include index blocks for an open file. - Most cache is in DRAM, which is volatile. - Linux is not a data-processing system + Also use the cache for read ahead and write behind. -write behind is a write to the cache + Can load entire disk tracks into cache at once. + Typically works quite well - hit ratios of 70-90%. + Can also do caching in disk controller - most controllers these days have 64K-4MB of cache/buffer in the controll- er. Mostly useful as buffer, not cache, since the main memory cache is so much larger. + Prefetching and Data Reorganization - If you read a block, it’s likely you are going to read the next few blocks as well, so you read those and put them in memory. Since it’s in memory, when you actually read it, it’ s a lot faster. + Since disk blocks are often read (and written) sequen- tially, it can be very helpful to prefetch ahead of the current read point. + It is also therefore useful to make sure that the physi- cal layout of the data reflects the logical organization of the data - i.e. logically sequential blocks are also physically sequential. Thus it is useful to periodically reorganize the data on the disk. - Don’t want to put sequential file randomly scattered over the disk! Duh! + Data Replication + Frequential used data can be replicated at multiple loca- tions on the disk. - If it’s at more places at the disk, then the average time to look for it decreases since it’s all over the place. + This means that on writes, extra copies must either be updated or invalidated. - If you don’t do this, you get lots of different versions which is bad! + ALIS - automatical locality improving storage - Basically a smart I/O system that can reorganize itself. Does this in the background during idle time. + Best results obtained when techniques are combined: reorganize to make sequential, cluster, and replicate. + RAID - Redundant Array of Inexpensive Disk + Observations: + Small disks cheaper than large ones (due to economies of scale) - This is per-byte + Failure rate is constant, independent of disk size + Therefore, if we replace a few large disks with lots of small disks, failure rate increases + Solution: + Interleave the blocks of the file across a set of smaller disks, and add a parity disk. - Parity in main memory is not enough, but in raid, you will know which physical disk failed so you know which parity it is. - If in main memory, you won’t know which bit went bad. + Note that since we presume (a) only one disk failure, and (b) we know which disk failed, we can reconstruct the failed disk. + Can do parity in two directions for extra reliabili- ty. + Advantage: + Improves read bandwidth. + Problem: + This means that we have to write the parity disk on every write. It becomes a bottleneck. - This kills our bandwidth which is very bad. + A solution - interleave on a different basis than the number of disks. That means that the parity disk varies, and the bottleneck is spread around. - Put the parity on different blocks, so P1 can be on D2, P2 on D3, etc etc… + Types of RAID: + RAID 0 - ordinary disks - No redundancy. This interleaves cross disks but no parity blocks. - You increase bandwidth, but reliability goes down + RAID 1 – replication - This is just a mirror. So you double writes, but your cost doubles. This has no performance gain. + RAID 4 - parity disk in fixed location - See Diagrams + RAID 5 - parity disk in varying location ----- ----- ----- ----- ----- | 1 5 | | 2 6 | | 3 7 | | 4 8 | | xor | | | | | | | | | | | | | | | | | | | | | ----- ----- ----- ----- ----- - The above raid setup has a single parity and can recover from one failure. O O O O O O O O O O The last row and last column are parities. This can recover O O O O O from two failures O O O O O O O O O =========================== Topic: Directories and Other File System Topics + Naming: + How do users refer to their files? + How does OS refer to the file itself? + How does OS find file, given name? + File Descriptor is a data structure or record that describes the file. + The file descriptor information has to be stored on disk, so it will stay around even when the OS doesn't. (Note that we are assuming that disk contents are permanent.) + In Unix, all the descriptors are stored in a fixed size array on disk. The descriptors also contain protection and accounting information. + A special area of disk is used for this (disk contains two parts: the fixed-size descriptor array, and the remainder, which is allocated for data and indirect blocks). + The size of the descriptor array is determined when the disk is initialized, and can't be changed. In Unix, the descriptor is called an inode, (index node and its index in the array is called its i-number). Internally, the OS uses the i-number to refer to the file. - Inode is the name of the file in the system. This is unique! + IBM calls the equivalent structure the volume table of contents (VTOC). + The Inode is the focus of all file activity in UNIX. There is a unique inode allocated for each file, includ- ing directories. An inode is 'named' by its dev/inumber pair. (iget/iget.c) + Inode fields: + reference count (number of times open) + number of links to file - Number of directories pointing to the file, if there are zero… then get rid of the file. + owner's user id, owner's group id - Need it for protection and accounting (disk quota and system where you are charged for disk space) - Disk space and cpu time is getting really cheap… so don’t really know how to charge (haha) :P + number of bytes in file + time last accessed, time last modified, last time inode changed - When was this last referenced, time last modified is more useful so you know which version it is. + disk block addresses, indirect blocks (discussed pre- viously) + flags: (inode is locked, file has been modified, some process waiting on lock) - These are one-bit flags + file mode: (type of file: character special, directo- ry, block special, regular, symbolic link, socket), - Symbolic link is another file + A socket is an endpoint of a communication, re- ferred to by a descriptor, just like a file or a pipe. Two processes can each create a socket and then connect those two endpoints to produce a re- liable byte stream. (a pipe requires a common parent process. a socket does not, and the processes may be on different machines) + protection info: (set user id on execution, set group id on execution, read, write, execute permissions, sticky bit - In inode sticky bit is the bit in an inode representing a directory that indicates that other users can or cannot modify this - So basically used for sharing a directory and sharing rights. + count of shared locks on inode - Lock you used for reading - How many people have it open for reading, makes sense. + count of exclusive locks on inode - Lock you used for writing - Why would you need this? More than one person writing? Not good! - This is more so like a “suggestion” so we should still keep track of how many people have it open. - Sort of like a “warning” + unique identifier + file sys associated with this inode + quota structure controlling this file - May have limited amount of disk space you can use, it basically tells you your quota oh yay. + When a file is open, its descriptor is kept in main memory. When the file is closed, the descriptor is stored back to disk. - We don’t store it in disk, because we don’t always want to be going back to disk to get it out. + There is usually a per process table of open files. + In Unix, there is a process open file table, with one entry for each file open. The integer entry into that table is the handle for that file open. Multiple opens for the file will get multiple en- tries. (note that if a process forks, a given entry can be shared by several processes.) + (standard-in is #0 and standard-out is #1, stderr is #2, must be per process.) + Unix also has a system open file table, which points to the inode for the file (in the inode table). This table is system wide. Maps names to files. - We have this because it makes it faster to access an already open file - So if the file is already open, we don’t have to go look for the name in the directory. + There is also the inode table, which is a system-wide table holding active and recently used inodes. - when you close a file, you don’t necessary delete it from the table - We need this because if something gets changed, you want all the processes to be able to see it. + Descriptor is kept in OS space which is paged. So may be necessary to have page fault to get to descriptor info. - All of these tables are stared in page tables, so we can have a page fault. + Users need a way of referencing files that they leave around on disk. One approach is just to have users remember descriptor indexes. I.e. the user would have to remember something like the number of the descriptor, or some such. Unfortunately, not very user friendly. + Of course, users want to use text names to refer to files. Special disk structures called directories are used to tell what descriptor indices correspond to what names. - This basically maps names to files. + Approach #1: have a single directory for the whole disk. Use a special area of disk to hold the directory. + Directory contains pairs. + Problems: + If one user uses a name, no-one else can. + If you can't remember the name of a file, you may have to look through a very long list. + Security problem - people can see your file names (which can be dangerous.) + Old personal computers (pre-Windows) work this way. + Approach #2: have a separate directory for each user (TOPS-10 approach). This is still clumsy: names from a user's dif- ferent projects get confused. Still can't remember names of files. - This is still a flat directory for each individual user! - File naming was a pain in the ass. + IBM's VM is similar to this. Files have 3 part name: , where location is A, B, C, etc. (i.e. which disk). Very painful. (Also, file names lim- ited to 8 characters.) + #3 - Unix approach: generalize the directory structure to a tree. + Directories are stored on disk just like regular files (i.e. file descriptor with 13 pointers, etc.). + User programs can manipulate directories almost like any other file. Only special system programs may write directories. + Each directory contains pairs. The file pointed to by the index may be another directory. Hence, get hierarchical tree structure. Names have slashes separating the levels of the tree. + There is one special directory, called the root. This directory has no name, and is the file pointed to by descriptor 2 (descriptors 0 and 1 have other special pur-poses). + Note that we need ROOT. Otherwise, we would have no way to reach any files. From root, we can get any- where in the file system. + Full file name is the path name, i.e. full name from root. + A directory consists of some number of blocks of DIRBLKSIZ bytes, where DIRBLKSIZ is chosen such that it can be transferred to disk in a single atomic operation (e.g. 512 bytes on most machines). + Each directory block contains some number of directo- ry entry structures, which are of variable length. Each directory entry has info at the front of it, containing its inode number, the length of the entry, and the length of the name contained in the entry. These are followed by the name padded to a 4 byte boundary with null bytes. All names are guaranteed null terminated. + Note that in Unix, a file name is not the name of a file. It is only a name by which the kernel can search for the file. The inode is really the "name" of the file. + Each pointer from a directory to a file is called a hard link. + In some systems, there is a distinction between a "branch" and a "link", where the link is a secon- dary access path, and the branch is the primary one (goes with ownership). + You "erase" a file by removing a link to it. In reality, a count is kept of the number of links to a file. It is only really erased when the last link is removed. + To really erase a file, we put the blocks of the file on the free list. - You basically garbage collect. If there are no links, then it is put on the free list. - If you created a file, and someone else has it open, then you can’t delete it. However, you can overwrite it. - Unix says that you can’t have multiple hard links so that way you won’t have a loop. Recursive loops are bad bad bad. - Hard link is an actual item number. + Symbolic Links + There are two ways to "link" to another directory or file. One is a direct pointer. In Unix, such links are limited to not cross "file systems" - i.e. not to another disk. - Recursive commands are okay in this case. + We can use symbolic links, by which instead of pointing to the file or directory, we have a symbolic name for that file or directory. + We need to be careful not to create cycles in the direc- tory system - otherwise recursive operations on the file system will loop. (E.g. cp -r). In Unix, this is solved by not permitting hard links to existing direc- tories (except by the superuser). + Pros and Cons of tree structured directory scheme + Can organize files in logical manner. Easy to find the file you're looking for, even if you don't exactly remember its name. + "Name" of the file is in fact a concatenation of the path from the root. Thus name is actually quite long- pro- vides semantic info. + Can have duplicate names, if path to the file is dif- ferent. + Can (assuming protection scheme permits) give away access to a subdirectory and the files under it, without giving access to all files. (Note: Unix does not permit multi- ple hard links to a directory, unless done by superuser.) + Access to a file requires only reading the relevant directories, not the entire list of files. (My list of files prints out to a 1/2" printout- 20000 files) + Structure is more complex to move around and maintain + A file access may require that many directories be read, not just one. + It is very nice that directories and file descriptors are separate, and that directories are implemented just like files. This simplifies the implementation and management of the structure (can write ``normal'' programs to mani- pulate them as files). + I.e. the file descriptors are things the user shouldn't have to touch. Directories can be treated as normal files. - A tree directory is basically like a tree that we learned in cs61b. (you can lookup a tree structure on wiki). We can have a tree structure that have loops, so a child can reference a parent. You can also have two separate nodes reference the same file! + Working directory: it is cumbersome constantly to have to specify the full path name for all files. + In Unix, there is one directory per process, called the working directory, which the system remembers. + This is not the same as the home directory, which is where you are at log-in time, and which is in effect the root of your personal file system. + Every user has a search path, which is a list of directories in which to look to resolve a file name. The first element is almost always the working directory. - Type a name of something… you don’t want to type the full path for every file that you have. - There is some way of specifying search path which is where you would look for this file that you want. + ``/'' is an escape to allow full path names. I.e. most names are relative file names. Ones with "/" are full (complete) path names. + Note that in Unix, the search path is maintained by the shell. If any other program wants to do the same, it has to rebuild the facilities from scratch. Should be in the OS. ("set path" in .cshrc or .login.) + My path is: (. ~/bin /usr/new /usr/ucb /bin /usr/bin /usr/local /usr/hosts ~/com) + Basically, want to look in working directory, then system library directories. + We probably don't want a search strategy that actual- ly searches more widely. If it did, it might find a file that wasn't really the target. + This is yet another example of locality. + Simple means to change working directory - "cd". Can also refer to directories of other users by prefacing their logins by "~". - Search paths are important because you don’t want to find the wrong file etc etc + Operations on Files + Open - put a file descriptor into your table of open files. Those are the files that you can use. May re- quire that locks be set, and a user count be incremented. (If any locking is involved, may have to check for deadlock.) + Close - inverse of open. + Create a file - sometimes done automatically by open. + Remove (rm) or erase - drop the link to the file. Put the blocks back on the free list if this is the last link. + Read - read a record from the file. (This usually means that there is an "access method" - i.e. I/O code - which deals with the user in terms of records, and the device in terms of physical blocks). + Write - like read, but may also require disk space allo- cation. + Rename ("mv" or "move") - rename the file. Unix combines two different operations here. Rename would strictly in- volve changing the file name within the same directory. "move" moves the file from one directory to another. Unix does both with one command. + Note that mv also destroys old file if there is one with new name. (This is BUG). + (which could be considered a bug, not a feature) + Seek - move to a given location in the file. + Synch - write blocks of file from disk cache back to disk + Change properties (e.g. protection info, owner) + Link - add a link to a file + Lock & Unlock - lock/unlock the file. + Partial Erase (truncate) + Note that commands such as "copy", "cat", etc., are built out of the simpler commands listed above.