Advanced UNIX File Systems. Berkley Fast File System, Logging File Systems And RAID. Classical Unix File System. Traditional UNIX file system keeps I-node information separately from the data blocks; Accessing a file involves a long seek from I-node to the data;
Berkley Fast File System,
Logging File Systems
1) Data blocks allocated randomly in aging file systems
- Blocks for the same file allocated sequentially when FS is new
- As FS “ages” and fills, need to allocate into blocks freed up when other files are deleted
- Deleted files essentially randomly placed So, blocks for new files become scattered across the disk.
2) Inodes allocated far from blocks
- All inodes at beginning of disk, far from data
- Traversing file name paths, manipulating files, directories requires going back and forth from inodes to data blocks
- Disk partitioned into groups of cylinders
- Data blocks in same file allocated in same cylinder
- Files in same directory allocated in same cylinder
- Inodes for files allocated in same cylinder as file data blocks
- To be able to allocate according to cylinder groups, the disk must have free space scattered across cylinders
- 10% of the disk is reserved just for this purpose
- Only used by root – why it is possible for “df” to report >100%
- Divide block into fragments;
- Each fragment is individually addressable;
- Fragment size is specified upon a file system creation;
- The lower bound of the fragment size is the disk sector size;
- Low bandwidth utilization
- Small max file size (function of block size)
- Very large files, only need two levels of indirection for 2^32
- Problem: internal fragmentation
- Fix: Introduce “fragments” (1K pieces of a block)
- Replicate master block (superblock Parameterize FS according to device characteristics
Parameterize according to device characteristics
– write directory entry, update directory I-node, allocate I-node, write allocated I-node on disk, write directory entry, update directory I-node.
– But what should be the order of these synchronous operations to minimize the damage that may occur due to failures?
1) allocate I-node (if needed) and write it on disk;
2) update directory on disk;
3) update I-node of the directory on disk.
– Should be synchronous, thus need seeks to I-nodes;
– In BFFS is not a great problem as long as files are relatively small, cause directory, file data blocks, and Inodes should be all in the same cylinder group.
– Real problem, because the file access pattern is random different applications use different files at the same time, and the dirty blocks are not guaranteed to be in the same cylinder group.
- Large buffer caches
- Absorb large fraction of read requests
- Can use for writes as well
- Coalesce small writes into large writes
Rosenblum and Ousterhout (Berkeley, ’91)
- Placement is improved, but still have many small seeks possibly related files are physically separated
- nodes separated from files (small seeks)
- Directory entries separate from inodes
- With small files, most writes are to metadata (synchronous)
- Synchronous writes very slow
- Data blocks, attributes, inodes, directories, etc.
1) Locating data written to the log.
- FFS places files in a location, LFS writes data “at the end”
2) Managing free space on the disk
- Disk is finite, so log is finite, cannot always append
- Need to recover deleted blocks in old parts of log
- Inodes pre-allocated in each cylinder group
- Directories contain locations of inodes
- Makes them hard to find
- Use another level of indirection: Inode maps
- Inode maps map file #s to inode location
- Location of inode map blocks kept in checkpoint region
- Checkpoint region has a fixed location
- Cache inode maps in memory for performance
- Need to recover deleted blocks
- Fragment log into segments
- Thread segments on disk
- Segments can be anywhere
- Reclaim space by cleaning segments
- Read segment
- Copy live data to end of log
- Now have free segment you can reuse
- Costly overhead
- A storage system, not a file system Patterson, Katz, and Gibson (Berkeley, ’88)
- Idea: Use many disks in parallel to increase storage bandwidth, improve reliability
- Files are striped across disks
- Each stripe portion is read/written in parallel
- Bandwidth increases with more disks
- Small files (small writes less than a full stripe)
- Need to read entire stripe, update with small write, then write entire segment out to disks
- Reliability: more disks increases the chance of media failure (MTBF)
- Turn reliability problem into a feature
- Use one disk to store parity data: XOR of all data blocks in stripe
- Can recover any data block from all others + parity block - “redundant”
- Good for random access (no reliability)
- Two disks, write data to both (expensive, 1X storage overhead)
- Parity blocks for different stripes written to different disks
- No single parity disk, hence no bottleneck at that disk
- Higher bandwidth, but still have large overhead
RAID2 stripes data at the bit level across disks and uses a Hamming code for parity. However, the performance of bit striping is abysmal and RAID2 is not practically used.
RAID3 stripes data at the byte level and dedicates an entire disk for parity. Like RAID2, RAID3 is not practically used for performance reasons. As most any read requires more than one byte of data, reads involve operations on every disk in the set. Such disk access will easily thrash a system. Additionally, loss of the parity disk yields a system vulnerable to corrupted data.
RAID4 stripes data at the block−level and dedicates an entire disk for parity. RAID4 is similar to both RAID2 and RAID3 but significantly improves performance as any read request contained within a single block can be serviced from a single disk. RAID4 is used on a limited basis due to the storage penalty and data corruption vulnerability of dedicating an entire disk to parity.
RAID5 implements block level striping like RAID4, but instead stripes the parity information across all disks as well. In this way, the total storage capacity is maximized and parity information is distributed across all disks. RAID5 also supports hot spares, which are disks that are members of the RAID but not in active use. The hot spares are activated and added to the RAID upon the detection of a failed disk. RAID5 is the most commonly used level as it provides the best combination of benefits and acceptable costs.