A Fast File System for Unix

A Fast File System for Unix Marshall K. McKusick , William N. Joy, Samuel J. Leffler , Robert S. Fabry Computer Science Research Group, University of California , Berkeley. 1984 Presented By:Aravind Subhash

Basic Unix File Abstraction and System Calls • Unix abstraction • Everything to appear as if it was sequential stream of bytes. • Set of functions that provides this abstraction :- • Open : When a process calls open , try to locate the file the process identifies and return a file descriptor. • Close : release file descriptor from use and make file inaccessible until another open call is made. • Read:allow to read n bytes out of a file and into some space in a process’s memory.Read call starts where the previous read left off, Seek Pointer • Write : allow to write n bytes from a process’s memory in to a file.If write call goes past end of file , file should be made larger to accommodate new data. • Truncate : Need to be able to remove data from end of the file stream. • Seek : Allow to move the seek pointer in a file , allowing to choose which part of the byte stream we read and write to.

Old File System • Developed at Bell Labs . • Each disk drive is divided into one or more partitions. • Each disk partition can contain one file system. • File system never spans multiple partitions

File System Layout on Disk • Picture from Modern Operating Systems -Andrew S. Tanenbaum

Old File System • File System is described by its • Super Block • Number of data blocks in the file system • A Count of the maximum number of files • A Pointer to the free list • A Linked list of all free blocks in the file system • Within the file system • Files • Descriptor associated with every file called - Inode

Old File System • Inode • Information describing file ownership • Time stamps marking last modification / access times • An Array of indices that point to the data blocks for the file.

Inode Data Structure • Given the I-Node it is possible to find all the blocks of the file. • In order to accomodate file growth , last disk address of inode is used to store address of a block containing more disk block addresses. • Advantage of Inode Scheme over linked files using an in-memory table is that the i-node need only be in memory when the corresponding file is open. Picture from Modern Operating Systems -Andrew S. Tanenbaum

Old File System • An Inode may also contain • References to indirect blocks containing further data block indices. • Singly Indirect Block • contains 128 further block addresses • Doubly Indirect Block • contains 128 addresses of further singly indirect blocks • Triply Indirect Block • contains 128 addresses of further doubly indirect blocks

Inode Illustrated • Picture from Modern Operating Systems -Andrew S. Tanenbaum

Original Unix File System • Original 512 byte UNIX file system was incapable of providing the data throughput rates • This was required by many applications • VLSI design / Image processing • small amount of processing on large quantities of data • Programs that map files from the file system into large virtual address spaces • What was needed ? • A file system providing higher bandwidth

Initial works at Berkeley-on the Unix File System • Aimed to improve reliability .How ? • Factors to be considered • Seek time • Rotational delay • Transfer Rate Throughput • Need to increase Throughput , Minimise overhead . How? Overhead picture source :http://www.csee.umbc.edu/~plusquel/611/slides/chap6_1.html

Improved by a factor of 2+ Cause : Changing block size from 512 to 1024 Bytes Increase caused because of 2 factors :- Each disk transfer accessed twice as much data Most files could be described without need to access indirect blocks as direct blocks contained twice as much data. Fragmented free spaces at the end of each file so new blocks can be inserted next to old blocks. To aid in updates to files. File System performance

Great System but what is still holding it back. • Although throughput had doubled , • Still used only about 4 % of disk bandwidth. • Although initially free list was ordered for optimal access • Got quickly scrambled as files were created and removed. • Over time free list became entirely random • caused files to have their blocks allocated randomly over the disk. • This forced a seek before every block access. • Transfer rates deteriorated because of randomisation of data block placement.

New File System • Each disk drive contains one or more file systems. • Each file system is described by its super block. • Each file system block has minimum size of 4k bytes • to create files as large as 232 bytes with only 2 levels of indirection • The new file system divides a disk partition into one or more areas called cylinder groups. http://www.seas.ucla.edu/classes/mkampe/cs111.sq05/docs/bsd.html

Each cylinder group has bookkeeping information redundant copy of super block space for inodes bit map describing available blocks in the cylinder group summary info. of usage of data blocks within cylinder group. Redundant information spirals down into the pack so that data on any single track , cylinder , or platter can be lost without losing all copies of the super block. New File System

Fast File System Optimises Reads • Data laid out such that larger blocks can be transferred in a single disk transaction • Greatly increasing file system throughput • By increasing block size , disk accesses in the new file system could transfer upto four times as much information per disk transaction.

As block size on the disk increases , waste rises quickly. Optimising Storage Allocation

Divide single file system block into fragments. {2,4,8 addressable fragments }. Block map is associated with each cylinder group records the space available in a cylinder group at the fragment level. Optimising Storage Allocation-New File System • Each bit in the map records status of fragment. • X fragment in use • O fragment available

File System Parameterisation • Next goal of the new file system • To parameterize the processor capabilities and mass storage characteristics •  Blocks can be allocated in an optimum configuration dependent way. • Parameters used • Speed of the processor • Hardware support for mass storage transfers • Characteristics of the mass storage devices.

Global Layout Policies • Use file system wide summary information • Responsible for deciding the placement of new directories and files. • Calculate rotationally optimal block layouts. • Decide when to force long seek to new cylinder group if there are insufficient blocks left in the current cylinder group . • Improve performance by clustering related information.

Global Layout Policies Call Local Allocation Routines • Local Allocation Routines • Use locally optimal scheme to layout data blocks. • Methods to improve file system performance • Increase locality of reference • This minimises seek latency • Improve layout of data • To make larger transfers possible

The global layout policy tries to place all the inodes of files in a directory in the same cylinder group. Allocation of Inodes within a cylinder group is done using the next free strategy. Spills when using datablocks are handled by redirecting block allocation to a different cylinder group. Layout Policies • New File System • All inodes within in a particular cylinder group can be read with 8-16 disk transfers. • Old File System • Requires one disk transfer to fetch the inode for each file in a directory

Use next available block rotationally closest to the requested block If there are no blocks available on the same cylinder , use a block within the same cylinder group If that cylinder group is entirely full , quadratically hash the cylinder group number to choose another cylinder group to look for a free block If hash fails , apply exhaustive search to all cylinder groups Quadratic hash is used : Fast in finding unused slots in nearly full hash tables . Four level allocation strategy used by local allocator

Performance Results • Inode layout policy is effective • Large directory having many directories within. • # disk accesses for inodes cut by a factor of 2. • Large directory containing only files. • # disk accesses for inodes cut by a factor of 8.

Throughput Analysis The slower write rates (when compared to the read rates) in the 4096 types occur because the kernel has to do twice as many disk allocations per second , making the processor unable to keep up with the disk transfer rate.

Percentage(%) of bandwidth is a measure of the effective utilisation of the disk by the file system. Both reads and writes are faster in the new system . Biggest contributing factor in this speedup is because of the larger block size used by the new file system. (New File System) uses 47% of the disk bandwidth Throughput Analysis Results

File System Functional Enhancements • Long file names • (New)File names can be of arbitrary length. • File Locking • New File System • Provides file locking. [ 2 Schemes : Hard locks , Advisory locks ] • Old File System • No provision for locking files • Main drawbacks • Processes consumed CPU time by looping over attempts to create locks • Locks left lying around because of system crashes had to be manually removed • Process running as sys.admin. were forced to use a different mechanism.

Symbolic links Is implemented as a file that contains a pathname. When system encounters a symbolic link while interpreting a component of a pathname , the contents of the symbolic link is prepended to the rest of the pathname,and this name is interpreted to yield the resulting pathname. Allows references across physical file systems and supports inter-machine linkage. File System Functional Enhancements picture source : http://www.seas.ucla.edu/classes/mkampe/cs111.sq05/docs/bsd.html

File System Functional Enhancements • Rename • (New)Create new version of temporary file and rename temporary file. • (Old ) Required three calls to the system • If programs were interrupted or the system crashed between these calls , target file could be left with only its temporary name. • Quotas • (Old)Any single user can allocate all available space in the file system Shared User Systems this might not be acceptable • (New)Quota mechanism sets limit on both number of inodes and the number of disk blocks that a user may allocate.

Current Trends & Conclusion • Original File System  Fast File System [FreeBSD] [NetBSD] [OpenBSD] [NeXTStep] [Solaris] [Ext2] [Linux Native] Source :http://www.tldp.org/FAQ/Linux-FAQ/partitions.html

Extra Slides

Linux kernel contains a virtual file system layer which is used during system calls acting on files. The VFS is an indirection layer which handles the file oriented system calls and calls the necessary functions in the physical filesystem code to do the I/O. This indirection mechanism is frequently used in the Unix like operating systems to ease the integration and the use of several filesystem types. Design and Implementation of the Second Extended Filesystem ,Rémy Card, Theodore Ts'o, Stephen Tweedie, Linux Kernel File System

When a process issues a file oriented system call , the kernel calls a function contained in the VFS. This function(1) handles the structure independent manipulations and redirects the call to a function(2) contained in the physical filesystem code This function(2) is responsible for handling the structure dependent operations.File system code uses the buffer cache functions to request I/O on devices. Design and Implementation of the Second Extended Filesystem ,Rémy Card, Theodore Ts'o, Stephen Tweedie, Linux Kernel File System

A Fast File System for Unix