Outline for Today’s Lecture

Outline for Today’s Lecture Administrative: • Welcome back! • Programming assignment 2 due date extended (as requested) Objective: • File caching • NTFS • Distributed File Systems

File Buffer Cache Proc • Avoid the disk for as many file operations as possible. • Cache acts as a filter for the requests seen by the disk • Reads well-served. Reuse! • Delayed writeback will avoid going to disk at all for temp files. Memory File cache

Why Are File Caches Effective? 1. Locality of reference: storage accesses come in clumps. spatial locality: If a process accesses data in block B, it is likely to reference other nearby data soon. (e.g., the remainder of block B) example: reading or writing a file one byte at a time temporal locality: Recently accessed data is likely to be used again. 2. Read-ahead: if we can predict what blocks will be needed soon, we can prefetch them into the cache. most files are accessed sequentially

File Buffer Cache Proc • What do wewant to cache? • inodes • dir nodes • whole files • disk blocks • The answer will determine • where in the request path the cache sits Memory File cache V.M.

open(/foo/bar/file);read(fd,buf,sizeof(buf)); read(fd,buf,sizeof(buf)); read(fd,buf,sizeof(buf)); close(fd); read(rootdir); read(inode); read(foo);read(inode); read(bar); read(inode); read(filedatablock); read(rootdir); read(inode); read(foo);read(inode); read(bar); read(inode); read(fd,buf,sizeof(buf)); read(fd,buf,sizeof(buf)); read(fd,buf,sizeof(buf)); Access Patterns Along the Way Proc F.S. File cache

File Access Patterns • What do users seem to want from the file abstraction? • What do these usage patterns mean for file structure and implementation decisions? • What operations should be optimized 1st? • How should files be structured? • Is there temporal locality in file usage? • How long do files really live?

File System design and impl Usage patterns observed today Know your Workload! • File usage patterns should influence design decisions. Do things differently depending: • How large are most files? How long-lived?Read vs. write activity. Shared often? • Different levels “see” a different workload. • Feedback loop

What to Cache? Locality in File Access Patterns (UNIX Workloads) • Most files are small(often fitting into one disk block) although most bytes are transferred from longer files. • Accesses tend to be sequential and 100% • Spatial locality • What happens when we cache a huge file? • Most opens are for read mode, most bytes transferred are by read operations

What to Cache? Locality in File Access Patterns (continued) • There is significant reuse (re-opens) - most opens go to files repeatedly opened & quickly. Directory nodes and executables also exhibit good temporal locality. • Looks good for caching! • Use of temp files is significant part of file system activity in UNIX - very limited reuse, short lifetimes (less than a minute). • Long absolute pathnames are common in file opens • Name resolution can dominate performance – why?

What to do about long paths? • Make long lookups cheaper - cluster inodes and data on disk to make each component resolution step somewhat cheaper • Immediate files - meta-data and first block of data co-located • Collapse prefixes of paths - hash table • Prefix table • “Cache it” - in this case, directory info

Issues for Implementing an I/O Cache Structure Goal: maintain Kslots in memory as a cache over a collection of m items on secondary storage (K << m). 1. What happens on the first access to each item? Fetch it into some slot of the cache, use it, and leave it there to speed up access if it is needed again later. 2. How to determine if an item is resident in the cache? Maintain a directory of items in the cache: a hash table. Hash on a unique identifier (tag) for the item (fully associative). 3. How to find a slot for an item fetched into the cache? Choose an unused slot, or select an item to replace according to some policy, and evict it from the cache, freeing its slot.

File Block Buffer Cache free/inactive list head Buffers with valid data are retained in memory in a buffer cache or file cache. Each item in the cache is a buffer header pointing at a buffer . Blocks from different files may be intermingled in the hash chains. System data structures hold pointers to buffers only when I/O is pending or imminent. - busy bit instead of refcount - most buffers are “free” HASH(vnode, logical block) free/inactive list tail Most systems use a pool of buffers in kernel memory as a staging area for memory<->disk transfers.

Handling Updates in the File Cache 1. Blocks may be modified in memory once they have been brought into the cache. Modified blocks are dirty and must (eventually) be written back. Write-back, write-through (104?) 2. Once a block is modified in memory, the write back to disk may not be immediate (synchronous). Delayed writes absorb many small updates with one disk write. How long should the system hold dirty data in memory? Asynchronous writes allow overlapping of computation and disk update activity (write-behind). Do the write call for block n+1 while transfer of block n is in progress. Thus file caches also can improve performance for writes. 3. Knowing data gets to disk Force it but you can’t trust to a “write” syscall - fsync

Mechanism for Cache Eviction/Replacement • Typical approach: maintain an ordered free/inactive list of slots that are candidates for reuse. • Busy items in active use are not on the list. • E.g., some in-memory data structure holds a pointer to the item. • E.g., an I/O operation is in progress on the item. • The best candidates are slots that do not contain valid items. • Initially all slots are free, and they may become free again as items are destroyed (e.g., as files are removed). • Other slots are listed in order of value of the items they contain. • These slots contain items that are valid but inactive: they are held in memory only in the hope that they will be accessed again later.

Replacement Policy free/inactive list head HASH(vnode, logical block) • The effectiveness of a cache is determined largely by the policy for ordering slots/items on the free/inactive list. Defines the replacement policy • A typical cache replacement policy is LRU • Assume hot items used recently are likely to be used again. • Move the item to the tail of the free list on every release. • The item at the front of the list is the coldest inactive item. • Other alternatives: • FIFO: replace the oldest item. • MRU/LIFO: replace the most recently used item. free/inactive list tail

Viewing Memory as a Unified I/O Cache A key role of the I/O system is to manage the page/block cache for performance and reliability. tracking cache contents and managing page/block sharing choreographing movement to/from external storage balancing competing uses of memory Modern systems attempt to balance memory usage between the VM system and the file cache. • Grow the file cache for file-intensive workloads. • Grow the VM page cache for memory-intensive workloads. • Support a consistent view of files across different style of access. • unified buffer cache

Synchronization Problems for a Cache 1. What if two processes try to get the same block concurrently, and the block is not resident? 2. What if a process requests to write block A while a put is already in progress on block A? 3. What if a get must replace a dirty block A in order to allocate a buffer to fetch block B? This will happen if the block/buffer at the head of the free list is dirty. What if another process requests to getA during the put? 4. How to handle read/write requests on shared files atomically? Unix guarantees that a read will not return the partial result of a concurrent write, and that concurrent writes do not interleave.

Linux Page Cache • Page Cache is the disk cache for all page-based I/O – subsumes file buffer cache. • All page I/O flows through page cache • pdflush daemons – writeback to disk any dirty pages/buffers. • When free memory falls below threshold, wakeup daemon to reclaim free memory • Specified number written back • Free memory above threshold • Periodically, to prevent old data not getting written back, wakeup on timer expiration • Writes all pages older than specified limit.

Layout

Layout on Disk • Can address both seek and rotational latency • Cluster related things together (e.g. an inode and its data, inodes in same directory (ls command), data blocks of multi-block file, files in same directory) • Sub-block allocation to reduce fragmentation for small files • Log-Structured File Systems

File Structure Implementation:Mapping File -> Block • Contiguous • 1 block pointer, causes fragmentation, growth is a problem. • Linked • each block points to next block, directory points to first, OK for sequential access • Indexed • index structure required, better for random access into file.

Data Block Addr ... File Attributes ... ... ... ... ... UNIX Inodes 3 3 3 3 Data blocks Block Addr 1 2 2 ... Decoupling meta-data from directory entries 1 2 2 1

File Allocation Table (FAT) eof Lecture.ppt Pic.jpg Notes.txt eof eof

The Problem of Disk Layout • The level of indirection in the file block maps allows flexibility in file layout. • “File system design is 99% block allocation.” [McVoy] • Competing goals for block allocation: • allocationcost • bandwidth for high-volume transfers • efficient directory operations • Goal: reduce disk arm movement and seek overhead. • metric of merit: bandwidth utilization

FFS and LFS Two different approaches to block allocation: • Cylinder groups in the Fast File System (FFS) [McKusick81] • clustering enhancements [McVoy91], and improved cluster allocation [McKusick: Smith/Seltzer96] • FFS can also be extended with metadata logging [e.g., Episode] • Log-Structured File System (LFS) • proposed in [Douglis/Ousterhout90] • implemented/studied in [Rosenblum91] • BSD port, sort of maybe: [Seltzer93] • extended with self-tuning methods [Neefe/Anderson97] • Other approach: extent-based file systems

Log-Structured File Systems • Assumption: Cache is effectively filtering out reads so we should optimize for writes • Basic Idea: manage disk as an append-only log (subsequent writes involve minimal head movement) • Data and meta-data (mixed) accumulated in large segments and written contiguously • Reads work as in UNIX - once inode is found, data blocks located via index. • Cleaning an issue - to produce contiguous free space, correcting fragmentation developing over time. • Claim: LFS can use 70% of disk bandwidth for writing while Unix FFS can use only 5-10% typically because of seeks.

LFS logs In LFS, all block and metadata allocation is log-based. • LFS views the disk as “one big log” (logically). • All writes are clustered and sequential/contiguous. • Intermingles metadata and blocks from different files. • Data is laid out on disk in the order it is written. • No-overwrite allocation policy: if an old block or inode is modified, write it to a new location at the tail of the log. • LFS uses (mostly) the same metadata structures as FFS; only the allocation scheme is different. • Cylinder group structures and free block maps are eliminated. • Inodes are found by indirecting through a new map

LFS Data Structures on Disk • Inode – in log, same as FFS • Inode map – in log, locates position of inode, version, time of last access • Segment summary – in log, identifies contents of segment (file#, offset for each block in segment) • Segment usage table – in log, counts live bytes in segment and last write time • Checkpoint region – fixed location on disk, locates blocks of inode map, identifies last checkpoint in log. • Directory change log – in log, records directory operations to maintain consistency of ref counts in inodes

Structure of the Log File 1 block 2 File 2 File 1block1 D1 clean =inode map block =inode Checkpoint region =directory node =data block =segment summary, usage, dirlog

Writing the Log in LFS • LFS “saves up” dirty blocks and dirty inodes until it has a full segment (e.g., 1 MB). • Dirty inodes are grouped into block-sized clumps. • Dirty blocks are sorted by (file, logical block number). • Each log segment includes summary info and a checksum. 2. LFS writes each log segment in a single burst, with at most one seek. • Find a free segment “slot” on the disk, and write it. • Store a back pointer to the previous segment. • Logically the log is sequential, but physically it consists of a chain of segments, each large enough to amortize seek overhead.

Growth of the Log File 1 block 2 File 2 File 1block1 D1 clean D1 File 3 File 1block1 write (file1, block1) creat (D1/file3) write (file3, block1) Checkpoint region

Death in the Log File 1 block 2 File 2 File 1block1 D1 clean D1 File 3 File 1block1 write (file1, block1) creat (D1/file3) write (file3, block1) Checkpoint region

Writing the Log: the Rest of the Story 1. LFS cannot always delay writes long enough to accumulate a full segment; sometimes it must push a partial segment. • fsync, update daemon, NFS server, etc. • Directory operations are synchronous in FFS, and some must be in LFS as well to preserve failure semantics and ordering. 2. LFS allocation and write policies affect the buffer cache, which is supposed to be filesystem-independent. • Pin (lock) dirty blocks until the segment is written; dirty blocks cannot be recycled off the free chain as before. • Endow *indirect blocks with permanent logical block numbers suitable for hashing in the buffer cache.

Cleaning in LFS What does LFS do when the disk fills up? 1. As the log is written, blocks and inodes written earlier in time are superseded (“killed”) by versions written later. • files are overwritten or modified; inodes are updated • when files are removed, blocks and inodes are deallocated 2. A cleaner daemon compacts remaining live data to free up large hunks of free space suitable for writing segments. • look for segments with little remaining live data • benefit/cost analysis to choose segments • write remaining live data to the log tail • can consume a significant share of bandwidth, and there are lots of cost/benefit heuristics involved.

Cleaning the Log File 1 block 2 File 2 File 1block1 D1 D1 File 3 File 1block1 Checkpoint region clean

Cleaning the Log File 2 File 1 block 2 File 1block1 D1 D1 File 3 File 1block1 Checkpoint region File 2 File 1 block 2

Cleaning the Log clean D1 File 3 File 1block1 Checkpoint region File 2 File 1 block 2

Cleaning Issues • Must be able to identify which blocks are live • Must be able to identify the file to which each block belongs in order to update inode to new location • Segment Summary block contains this info • File contents associated with uid (version # and inode #) • Inode entries contain version # (incr. on truncate) • Compare to see if inode points to block under consideration

Policies • When cleaner cleans – threshold based • How much – 10s at a time until threshold reached • Which segments • Most fragmented segment is not best choice. • Value of free space in segment depends on stability of live data (approx. age) • Cost / benefit analysis Benefit = free space available (1-u) * age of youngest block Cost = cost to read segment + cost to move live data • Segment usage table supports this • How to group live blocks

Recovering Disk Contents • Checkpoints – define consistent states • Position in log where all data structures are consistent • Checkpoint region (fixed location) – contains the addresses of all blocks of inode map and segment usage table, ptr to last segment written • Actually 2 that alternate in case a crash occurs while writing checkpoint region data • Roll-forward – to recover beyond last checkpoint • Uses Segment summary blocks at end of log – if we find new inodes, update inode map found from checkpoint • Adjust utilizations in segment usage table • Restore consistency in ref counts within inodes and directory entries pointing to those inodes using Directory operation log (like an intentions list)

Recovery of the Log File 1 block 2 File 2 File 1block1 D1 D1 File 3 File 1block1 Checkpoint region Written since checkpoint

Recovery in Unix fsck • Traverses the directory structure checking ref counts of inodes • Traverses inodes and freelist to check block usage of all disk blocks

Evaluation of LFS vs. FFS 1. How effective is FFS clustering in “sequentializing” disk writes? Do we need LFS once we have clustering? • How big do files have to be before FFS matches LFS? • How effective is clustering for bursts of creates/deletes? • What is the impact of FFS tuning parameters? 2. What is the impact of file system age and high disk space utilization? • LFS pays a higher cleaning overhead. • In FFS fragmentation compromises clustering effectiveness. 3. What about workloads with frequent overwrites and random access patterns (e.g., transaction processing)?

Which kinds of locality on disk are better?

Benchmarks and Conclusions 1. For bulk creates/deletes of small files, LFS is an order of magnitude better than FFS, which is disk-limited. • LFS gets about 70% of disk bandwidth for creates. 2. For bulk creates of large files, both FFS and LFS are disk-limited. 3. FFS and LFS are roughly equivalent for reads of files in create order, but FFS spends more seek time on large files. 4. For file overwrites in create order, FFS wins for large files.

The Cleaner Controversy Seltzer measured TP performance using a TPC-B benchmark (banking application) with a separate log disk. 1. TPC-B is dominated by random reads/writes of account file. 2. LFS wins if there is no cleaner, because it can sequentialize the random writes. • Journaling log avoids the need for synchronous writes. 3. Since the data dies quickly in this application, LFS cleaner is kept busy, leading to high overhead. 4. Claim: cleaner consumes 34% of disk bandwidth at 48% space utilization, removing any advantage of LFS.

NTFS

NTFS • API functions for file I/O in Windows 2000 • Second column gives nearest UNIX equivalent

File System API Example A program fragment for copying a file using the Windows 2000 API functions

Directory System Calls • API functions for directory management in Windows 2000 • Second column gives nearest UNIX equivalent, when one exists

Outline for Today’s Lecture

Outline for Today’s Lecture

Presentation Transcript

Lecture 24

Lecture 1 Introduction

Today’s outline

SPU-22: The Unity of Science from the Big Bang to the Brontosaurus and Beyond

EE360: Multiuser Wireless Systems and Networks Lecture 6 Outline

EE359 – Lecture 3 Outline

EE360: Multiuser Wireless Systems and Networks Lecture 3 Outline

Lecture 2

16.317 Microprocessor Systems Design I

Lecture 18

Outline for Today’s Lecture

Lecture 3 Outline

Lecture 6: Descartes

Today ’ s take-home lessons (i.e. what you should be able to answer at end of lecture)

Lecture 9 of 14

Overview of Today’s Lecture

Lecture 12: Noam Chomsky

Before Class

16.216 ECE Application Programming

EE359 – Lecture 2 Outline

16.317 Microprocessor Systems Design I

Lecture 15