Supporting High Performance I/O with Effective Caching and Prefetching

Supporting High Performance I/O with Effective Caching and Prefetching Song Jiang The ECE Department Wayne State University

Disk I/O performance is critical to data-intensive applications • Data-intensive applications demand efficient I/O support. • Database; • Multimedia; • Scientific simulations; • Other desktop applications. • Hard disk is the most cost-effective online storage device. • Workstations; • Distributed systems; • Parallel systems.

The disks in 2000 are more than 57 times “SLOWER” than their ancestors in 1980. Unbalancedsystem improvements Bryant and O’Hallaron, “Computer Systems: A Programmer’s Perspective”, Prentice Hall, 2003.

Disk performance and disk arm seeks • Disk performance is limited by mechanical constraints; • Seek • Rotation • Disk seek is the Achilles’ heel of disk access performance; • Access of sequential blocks is much faster.

An example: IBM Ultrastar 18ZX specification * Our goal: to increase the sequentiality for a larger transfer rate. Seq. Read: 4,700 IO/s Rand. Read: < 200 IO/s * Taken from IBM “ULTRASTAR 9LZX/18ZX Hardware/Functional Specification” Version 2.4

Towards peak disk transfer rate

Random accesses are common • Workloads on popular operating systems • UNIX: most accessed files are short in length (80% are smaller than 26 Kbytes) [Ousterhout 1991]; • Windows NT: 40% of I/O operations are to files shorter than 2KBytes [Vogels 1999]. • Scientific computing • The SIO study: in many applications the majority of requests are for a small amount of data (less than a few Kbytes) [Reed 1997]; • The CHARISMA study: large, regular data structures are distributed among processes with interleaved accesses of shared files [Kotz 1996].

Improving disk performance by reducing random accesses Two approaches: • Aggregate small random accesses into large sequential accesses; • Hide random accesses. My objective:improve access sequentiality in a general-purpose OS without any aid from programs. Software levels: • Applications: Some database systems; • Run-time systems: Collective-I/O; • Operating systems: TIP Informed prefetching; • Disk scheduler: CSCAN, SSTF.

Outline • Inadequacy of current buffer cache management in OS; • Managing disk layout information; • Sequence-based and history-aware prefetching; • Miss-penalty aware caching; • Performance evaluation in a Linux implementation; • Related work; • Summary.

Critical role of buffer cache in OS Application I/O Requests • Buffer cache can significantly influence request streams to disk. • Only unsatisfied requests are passed to disk; • Prefetching requests are generated to disk. • The opportunity for improving access sequentiality is not well exploited. • Buffer cache ignores disk layout information; • Buffer cache is an agent between two sides; • Only the behavior of one side (application) is considered. • The effectiveness of the I/O scheduler relies on the output of buffer cache. Buffer cache (caching & prefetching) I/O Scheduler Disk Driver disk

Average access time = • (1-miss_ratio)*hit_time+miss_ratio*miss_penalty • Sequentially accessed blocks small miss penalty • Randomly accessed blocks large miss penalty Hard Disk Drive A C Disk Tracks X1 X2 X3 X4 D B Problems of caching without disk layout information • Minimizing cache miss ratio by only exploiting temporal locality; • Avg. access time = (1-miss_ratilo)*hit_time + miss_ratio*miss_penalty • Small number of misses could mean long access time. • Should minimize access time by also considering on-disk data layout. • Distinguish sequentially and randomly accessed blocks; • Give “expensive” random blocks a high caching priority to suppress future random accesses; • Disk accesses become more sequential.

D File X C X1 File Y X2 Z2 Y1 Y2 Z1 A File Z B File R Hard Disk Drive Problems of prefetching without disk layout information Disk Tracks • Conducting prefetching only on file abstraction. • Working only inside individual file. • Performance constraints due to prefetching on logical file abstraction. • Sequential accesses spanning multiple small files cannot be detected. • Sequentiality at logic abstraction may not correspond to sequentiality on physical disk. • Previously detected sequences of blocks are not remembered. • Metadata blocks cannot be prefetched. • Should conduct prefetching directly on data disk layout.

Using Disk layout information in buffer cache management • What disk layout information to use? • Logical block number (LBN): logical disk geometry provided by firmware; • Accesses of contiguous LBNs have a performance close to accesses of contiguous blocks on disk; • The LBN interface is highly portable across platforms. • How to efficiently manage the disk layout information? • Currently LBN is only used to identify disk locations for read/write; • To track access times of disk blocks and search for access sequences via LBNs; • Disk block table: a data structure for efficient disk block tracking.

time1 time2 Timestamps 10 7 0 7 20 LBN:5140 = 0*5122+ 10*512 + 20 1 ^ 1 7 Diskblock table: a new data structure for tracking disk blocks = 8 = 9 = 10 = 7 LBN : Block 8 10 N1 N2 N3 N4 10 8 1 7 9 9

DiskSeen: a two-side-oriented buffer cache scheme Buffer Cache Caching area Prefetching area Destaging area Disk Block transfers between areas On-demand read Asynchronous prefetching Delayed write-back

readahead window Current window Sequence-based prefetching Timestamp LBN • Prefetching is based on LBN; • The algorithm is similar to the Linux 2.6.11 readahead algorithm: • When 32 contiguous blocks are accessed, create 16-block current window and 32-block readahead window; • The two windows are shifted forward while sequential accesses are detected.

43515 63110 63200 63290 22000 43501 43510 52000 N/A 35000 43500 37000 • Prefetching: • Spatial window size: 32 [ Block: 5 … (5+32) ] • Temporal window size: 256 [ timestamp: 43515 … 43515+256 ] History-aware prefetching Current on-going trail 85000 85001 74000 85010 85011 63110 63111 63200 63290 43515 43501 43510 43550 43500 20000 20001 N/A N/A 34950 History trail B1 B2 B3 B4 B5 71211 63370 73560 23531 64300 … … 51300 43520 43520 13540 52405 43521 43521 23531 N/A …… 43550 43550 21600 N/A N/A B37 B8 B7 B6

Spatial window size readahead window Temporal window size History-aware prefetching (Cont’d) Timestamp current window LBN • Two windows are shifted forward to mask disk times; • To set parameters for next readahead window: • Increase spatial window size if many prefetched blocks are requested. • Increase temporal window size if only a small number of blocks are prefetchable;

Outline • Inadequacy of current buffer cache management in OS; • Managing disk layout information; • Disk-based and history-aware prefetching; • Miss-penalty aware caching; • Performance evaluation in a Linux implementation; • Related work; • Summary.

Miss-penalty-aware management of the destaging area • Two FIFO queues: on-demand blocks in on-demand queue, prefetched blocks in prefetch queue; • Memory allocation between the queues determines block replacement; • Benefit with addition of a segment = miss penalty in the ghost segment. Ghost segment On-demand queue Prefetch queue

Summary: the DiskSeen scheme • Using block table, disk layout information is easily examined; • Prefetching works directly on disk layout and is history-aware; • Block replacement is miss-penalty aware. Buffer Cache Caching area Prefetching area Destaging area

Outline • Inadequacy of current buffer cache management in OS; • Managing disk layout information; • Disk-based and history-aware prefetching; • Miss-penalty aware caching; • Performance evaluation in a Linux implementation; • Related work; • Future work.

The DiskSeen prototype in Linux 2.6.11 • Prefetch blocks directly on block device; • Blocks without disk mappings are treated as random blocks; • The size of segment for memory allocation is 128KB; • Intel P4 3.0GHz processor, a 512MB memory, and Western Digital hard disk of 7200 RPM and 160GB; • The file system is Ext3.

Benchmarks • grep: search a collection of files for lines containing a match to a given regular expression. • CVS: a versioning control system; • TPC-H: a decision support benchmark; • BLAST: a tool searching multiple databases to match nucleotide or protein sequences.

74% 25% Execution times of benchmarks DiskSeen (2nd run) DiskSeen (1st run)

Disk head movements (CVS) CVS repository Working directory Prefetching W/O DiskSeen With DiskSeen

Disk head movement (grep) W/O DiskSeen With DiskSeen

Disk head movement (grep) data blocks Inode blocks W/O DiskSeen with DiskSeen

Disk head movement (grep) Prefetching W/O DiskSeen with DiskSeen

Outline • Inadequacy of current buffer cache management in OS; • Managing disk layout information; • Disk-based and history-aware prefetching; • Miss-penalty aware caching; • Performance evaluation in a Linux implementation; • Related work; • Summary.

Related work • Caching • Numerous replacement algorithms have been proposed: LRU, FBR [Robinson, 90], LRU-K [O’Neil, 93], 2Q[Johnson, 94],LRFU[Lee, 99],EELRU [Smaragdakis, 99],MQ[Zhou, 01],LIRS[Jiang, 02],ARC[Modha, 03],CLOCK-pro [JIang, 05] ….. • They all aim to only minimize the number of misses, instead of disk times due to misses.

Related work (Con’t) • Prefetching • Based on complex models • probability graph model [Vellanki, 99] • Information-theoretic Lempel-Ziv algorithm [Curewitz, 93] • Time series model [Tran, 01]  Commercial systems have rarely used sophisticated prediction schemes.

Related work (Con’t) • Prefetching • Hint-based prefetching ([Patterson 95, Cao 96, Brown 01]) • Across-file prefetching ([Griffioen 94, Kroeger 96, Kroeger 01]) • Using file as prefetching unit; • High overhead and inconsistent performance; • Unsuitable for general-purpose OS; • Applicable only in application-level prefetching such as web proxy/server file prefetching [Chen 03].

Related work (Con’t) • Improving disk placement • Cluster related data and metadata into the same cylinder group [Mckusick 84, Ganger 97]; • Reorganize disk block placement [Black 91, Hsu 03]; • Dynamically create block replicas [Huang 05]; • DiskSeen is more flexible and responsive to changing access patterns.

Summary • Disk performance is constrained by random accesses; • The buffer cache plays a critical role in reducing random accesses; • Missing data disk layout information in buffer cache limits its performance; • DiskSeen improves disk performance by considering both temporal and spatial locality. • A Linux kernel implementation shows significant performance improvements for various types of workloads.

Supporting High Performance I/O with Effective Caching and Prefetching