Storage Systems CSE 598d, Spring 2007

Storage SystemsCSE 598d, Spring 2007 Lecture 11: Disk scheduling Feb 27, 2007 (ACK: Several slides borrowed from Shiva Chaitanya)

Disk Access Time: Components • CPU time to issue and process I/O • contention for controller • contention for bus • contention for memory • verifying block correctness with checksums (retransmissions) • waiting in scheduling queues • ...

Disk Scheduling Seek time is a dominant factor of total disk I/O time Let operating system or disk controller choose which request to serve next depending on the head’s current position and requested block’s position on disk Disk scheduling is much more difficult than CPU scheduling • a mechanical device – hard to determine (accurate) access times • disk accesses cannot be preempted – runs until it finishes • disk I/O often the main performance bottleneck

Scheduling at Multiple Locations! S/W, H/W Components between an application and the disk: - File system - Device driver - SCSI bus - RAID controller (if employing RAID) - Some bus - Disk controller Why? • Why not do it only at FS/DD level? • Why not do it only within the disk? Scheduling locations

Scheduling at Multiple Locations! Why? • Key ideas that disk scheduling employs: • Request re-ordering for seek/positioning minimization • Exploit temporal locality • Anticipation for sequential streams • Introduce non-work conserving behavior! • Exploit spatial locality • Coalesce consecutively placed requests • Free-block scheduling • Different optimizations are best done at different locations • Furthermore, the best location to do an optimization depends on the workload!

Goals • Short response time • High overall throughput • Fairness (equal probability for all blocks to be accessed in the same time) Tradeoff: Throughput vs. Fairness Socialism vs. Capitalism?

Disk Scheduling Several traditional algorithms • First-Come-First-Serve (FCFS) • Shortest Seek Time First (SSTF) • Shortest Positioning Time First (SPTF) • SCAN • C-SCAN • LOOK • C-LOOK • …

cylinder number 1 5 10 15 20 25 time First–Come–First–Serve (FCFS) FCFS serves the first arriving request first: • Long seeks • Short average response time incoming requests (in order of arrival): 21 8 7 2 14 24 12

cylinder number 1 5 10 15 20 25 time Shortest Seek Time First (SSTF) SSTF serves closest request first: • short seek times • longer maximum seek times – may lead to starvation incoming requests (in order of arrival): 24 21 8 7 2 14 12

cylinder number 1 5 10 15 20 25 time SCAN SCAN moves head edge to edge and serves requests on the way: • bi-directional • compromise between response time and seek time optimizations incoming requests (in order of arrival): 14 21 24 7 12 2 8

cylinder number 1 5 10 15 20 25 time C–SCAN Circular-SCAN moves head from edge to edge • serves requests on one way – uni-directional • improves response time (fairness) incoming requests (in order of arrival): 7 12 2 21 14 8 24

cylinder number 1 5 10 15 20 25 time LOOK and C–LOOK LOOK (C-LOOK) is a variation of SCAN (C-SCAN): • same schedule as SCAN • does not run to the edges • stops and returns at outer- and innermost request • increased efficiency incoming requests (in order of arrival): 12 14 2 24 7 21 8

V–SCAN(R) V-SCAN(R) combines SCAN (or LOOK) and SSTF • define a R-sized unidirectional SCAN window, i.e., C-SCAN, and use SSTF outside the window • Example: V-SCAN(0.6) • makes a C-SCAN (C-LOOK) window over 60 % of the cylinders • uses SSTF for requests outside the window • V-SCAN(0.0) equivalent with SSTF • V-SCAN(1.0) equivalent with SCAN • V-SCAN(0.2) is supposed to be an appropriate configuration

Shortest Positioning Time First (SPTF) • Given the complete knowledge of the actual mapping of data blocks onto the media, the scheduler can choose the request with the minimum positioning delay (combined seek and rotational latency) • SPTF, like SSTF suffers from poor starvation resistance. To reduce response time variance, priority can be given to requests that have been in pending queue for excessive periods of time

Aged Shortest Positioning Time First (ASPTF) • ASPTF(W) adjusts each positioning delay (Tpos) by subtracting a weighted value corresponding to the amount of time the request has been waiting for service (Twait) Teff = Tpos – (W*Twait) • For large values of W, ASPTF behaves like FCFS

Scheduling in Modern Disk Drives Features of current disk drives that affect traditional scheduling algorithms • Host interface • Data layout • On-Board Cache Ref: B.L. Worthington, Greg Ganger, N. Patt : Scheduling Algorithms for Modern Disk Drives ACM Sigmetrics 1994

Host interface • Controller presents a request to the disk drive in terms of the starting logical block number and request size • Subsequent media access hidden from the host • Scheduling entities outside of the drive have little knowledge of overhead delays

Data Layout • Many systems assume sequentiality of LBN-to-PBN mappings in seek reducing algorithms • Aggressive algorithms require highly accurate knowledge of the data layout which is typically hidden • Complexity of mappings increased by zoned recording, track/cylinder skew and defect management

On-Board Cache Memory within disk drives has progressed from small speed-matching buffers to megabytes of cache memory Disk logic typically prefetches data into cache to satisfy sequential read requests. This affects scheduling in two ways: • Position of the head cannot be determined easily • Requests that can be satisfied by cache could be given higher priority

As expected, FCFS quickly saturates as workload increases SSTF provides lower mean response time Scheduling by Logical Block Number

Scheduling by Logical Block Number FCFS has the lowest coefficient for lighter workloads As FCFS begins to saturate and its response time variance increases, C-LOOK emerges as a better algorithm for response time variance

Scheduling with Full knowledge As W increases, the average response time slowly grows, though variance drops

Scheduling with Full Knowledge

Modern Disk Scheduling • In modern drives, C-LOOK best exploits the prefetching cache for workloads with significant read sequentiality • SSTF and LOOK perform better for random workloads • Powerful disk controllers use variants of Shortest Positioning Time First (SPTF).

Freeblock Scheduling • An approach to utilizing more of a disk’s potential media bandwidth • Fill rotational latency periods with useful media transfers for background applications • It has been observed that 20-50% of a never-idle disk’s bandwidth can often be provided to background applications without affecting foreground response times Ref: Christopher R. Lumb, Jiri Schindler, Greg Ganger : “Towards Higher Disk Head Utilization: Extracting Free Bandwidth From Busy Disk Drives”, OSDI , 2000

Disk-intensive background tasks • Disk Reorganization • File system cleaning • Backup • Prefetching • Write-back • Integrity Checking • RAID scrubbing • Virus detection • Index Reorganization • …

Free Bandwidth • Time required for a disk media access Taccess = Tseek + Trotate + Ttransfer • Freeblock scheduling uses the Trotate component of disk access to transfer additional data • Instead of just waiting for desired sector to arrive, this technique transfers the intermediate sectors

Steps in Freeblock Scheduling • Predict how much rotational latency will occur before the next foreground media transfer • Requires detailed knowledge of disk attributes, including layout algorithms and time dependent mechanical positioning overheads • Squeeze additional media transfers into that time • Get to the destination track in time for the foreground transfer

Anticipatory Disk Scheduling Reorder available disk requests for • performance by seek optimization, • proportional resource allocation, etc. Any policy needs multiple outstanding requests to make good decisions! Ref: Sitaram Iyer, Peter Druschel : “Anticipatory scheduling : A disk scheduling framework to overcome deceptive idleness in synchronous I/O”, SOSP 2001

With enough requests… issued by process A issued by process B time seek location on disk E.g., Throughput = 21 MB/s (IBM Deskstar disk)

issued by process A issued by process B too late! forced! forced! With synchronous I/O… schedule E.g., Throughput = 5 MB/s Next

Deceptive idleness Process A is about to issue next request. but Scheduler hastily assumes that process A has no further requests!

Allocate disk service in say 1:2 ratio: Deceptive idleness causes 1:1 allocation: A B A B Proportional scheduler Next

Anticipatory scheduling Key idea: Sometimes wait for process whose request was last serviced. Keeps disk idle for short intervals. But with informed decisions, this: • Improves throughput • Achieves desired proportions

Cost-benefit analysis Balance expected benefits of waiting against cost of keeping disk idle. Tradeoffs sensitive to scheduling policy e.g., 1. seek optimizing scheduler 2. proportional scheduler

Statistics For each process, measure: 1.Expected median and 95percentile thinktime 2. Expected positioning time Number of requests Thinktime Median 95percentile last next

Benefit = best.positioning_time — next.positioning_time Cost = next.median_thinktime Waiting_duration = (Benefit > Cost) ? next.95percentile_thinktime : 0 Cost-benefit analysisfor seek optimizing scheduler best := best available request chosen by scheduler next := expected forthcoming request from process whose request was last serviced

Proportional scheduler Costs and benefits are different. e.g., proportional scheduler: Wait for process whose request was last serviced, 1. if it has received less than its allocation, and 2. if it has think time below a threshold (e.g., 3ms) Waiting_duration = next.95percentile_thinktime

Prefetch Overlaps computation with I/O. Side-effect: avoids deceptive idleness! • Application-driven • Kernel-driven

Conclusion Anticipatory scheduling: • overcomes deceptive idleness • achieves significant performance improvement on real applications • achieves desired proportions • and is easy to implement!

Fairness : Evaluating disk scheduling algorithms • Storage system designers prefer to keep the queue length at disks small regardless of the load • When queuing threshold is reached at the disk, the controller or the device driver queues the requests until disk queue is processed • Low queuing threshold minimizes request starvation at the disk level when unfair scheduling algorithms are deployed Ref: Alma Riska, Erik Riedel : “It’s not fair – evaluating efficient disk scheduling” , MASCOTS 2003

Results • Queuing more requests at the disk provides the scheduling algorithms more information used for better disk resource utilization • Percentage of requests starved remains small even if longer queues build up at the disk • Overall request starvation is independent from the queuing threshold at the disk

Storage subsytem architecture Queues at various levels Outstanding requests queued at disk and at device driver in a single disk system And, at the disks and the controller(s) in a multiple disk system

Impact of queuing thresholds Average load of 64 outstanding requests in system Average load of 16 outstanding requests in system

Response time distribution • Higher the load the larger the gap between the performances of different scheduling algorithms • Fair and simple FCFS yields longest average request response time • Best performance obtained when increasing the queue threshold under SPTF • How about request starvation and variability in the request response time ?

Response time distribution Tail of response time distribution with average load of 16 outstanding requests and threshold of 8 Tail of response time distribution with average load of 16 outstanding requests and threshold of 16

Observations .. • Majority of requests under FCFS exhibit long response times, while seek-reducing algorithms result in majority of short response times • More than 90% of requests under SPTF have shorter response times than FCFS and only 1% exhibit upto double the response times in FCFS • Amount of starvation in position-based scheduling algorithms for both queuing thresholds is the same relative to FCFS • Hence, queuing more requests improves disk performance without introducing more request starvation

Scheduling at Device driver level • Depends on workload and filesystem layout • Eg, with SCAN, seek times to sectors in the middle of the disks are shorter • OS could choose between algorithms based on current queue • Likely to be expensive in CPU cycles • Queue changes as new requests arrive • SSTF or SCAN are reasonable defaults • Allow algorithm selection as part of OS tuning • FreeBSD: C-SCAN • Linux 2.2 :SCAN • Linux 2.6 : four different versions of elevator algorithm

Discussion: Scheduling at Multiple Locations • Positioning-based optimizations best done within the disk • Seek-based optimizations best done at device driver • Why do scheduling within FS? • Device and DD independent • Aware of buffer cache • Application isolation • Disk queue length crucial • Short queue results in degraded throughput • Locally good but globally bad schedules • Long queue results in unfairness • Non-work conservation can improve fairness and throughout! • Anticipatory scheduling • Achieving proportional fairness non-trivial • Solutions based on hierarchy of queues, anticipatory scheduling can help • Request coalescing can result in great improvement in throughput • FS and device driver are good places • Improve the sequentiality of the request stream seen by the disk • Free-block scheduling can improve throughput • Can view this as a “corrector” for the non work conserving nature of disk

Additional slides on free-block scheduling

Storage Systems CSE 598d, Spring 2007