Storage Systems â€“ Part II

INF5070 – Media Storage and Distribution Systems: Storage Systems – Part II 27/10 - 2003

Overview • Data placement • Multiple disks • Managing heterogeneous disks • Prefetching • Memory caching

Data Placement on Disk

Data Placement on Disk • Disk blocks can be assigned to files many ways, and several schemes are designed for • optimized latency • increased throughput • access pattern dependent

Constant angular velocity (CAV) disks equal amount of data in each track(and thus constant transfer time) constant rotation speed Zoned CAV disks zones are ranges of tracks typical few zones the different zones have different amount of data different bandwidth i.e., more better on outer tracks Disk Layout

Disk Layout • Cheetah X15.3 is a zoned CAV disk: • Always place often used or high rate data on outermost tracks (zone 0) …!? • NO, arm movement is often more important than transfer time

Data Placement on Disk • Contiguous placement stores disk blocks contiguously on disk • minimal disk arm movement reading the whole file (no intra-file seeks) • possible advantage • head must not move between read operations - no seeks or rotational delays • can approach theoretical transfer rate • often WRONG: read other files as well • real advantage • do not have to pre-determine block (read operation) size (whatever amount to read, at most track-to-track seeks are performed) • no inter-operation gain if we have unpredictable disk accesses file A file B file C

Using Adjacent Sectors, Cylinders and Tracks • To avoid seek time (and possibly rotational delay), we can store data likely to be accessed together on • adjacent sectors (similar to using larger blocks) • if the track is full, use another track on the same cylinder (only use another head) • if the cylinder is full, use next (adjacent) cylinder (track-to-track seek)

file A file B file C Data Placement on Disk • Interleaved placement tries to store blocks from a file with a fixed number of other blocks in-between each block • minimal disk arm movement reading the files A, B and C(starting at the same time) • fine for predictable workloads reading multiple files • no gain if we have unpredictable disk accesses • Non-interleaved (or even random) placement can be used for highly unpredictable workloads

head block access probability block access probability cylinder number cylinder number Data Placement on Disk • Organ-pipe placement consider the usual disk head position • place most popular data where head is most often • center of the disk is closest to the head using CAV disks • but, a bit outward for zoned CAV disks (modified organ-pipe) innermost outermost disk: Note:skew dependent on tradeoff between zoned transfer time and storage capacityvs.seek time modified organ-pipe: organ-pipe:

Fast File System • FFS is a general file system • idea is to keep inode and associated blocks close(no long seeks when getting the inode and data) • organizes the disks in partitions – cylinder groups • having several inodes • free block bitmap • … • tries to store a file within a cylinder group • next block on same cylinder • a block within the cylinder group • find a block in another group using a hash function • search all cylinder groups for a free block

Log-Structured File System • Log-structured placement is based on assumptions (facts?) that • RAM memory is getting larger • write operations are most expensive • reads can often be served from buffer cache (!!??) • Organize disk blocks as a circular log • periodically, all pending (so far buffered) writes are performed as a batch • write on next free block regardless of content (inode, directory, data, …) • a cleaner reorganizes holes and deleted blocks in the background • stores blocks contiguously when writing a single file • efficient for small writes, other operations as traditional UNIX FS disk:

Minorca File System • Minorca is a multimedia file system (from IFI/UiO) • enhanced allocation of disk blocks for contiguous storage of media files • supports both continuous and non-continuous files in the same system using different placement policies • Multimedia-Oriented Split Allocation (MOSA) – one file system, two sections • cylinder group sections (CGSs) for non-continuous files • like traditional BSD FFS disk partitions • small block sizes (like 4 or 8 KB) • traditional FFS operations • extent sections for continuous files • extents contain one or more (adjacent) CGSs • summary information • allocation bitmap • data block area • expected to store one media file • large block sizes (e.g., 64 KB) • new “transparent” file operations, create file using O_CREATEXT cylinder group cylinder group … extent extent extent extent extent …

Minorca File System • Count-augmented address indexing in the extent section • observation: indirect block reads introduce disk I/O and break access locality (e.g., inode) • introduce a new inode structure • add counter field to original direct entries – direct points to a disk blockand count indicated how many other blocks is following the first block (continuously) • if contiguous allocation is assured, each direct entry is able to access much more blocks without additional retrieving an indirect block attributes count 0 direct 0 count 1 direct 1 direct 2 count 2 … … direct 10 count 10 direct 11 count 11 single indirect double indirect triple indirect

Other File Systems Examples • Contiguous Allocation • Presto • similar to Minorca extents for continuous files • doesn’t support small, discrete files • Fellini • simple flat file system • maintains free block list with grouping contiguous blocks • Continuous Media File System • ... • Several systems use multiple disks and stripe data • Symphony • Tiger Shark • Tiger • ...

Multiple Disks

Multiple Disks • Disk controllers and busses manage several devices • One can improve total system performance by replacing one large disk with many small accessed in parallel • Several independent heads can read simultaneously(if the other parts of the system can manage the speed) Single disk: Two disks: Note:the single disk might be faster, but as seek time and rotational delay are the dominant factors of total disk access time, the two smaller disks might operate faster together performing seeks in parallel...

Client1 Client2 Client3 Client4 Client5 Server Striping • Another reason to use multiple disks is when one disk cannot deliver requested data rate • In such a scenario, one might use several disks for striping: • bandwidth disk: Bdisk • required bandwidth: Bdisplay • Bdisplay > Bdisk • read from n disks in parallel: n Bdisk > Bdisplay • clients are serviced in rounds • Advantages • high data rates • higher transfer rate compared to one disk • Drawbacks • can’t serve multiple clients in parallel • positioning time increases (i.e., reduced efficiency)

Client1 Client2 Client3 Server Interleaving (Compound Striping) • Full striping usually not necessarytoday: • faster disks • better compression algorithms • Interleaving lets each client may be serviced by only a set of the available disks • make groups • ”stripe” data in a way such thata consecutive request arrive atnext group (here each disk is a group)

X0,1 X0,0 X1,1 X1,0 X2,1 X2,0 X3,0 X3,1 Interleaving (Compound Striping) • Divide traditional striping group into sub-groups, e.g., staggered striping • Advantages • multiple clients can still be served in parallel • more efficient disks • potentially shorter response time • Drawbacks • load balancing (all clients access same group)

Mirroring • Multiple disks might come in the situation where all requests are for one of the disks and the rest lie idle • In such cases, it might make sense to have replicas of data on several disks – if we have identical disks, it is called mirroring • Advantages • faster response time • survive crashes – fault tolerance • load balancing by dividing the requests for the data on the same disks equally among the mirrored disks • Drawbacks • increases storage requirement and write operations

Redundant Array of Inexpensive Disks • The various RAID levels define different disk organizations to achieve higher performance and more reliability • RAID 0 - striped disk array without fault tolerance (non-redundant) • RAID 1 - mirroring • RAID 2 - memory-style error correcting code (Hamming Code ECC) • RAID 3 - bit-interleaved parity • RAID 4 - block-interleaved parity • RAID 5 - block-interleaved distributed-parity • RAID 6 - independent data disks with two independent distributed parity schemes (P+Q redundancy) • RAID 7 • RAID 10 • RAID 53 • RAID 1+0

Redundant Array of Inexpensive Disks • RAID is intended ... • ... for general systems • ... to give higher throughput • ... to be fault tolerant • For multimedia systems, some requirements are still missing: • low latency • guaranteed response time • optimizations for linear access to large objects • optimizations for cyclic operations • …

Replication • Replication is in traditional RAID systems often used for fault tolerance (and higher performance in the new combined levels) • Replication in multimedia systems is used for • reducing hot spots • increase scalability • higher performance • … • and, fault tolerance is often a side effect  • Replication in multimedia scenarios should • be based on observed load • changed dynamically as popularity changes

number of viewers of segment j weighting factor number of replicas of segment i this sum considers the number of future viewers for this segment factor for expected benefit for additional copy Dynamic Segment Replication (DSR) • DSR tries to balance load by dynamically replicating hot data • assumes read only, VoD-like retrieval • predefines a load threshold for when to replicate a segment by examining current and expected load • uses copyback streams • replicate when threshold is reached, but which segment and where?? • tries to find a lightly loaded device, based on future load calculations • not necessarily segment that receives additional requests(another segment may have more requests) • replicates based on payoff factor p (replicate segment x with highest p):

Some Challenges Managing Multiple Disks • How large should a stripe group and stripe unit be? • Can one avoid hot sets of disks (load imbalance)? • What and when to replicate? • Heterogeneous disks?

Heterogeneous Disks

File Placement • A multimedia file might be stored (striped) on multiple disks, but how should one choose on which devices? • storage devices limited by both bandwidth and space • we have hot (frequently viewed) and cold (rarely viewed) files • we may have several heterogeneous storage devices • the objective of a file placement policy is to achieve maximum utilization of both bandwidth and space, and hence, efficient usage of all devices by avoiding load imbalance • must consider expected load and storage requirement • should a file be replicated • expected load may change over time

Bandwidth-to-Space Ratio (BSR) – I • BSR attempts to mix hot and cold as well as large and small multimedia objects on heterogeneous devices • don’t optimize placement based on throughput or space only • BSR consider both required storage space and throughput requirement(which is dependent on playout rate and popularity) to achieve a best combined device utilization disk(deviation): disk (deviation): disk(no deviation): media object: wasted space wasted bandwidth space bandwidth may vary according to popularity

Bandwidth-to-Space Ratio (BSR) – II • The BSR policy algorithm: • input: space and bandwidth requirements • phase 1: • find a device to place the media object according to BSR • if no device, or stripe of devices, can give sufficient space or bandwidth, then add replicas • phase 2: • find devices for the needed replicas • phase 3: • allocate expected load on replica devices according to BSR of the devices • phase 4: • if not enough resources are available, see if other media objects can delete replicas according to their current workload • all phases may be needed adding a new media object or increasing the workload – for decrease, only the phase 3 (reallocation) in needed • Popular, high data rate movies should be on high bandwidth disks

X1ready for display X0ready for display disk 0 X2,0 X1,0 X0,0 disk 1 X2,1 X1,1 X0,1 disk 2 disk 3 Disk Grouping • Disk grouping is a technique to “stripe” (or fragment) data over heterogeneous disks • groups heterogeneous physical disks to homogeneous logical disks • the amount of data on each disk (fragments) is determined so that the service time (based on worst-case seeks) is equal for all physical disks in a logical disk • blocks for an object are placed (and read) on logical disks in a round-robin manner – all disks in a group is activated simultaneously logical disk 0 X0,0 X2,0 X0 X2 X0,1 X2,1 logical disk 1 X1,0 X3,0 X1 X3 X1,1 X3,1

X2,0ready for display X0,0ready for display X1,0ready for display disk 0 disk 1 disk 2 disk 3 Staggered Disk Grouping • Staggered disk grouping is a variant of disk grouping minimizing memory requirement • reading and playing out differently • not all fragments of a logical block is needed at the same time • first (and largest) fragment on most powerful disk, etc. • read sequentially (must not buffer later segments for a long time) • start display when largest fragment is read logical disk 0 X0,0 X2,0 X0 X2 X2,0 X0,0 X2,1 X0,1 X0,1 X2,1 logical disk 1 X1,0 X3,0 X1 X3 X1,0 X1,1 X1,1 X3,1

logical disk 0 Xready for display disk 0 logical disk 1 logical disk 2 disk 1 logical disk 3 disk 2 logical disk 4 X1 X3 X0 X4 X2 disk 3 Disk Merging • Disk merging forms logical disks form capacity fragments of a physical disk • all logical disks are homogeneous • supports an arbitrary mix of heterogeneous disks (grouping needs equal groups) • starts by choosing how many logical disks the slowest device shall support (e.g., 1 for disk 1 and 3) and calculates the corresponding number of more powerful devices (e.g., 1.5 for disk 0 and 2 if these disks are 1.5 times better) • most powerful: most flexible (arbitrary mix of devices) and can be adapted to zoned disks (each zone considered as a disk) X0 X0 X2,0 X1 X1 X2 X3 X2,1 X3 X4 X4

Prefetching and Buffering

Prefetching • If we can predict the access pattern, one might speed up performance using prefetching • a video playout is often linear  easy to predict access pattern • eases disk scheduling • read larger amounts of data per request • data in memory when requested – reducing page faults • One simple (and efficient) way of doing prefetching is read-ahead: • read more than the requested block into memory • serve next read requests from buffer cache • Another way of doing prefetching is double (multiple) buffering: • read data into first buffer • process data in first buffer and at the same time read data into second buffer • process data in second buffer and at the same time read data into first buffer • etc.

Multiple Buffering • Example:have a file with block sequence B1, B2, ...our program processes data sequentially, i.e., B1, B2, ... • single buffer solution: • read B1  buffer • process data in buffer • read B2  buffer • process data in Buffer • ... • if P = time to process/block R = time to read in 1 block n = # blockssingle buffer time = n (P+R) process data memory: disk:

Multiple Buffering • double buffer solution: • read B1  buffer1 • process data in buffer1, read B2  buffer2 • process data in buffer2, read B3  buffer1 • process data in buffer1, read B4  buffer2 • ... • if P = time to process/block R = time to read in 1 block n = # blocksif P  R double buffer time = R + nP • if P < R, we can try to add buffers (n - buffering) process data process data memory: disk:

Memory Caching

Is Caching Useful in a Multimedia Scenario? • High rate data may need lots of memory for caching… • Tradeoff: amount of memory, algorithms complexity, gain, … • Cache only frequently used data – how?(e.g., first (small) parts of a broadcast partitioning scheme, allow on “top-ten” only, …) Maximum amount of memory (totally) that a Dell Server can manage in 2002 – and all is NOT used for caching

Need For Special “Multimedia Algorithms” ? In this case, LRU replaces the next needed frame. So the answer is in many cases YES… • Most existing systems use an LRU-variant, e.g., • keep a sorted list • replace first in list • insert new data elements at the end • if a data element is re-accesses, move back to the end of the list • Extreme example – video frame playout: shortest time since access longest time since access LRU buffer play video (7 frames): 3 4 1 2 7 5 6 1 7 2 3 4 5 6 rewind and restart playout at1: 2 1 3 4 5 6 7 playout2: 3 2 4 5 6 7 1 playout3: 4 3 5 6 7 1 2 playout4:

“Classification” of Mechanisms • Block-level caching consider (possibly unrelated) set of blocks • each data element is viewed upon as an independent item • usually used in “traditional” systems • e.g., FIFO, LRU, CLOCK, … • multimedia approaches: • Least/Most Relevant for Presentation (L/MRP) • … • Stream-dependent caching consider a stream object as a whole • related data elements are treated in the same way • research prototypes in multimedia systems • e.g., • BASIC • DISTANCE • Interval Caching (IC) • Generalized Interval Caching (GIC) • Split and Merge (SAM) • SHR

referenced history skipped X X X current presentation point relevance value 1.0 16 18 0.8 20 0.6 22 0.4 24 0.2 26 0 COPU number 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 Least/Most Relevant for Presentation (L/MRP) [Moser et al. 95] • L/MRP is a buffer management mechanism for a single interactive, continuous data stream • adaptable to individual multimedia applications • supports pre-loading, i.e., prefetch data from disk • replaces least relevant pages regarding current playout of the multimedia stream COPUs – continuous object presentation units playback direction 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 18 17 16 15 19 14 20 21 13 22 12 23 11 10 24 25 26

Least/Most Relevant for Presentation (L/MRP) • L/MRP … • … gives “few” disk accesses (compared to other schemes) • … supports interactivity • … supports prefetching • … targeted for single streams (users) • … expensive (!) to execute (calculate relevance values for all COPUs each round) • Variations: • Q-L/MRP – extends L/MRP with multiple streams and changes prefetching mechanism (reduces overhead) [Halvorsen et. al. 98] • MPEG-L/MRP – gives different relevance values for different MPEG frames [Boll et. all. 00]

S33 S34 S32 S11 S31 S11 S12 S12 S11 S13 S21 S22 Video clip 1 Video clip 1 Video clip 1 Video clip 3 Video clip 2 I12 I11 I21 I33 I32 I31 Interval Caching (IC) • Interval caching (IC) is a caching strategy for streaming servers • caches data between requests for same video stream – based on playout intervals between requests • following requests are thus served from the cache (not disk) filled by the preceding stream • sort intervals on length, buffer requirement is data size of interval • to maximize cache hit ratio (minimize disk accesses) the shortest intervals are cached first : S32 S33 S12 S31 S11 S21

S12 S22 S21 S11 Video clip 2 I21 Generalized Interval Caching (GIC) • Interval caching (IC) does not work for short clips • a frequently accessed short clip will not be cached • GIC generalizes the IC strategy • manages intervals for long video objects as IC • short intervals extend the interval definition • keep track of a finished stream for a while after its termination • define the interval for short stream as the length between the new stream and the position of the old stream if it had been a longer video object • the cache requirement is, however, only the real requirement • cache the shortest intervals as in IC S11 Video clip 1 I11 C11

Generalized Interval Caching (GIC) • Open function: form if possible new interval with previous stream; if (NO) {exit} /* don’t cache */ compute interval size and cache requirement; reorder interval list; /* smallest first */ if (not already in a cached interval) { if (space available) {cache interval} else if (larger cached intervals exist and sufficient memory can be released) { release memory form larger intervals; cache new interval; } } • Close function if (not following another stream) {exit} /* not served from cache */ delete interval with preceding stream; free memory; if (next interval can be cached in released memory) { cache next interval }

The End:Summary

Summary • Much work has been performed to optimize disks performance • For multimedia streams, ... • time-aware scheduling is important • use large block sizes or read many continuous blocks • prefetch data from disk to memory to have a hiccup free playout • striping might not be necessary on new disks (at least not on all disks) • replication on multiple disks can offload a hot spot of disks • memory caching can save disk I/Os, but it might not be worth the effort • ... • BUT, new disks are “smart”, we cannot fully control the device

Some References • Advanced Computer & Network Corporation: “RAID.edu”, http://www.raid.com/04_00.html, 2002 • Boll, S., Heinlein, C., Klas, W., Wandel, J.: “MPEG-L/MRP: Adaptive Streaming of MPEG Videos for Interactive Internet Applications”, Proceedings of the 6th International Workshop on Multimedia Information System (MIS’00), Chicago, USA, October 2000, pp. 104 - 113 • Halvorsen, P., Goebel, V., Plagemann, T.: “Q-L/MRP: A Buffer Management Mechanism for QoS Support in a Multimedia DBMS”, Proceedings of 1998 IEEE International Workshop on Multimedia Database Management Systems (IW-MMDBMS'98), Dayton, Ohio, USA, August 1998, pp. 162 - 171 • Moser, F., Kraiss, A., Klas, W.: “L/MRP: a Buffer Management Strategy for Interactive Continuous Data Flows in a Multimedia DBMS”, Proceedings of the 21th VLDB Conference, Zurich, Switzerland, 1995 • Plagemann, T., Goebel, V., Halvorsen, P., Anshus, O.: “Operating System Support for Multimedia Systems”, Computer Communications, Vol. 23, No. 3, February 2000, pp. 267-289 • Sitaram, D., Dan, A.: “Multimedia Servers – Applications, Environments, and Design”, Morgan Kaufmann Publishers, 2000 • Zimmermann, R., Ghandeharizadeh, S.: “Continuous Display using Heterogeneous Disk-Subsystems”, Proceedings of the 5th ACM International Multimedia Conference, Seattle, WA, November 1997

Storage Systems â€“ Part II