The HP AutoRAID Hierarchical Storage SystemJohn Wilkes, Richard Golding, Carl Staelin, and Tim Sullivan “virtualized disk gets smart…”
HP AutoRAID 2 • File System Recap • OS manages storage of files on storage media using a File System • storage media: • comprised of an array of data units, called sectors • File System: • organizes sectors into addressable storage units • establishes directory structure for accessing files • FFS and LFS both developed as improvements over previous FSes • improved performance by optimizing access • FFS: • increased block size to reduce # of block addresses managed in directory • logically grouped cylinders to help ensure locality for blocks of a file • LFS: • eliminated seek times by always writing at end of the log • introduced new addressable structure called extents • an extent is a large contiguous set of blocks • need extents so as to have plenty of room at end of log for writing new entries • requires Garbage Collection of old log entries • live blocks of partially filled extents are migrated to other extents to free up space
HP AutoRAID 3 • Crash Recovery • issue is consistency of directory data after a crash or power failure • directory information typically written after the file data is written • FFS: • after a crash you have no way of knowing what you were last doing • requires a consistency check • all inode information must be verified against data it maps to • inconsistencies cannot always be repaired, data can be lost • LFS: • drastically reduces time to recover because of checkpointing • checkpoint = noted recent time when files and inode map were consistent • verify by rolling forward through log from last checkpoint • LFS keeps lots of other metadata information and stores some of it with the file • increased odds of restoring consistency • But neither can recover from a hardware failure….
HP AutoRAID 4 • RAID !(round about the 1980’s) • Redundant Array of Inexpensive (or Independent) Disks • connect multiple cheap disks into an ARRAY of disks, spread data across them! • a single disk has less reliability than an array of smaller drives with redundancy • Virtualization ! • multiple disks but the File System seesonly one virtual unit (doesn’t know it’s virtual!) • requires an ARRAYCONTROLLER, a combination of hardware and software • handles mapping between where the FS thinks data is and where it actually is • Redundancy! • partial, like parity • full, like an extra copy • if a single drive in the array is lost, its data can be automatically regenerated • no longer have to worry too much about drives failing!
HP AutoRAID 5 • RAID Levels • RAID1 - Mirroring • full redundancy! • zero recovery time in case of disk failure, just use copy • storage capacity = 50% of total size of array • writes are serialized at some level between the two disks • in case of crash or power failure, both disks are NOT in inconsistent state • this makes writes slower than just writing to one disk • a write request does not return until both copies have been updated • transfer rate = same as one disk • parallel reads ! • each copy can service a read request
HP AutoRAID 6 • RAID Levels • RAID3 - Byte level striping, parity on check disk • spread data by striping: byte1 -> disk1, byte2 -> disk2, byte3 -> disk3 • reads and writes of stripe’s bytes happen at the same time! • transfer rate = (N - 1) * transfer rate of one disk • only partial redundancy! • check disk stores parity information • parity overhead amounts to one bit per group of corresponding bits in a stripe • redundancy overhead = 1 / N % • Oops! Byte striping means every disk involved in every request! • No parallel reads nor writes
HP AutoRAID 7 • Parity • parity is computed using XOR ( ^ ):
HP AutoRAID 8 • RAID Levels • RAID5 - Block level striping, parity interleaved • striping unit is 1 block: block1 -> disk1, block2 -> block2, block3 -> block3, etc. • blocks of stripes written at same time! • transfer rate = (N - 1) * transfer rate of one disk • only partial redundancy! • parity information dispersed round-robin among all disks • same redundancy overhead as level 3, = 1 / N % • Hey! Blockstripingcan mean that every disk is NOT involved in a (small) request • parallel reads and writes can occur, depends on which disks store involved blocks • BUT writes get slower! • this happened in RAID 3 too • read - modify - write : • read parity • recompute/modify parity • write data and parity
HP AutoRAID 9 • RAID 1 vs RAID 5 • Reads: • RAID 1 (mirroring): • always offers parallel reads • RAID 5: • can only sometimes offer parallel reads • depends on where the needed blocks are • two read requests that require blocks on the same disk must be serialized • Writes: • RAID 1: • (mirroring) must complete two writes before request returns • granularity of serialization can be smaller than a file • can’t do parallel writes • RAID 5: • typically does read-modify-write to recompute parity • (HP AutoRAID uses combo of read-modify-write and LFS !) • can’t do parallel writes either • Redundancy Overhead: • RAID 1 = full redundancy, storage capacity reduced by 50% • RAID 5 = partial redundancy, storage capacity reduced by 1/N%
HP AutoRAID 10 • Storage Hierarchy = HP AutoRAID • RAID 1 = fast reads and writes, but 50% redundancy overhead • RAID 5 = strong reads, slow writes, 1/N% storage overhead • RAID 1 is fast but expensive, like a cache! • RAID 5 is slower but cheaper, like main memory! • Neither is optimum under all circumstances… • SO create a hierarchy: • use mirroring for active blocks • active set = blocks of regularly read and written files • use RAID 5 for inactive blocks • inactive set = blocks of read-only and rarely accessed files • Sounds hard! • Who pushes the data back and forth between the sets? • How often do you have to do it? • if the sets change too often, no time for anything else!
HP AutoRAID 11 • Who Minds the Storage Hierarchy? • The System Administrator? • as long as you don’t have to pay them much • and if they get it right all the time and don’t make any mistakes • The File System? • if so, big plus: File System knows better than anything who is using which files • can best determine active and inactive sets based on tracking access patterns • BUT, there are a lot of different OSes with different File System options • that makes deployment hard • each File System must be modified in order to manage a storage hierarchy • An Array Controller? • embed the software to manage the hierarchy in the hardware of a controller • no deployment issues, just add the hardware to the system • overrules the existing File System… • lose the ability to track access patterns… • need a reliable and often correct policy for determining active/inactive sets… • sounds like virtualization…
HP AutoRAID 12 • HP AutoRAID(local hard drive gets smart!) • array controller’s embedded software manages active/inactive sets • application level user interface for configuration parameters • set up LUNs (virtual logical units) • virtualization • File System is out of the loop! • Consider Mapping: • File System things it is addressing the blocks of a particular file • doesn’t know the file is actually in a storage hierarchy • is the requested file in the active set? • Or inactive set? • which disk is it on? • need some set mapping between what the file system sees and where data actually resides on disk
HP AutoRAID 13 • Virtual to Physical Mapping • Physically: • the array is structured by an address hierarchy: • PEGs contain 3 or more PEXs • PEXs address 1MB worth 128KB segments • a segment holds 2 Relocation Blocks • PEXs are typically 1MB of contiguous disk space • Segments are 128KB of contiguous sectors • Relocation Blocks serve as the: • striping unit in RAID 5, the mirroring unit in RAID 1, • and as the unit of migration between active and inactive sets • Virtually, the File System sees: • LUNs: Logical Units • purely virtual, no superblock, no directory, not a partition • rather is a set of RBs that get mapped to physical segments when actually used • user can create as many LUNs as they want • Each LUN has a virtual device table that holds the list of RBs assigned to it • RBs in virtual device table are mapped to RBs in PEG tables • PEG tables map RBs to PEXtables in which RBs are assigned to actual segments
Mapping • if RB3 migrates from inactive to active, simply update the PEX mapping in the PEG table that maps RB3 HP AutoRAID 14
HP AutoRAID 15 • How cool is that… • What you can do when you’re not in control anymore.. • Hot-pluggable disks • take one out and RAID immediately begins regenerating missing data • or, if one fails, activate a spare, if available • array still functions, no down time • requests for missing data are given top priority for regeneration • Create a larger array on the fly • size of array is limited to the size of the smallest disk • so take a small disk out and put a larger disk in • systematically replace all disks, one by one, letting each regenerate • when last bigger disk goes in, array is automatically larger
HP AutoRAID 16 • HP AutoRAID Read and Write Operations • RAID 1 Mirrored Storage Class • normal RAID Level 1 reads and writes • 2 reads can happen in parallel • a write is serialized (at the segment level) between the two disks • both updates must complete before request returns (remember the overhead!) • RAID 5 Storage Class • reads are processed as normal RAID 5 read operations • reads are parallel if possible • writes are log structured • when they happen is more complicated • RAID 5 Writes happen for 1 of 3 reasons: • a File System request tries to write data at RAID 5: • results in promotion of requested data to active set • (no actual write happens at RAID 5 in this case) • Mirrored storage class runs out of space: • so data is demoted from active to inactive, RBs copied from active to inactive • when garbage collecting and cleaning in RAID 5
HP AutoRAID 17 • Holes, Cleaning, and Garbage Collection • Holes come from: • demotion of RBs from active to inactive leaves ‘holes’ in PEXs of mirrored class • holes are managed as a free list • promotion of RBs from inactive to active leaves holes in PEXs of RAID 5 • by the way, RAID 5 in HP AutoRAID uses LFS… • so holes must be garbage collected • Cleaning: • plug the holes • RBs are migrated between PEGs to fill some, empty others • cleaning mirrored class frees up PEGs to accommodate bursts or to give to RAID 5 • cleaning RAID 5 is an alternative to garbage collection • Garbage Collection: • normal LFS garbage collection • or can be hole plugging garbage collection to fill/free PEGs • this performs much better, reduces garbage collection by up to 90%!
HP AutoRAID 18 • Performance • depends most on how much of the active set fits into the mirrored class • if it all fits, then RAID 5 goes unused. Performance is that of a RAID I array • tested OLTP against weaker RAID and JBOD • JBOD = just a bunch of disks, striped, no redundancy (so performs the best!) • tested with all of active set fitting in Mirrored Storage class • so no migration overhead • AutoRAID lags due to redundancy overhead • tested performance for different %’s of active set at mirrored level • more disks = higher % at Mirrored Storage Class • obviously performance rises with higher % because less migration • interesting to note at 8 drives, when all of active set fits • performance rises because transfer rate is increasing, more disks to write to Shows transaction rate of OLTP for slow RAID, HP AutoRAID, and for JBOD Shows transaction rate as number of disk in AutoRAID increases
HP AutoRAID 19 • Can the File System help? • File System sees virtual disk, • probably has its own ideas of how best to lay data to blocks to optimize access • perhaps by assigning RBs of a LUN to a linear set of contiguous blocks… • BUT are they really going to be contiguous? • in the array, RBs can be mapped anywhere and most likely are not stored linearly • so does this make seek times really bad? • ran tests where they initially set up array: • with all RBs laid out completely linearly • with all RBs laid out completely randomly • Resulted in only modest performance gains for initial linear layout • note there is no way to migrate data between sets and maintain a linear layout.. • Conclusion: • the 64KB RB allocation block may sound big, but works just fine • remember, large block sizes amortize seek times • RB size is subject to same considerations as block size on a normal hard drive
HP AutoRAID 20 • Mirrored Storage Class Read Selection Algorithm • which copy should be read? • possibilities: • strict alternation • keep one disk head on the outer track, the other on the inner • read from the disk with the shortest queue • read from the disk with the shortest seek time • strict alternation and inner/outer can give big benefits under certain workloads… • AND can really punish under other workloads • shortest queue and shortest seek time yield same modest gain • but it is hard to track shortest seek time • so shortest queue wins
HP AutoRAID 21 • Conclusion • redundancy protects from data loss due to hardware failure • different striping units and levels of redundancy result in different performance • performance depends on type of workload • redundancy also introduces overhead • 50% for mirroring • reduce redundancy overhead by using a storage hierarchy • implementing different RAID levels for active and inactive data • storage hierarchy managed by an array controller • management software embedded onto hardware controller • special mapping virtualizes the array • File System sees one (or more) virtual logical units