1 / 92

STORAGE AND I/O

STORAGE AND I/O. Jehan-François Pâris jfparis@uh.edu. Chapter Organization. Availability and Reliability Technology review Solid-state storage devices I/O Operations Reliable Arrays of Inexpensive Disks. DEPENDABILITY. Reliability and Availability. Reliability

thi
Download Presentation

STORAGE AND I/O

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. STORAGE AND I/O Jehan-François Pâris jfparis@uh.edu

  2. Chapter Organization • Availability and Reliability • Technology review • Solid-state storage devices • I/O Operations • Reliable Arrays of Inexpensive Disks

  3. DEPENDABILITY

  4. Reliability and Availability • Reliability • Probability R(t) that system will be up at time t if it was up at time t = 0 • Availability • Fraction of time the system is up • Reliability and availability do not measure the same thing!

  5. Which matters? • It depends: • Reliability for real-time systems • Flight control • Process control, … • Availability for many other applications • DSL service • File server, web server, …

  6. MTTF, MMTR and MTBF • MTTF is mean time to failure • MTTR is mean time to repair • 1/MTTF is failure rate l • MTTBF, the mean time between failures, is MTBF = MTTF + MTTR

  7. Reliability • As a first approximationR(t) = exp(–t/MTTF) • Not true if failure rate varies over time

  8. Availability • Measured by(MTTF)/(MTTF + MTTR) = MTTF/MTBF • MTTR is very important • A good MTTR requires that we detect quickly the failure

  9. The nine notation • Availability is often expressed in "nines" • 99 percent is two nines • 99.9 percent is three nines • … • Formula is –log10 (1 – A) • Example:–log10 (1 – 0.999) = –log10 (10-3) = 3

  10. Example • A server crashes on the average once a month • When this happens, it takes 12 hours to reboot it • What is the server availability?

  11. Solution • MTBF = 30 days • MTTR = 12 hours = ½ day • MTTF = 29 ½ days • Availability is 29.5/30 =98.3 %

  12. Keep in mind • A 99 percent availability is not as great as we might think • One hour down every 100 hours • Fifteen minutes down every 24 hours

  13. Example • A disk drive has a MTTF of 20 years. • What is the probability that the data it contains will not be lost over a period of five years?

  14. Example • A disk farm contains 100 disks whose MTTF is 20 years. • What is the probability that no data will be lost over a period of five years?

  15. Solution • The aggregate failure rate of the disk farm is 100x1/20 =5 failures/year • The mean time to failure of the farm is 1/5 year • We apply the formula R(t) = exp(–t/MTTF) = -exp(–5×5) = 1.4 ×10-11 • Almost zero chance!

  16. TECHNOLOGY OVERVIEW

  17. Disk drives • See previous chapter • Recall that the disk access time is the sum of • The disk seek time (to get to the right track) • The disk rotational latency • The actual transfer time

  18. Flash drives • Widely used in flash drives, most MP3 players and some small portable computers • Similar technology as EEPROM • Two technologies

  19. What about flash? • Widely used in flash drives, most MP3 players and some small portable computers • Several important limitations • Limited write bandwidth • Must erase a whole block of data before overwriting them • Limited endurance • 10,000 to 100,000 write cycles

  20. Storage Class Memories • Solid-state storage • Non-volatile • Much faster than conventional disks • Numerous proposals: • Ferro-electric RAM (FRAM) • Magneto-resistive RAM (MRAM) • Phase-Change Memories (PCM)

  21. Phase-Change Memories No moving parts A data cell Crossbarorganization

  22. Phase-Change Memories • Cells contain a chalcogenide material that has two states • Amorphous with high electrical resistivity • Crystalline with low electrical resistivity • Quickly cooling material from above fusion point leaves it in amorphous state • Slowly cooling material from above crystallization point leaves it in crystalline state

  23. Projections • Target date 2012 • Access time 100 ns • Data Rate 200–1000 MB/s • Write Endurance 109 write cycles • Read Endurance no upper limit • Capacity 16 GB • Capacity growth > 40% per year • MTTF 10–50 million hours • Cost < $2/GB

  24. Interesting Issues (I) • Disks will remain much cheaper than SCM for some time • Could use SCMs as intermediary level between main memory and disks Main memory SCM Disk

  25. A last comment • The technology is still experimental • Not sure when it will come to the market • Might even never come to the market

  26. Interesting Issues (II) • Rather narrow gap between SCM access times and main memory access times • Main memory and SCM will interact • As the L3 cache interact with the main memory • Not as the main memory now interacts with the disk

  27. RAID Arrays

  28. Today’s Motivation • We use RAID today for • Increasing disk throughput by allowing parallel access • Eliminating the need to make disk backups • Disks are too big to be backed up in an efficient fashion

  29. RAID LEVEL 0 • No replication • Advantages: • Simple to implement • No overhead • Disadvantage: • If array has n disks failure rate is n times the failure rate of a single disk

  30. RAID level 0 RAID levels 0 and 1 Mirrors RAID level 1

  31. RAID LEVEL 1 • Mirroring: • Two copies of each disk block • Advantages: • Simple to implement • Fault-tolerant • Disadvantage: • Requires twice the disk capacity of normal file systems

  32. RAID LEVEL 2 • Instead of duplicating the data blocks we use an error correction code • Very bad idea because disk drives either work correctly or do not work at all • Only possible errors are omission errors • We need an omission correction code • A parity bit is enough to correct a single omission

  33. RAID levels 2 and 3 Check disks RAID level 2 Parity disk RAID level 3

  34. RAID LEVEL 3 • Requires N+1 disk drives • N drives contain data (1/N of each data block) • Block b[k] now partitioned into N fragments b[k,1], b[k,2], ... b[k,N] • Parity drive contains exclusive or of these N fragments p[k] = b[k,1] b[k,2]  ...b[k,N]

  35. How parity works? • Truth table for XOR (same as parity)

  36. Recovering from a disk failure • Small RAID level 3 array with data disks D0 and D1 and parity disk P can tolerate failure of either D0 or D1

  37. Block Chunk Chunk Chunk Chunk How RAID level 3 works (I) • Assume we have N + 1 disks • Each block is partitioned into N equal chunks N = 4 in example

  38.    Parity How RAID level 3 works (II) • XOR data chunks to compute the parity chunk • Each chunk is written into a separate disk Parity

  39. How RAID level 3 works (III) • Each read/write involves all disks in RAID array • Cannot do two or more reads/writes in parallel • Performance of array not better than that of a single disk

  40. RAID LEVEL 4 (I) • Requires N+1 disk drives • N drives contain data • Individual blocks, not chunks • Blocks with same disk address form a stripe x x x x ?

  41. RAID LEVEL 4 (II) • Parity drive contains exclusive or of theN blocks in stripe p[k] = b[k] b[k+1]  ...b[k+N-1] • Parity block now reflects contents of several blocks! • Can now do parallel reads/writes

  42. RAID levels 4 and 5 Bottleneck RAID level 4 RAID level 5

  43. RAID LEVEL 5 • Single parity drive of RAID level 4 is involved in every write • Will limit parallelism • RAID-5 distribute the parity blocks among the N+1 drives • Much better

  44. The small write problem • Specific to RAID 5 • Happens when we want to update a single block • Block belongs to a stripe • How can we compute the new value of the parity block p[k] b[k] b[k+1] b[k+2] ...

  45. First solution • Read values of N-1 other blocks in stripe • Recompute p[k] = b[k] b[k+1]  ...b[k+N-1] • Solution requires • N-1 reads • 2 writes (new block and new parity block)

  46. Second solution • Assume we want to update block b[m] • Read old values of b[m] and parity block p[k] • Compute p[k] = new b[m]  old b[m]  old p[k] • Solution requires • 2 reads (old values of block and parity block) • 2 writes (new block and new parity block)

  47. RAID level 6 (I) • Not part of the original proposal • Two check disks • Tolerates two disk failures • More complex updates

  48. RAID level 6 (II) • Has become more popular as disks become • Bigger • More vulnerable to irrecoverable read errors • Most frequent cause for RAID level 5 array failures is • Irrecoverable read error occurring while contents of a failed disk are reconstituted

  49. RAID level 6 (III) • Typical array size is 12 disks • Space overhead is 2/12 = 16.7 % • Sole real issue is cost of small writes • Three reads and three writes: • Read old value of block being updated,old parity block P, old party block Q • Write new value of block being updated, new parity block P, new party block Q

  50. CONCLUSION (II) • Low cost of disk drives made RAID level 1 attractive for small installations • Otherwise pick • RAID level 5 for higher parallelism • RAID level 6 for higher protection • Can tolerate one disk failure and irrecoverable read errors

More Related