1 / 18

Data Integrity and Protection

Data Integrity and Protection. Jiwoong Park(jwpark@dcslab.snu.ac.kr) School of Computer Science and Engineering Seoul National University. Overview. Disk Failure Modes Fail-Stop Fail-Partial Types of Failures Latent-Sector Error ( LSE ) Block Corruption Solutions

djessica
Download Presentation

Data Integrity and Protection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Integrity and Protection Jiwoong Park(jwpark@dcslab.snu.ac.kr) School of Computer Science and Engineering Seoul National University

  2. Overview • Disk Failure Modes • Fail-Stop • Fail-Partial • Types of Failures • Latent-Sector Error (LSE) • Block Corruption • Solutions • Redundancy Mechanism for Recovery • Checksum for Error Detection • Block Corruption • Misdirected Write • Lost Write • Overhead • Space / Time

  3. Disk Failure Modes New States! • Fail-Stop • The entire disk is working • The entire disk fails completely The Fail-Stop model cannot describe the modern view of disk failure !! • Fail-Partial • The entire disk is working • The entire disk fails completely • The disk is seemingly working • but have one or more inaccessible blocks • but hold the wrong contents

  4. Types of Failures: Latent-Sector Error (LSE) Head Crash Source: http://en.wikipedia.org/wiki/Head_crash (retrieved on 2015/06/07) Cosmic Rays Source: http://apod.nasa.gov/apod/ap060814.html (retrieved on 2015/06/07) • Latent-Sector Error (LSE) • Cause • Disk sectors have been damaged in some way • Head Crash • Can damage the surface, making the bits unreadable • Cosmic Rays • Can flip bits, leading to incorrect contents • Detection • Easily detected with in-diskError Correcting Code (ECC) • The disk returns an error in case of LSE • Recovery • Can be solved by redundancy mechanism

  5. Types of Failures: Block Corruption Wrong Location Source: http://www.derby-solicitors.co/consumer-rights/faulty-goods/ (retrieved on 2015/06/07) • Block Corruption • Cause • Disk blocks become corrupt in a way no detectable by the disk itself • Buggy Disk Firmware • Can write a block to the wronglocation • Faulty Bus • Can transfer the corrupt data • Detection • Not detectable by the disk itself (Silent Faults) • Can be solved by checksum • Recovery • Can be solved by redundancy mechanism

  6. Solutions: Redundancy Mechanism for Recovery Disk Fail Recovery D3 = D1 XOR D2 XOR P1 P2 = D4 XOR D5 XOR D6 D8 = D7 XOR P3 XOR D9 D11 = P4 XOR D10 XOR D12 D1 D2 D3 D3 P1 LSE D4 D5 P2 P2 D6 D5 D7 P3 D8 D8 D9 P4 P10 P11 P11 P12 Physical Disk 1 Physical Disk 2 Physical Disk 3 Physical Disk 4 • How to Recover from Disk Failures? • Redundancy Mechanism • Example • Mirrored RAID (RAID-4, RAID-5) • Can reconstruct the block from the other blocks in the parity group • Problem • Recovery fails when full-disk faults and LSEs occur at the same time • Solution • RAID-DP • Two parity disks instead of one + Write Anywhere File Layout (WAFL)

  7. Solutions: Checksum for Error Detection • How to Detect Silent Faults? • Checksum • A primary mechanism used by modern storage systems for data integrity • The result of the checksum function that • Take a chunk of dataas input • Compute a function over the data • Produce a small summary of the contents of the data • How to Use Checksum? • When reading a block D • Read the stored checksum • Compute the checksum over the retrieved block D • Compare the stored checksum and the computed checksum • If equal, (No Error) • Return the data to the user • Else if not equal, (data corruption) • Use redundant mechanism or return an error

  8. Common Checksum Function Two-bit Corruption 0011 1011 1110 0100 0110 1010 1100 0000 0101 0001 1110 1011 1110 0100 1111 1110 1100 1000 0010 1111 0100 1010 1100 0110 1100 1001 0011 0110 1101 0010 1010 0110 1011 1101 XOR 0010 0000 0000 0001 1011 1001 0100 0000 0011 Detection Failed • XOR-based Checksum • Basic Idea • XOR each chunk of the data block being checksummed • Produce a single value • Example • 4-byte checksum over a block of 16 bytes • Limitation • Cannot detect two bits corruptions in the same position within each checksummed unit

  9. Common Checksum Function 2’s Complement Conversion ~X + 1 0011 1011 1110 0100 1100 0100 0001 1011 0110 1010 1100 0000 1001 0101 0011 1111 0101 0001 1110 1011 1010 1110 0001 0100 1110 0100 1111 1110 0001 1011 0000 0001 0011 0111 1101 0000 1100 1000 0010 1111 1011 0101 0011 1001 0100 1010 1100 0110 1100 1001 0011 0110 0011 0110 1100 1001 1101 0010 1010 0110 0011 1110 0110 1010 ADD 1 1110 0001 1101 1110 1000 1110 0000 0001 • Addition-based Checksum • Basic Idea • Perform 2’s complement addition over each chunk of the data • Produce a single value, ignoring overflow • Example • 4-byte checksum over a block of 16 bytes • Limitation • Not good if the data is shifted

  10. Common Checksum Function Initialize: s1 = s2 = 0 Start More data? Yes s1 = (s1 + di) mod 255 s2 = (s2 + s1) mod 255 No 16-bit checksum s2 s1 Stop Output: • Fletcher Checksum • Basic Idea • Compute two check bytes, s1 and s2 • s1 = (s1 + di) mod 255 • s2 = (s2 + s1) mod 255 • Example • 8-bit Fletcher Checksum • Can detect all single-bit errors, all double-bit errors, a large percentage of burst errors

  11. Common Checksum Function Input • 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 • % 1 0 0 0 0 0 1 1 1 • = 0 1 0 1 0 1 1 1 0 0 0 0 0 0 0 0 • % 1 0 0 0 0 0 1 1 1 • = 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 • % 1 0 0 0 0 0 1 1 1 • = 0 0 0 1 0 1 1 0 1 1 0 0 0 0 0 0 • % 1 0 0 0 0 0 1 1 1 • = 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 • % 1 0 0 0 0 0 1 1 1 • = 0 0 0 0 0 1 1 0 1 0 1 1 0 0 0 0 • % 1 0 0 0 0 0 1 1 1 • = 0 0 0 0 0 0 1 0 1 0 1 0 1 1 0 0 • % 1 0 0 0 0 0 1 1 1 • = 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 • % 1 0 0 0 0 0 1 1 1 • = 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 Divisor Output • Cyclic Redundancy Check (CRC) • Basic Idea • Treat the data block as if it is a large binary number • Divide it by an agreed upon value (k) • Produce a single value • The remainder of this division • Example • CRC-8-ATM with Polynomial x8+x2+x+1 • 1x8+0x7+0x6+0x5+0x4+0x3+1x2+1x1+1x0 • Polynomial (Divisor) = 100000111(2) • Input Di = 01010111(2) • Output CRC(Di) = 10100010(2)

  12. Good Checksum Function TNSTAAFL Source: http://www.hagencartoons.com/ (retrieved on 2015/06/07) • Collision • Two data blocks with non-identical contents can have identical checksums • There’s No Such Thing As A Free Lunch (TNSTAAFL) • The more protection you get, the costlier it is • Good Checksum Function • Easy to compute • Lower chance of collisions

  13. Checksum Layout D0 D1 D2 D3 D4 C[D0] C[D1] C[D2] C[D3] C[D4] C[D0] C[D1] C[D2] C[D3] C[D4] D0 D1 D2 D3 D4 • Basic Approach • Simply store a checksum with each disk sector (or block) • Problem • Disks only can write in sector-sized chunks (512 bytes) • Solution • Format the drive with 520-byte sectors • Single write is needed to overwrite block D • Store the n checksums together in a sector, followed by n data blocks • One read and two writes are needed to overwrite block D • A Read for stored checksum + Writes for data and checksum

  14. Misdirected Writes C[D0] Disk=1 Block=0 D0 C[D1] Disk=1 Block=1 D1 C[D2] Disk=1 Block=2 D2 Disk 1 C[D0] Disk=0 Block=0 C[D1] Disk=0 Block=1 C[D2] Disk=0 Block=2 D0 D1 D2 Disk 0 • Cause • Write the data to disk correctly, but in the wrong location • Solution • Add a little more information to each checksum • Physical Identifier of the disk and sector number of the block

  15. Lost Writes • Cause • The device informs the host that a write has completed but in fact it hasn’t • Previous approaches cannot help to detect lost writes because • The old block likely has a matching checksum • The physical ID will also be correct • Solution • Write Verify (Read-After-Write) • Immediately read back the data after a write • Quite slow, doubling the number of I/Os needed to complete a write • Checksum in elsewhere • Used in Zettabyte File System (ZFS) • Store checksum in inode / indirect block for every block included within a file • Can fail iffthe writes to both the inode and the data are lostsimultaneously

  16. Scrubbing • Periodical checking • Disk Scrubbing • Periodicallyread through every block of the system • Check if checksums are still valid • Schedule scans on a nightly or weekly basis Source: http://stockfresh.com/royalty-free-stock-photos/dozing (retrieved on 2015/06/07) • When Checksum Actually Get Checked • When Accessed by Applications • Most data is rarely accessed, remaining unchecked • Unchecked data is problematic for a reliable storage system

  17. Overheads of Checksumming • Space • Disk • The space for stored checksum in the disk cannot be used for user data • Memory • Accessing data needs room in memory for both checksums and the data itself • Time • CPU • CPU must compute the checksum over each block • Whenever the data is stored / accessed • Combined copying / checksumming can be quite effective • I/O • Extra I/Os to access the checksums stored distinctly from the data • Can be reduced by design • Extra I/Os needed for background scrubbing • Can limit the impact by controlling the start time of disk scrubbing (e.g., midnight)

  18. Summary • Disk Failure • Latent-Sector Errors (LSEs) are easily detectable • Silent faults are hard to detect • Block corruption, Misdirected Writes, Lost Writes • Solution • Redundancy Mechanism • Checksum • Time overheads can be noticeable • Space overheads are negligible • Best Checksum? • It depends • Addition checksum for very lightweight ones • Fletcher checksum for most intermediate complexity ones • CRC for ones that needs the best error detection with more computational cost

More Related