1 / 24

Understanding the Robustness of SSDs under Power F ault

Understanding the Robustness of SSDs under Power F ault. 서동화 dhdh0113@gmail.com. Contents. Introduction Background Testing Framework Experimental result. Introduction. Flash-based solid state disks(SSDs) a “truly revolutionary and disruptive” technology. Greater performance.

teneil
Download Presentation

Understanding the Robustness of SSDs under Power F ault

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Understanding the Robustness of SSDs under Power Fault 서동화 dhdh0113@gmail.com

  2. Contents • Introduction • Background • Testing Framework • Experimental result

  3. Introduction • Flash-based solid state disks(SSDs) • a “truly revolutionary and disruptive” technology. • Greater performance. • Lower power draw. • the behavior of flash memory in adverse conditions has only been studied at a component level. • Given the opaque and confidential nature of FTL. • The behavior of full devices in unusual conditions is still a mystery to public. • This paper considers the behavior of SSDs under power fault. • Although loss of power seems like an easy fault to prevent, recent experience shows that a simple loss of power is still a distressingly frequent occurrence.

  4. Introduction • Power fault case • HOSTING • Jul. 2012 “… human error was responsible for a data center POWER OUTAGES…” • Amazon • Jun. 2012 “Amazon Data Center LOSES POWER During Storm..” • Amazon • May 2010 “Car Crash Triggers Amazon POWER OUTAGE…” • iWeb • 2010 “About 3,000 servers at Montreal web hos iWeb experienced an OUTAGES …” • And so on…

  5. Background • NAND Flash Low-Level Details • The floating gate inside a NAND flash cell is susceptible to a variety of faults that may cause data corruption. • Write endurance • Program disturb • Read disturb • aging

  6. Background • NAND Flash Low-Level Details <erase> <write> <read>

  7. Reference • Write disturb • Program disturb <Characterizing Flash Memory: Anomalies, Observations, and Applications>

  8. Reference • Read disturb <Characterizing Flash Memory: Anomalies, Observations, and Applications>

  9. Background • SSD High-Level Concerns • SSD using firmware called “FTL” to make device appear as if it can do update-in-place. • The primary responsibility of an FTL is to maintain a mapping between logical and physical addresses. • Remapping table are typically stored in a volatile write back cache. • Due to cost considerations, manufactures typically attempt to minimize the size of the write-back cache as well as the capacitor backing it. • Loss of power during program operations can make the flash cells more susceptible to other faults. • Erase operations are also susceptible to power loss, since they take much longer to complete than program operations.

  10. Testing Framework • Types of failures • Bit Corruption • Metadata Corruption • Dead Device before power fault after power fault

  11. Testing Framework • Types of failures • Shorn Writes • Flying Writes after power fault before power fault

  12. Testing Framework • Types of failures • Bit corruption • Half-programmed flash cells are susceptible to bit errors. • Flying writes • due to corruption and missing updates in the FTL’s remapping tables. • Shorn writes • Because single operations may be internally remapped to multiple flash chips to improve throughput. • Metadata corruption • Because an FTL is a complex piece of software and corruption of its internal state could be problematic. • Unserializable writes • Due to high degree of parallelism inside an SSD.

  13. Testing Framework • Types of failures • Local consistency • Most of the faults can be detected using local-only data. • Either a record is correct or it its not. • Global consistency • Unserializabilityis more complex property. • Whether the result of a workload is serializable depends not only on individual records, but on how they can fit into a total order of all the operations.

  14. Testing Framework • Detecting local failures • In order to detect local failures, we need to write records that can be checked for consistency.

  15. Testing Framework • Dealing with complex FTLs • Naive padding • Random number padding • Pad with copies of the header • Advanced FTL’s compression • In order to avoid such compre- ssion, we further perform rando-mization on the regular record format

  16. Testing Framework • Detecting global failures • Unserializability is not a property of a single record and thus cannot be tested with fairly local information. • During a power fault, we expect that some FTLs may fail to persist outstanding writes to the flash, or may lose mapping table updates. • We call such misordered or missing operations unseiralized writes.

  17. Testing Framework • Detecting global failures • To detect unserializability, we need information about the completion time of each write. • We make use of the time when the records were created.

  18. Testing Framework • Applying workloads • Random writes • Concurrent sequential writes • Single-threaded sequential writes • Power fault injection • Putting it together

  19. Experimental result • Experimental Environment • We selected fifteen representative SSDs from five different vendors. • For comparison purposes, we also evaluated two traditional hard drives. • The SSDs and the hard drives are used as raw devices. • No file system is created on the devices. • We use synchronized I/O. • Which means each write operation does not return until its data is flushed to the devices. • Bypass the buffer cache. • Scenarios • Power fault during concurrent random writes. • Power fault during concurrent sequential writes. • Power fault during single-threaded sequential writes.

  20. Experimental result • Overall Results • We found that 13 out of 15 devices exhibit failure. • In SSD#3, about one third of data was lost due to one third of the device becoming inaccessible. • In SSD#1, all of its data was lost. What the hell …

  21. Experimental result • Bit corruption • One common way to deal with bit errors is using ECC. • Number of chip-level bit errors under power failure could exceed the correction capability of ECC. • Shorn writes • This shows that shorn writes is not a rare failure mode under power fault. • Subpage programming

  22. Experimental result • Unserializable writes • No relationship between the number of serialization errors and a SSD’s unit price stands out except for the fact that the most expensive SLC. • Scenario • 1) uncompleted program 2) FTL 3) old record

  23. Experimental result • Metadata corruption • After 8 injected power faults, only 69.5% of all the records can be retrieved from SSD#3. • This corruption makes 30.5% of the flash memory space unavailable. • We assume corruption of metadata. • Dead device • After 136 injected power faults, SSD#1 became completely useless. • All of the data stored on it was lost. • Loss of metadata • Power spike during power loss

More Related