Lecture 5: Wrap-up RAID Flash memory

Lecture 5: Wrap-up RAID Flash memory Prof. Shahram Ghandeharizadeh Computer Science Department University of Southern California

Mental Block from Last Lecture

Question! Why? Level 4 Level 5

Last Lecture’s Discussion With RAID 4, why is the performance of small writes D/2G? To write block b: Read the old Block b and old parity block ECC1, Compute the new parity using the old Block b, new Block b, and the old parity: New parity = (old block xor new block) xor old parity ECC1 Write new block b and new parity block. Disk 1 Disk 2 Disk 3 Disk 4 Parity Block a Block b Block c Block d ECC 1

Small Write with RAID 4 Note that a write to a block on Disk 3 cannot proceed in parallel because the Parity disk is busy! One disk would perform the write of block b without reading it first. With 1 group, level 4 RAID performs ½ the number of small write events when compared with 1 disk. With nG groups, level 4 RAID performs nG/2 number of events when compared with 1 disk. nG = D/G Disk 1 Disk 2 Disk 3 Disk 4 Parity Read Old b Read Old ECC1 New parity ECC1 = (old b xor new b) xor old ECC1 Write New b Write New ECC1

RAID 4 Two groups may perform write operations independently.

RAID 5: Resolve the Bottleneck With Level 5 RAID, different disks may perform different small write operations simultaneously. Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Block a Block b Block c Block d ECC 1 Block h Block e Block f Block g ECC 2 Block i Block j ECC 3 Block k Block l Block p Block m ECC 4 Block n Block o Block t ECC 5 Block q Block r Block s

RAID 5 Example: Write block a and f simultaneously and initiate a part of write for block j. All disks are busy reading data and parity blocks. A write requires 4 I/Os. Disk 1 Disk 2 Disk 3 Disk 4 Disk 5 Read a Read f Read ECC 1 Read ECC 3 Read ECC 2 Compute parity blocks for a and f Write ECC 1 Write ECC 2 Write a Write f Read n

RAID 5 When compared with one disk, Level 5 RAID performs 4 times as many I/Os. To compare with one disk, divide the total number of operations supported by the data disks by 4. Total number of small writes for 1 group D/4 + ¼ (1 check disk) With nG groups, there are nG check disks. D/4 + nG*C/4 (nG = D/G) D/4 + (D/4 * C/G)

Level 5 RAID D/4 + (D/4 * C/G)

A Comment Definitions may appear somewhat arbitrary and far-fetched! Definitions are applied consistently.

Flash Memory Goetz Graefe. The Five-Minute Rule Twenty Years Later, and How Flash Memory Changes the Rules. DaMoN 2007.

Alternative Storage Mediums Magnetic disk drive Flash memory Dynamic Random Access Memory (DRAM)

Flash Memory [Kim et al. 02] Nonvolatile storage media: stored data is sustained after power is turned off. Supports random access to data. Comes in two types: NOR: can read/erase/write 1 byte individually. NAND: optimized for mass storage and supports read/erase/write of a block. A block consists of multiple pages. A page is typically 512 bytes. A block is somewhere between 4KB to 128KB. Write performance for flash memory is an order-of-magnitude higher when compared with NOR.

Flash Storage Comes with different interfaces: UFD: USB Flash Disk. Throughput is price dependent; typically quoted at: Read throughput of 8 to 16 MBps Write throughput of 6 to 12 MBps

Flash Storage Comes with different interfaces: UFD: USB Flash Disk. Throughput is price dependent; typically quoted at: Read throughput of 8 to 16 MBps. Write throughput of 6 to 12 MBps. Flash memory card: Accessed as memory. Typically byte-accessible. Flash disk: Accessed through a disk interface. Block-accessible. Focus of this paper is on flash disk.

Flash Memory Reads are faster than writes because a write of a page (512 bytes) requires a block to be erased. Sequential writes are fast because the interface has a cache and manages write operations intelligently. Random write operations are slow because of the erase operations and a small cache.

Flash: Sequential Reads/Write [Gray’08] Read/Write performance is sensitive to request size. Read performance is significantly better than write performance. Throughput plateaus at 53 MBps for reads and 35 MBps for writes. Note the higher throughput for flash disk when compared with UFD.

Flash: Random Read/Write [ Gray’08] Read performance is comparable to sequential reads. Write performance is very poor, 216 KBps with 8KB writes (27 requests per second). Poor performance of random writes is being addressed – might have been addressed already! (A fast moving field!)

Disk & Flash [Gray 08] Disk provides a higher bandwidth with sequential reads/writes. With random reads, flash blows disk away! Why? When one considers power consumption, IOPS/watt of flash is very impressive!

Flash Reliability of flash suffers after 100,000 to 1,000,000 erase-and-write cycles. Less reliable (a lower MTTF) than magnetic disk assuming a write intensive workload.

Characteristics RAM is faster than the other two storage mediums. Flash disk consumes less power than disk because there are no moving parts.

Disk and DRAM Question: When does it make economic sense to make a piece of data resident in DRAM and when does it make sense to have it resident in disk where it must be moved to main memory prior to reading or writing it? Assumptions: Fix sized disk pages, say 4 Kilobyte. A 250 GB disk costs $80 and supports 83 page reads per second. So the price per page read per second is about $1. 1 MB of DRAM holds 256 disk pages and costs $0.047 per megabyte. So, the cost of a disk page occupying DRAM is $0.000184. If making a page memory resident saves 1 page a/s then it saves $1. A good deal. If it saves .1 page a/s then it saves 10 cents, still a good deal. Break even point is an access every $1/0.000184 which is roughly 90 minutes. In 1987, this break even point was 2 minutes.

Disk and DRAM: Moral of the story In 2007, pages referenced every 90 minutes should be DRAM resident. In 1987, pages referenced every 5 minutes should be DRAM resident. Key observation: Focus is on memory space and disk bandwidth! Is something missing from this analysis?

Assumed Page Size Matters A larger page size enhances the throughput of a magnetic disk drive. How?

Assumed Page Size Matters A larger page size enhances the throughput of a magnetic disk drive. With small page sizes (1 KB), seek and rotational latency result in a lower disk throughput, and a higher cost per a/s.

Flash and DRAM Question: When does it make economic sense to make a piece of data resident in DRAM and when does it make sense to have it resident in Flash where it must be moved to main memory prior to reading or writing it? Assumptions: Fix sized disk pages, say 4 Kilobyte. A 32 GB Flash disk costs $999 and supports 6200 page reads per second. So the price per page read per second is about $0.16. 1 MB of DRAM holds 256 disk pages and costs $0.047 per megabyte. So, the cost of a disk page occupying DRAM is $0.000184. If making a page memory resident saves 1 page a/s then it saves $0.16. A good deal. If it saves .1 page a/s then it saves $0.016, still a good deal. Break even point is an access every $0.16/0.000184 which is roughly 15 minutes. In the price of flash drops to $400, break even point is 6 minutes.

Flash and DRAM: Moral of the story With 2007 price of $999, pages referenced every 15 minutes should be DRAM resident. With anticipated price of $400, pages referenced every 6 minutes should be DRAM resident. Focus is on DRAM space, and Flash bandwidth! Is something missing from this analysis?

What is Missing? Page size matters (same discussion as DRAM) With flash disk, throughput of reads and writes is asymmetrical – even with sequential reads and writes. A 32 GB Flash disk costs $999 and supports 30 page writes per second. So the price per page write per second is about $33. (For reads, it is 16 cents.)

Disk and Flash Memory With Flash memory, the available flash is accessible in the same manner as DRAM. The read and write performance of Flash memory is different than DRAM. One may repeat the analysis to establish a Δ-Minute rule for magnetic disk and flash memory, see discussion of Table 3.

Possible Software Architectures? Extended buffer pool: Flash is an extension of DRAM. Extended disk: Flash is an extension of disk. Treat DRAM, Flash, and magnetic disk independently using a new cache management technique. Trojan storage manager. This paper focuses on the first two possibilities using LRU to manage their content.

Architecture Choice Choice of an architecture depends on pattern of usage. This study claims: File systems and operating systems prefer “extended buffer pool” architecture. DBMS prefer “extended disk architecture” Why?

Usage Pattern File system/OS: Pointer pages maintain data pages or runs of contiguous pages. Movement of a page requires writing of the page and the entire pointer page. During recovery, checks the entire storage. Many random I/Os! Extended buffer pool architecture. DBMS, assuming logging with immediate database modification: Data is stored in B-tree indexes. Writing a page requires appending a few bytes in the log file. The log file is flushed using large sequential write operations. During recovery plays log records sequentially. Large I/Os! Extended disk architecture.

LOG-BASED RECOVERY A=1000 B=10 A=1000 B=10 (2) A=A-50 (1) Read(A) A=1000 B=10 (3) Write(A) (5) B=B+50 (4) Read(B) (6) Write(B) (7) Commit

Checkpointing Motivation: In the presence of failures, the system consults with the log file to determine which transaction should be redone and which should be undone. There are two major difficulties: the search process is time consuming most transactions are okay as their updates have made it to the database (the system performs wasteful work by searching through and redoing these transactions). Approach: perform a checkpoint that requires the following operations: output all log records from main memory to the disk output all modified (dirty) pages in the buffer pool to the disk output a log record <checkpoint> onto the log file on disk

Checkpointing (Cont…) Dirty pages and log records stored on flash storage persist during failure. No need to flush them to disk drive. If DBMS assumes extended buffer pool architecture, the check-point operation will flush data to disk un-necessarily! Motivation for extended disk architecture with xact-processing applications!

Checkpointing Unsure about the following argument:

B+-tree is a multi-level tree structured directory A node is a page. A larger node has a higher fan-out, reducing the depth of the tree. Utility of a node is measured by the logarithm of records in a node. A larger node has a higher utility. B+-TREE Root Internal Nodes …. Leaf Nodes ... ... Data File

B+-Tree Using Flash-Disk hardware combination, page size of 256/512 maximizes utility/time value. Note access time does not change as a function of page size. 3rd column is a log function of the 2nd column.

B+-tree Using DRAM-Flash combination, a small page size (2 KB) provides the highest utility.

Summary With an extended-disk architecture that requires a page to migrate from DRAM to Flash and then to disk, different B+-tree page sizes should be used with Flash and Disk. SB-trees [O’Neil 1992] supports the concept of extents and different page sizes.

EXTERNAL SORTING Sort a 20 page relation assuming a five page buffer pool. Merge sort

EXTERNAL SORTING Use flash to store intermediate runs: Large sequential reads/writes to flash memory. More energy efficient! Merge sort

References Gray and Fitzgerald. Flash Disk Opportunity for Server Applications. ACM Queue, July 2008. Kim et. al. A Space-Efficient Flash Translation Layer for CompactFlash Systems. IEEE Transactions on Consumer Electronics, Vol. 48, No. 2, May 2002. O’Neil P. The SB-Tree: An Index-Sequential Structure for High-Performance Sequential Access. Acta Inf., 29(3), 1992.

Lecture 5: Wrap-up RAID Flash memory