flash memory and ssd n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Flash Memory and SSD PowerPoint Presentation
Download Presentation
Flash Memory and SSD

Loading in 2 Seconds...

play fullscreen
1 / 176

Flash Memory and SSD - PowerPoint PPT Presentation


  • 437 Views
  • Uploaded on

Flash Memory and SSD. Nov. 8, 2010 Sungjoo Yoo Embedded System Architecture Lab. Agenda. NAND Flash memory Internal operations and reliability Flash memory organization and operations Read, program, and erase ECC (error correction code) Wear leveling, etc.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Flash Memory and SSD' - vondra


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
flash memory and ssd

Flash Memory and SSD

Nov. 8, 2010

Sungjoo Yoo

Embedded System Architecture Lab.

ESA, POSTECH, 2010

agenda
Agenda
  • NAND Flash memory
    • Internal operations and reliability
  • Flash memory organization and operations
    • Read, program, and erase
  • ECC (error correction code)
  • Wear leveling, etc.
  • Flash translation layer (FTL)
  • SSD architecture

ESA, POSTECH, 2010

context ssd in notebook and pda
Context: SSD in Notebook and PDA
  • SSD Benefits
    • High performance, low power, and reliability

10 Flash

memory chips

ESA, POSTECH, 2010

an example intel ssd
An Example: Intel SSD
  • Especially good for random accesses w.r.t. HDD

ESA, POSTECH, 2010

nor vs nand summary

[Source: J. Lee, 2007]

NOR vs. NAND Summary
  • NOR Flash
    • Random, direct access interface
    • Fast random reads
    • Slow erase and write
    • Mainly for code storage
  • NAND Flash
    • Block I/O access
    • Higher density, lower cost
    • Better performance for erase and write
    • Mainly for (sequential) data storage

ESA, POSTECH, 2010

area efficiency

[Source: Samsung, 2000]

Area Efficiency
  • Metal contacts in NOR cell are the limiting factor: 2.5X difference in area/cell

ESA, POSTECH, 2010

nand flash memory circuit
NAND Flash Memory Circuit

[A 113mm² 32Gb 3b/cell NAND Flash Memory]

slc vs mlc

[Source: Z. Wu, 2007]

SLC vs. MLC
  • SLC (single level cell) vs. MLC

MLC

SLC

probability

10

00

01

0

11

1

voltage

Vth

  • Fast, less error
  • Low bit density
  • Slow, more error
  • High bit density

+

ECC (error correction code)

E.g., RS, LDPC, BCH, …

E.g., 4bit ECC for 512B

ESA, POSTECH, 2010

a zeroing cell to cell interference page architecture

(2008)

  • Three major parasitic effects
    • Background pattern dependency (BDP)
    • Noise
    • Cell-to-Cell interference
  • Proposed solution :
    • Reduce the number of neighbour cells
    • Reduce the Vth shift

A Zeroing Cell-to-Cell Interference Page Architecture

With Temporary LSB Storing and Parallel MSB Program Scheme for MLC NAND Flash Memories

slide12

Features of 32Gb 3b/cell (D3) NAND flash memory with sub-35nm CMOS process (This has reduced cost and increased density compared to conventional chips)

A 113mm² 32Gb 3b/cell NAND Flash Memory

slide13

- Achieving typical 2.3MB/s program performance with an ISPP scheme - lowering the pate programming current using self-boosting of program inhibit voltages

A 3.3V 32Mb NAND Flash Memory with Incremental Step Pulse Programming Scheme

ISPP

conventional

Vpgm

Vpgm

time

time

slide14

- Achieving typical 2.3MB/s program performance with an ISPP scheme - lowering the pate programming current using self-boosting of program inhibit voltages

A 3.3V 32Mb NAND Flash Memory with Incremental Step Pulse Programming Scheme

0.6V

Distribution of program cycles number in a chip, The ISPP device has the narrowest distribution

Programmed cell with Vth distribution of device with ISPP and 2 devices with fixed Vpgm pulses

slide16
Circuits and algorithms which result in an approximately 30% increase in program throughput compared to the conventional scheme

A 48nm 32Gb 8-Level NAND Flash Memory with 5.5MB/s Program Throughput

slide18
- Programming method suppressing the floating gate coupling effect and making the narrow Vth distribution for 16LC

A 70nm 16Gb 16-Level-Cell NAND flash Memory

slide19
- Programming method suppressing the floating gate coupling effect and making the narrow Vth distribution for 16LC

A 70nm 16Gb 16-Level-Cell NAND flash Memory

Vth distribution transition of 16LC

Comparison of Vth distribution

slide20

Features of 32Gb 3b/cell (D3) NAND flash memory with sub-35nm CMOS process (This has reduced cost and increased density compared to conventional chips)

A 5.6MB/s 64Gb 4b/Cell NAND Flash Memory in 43nm CMOS

nand flash lifetime

[Source: Samsung, 2000]

NAND Flash Lifetime
  • # of erase operations is limited due to degradation  wear leveling & ECC are needed!

ESA, POSTECH, 2010

slide23

representing raw error data from multi-level-cell devices from four manufacturers,identifying the root-cause mechanisms,and estimating the resulting uncorrectable bit error rate(UBER)

Bit Error Rate in NAND Flash Memories

slide24

representing raw error data from multi-level-cell devices from four manufacturers,identifying the root-cause mechanisms,and estimating the resulting uncorrectable bit error rate(UBER)

Bit Error Rate in NAND Flash Memories

Detrapping

SILC

slide25

representing raw error data from multi-level-cell devices from four manufacturers,identifying the root-cause mechanisms,and estimating the resulting uncorrectable bit error rate(UBER)

Bit Error Rate in NAND Flash Memories

slide26

representing raw error data from multi-level-cell devices from four manufacturers,identifying the root-cause mechanisms,and estimating the resulting uncorrectable bit error rate(UBER)

Bit Error Rate in NAND Flash Memories

column dependent errors
Column-Dependent Errors

Redundant

bitlines!

Redundant

bitlines?

[Swenson, 2009]

agenda1
Agenda
  • NAND Flash memory
    • Internal operations and reliability
  • Flash memory organization and operations
    • Read, program, and erase
  • ECC (error correction code)
  • Wear leveling, etc.
  • Flash translation layer (FTL)
  • SSD architecture

ESA, POSTECH, 2010

flash operations

[Source: J. Lee, 2007]

Flash Operations
  • Operations
    • Read
    • Write or Program
      • Changes a desired state from 1 to 0
    • Erase
      • Changes all the states from 0 to 1
  • Unit
    • Page
      • Read/Write unit (in NAND)
    • Block
      • Erase unit

write

erase

ESA, POSTECH, 2010

performance comparison

[Source: Micron, 2007e]

Performance Comparison

Small block

12.65MB/s for read

2.33MB/s for program

33ns*2K = 66us

tR = 15us

tPROG = 200us

Large block

16.13MB/s for read

5.20MB/s for program

tR = 25us

tPROG = 300us

Runtime

reduction!

Note:

The same erase time

per block!

ESA, POSTECH, 2010

read operation

[Source: Micron, 2006]

Read Operation

25us

Data

ESA, POSTECH, 2010

program with random data input

[Source: Micron, 2006]

Program with Random Data Input
  • Often used for partial page program

ESA, POSTECH, 2010

single page write case
Single Page Write Case
  • Remember “erase-before-write” means “no overwrite”!

(tR + tRC + tWC + tPROG )*(# pages/block) + tERASE

= (25us + 105.6us*2 + 300us)*64 + 2ms

= 36.32ms for a single-page (2KB) write operation

ESA, POSTECH, 2010

internal data move operations

[Source: Micron, 2007c]

Internal Data Move Operations
  • Internal data move
  • Internal data move with random data input

ESA, POSTECH, 2010

slide55

[Source: Micron, 2007c]

Removed!

ESA, POSTECH, 2010

read status

[Source: Micron, 2007d]

Read Status
  • Read status can be issued during other operations

ESA, POSTECH, 2010

the simplest nand flash controller

[Source: Micron, 2006]

The Simplest NAND Flash Controller
  • NAND Flash Control by Processor

ESA, POSTECH, 2010

command address and data selection
Command, Address, and Data Selection
  • Which operation?

int *cmd = 0xFFF010;

int *addr = 0xFFF020;

int *data = 0xFFF000;

*cmd = 0x80;

*addr = ColL;

*addr = ColH;

ESA, POSTECH, 2010

agenda2
Agenda
  • NAND Flash memory
    • Internal operations and reliability
  • Flash memory organization and operations
    • Read, program, and erase
  • ECC (error correction code)
  • Wear leveling, etc.
  • Flash translation layer (FTL)
  • SSD architecture

ESA, POSTECH, 2010

slc vs mlc1

[Source: Z. Wu, 2007]

SLC vs. MLC
  • SLC (single level cell) vs. MLC

MLC

SLC

probability

10

00

01

0

11

1

voltage

Vth

  • Fast, less error
  • Low bit density
  • Slow, more error
  • High bit density

+

ECC (error correction code)

E.g., RS, LDPC, BCH, …

E.g., 4bit ECC for 512B

ESA, POSTECH, 2010

mlc vs slc characteristics

[Source: Micron, 2006]

MLC vs. SLC: Characteristics
  • MLC
    • 2x inferior performance to SLC
    • 10x shorter lifetime than SLC

ESA, POSTECH, 2010

nand flash lifetime1

[Source: Samsung, 2000]

NAND Flash Lifetime
  • # of erase operations is limited due to degradation  wear leveling & ECC are needed!

ESA, POSTECH, 2010

column dependent errors1
Column-Dependent Errors

Redundant

bitlines!

Redundant

bitlines?

[Swenson, 2009]

two methods to enhance endurance effective capacity
Two Methods to Enhance Endurance (& Effective Capacity)
  • ECC (error correction code)
  • Wear leveling

ESA, POSTECH, 2010

ecc algorithms

[Source: Micron, 2006]

ECC Algorithms
  • SLC: Hamming
  • MLC: RS, BCH, LDPC, etc.

ESA, POSTECH, 2010

hamming code of a single data

[Source: Micron, 2007f]

Hamming Code of A Single Data

Old ECC

0^0^0^1

1^1^0^1

101

0

010

0^1^0^0

0

Corruption: 01010001  01010101

0

New ECC with the corrupted data

0

Error detection by XORing old and new ECCs

1 bit error location (=correction) by XORing old and new odd ECCs

2n data  2*n bits for ECC

ESA, POSTECH, 2010

spare area to store ecc

[Source: Micron, 2007g]

Spare Area to Store ECC
  • Total 24 bits (=18+6)
    • Byte: 512  29  2*9 = 18, Bit: 8  23  2*3 = 6

ESA, POSTECH, 2010

ecc algorithms1

[Source: Micron, 2006]

ECC Algorithms
  • SLC: Hamming
  • MLC: RS, BCH, LDPC, etc.

ESA, POSTECH, 2010

ecc for mlc example

[Source: Z. Wu, 2007]

ECC for MLC (Example)
  • 4 level (2 bit) cell vs. 8 level (3 bit) cell
  • (511, 451) Reed-Solomon code
  • Applied to 8 level cell
  • Code rate = 451/511 = 0.883
  • Effective bits = 3*451/511 ~ 2.6 bits/cell
  • >30% better capacity than 2 bit MLC

ESA, POSTECH, 2010

effective error correction level
Effective Error Correction Level
  • ECC Strength, BER, and BLER

[Deal, 2009]

ecc strength and parity bytes bch code c ase
ECC Strength and Parity BytesBCH Code Case
  • Parity data size
    • n bit data, e.g., 8191b
    • t bit error correction
    • ~(log2n+1)*t bits for parity
      • E.g., 1024B-ECC32: (13+1)*32 bits = 56B

[Deal, 2009]

copyback to enhance data integrity

[Source: Micron, 2008]

Copyback to Enhance Data Integrity
  • If # errors (of a block) increases, then read data and re-write it (to the same or other block)
    • Similar to self refresh in DRAM

ESA, POSTECH, 2010

agenda3
Agenda
  • NAND Flash memory
    • Internal operations and reliability
  • Flash memory organization and operations
    • Read, program, and erase
  • ECC (error correction code)
  • Wear leveling, etc.
  • Flash translation layer (FTL)
  • SSD architecture

ESA, POSTECH, 2010

wear leveling

[Source: Micron, 2006e]

Wear Leveling
  • Two scenarios of MLC NAND Flash usage (512MB, 4096 blocks, 10k erase cycles)
    • Update 6 files/hour, 50 blocks/file, 24 hours
  • Only 200 blocks are used
  • All 4096 blocks are evenly used

ESA, POSTECH, 2010

wear leveling methods
Wear Leveling Methods
  • Track erase cycles per block
  • Try to achieve uniform erases over blocks
  • Hot-cold swapping (based methods)
  • K-leveling
hot cold swapping
Hot-Cold Swapping
  • If the erase cycle (EC) of a block reaches a limit in the log block, the youngest data block becomes a log block while swapping the contents of both blocks

Log block

k leveling
K-Leveling

Blocks

2K

erases

2K

erases

2K

erases

K-leveling: An Efficient Wear-leveling Scheme for Flash Memory, 2006

initial bad block identification

[Source: Micron, 2006b]

Initial Bad Block Identification

The first page in the block

  • Performed on boot up
  • If(data at 0x2048 on Pages 0 and 1 == 0xFF)

It’s a good block

The data @ 2048

ESA, POSTECH, 2010

block degradation and tracking at runtime

[Source: Micron, 2006b]

Block Degradation and Tracking at Runtime
  • Important to track the blocks that go bad during normal device operation
  • When and how to check?
    • Issue a Read Status command after any Erase and Program operation
  • Two types of failure
    • Permanent: add the block to the bad block table
    • Temporary
      • Program disturb: e.g., neighbor page data are corrupted
      • Read disturb: e.g., due to too many reads of a single page
      • Over-programming: e.g., all data look like ‘0’
      • Data loss: e.g., due to charge loss or gain
      • Solution: erase the corresponding block and re-program it

ESA, POSTECH, 2010

agenda4
Agenda
  • NAND Flash memory
    • Internal operations and reliability
  • Flash memory organization and operations
    • Read, program, and erase
  • ECC (error correction code)
  • Wear leveling
  • Flash translation layer (FTL)
  • SSD architecture

ESA, POSTECH, 2010

typical flash storage

[Source: J. Lee, 2007]

Typical Flash Storage
  • Both # of Flash i/o ports and controller technology determine Flash performance

Host (PC)

Intel SSD

I/O Interface(USB, IDE, PCMCIA)

NANDFlash

Controller

FTL runs

on the controller

FTL runs

on the controller

ESA, POSTECH, 2010

flash translation layer ftl

[Source: J. Lee, 2007]

Flash Translation Layer (FTL)
  • A software layer emulating standard block device interface Read/Write
  • Features
    • Sector mapping
    • Garbage collection
    • Power-off recovery
    • Bad block management
    • Wear-leveling
    • Error correction code (ECC)

ESA, POSTECH, 2010

single page write case1
Single Page Write Case
  • Remember “erase-before-write” means “no overwrite”!

(tR + tRC + tWC + tPROG )*(# pages/block) + tERASE

= (25us + 105.6us*2 + 300us)*64 + 2ms

= 36.32ms for a single-page (2KB) write operation

ESA, POSTECH, 2010

replacement block scheme ban 1995
Replacement Block Scheme [Ban, 1995]
  • In-place scheme
    • Keep the same page index in data and update blocks

Called data (D) block

Called update (U) block (or log block)

D block

U block 1

Previously, two single-page write operations take

2 x 36.32ms = 72.63ms

~90X

reduction!

Two single page-write operations take

2 x (tWC + tPROG )

= 2 x (105.6us + 300us)

= 0.81ms

ESA, POSTECH, 2010

replacement block scheme ban 19951
Replacement Block Scheme [Ban, 1995]
  • In-place scheme
    • Keep the same page index in data and log blocks

D block

U block 1

U block 2

Advantage

Simple

Disadvantages

Utilization is low

Violate the sequential write

constraint

ESA, POSTECH, 2010

log buffer based scheme kim 2002
Log Buffer-based Scheme [Kim, 2002]
  • In-place (linear mapping) vs. out-of-place (remapping) schemes

D block

U block 1

U block 2

D block

U block 1

U block 2

In-place scheme

+No need to manage complex mapping info

- Low U block utilization

- Violation of sequential write constraint

Out-of-place scheme

+ High U block utilization

+ Sequential write

- Mapping information

needs to be maintained

ESA, POSTECH, 2010

garbage collection gc
Garbage Collection (GC)

D block

U block 1

U block 2

No more U block!

 Perform garbage collection

to reclaim U block(s) by erasing blocks with

many invalid pages

ESA, POSTECH, 2010

three types of garbage collection

[Kang, 2006]

Three Types of Garbage Collection
  • Which one will be the most efficient?

ESA, POSTECH, 2010

garbage collection overhead
Garbage Collection Overhead

Full merge cost calculation

Assumptions

64 page block

tRC = tWC = 100us

tPROG = 300us

tERASE = 2ms

Max # of valid page copies = 64

# block erases = 3

Full merge operations

D block

Free block

U block 1

U block 2

Runtime cost

= 64*(tRC+tWC+tPROG)+3*tERASE

= 64*(100us*2+300us)+3*2ms

= 38ms

X

X

X

Valid page copies may dominate runtime cost

 minimize # valid page copies

ESA, POSTECH, 2010

three representative methods of flash translation layer
Three Representative Methods of Flash Translation Layer
  • FAST [Lee, 2007]
    • Two types of log block
      • A sequential write log block to maximize switch merges
      • Random write log blocks cover the other write accesses
  • Superblock [Kang, 2006]
    • A group of blocks is managed as a superblock
    • Linear address mapping is broken within a superblock to reduce # of valid page copies in GC
  • LAST [Lee, 2008]
    • Two partitions in random write log blocks
      • Hot partition  more dead blocks  reduction in full merge

ESA, POSTECH, 2010

profiling conditions
Profiling Conditions
  • Samsung Sens X360, 64GB SSD (Samsung)
  • MobileMark 2007: Reader, Productivity, and DVD
  • busTrace v8.0: http://www.bustrace.com
  • 2 hours trace
    • 2 hours from the start of MobileMark
summary of profiling results
Summary of Profiling Results

Total read/write

Total read/write

Total read/write

217MB

2,878MB

931MB

Read

Read

Read

97MB

1,939MB

608MB

Write

Write

Write

120MB

939MB

323MB

detailed view productivity scenario
Detailed View: Productivity Scenario
  • Local view is not so complex as the global view
clear difference of access patterns in rd wr lengths
Clear Difference of Access Patterns in RD/WR Lengths
  • Long sequential RDs/WRs (>=128 sectors) are concentrated on some specific durations (e.g., program load)
  • Short sequential and random RDs/WRs are distributed over the entire period
locality detection random vs sequential

[Lee, 2008]

Locality Detection:Random vs. Sequential
  • Observations
    • Short requests are very frequent (a)
    • Short requests tend to access random locations (b)
    • Long requests tend to access sequential locations (b)
    • Threshold of randomness
      • 4KB from experiments

ESA, POSTECH, 2010

slide104

[Lee, 2008]

LAST
  • Observations
    • Typical SSD accesses have both random and sequential traffics
    • Random traffics can be classified into hot and cold

ESA, POSTECH, 2010

last scheme

[Lee, 2008]

LAST Scheme

ESA, POSTECH, 2010

last scheme1

[Lee, 2008]

LAST Scheme

< 4KB

>= 4KB

ESA, POSTECH, 2010

why hot and cold

[Lee, 2008]

Why Hot and Cold?
  • Observation
    • A large amount of invalid pages (>50%) occupy random log buffer space
    • They are mostly caused by hot pages
  • Problem
    • Invalid pages are distributed over random log buffer space, which causes full merges (expensive!)

ESA, POSTECH, 2010

aggregating invalid pages due to hot pages
Aggregating Invalid Pages due to Hot Pages
  • An example trace
    • 1,4,3,1,2,7, 8, 2, 1, …
  • Single random buffer partition suffer from distributed invalid pages
  • In LAST method, Hot partition aggregates invalid pages --> full merges can be reduced. In addition, full merges are delayed

ESA, POSTECH, 2010

temporal locality hot or cold
Temporal Locality: Hot or Cold?
  • Update interval (calculated for each page access)

= Current page access time – last page access time

  • If update interval < k (threshold)
    • Hot (means frequent writes)

ESA, POSTECH, 2010

last scheme2

[Lee, 2008]

LAST Scheme

< 4KB

>= 4KB

ESA, POSTECH, 2010

experimental results
Experimental Results
  • Full merge cost is significantly reduced by LAST
  • Many dead blocks are created  GC with the lowest cost (only erase is needed)

ESA, POSTECH, 2010

agenda5
Agenda
  • NAND Flash memory
    • Internal operations and reliability
  • Flash memory organization and operations
    • Read, program, and erase
  • ECC (error correction code)
  • Wear leveling
  • Flash translation layer (FTL)
  • SSD architecture

ESA, POSTECH, 2010

ssd architecture
SSD Architecture
  • Multi-channel SSD architecture
  • High speed interface
  • RAID (redundant array of independent disks)
  • PRAM for SSD
ssd architecture terminology 10 channels 8 ways channel
SSD Architecture Terminology10 channels & 8 ways/channel

NAND

NAND

NAND

NAND

Controller

Ch 1

NAND

NAND

NAND

NAND

NAND

NAND

NAND

NAND

Ch 2

NAND

NAND

NAND

NAND

Host

SATA2

Channel = Flash memory chip

Way = Flash memory die

NAND

NAND

NAND

NAND

Ch 10

NAND

NAND

NAND

NAND

ESA, POSTECH, 2009

single channel ssd
Single Channel SSD

NAND

Controller

NAND

NAND

NAND

Host

Ch 1

NAND

NAND

NAND

NAND

Block

ID

Page

ID

Flash

ID

Sector

ID

LA

(logical address)

6

3

2

Program (way)

IO (channel)

8*63.33us + 300us

=806.64us

Flash 0

Flash 1

Flash 2

Flash 3

Flash 4

Flash 5

Flash 6

Flash 7

Time

Single channel case

multi channel ssd
Multi-Channel SSD

NAND

Controller

NAND

Ch 1

NAND

NAND

Host

NAND

NAND

Ch 2

NAND

NAND

Block

ID

Page

ID

Flash

ID

Sector

ID

LA

(logical address)

6

3

2

Multi-channel / multi-way SSD controller architecture

63.33us

8*63.33us + 300us

=806.64us

300us

Flash 0

Flash 0

Flash 1

Flash 1

Flash 2

Flash 2

Flash 3

Flash 3

4*63.33us + 300us

=553.32us

Flash 4

Flash 4

Flash 5

Flash 5

Flash 6

Flash 6

Flash 7

Flash 7

Time

Time

Single channel case

Two channel case

Multi-channel effect

notice
Notice
  • Most of the following slides are from ONFI, www.onfi.com
sequential reads by single core
Sequential Reads by Single Core
  • Single core does not utilize Flash chip-level parallelism
  • Flash I/O needs to be performed sequentially

8X

Block

ID

Flash

ID

Page

ID

Sector

ID

Flash

Flash

LA

(logical address)

3

6

2

Flash

Note: a Flash page read command

(00h-30h) consumes only a few

clock cycles

Flash

60us

60us

Flash 0

Flash

Flash 1

Flash 2

Flash

Flash 3

Flash 4

Flash

Flash 5

Flash 6

Flash

Flash 7

ddr memory operation sdram ddr and ddr2 3
DDR Memory Operation: SDRAM, DDR, and DDR2/3
  • SDR SDRAM
  • DDR
  • DDR2
  • DDR3

Use both edges

Increase

the clock speed

of memory

interface

Max bus clock

400MHz

Max bus clock

800MHz

performance improvement by high speed nand interface
Performance Improvement by High Speed NAND Interface
  • Single core does not utilize Flash chip-level parallelism
  • Flash I/O needs to be performed sequentially

8X

Block

ID

Flash

ID

Page

ID

Sector

ID

Flash

Flash

LA

(logical address)

3

6

2

Flash

Flash

60us

21us

Flash 0

Flash

Flash 1

Flash 2

Flash

33MHz  100MHz

gives a significant

performance improvement

Flash 3

Flash 4

Flash

Flash 5

Flash 6

Flash

Flash 7

summary of high speed interface
Summary of High Speed Interface
  • Fast NAND interface improves SSD performance, especially, read performance
  • Currently, two camps with similar solutions
    • ONFI: Intel, Hynix, Micron, SanDisk, …
    • Toggle-mode DDR NAND: Samsung, …
agenda6
Agenda
  • What is RAID?
  • State-of-the-art SSD RAID
slide145

[Patterson, ‘88]

RAID
  • High performance & reliability
    • Parallel accesses to disks
    • ECC or parity in check disks enable error detection/correction
ssd architecture terminology 10 channels 8 ways channel1

[Patterson, ‘88]

SSD Architecture Terminology10 channels & 8 ways/channel

NAND

NAND

NAND

NAND

Controller

Ch 1

NAND

NAND

NAND

NAND

NAND

NAND

NAND

NAND

Ch 2

NAND

NAND

NAND

NAND

Host

SATA2

Channel = Flash memory chip

Way = Flash memory die

NAND

NAND

Way, channel, or SSD = disk in RAID

NAND

NAND

Ch 10

NAND

NAND

NAND

NAND

ESA, POSTECH, 2009

sector and bit level interleaving

[Patterson, ‘88]

Sector and Bit Level Interleaving

Sector level interleaving

Bit level interleaving

P

P

1

0

1

0

0

1

1

1

0

0

0

1

1

1

1

0

0

0

0

0

1 0 1 0

0

1

1

1

0 1 1 1

0 0 0 1

1 1 1 0

Pros

Independent small reads Easy parity/ECC calculation

Cons

A little difficult parity calculation Low small read performance

(a small write requires 2R+2W) Slow down due to lack of sync

slow down due to lack of synchronization among disks

[Patterson, ‘88]

Slow Down due to Lack of Synchronization among Disks

Bit level interleaving

C

Note: Latency values

5 2 1 5

2

3

1

2

MAX(5, 2, 1, 5, 2) = 5

MAX(2, 6, 1, 3, 3) = 6

MAX(7, 3, 3, 1, 1) = 7

MAX(1, 1, 6, 5, 2) = 6

2 6 1 3

7 3 3 1

1 1 6 5

- Each disk has a different latency due to the different locations of disk head

- Maximum of disk latency among disks can determine total (read) performance

- The slow down problem can be severe especially for small reads since sequential

reads can average out inter-disk latency difference

raid 1 2

[Patterson, ‘88]

RAID 1 & 2
  • RAID 1: mirrored disks
    • Too expensive
  • RAID 2: check disks with ECC (Hamming code)
    • Bit level interleaving
    • Low small read performance
    • High check disk cost
raid 2

[Patterson, ‘88]

RAID 2
  • High cost  RAID 3
    • Hamming code requires several disks
      • 10 data disks -> 4 check disks
      • 25 data disks -> 5 check disks
  • Low small read performance  RAID 4
    • On a small read, all the disks in a group need to be read including check disks

Bit level interleaving

ECC

ECC

1 0 1 0

1

0

1

0

0

1

1

0

0 1 1 1

0 0 0 1

1 1 1 0

raid 3

[Patterson, ‘88]

RAID 3
  • Only one check disk per group for parity
  • Each data disk can detect an error
  • In case of error, the data from the other data disks and the parity are used to correct the error
  • Still bit level interleaving
raid 2 to 3

[Patterson, ‘88]

RAID 2 to 3

ECC

ECC

1 0 1 0

1

0

1

0

0

1

1

0

RAID 2

0 1 1 1

0 0 0 1

1 1 1 0

P

1 0 1 0

1

0

1

0

RAID 3

0 1 1 1

0 0 0 1

Each disk has its own parity

0 0 1 0

1 1 1 0

raid 4

[Patterson, ‘88]

RAID 4
  • Sector level interleaving
  • Independent reads are possible  high small read performance
  • However, still low small write performance since the check disk is the bottleneck
raid 41

[Patterson, ‘88]

RAID 4

Sector level interleaving

Sector level interleaving

P

P

1

0

1

0

0

1

1

1

0

0

0

1

1

1

1

0

0

0

0

0

1

0

1

0

0

1

1

1

0

0

0

1

1

1

1

0

0

0

0

0

Small read performance improves

New parity = old parity ^ old data ^ new data

raid 5

[Patterson, ‘88]

RAID 5
  • Parity is spread over all the disks
  • Parity access is not the bottleneck in small write performance
  • Sector level interleaving
differential ssd raid microsoft 2009
Differential SSD RAID [MicroSoft, 2009]
  • Reliability vs. performance in SSD RAID
  • RAID 4 case example
differential raid
Differential RAID
  • Skewed parity assignment to disks
  • RAID 4 : (100%, 0, 0, 0, 0, 0)
  • RAID 5 : (20%, 20%, 20%, 20%, 20%)
  • Diff-RAID : (1, 5, 5, 5, 5, 5)
differential raid1
Differential RAID
  • RAID 4 and 5 are two extremes
    • RAID 4 : (100%, 0, 0, 0, 0, 0)
    • RAID 5 : (20%, 20%, 20%, 20%, 20%)
  • Diff-RAID enables trade-off btw reliability and performance
open problems
Open Problems
  • Parity assignment problem in Diff-RAID
    • Given the reliability constraint and SSD access characteristics, assign parity to disks in order to maximize performance while meeting the reliability constraint
      • Design time version
      • Runtime version
  • Minimizing writes and reads due to parity updates
    • Problem statement: Parity updates require small data update. However, the page size of NAND Flash increases up to 16KB in near future. It will be too expensive to update parity on each small writes.
    • New ideas are required
      • Parity log block (packing parity data of different pages into a single log page)
      • PRAM (and DRAM buffer) for parity or parity log block
      • Parity prefetch to reduce channel/way conflicts due to parity reads, etc.
a self balancing striping scheme

[Chang, 2008]

A Self Balancing Striping Scheme
  • Load balancing in multiple NAND channels
pram a hybrid storage partner

[Samsung, 2008]

PRAM: A Hybrid Storage Partner
  • PRAM
    • Fast byte or word-access capability without erase before-program requirement like normal DRAMs
    • Still high cost / bit
      • Though recently in mass production
a pram nand hybrid storage1

[Samsung, 2008]

A PRAM+NAND Hybrid Storage
  • File system metadata occupies up to 50% of total write traffics
  • Store meta data on PRAM and other user data on NAND
on average 16 of metadata is updated

[Samsung, 2008]

On Average, 16% of Metadata is Updated
  • PRAM filter extracts only the updated data
reference
Reference
  • [Samsung, 2000] Samsung Electronics, Samsung NAND Flash Memory, 2000.
  • [Micron, 2006] Micron, NAND Flash 101 - An Introduction to NAND Flash and How to Design It In to Your Next Product, Nov. 2006.
  • [Micron, 2007] Micron, NAND Flash Performance Increase - Using the Micron® PAGE READ CACHE MODE Command, June 2007.
  • [Micron, 2007b] Micron, NAND Flash Performance Increase with PROGRAM PAGE CACHE MODE Command, June 2007.
  • [Micron, 2007c] Micron, NAND Flash Performance Improvement Using Internal Data Move, June 2007.
  • [Micron, 2007d] Micron, Monitoring Ready/Busy Status in 2, 4, and 8Gb Micron NAND Flash Devices, June 2007.
  • [Micron, 2007e] Micron, Small Block vs. Large Block NAND Devices, June 2007.
  • [Micron, 2006b] Micron, NAND Flash Design and Use Considerations, Aug. 2006.
  • [Micron, 2006e] Micron, Wear-Leveling Techniques in NAND Flash Devices, Aug. 2006.
  • [Micron, 2007f] Micron, Hamming Codes for NAND Flash Memories, June 2007.
  • [Z. Wu, 2007] Flash memory with coding and signal processing, US Patent 2007/0171714 A1, Published, 2007
  • [Micron, 2008] Micron, Using COPYBACK Operations to Maintain Data Integrity in NAND Flash Devices, Oct. 2008.
  • [Micron, 2007g] Micron, Micron ECC Module for NAND Flash via Xilinx™ Spartan™-3 FPGA, June 2007.

ESA, POSTECH, 2010

reference1
Reference
  • [Ban, 1995] A. Ban, Flash File System, US Patent, no. 5,404,485, April 1995.
  • [Kim, 2002] J. Kim, et al., “A Space-Efficient Flash Translation Layer for compactflash Systems”, IEEE Transactions on Consumer Electronics, May 2002.
  • [Kang, 2006] J. Kang, et al., “A Superblock-based Flash Translation Layer for NAND Flash Memory”, Proc. EMSOFT, Oct. 2006.
  • [S. Lee, 2007] A Log Buffer-Based Flash Translation Layer Using Fully-Associative Sector Translation, ACM TECS, 2007
  • [Lee, 2008] S. Lee, et al., “LAST: locality-aware sector translation for NAND flash memory-based storage systems”, ACM SIGOPS Operating Systems Review archive, Volume 42 ,  Issue 6, October 2008.

ESA, POSTECH, 2010

nand nor characteristics

[Source: Micron, 2006]

NAND/NOR Characteristics
  • NAND is currently favored thanks to better write (erase) performance and area efficiency

ESA, POSTECH, 2010

nand vs nor required pins

[Source: Micron, 2006]

NAND vs. NOR: Required Pins
  • NAND utilizes multiplexed I/O (I/O[7:0] in the table) for commands and data
    • NAND operation: command  address  data
  • NOR has separate address and data buses

ESA, POSTECH, 2010

garbage collection in last step 1 victim partition selection
Garbage Collection in LAST:Step 1 Victim Partition Selection
  • Basic rule
    • If there is a dead block in Hot partition, we select Hot partition as the victim partition
    • Else, we select Cold partition as the victim
  • Demotion from Hot to Cold page
    • If there is a log block whose updated time is smaller than a certain threshold time, age threshold (i.e., old enough), then we select Hot partition as the victim

ESA, POSTECH, 2010

garbage collection in last step 2 victim block selection
Garbage Collection in LAST:Step 2 Victim Block Selection
  • Case A: Victim partition = Hot partition
    • If there is a dead block, select it
    • Else, select a least recently updated block
  • Case B: Victim partition = Cold partition
    • Choose the block with the lowest (full) merge cost (in the merge cost table)
      • Na: associativity degree, Np: # valid page copies
      • Cc: page copy cost, Ce: erase cost

ESA, POSTECH, 2010

adaptiveness in last
Adaptiveness in LAST
  • Hot/cold partition size (Sh, Sc), temporal locality threshold (k), age threshold, etc. are adjusted at runtime depending on the given traffics
    • Nd = # dead blocks in Hot partition
    • Uh = utilization of Hot partition (# valid pages / # total pages)
  • One example of runtime policy
    • If Nd is increasing, then reduce Sh since too many log blocks are assigned to Hot partition
  • There are several more policy examples in the paper
    • Comments: They do not seem to be extensive. Thus, they can be improved further

ESA, POSTECH, 2010