slide1 l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Design Tradeoffs for SSD Performance PowerPoint Presentation
Download Presentation
Design Tradeoffs for SSD Performance

Loading in 2 Seconds...

play fullscreen
1 / 54

Design Tradeoffs for SSD Performance - PowerPoint PPT Presentation


  • 341 Views
  • Uploaded on

Design Tradeoffs for SSD Performance. Ted Wobber Principal Researcher Microsoft Research, Silicon Valley. Rotating Disks vs. SSDs. We have a good model of how rotating disks work… what about SSDs?. Rotating Disks vs. SSDs Main take-aways.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Design Tradeoffs for SSD Performance' - trapper


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
design tradeoffs for ssd performance

Design Tradeoffs for SSD Performance

Ted Wobber

Principal Researcher

Microsoft Research, Silicon Valley

rotating disks vs ssds
Rotating Disks vs. SSDs

We have a good model ofhow

rotating disks work… what about SSDs?

rotating disks vs ssds main take aways
Rotating Disks vs. SSDsMain take-aways
  • Forget everything you knew about rotating disks. SSDs are different
  • SSDs are complex software systems
  • One size doesn’t fit all
a brief introduction
A Brief Introduction

Microsoft Research – a focus on ideas and understanding

will ssds fix all our storage problems
Will SSDs Fix All Our Storage Problems?
  • Excellent read latency; sequential bandwidth
  • Lower $/IOPS/GB
  • Improved power consumption
  • No moving parts
  • Form factor, noise, …

Performance surprises?

performance surprises
Performance/Surprises
  • Latency/bandwidth
    • “How fast can I read or write?”
    • Surprise: Random writes can be slow
  • Persistence
    • “How soon must I replace this device?”
    • Surprise: Flash blocks wear out
what s in this talk
What’s in This Talk
  • Introduction
  • Background on NAND flash, SSDs
  • Points of comparison with rotating disks
    • Write-in-place vs. write-logging
    • Moving parts vs. parallelism
    • Failure modes
  • Conclusion
what s not in this talk
What’s *NOT* in This Talk
  • Windows
  • Analysis of specific SSDs
  • Cost
  • Power savings
full disclosure
Full Disclosure
  • “Black box” study based on the properties of NAND flash
  • A trace-based simulation of an “idealized” SSD
  • Workloads
    • TPC-C
    • Exchange
    • Postmark
    • IOzone
background nand flash blocks
BackgroundNAND flash blocks
  • A flash block is a grid of cells

1

1

1

1

0

1

1

1

0

0

1

1

  • Erase: Quantum release for all cells
  • Program: Quantuminjection for some cells
  • Read: NAND operationwith a page selected

4096 + 128 bit-lines

64 pagelines

Can’t reset bits to 1 except with erase

background 4gb flash package slc
Background4GB flash package (SLC)

Serial out

Register

Reg

Reg

Reg

Reg

Reg

Reg

Plane

Plane 3

Plane 3

Plane 0

Plane 0

Plane 1

Plane 1

Plane 2

Plane 2

Reg

Reg

Block

’09?

20μs

Die 1

Die 0

MLC (multiple bits in cell): slower, less durable

background ssd structure
BackgroundSSD Structure

Flash Translation Layer

(Proprietary firmware)

Simplified block diagram of an SSD

write in place vs logging
Write-in-Place vs. Logging
  • Rotating disks
    • Constant map fromLBA to on-disk location
  • SSDs
    • Writes always to new locations
    • Superseded blocks cleaned later
log based writes map granularity 1 block
Log-based WritesMap granularity = 1 block

Flash Block

LBA to Block Map

P

P

P0

P1

Write order

Block(P)

Pages are moved – read-modify-write,(in foreground):

Write Amplification

log based writes map granularity 1 page
Log-based WritesMap granularity = 1 page

LBA to Block Map

P

Q

P

P0

  • Q0

P1

Page(P)

Page(Q)

Blocks must be cleaned(in background):

Write Amplification

log based writes simple simulation result
Log-based WritesSimple simulation result
  • Map granularity = flash block (256KB)
    • TPC-C average I/O latency = 20 ms
  • Map granularity = flash page (4KB)
    • TPC-C average I/O latency = 0.2 ms
log based writes block cleaning
Log-based WritesBlock cleaning
  • Move valid pages so block can be erased
  • Cleaning efficiency: Choose blocks to minimize page movement

LBA to Page Map

P

Q

R

Q

P

R

R0

P0

Q0

R0

P0

Q0

Page(P)

Page(Q)

Page(R)

over provisioning putting off the work
Over-provisioningPutting off the work
  • Keep extra (unadvertised) blocks
  • Reduces “pressure” for cleaning
  • Improves foreground latency
  • Reduces write-amplification due to cleaning
delete notification avoiding the work
Delete NotificationAvoiding the work
  • SSD doesn’t know what LBAs are in use
    • Logical disk is always full!
  • If SSD can know what pages are unused, these can treated as “superseded”
  • Better cleaning efficiency
  • De-facto over-provisioning

“Trim” API:

An important step forward

delete notification cleaning efficiency

Postmark trace

One-third pages moved

Cleaning efficiency improved by factor of 3

Block lifetime improved

Delete NotificationCleaning Efficiency
lba map tradeoffs
LBA Map Tradeoffs
  • Large granularity
    • Simple; small map size
    • Low overhead for sequential write workload
    • Foreground write amplification (R-M-W)
  • Fine granularity
    • Complex; large map size
    • Can tolerate random write workload
    • Background write amplification (cleaning)
write in place vs logging summary
Write-in-place vs. LoggingSummary
  • Rotating disks
    • Constant map fromLBA to on-disk location
  • SSDs
    • Dynamic LBA map
    • Various possible strategies
    • Best strategy deeply workload-dependent
moving parts vs parallelism
Moving Parts vs. Parallelism
  • Rotating disks
    • Minimize seek time andimpact of rotational delay
  • SSDs
    • Maximize number ofoperations in flight
    • Keep chip interconnect manageable
improving iops strategies
Improving IOPSStrategies
  • Request-queue sort by sector address
  • Defragmentation
  • Application-level block ordering

Defragmentation

for cleaning efficiencyis unproven: next write might re-fragment

One request at a time

per disk head

Null seek time

flash chip bandwidth
Flash Chip Bandwidth
  • Serial interface is performance bottleneck
    • Reads constrained by serial bus
    • 25ns/byte = 40 MB/s (not so great)

Reg

Reg

Reg

Reg

Reg

Reg

8-bit serial bus

Reg

Reg

Die 1

Die 0

ssd parallelism strategies
SSD ParallelismStrategies
  • Striping
  • Multiple “channels” to host
  • Background cleaning
  • Operation interleaving
  • Ganging of flash chips
striping
Striping
  • LBAs striped across flash packages
    • Single request can span multiple chips
    • Natural load balancing
  • What’s the right stripe size?

Controller

7 15

23 31 39 47

6 14

22 30 38 46

3 11

19 27 35 43

5 13

21 29 37 45

2 10

18 26 34 42

4 12

20 28 36 44

1 9

17 25 33 41

0 8

16 24 32 40

operations in parallel
Operations in Parallel
  • SSDs are akin to RAID controllers
    • Multiple onboard parallel elements
  • Multiple request streams are needed to achieve maximal bandwidth
  • Cleaning on inactive flash elements
    • Non-trivial scheduling issues
    • Much like “Log-Structured File System”, but at a lower level of the storage stack
interleaving
Interleaving
  • Concurrent ops on a package or die
    • E.g., register-to-flash “program” on die 0 concurrent with serial line transfer on die 1
  • 25% extra throughput on reads, 100% on writes
  • Erase is slow, can be concurrent with other ops

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Reg

Die 1

Die 0

interleaving simulation
InterleavingSimulation
  • TPC-C and Exchange
  • No queuing, no benefit
  • IOzone and Postmark
  • Sequential I/O component results in queuing
  • Increased throughput
intra plane copy back
Intra-plane Copy-back
  • Block-to-block transfer internal to chip
    • But only within the same plane!
  • Cleaning on-chip!
  • Optimizing for this can hurt load balance
    • Conflicts with striping
    • But data needn’t crossserial I/O pins

Reg

Reg

Reg

Reg

cleaning with copy back simulation
Cleaning with Copy-backSimulation
  • Copy-back operation for intra-plane transfer
  • TPC-C shows 40% improvement in cleaning costs
  • No benefit for IOzone and Postmark
    • Perfect cleaning efficiency
ganging
Ganging
  • Optimally, all flash chips are independent
  • In practice, too many wires!
  • Flash packages can share a control bus with or/without separate data channels
  • Operations in lock-step or coordinated

Shared-control gang

Shared-bus gang

shared bus gang simulation
Shared-bus GangSimulation
  • Scaling capacity without scaling pin-density
  • Workload (Exchange) requires 900 IOPS
    • 16-gang fast enough
parallelism tradeoffs
Parallelism Tradeoffs
  • No one scheme optimal for all workloads

With faster serial connect, intra-chip ops are less important

moving parts vs parallelism summary
Moving Parts vs. ParallelismSummary
  • Rotating disks
    • Seek, rotational optimization
    • Built-in assumptions everywhere
  • SSDs
    • Operations in parallel are key
    • Lots of opportunities forparallelism, but with tradeoffs
failure modes rotating disks
Failure ModesRotating disks
  • Media imperfections, loose particles, vibration
  • Latent sector errors [Bairavasundaram 07]
    • E.g., with uncorrectable ECC
    • Frequency of affected disks increases linearly with time
    • Most affected disks (80%) have < 50 errors
    • Temporal and spatial locality
    • Correlation with recovered errors
  • Disk scrubbing helps
failure modes ssds
Failure ModesSSDs
  • Types of NAND flash errors (mostly when erases > wear limit)
    • Write errors: Probability varies with # of erasures
    • Read disturb: Increases with # of reads
    • Data retention errors: Charge leaks over time
    • Little spatial or temporal locality(within equally worn blocks)
  • Better ECC can help
  • Errors increase with wear: Need wear-leveling
slide43

Wear-levelingMotivation

  • Example: 25% over-provisioning to enhance foreground performance
slide44

Wear-levelingMotivation

  • Premature worn blocks = reduced over-provisioning = poorer performance
slide45

Wear-levelingMotivation

  • Over-provisioning budget consumed : writes no longer possible!
  • Must ensure even wear
wear leveling modified greedy algorithm
Wear-levelingModified "greedy" algorithm

Expiry Meter

for block A

Cold content

Block B

Block A

Q

R

P

Q

R

Q0

R0

Q0

R0

P0

  • If Remaining(A) < Throttle-Threshold, reduce probability of cleaning A
  • If Remaining(A) < Migrate-Threshold,
  • clean A, but migrate cold data into A
  • If Remaining(A) >= Migrate-Threshold,
  • clean A
wear leveling results
Wear-leveling Results
  • Fewer blocks reach expiry with rate-limiting
  • Smaller standard deviation of remaining lifetimes with cold-content migration
  • Cost to migrating cold pages (~5% avg. latency)

Block wear in IOzone

failure modes summary
Failure ModesSummary
  • Rotating disks
    • Reduce media tolerances
    • Scrubbing to deal with latentsector errors
  • SSDs
    • Better ECC
    • Wear-leveling is critical
    • Greater density  more errors?
rotating disks vs ssds49
Rotating Disks vs. SSDs
  • Don’t think of an SSD as just a faster rotating disk
  • Complex firmware/hardware system with substantial tradeoffs

ssd design tradeoffs
SSD Design Tradeoffs
  • Write amplification more wear
call to action
Call To Action
  • Users need help in rationalizing workload-sensitive SSD performance
    • Operation latency
    • Bandwidth
    • Persistence
  • One size doesn’t fit all… manufacturers should help users determine the right fit
  • Open the “black box” a bit
    • Need software-visible metrics
additional resources
Additional Resources
  • USENIX paper:http://research.microsoft.com/users/vijayanp/papers/ssd-usenix08.pdf
  • SSD Simulator download:http://research.microsoft.com/downloads
  • Related Sessions
    • ENT-C628: Solid State Storage in Server and Data Center Environments (2pm, 11/5)
slide54

© 2008 Microsoft Corporation. All rights reserved. Microsoft, Windows, Windows Vista and other product names are or may be registered trademarks and/or trademarks in the U.S. and/or other countries.

The information herein is for informational purposes only and represents the current view of Microsoft Corporation as of the date of this presentation. Because Microsoft must respond to changing market conditions, it should not be interpreted to be a commitment on the part of Microsoft, and Microsoft cannot guarantee the accuracy of any information provided after the date of this presentation. MICROSOFT MAKES NO WARRANTIES, EXPRESS, IMPLIED OR STATUTORY, AS TO THE INFORMATION IN THIS PRESENTATION.