Flash-based (cloud) storage systems

Flash-based (cloud) storage systems Lecture 25 Aditya Akella

BufferHash: invented in the context of network de-dup (e.g., inter-DC log transfers) • SILT: more “traditional” key-value store

Cheap and Large CAMs for High Performance Data-Intensive Networked Systems Ashok Anand, ChitraMuthukrishnan, Steven Kappes, and AdityaAkella University of Wisconsin-Madison SumanNath Microsoft Research

New data-intensive networked systems Large hash tables (10s to 100s of GBs)

New data-intensive networked systems WAN optimizers Object WAN Branch office Data center Key (20 B) Chunk pointer Large hash tables (32 GB) Object High speed (~10K/sec) lookups for 500 Mbps link Chunks(4 KB) Look up High speed (~10 K/sec) inserts and evictions Object store (~4 TB) Hashtable (~32GB)

New data-intensive networked systems • Other systems • De-duplication in storage systems (e.g., Datadomain) • CCN cache (Jacobson et al., CONEXT 2009) • DONA directory lookup (Koponen et al., SIGCOMM 2006) • Cost-effective large hash tables • Cheap LargecAMs

Candidate options Random reads/sec Cost (128 GB) Too slow +Price statistics from 2008-09 Too expensive Random writes/sec Flash-SSD Disk DRAM 10K* 300K 250 $30+ $120K+ $225+ 250 300K 2.5 ops/sec/$ 5K* Slow writes * Derived from latencies on Intel M-18 SSD in experiments • How to deal with slow writes of Flash SSD

CLAM design • New data structure “BufferHash” + Flash • Key features • Avoid random writes, and perform sequential writes in a batch • Sequential writes are 2X faster than random writes (Intel SSD) • Batched writes reduce the number of writes going to Flash • Bloom filters for optimizing lookups BufferHashperforms orders of magnitude better than DRAM based traditional hash tables in ops/sec/$

Flash/SSD primer • Random writes are expensive Avoid random page writes • Reads and writes happen at the granularity of a flash page I/O smaller than page should be avoided, if possible

Conventional hash table on Flash/SSD Keys are likely to hash to random locations Random writes Flash SSDs: FTL handles random writes to some extent;But garbage collection overhead is high ~200 lookups/sec and ~200 inserts/sec with WAN optimizer workload, << 10 K/s and 5 K/s

Conventional hash table on Flash/SSD DRAM Can’t assume locality in requests – DRAM as cache won’t work Flash

Our approach: Buffering insertions • Control the impact of random writes • Maintain small hash table (buffer) in memory • As in-memory buffer gets full, write it to flash • We call in-flash buffer, incarnation of buffer DRAM Flash SSD Buffer: In-memory hash table Incarnation: In-flash hash table

Two-level memory hierarchy DRAM Buffer Flash Incarnation 4 3 2 1 Latest incarnation Oldest incarnation Incarnation table Net hash table is: buffer + all incarnations

Lookups are impacted due to buffers DRAM Buffer Lookup key Flash In-flash look ups 4 3 2 1 Incarnation table Multiple in-flash lookups. Can we limit to only one?

Bloom filters for optimizing lookups DRAM Buffer Lookup key Bloom filters In-memory look ups Flash False positive! 4 3 2 1 Configure carefully! Incarnation table 2 GB Bloom filters for 32 GB Flash for false positive rate < 0.01!

Update: naïve approach Update key DRAM Buffer Bloom filters Update key Flash Expensive random writes 4 3 2 1 Incarnation table Discard this naïve approach

Lazy updates Update key DRAM Buffer Insert key Bloom filters Key, new value Key, old value Flash 4 3 2 1 Incarnation table Lookups check latest incarnations first

Eviction for streaming apps • Eviction policies may depend on application • LRU, FIFO, Priority based eviction, etc. • Two BufferHash primitives • Full Discard: evict all items • Naturally implements FIFO • Partial Discard: retain few items • Priority based eviction by retaining high priority items • BufferHash best suited for FIFO • Incarnations arranged by age • Other useful policies at some additional cost • Details in paper

Issues with using one buffer • Single buffer in DRAM • All operations and eviction policies • High worst case insert latency • Few seconds for 1 GB buffer • New lookups stall DRAM Buffer Bloom filters Flash 4 3 2 1 Incarnation table

Partitioning buffers 0 XXXXX 1 XXXXX • Partition buffers • Based on first few bits of key space • Size > page • Avoid i/o less than page • Size >= block • Avoid random page writes • Reduces worst case latency • Eviction policies apply per buffer DRAM Flash 4 3 2 1 Incarnation table

BufferHash: Putting it all together • Multiple buffers in memory • Multiple incarnations per buffer in flash • One in-memory bloom filter per incarnation DRAM Buffer K Buffer 1 . . Flash . . Net hash table = all buffers + all incarnations

Latency analysis • Insertion latency • Worst case size of buffer • Average case is constant for buffer > block size • Lookup latency • Average case Number of incarnations • Average case False positive rate of bloom filter

Parameter tuning: Total size of Buffers Total size of buffers = B1 + B2 + … + BN Given fixed DRAM, how much allocated to buffers Total bloom filter size = DRAM – total size of buffers . . DRAM B1 BN Lookup #Incarnations * False positive rate # Incarnations = (Flash size/Total buffer size) . . Flash False positive rate increases as the size of bloom filters decrease Too small is not optimal Too large is not optimal either Optimal = 2 * SSD/entry

Parameter tuning: Per-buffer size What should be size of a partitioned buffer (e.g. B1) ? . . DRAM B1 BN Affects worst case insertion Adjusted according to application requirement (128 KB – 1 block) . . Flash

SILT: A Memory-Efficient,High-Performance Key-Value Store Hyeontaek Lim, Bin Fan, David G. AndersenMichael Kaminsky† Carnegie Mellon University †Intel Labs 2011-10-24

Key-Value Store Clients Key-Value StoreCluster PUT(key, value) value = GET(key) DELETE(key) • E-commerce (Amazon) • Web server acceleration (Memcached) • Data deduplication indexes • Photo storage (Facebook)

SILT goal: use muchless memory than previous systems while retaining high performance.

Three Metrics to Minimize Memory overhead = Index size per entry • Ideally 0 (no memory overhead) Read amplification = Flash reads per query • Limits query throughput • Ideally 1 (no wasted flash reads) Write amplification = Flash writes per entry • Limits insert throughput • Also reduces flash life expectancy • Must be small enough for flash to last a few years

Landscape before SILT Read amplification SkimpyStash HashCache BufferHash FlashStore FAWN-DS ? Memory overhead (bytes/entry)

Solution Preview: (1) Three Stores with (2) New Index Data Structures Queries look up stores in sequence (from new to old) Inserts only go to Log Data are moved in background SILT Log Index (Write friendly) SILT Sorted Index (Memory efficient) SILT Filter Memory Flash

LogStore: No Control over Data Layout Naive Hashtable (48+ B/entry) SILT Log Index (6.5+ B/entry) Memory Flash Inserted entries are appended (Older) (Newer) On-flash log Memory overhead Write amplification 6.5+ bytes/entry 1

SortedStore: Space-Optimized Layout SILT Sorted Index (0.4 B/entry) Memory Flash Need to perform bulk-insert to amortize cost On-flash sorted array Memory overhead Write amplification 0.4 bytes/entry High

Combining SortedStore and LogStore <SortedStore> <LogStore> SILT Sorted Index SILT Log Index Merge On-flash sorted array On-flash log

Achieving both Low Memory Overhead and Low Write Amplification • Low memory overhead • High write amplification SortedStore • High memory overhead • Low write amplification LogStore SortedStore LogStore Now we can achieve simultaneously: Write amplification = 5.4 = 3 year flash life Memory overhead = 1.3 B/entry With “HashStores”, memory overhead = 0.7 B/entry!

SILT’s Design (Recap) <SortedStore> <HashStore> <LogStore> SILT Sorted Index SILT Filter SILT Log Index Merge Conversion On-flash sorted array On-flash hashtables On-flash log Memory overhead Read amplification Write amplification 0.7 bytes/entry 1.01 5.4

New Index Data Structures in SILT SILT Sorted Index SILT Filter & Log Index Entropy-coded tries Partial-key cuckoo hashing For SortedStore Highly compressed (0.4 B/entry) For HashStore & LogStore Compact (2.2 & 6.5 B/entry) Very fast (> 1.8 M lookups/sec)

Landscape Read amplification SkimpyStash HashCache BufferHash FlashStore FAWN-DS SILT Memory overhead (bytes/entry)

BufferHash: Backup

Outline • Background and motivation • Our CLAM design • Key operations (insert, lookup, update) • Eviction • Latency analysis and performance tuning • Evaluation

Evaluation • Configuration • 4 GB DRAM, 32 GB Intel SSD, Transcend SSD • 2 GB buffers, 2 GB bloom filters,0.01false positive rate • FIFO eviction policy

BufferHash performance • WAN optimizer workload • Random key lookups followed by inserts • Hit rate (40%) • Used workload from real packet traces also • Comparison with BerkeleyDB (traditional hash table) on Intel SSD Better lookups! Better inserts!

Insert performance CDF 1.0 0.8 99% inserts < 0.1 ms 0.6 40% of inserts > 5 ms ! 0.4 0.2 Insert latency (ms) on Intel SSD Buffering effect! Random writes are slow!

Lookup performance CDF 99% of lookups < 0.2ms 1.0 0.8 40% of lookups > 5 ms 0.6 0.4 Garbage collection overhead due to writes! 0.2 Lookup latency (ms) for 40% hit workload 60% lookups don’t go to Flash 0.15 ms Intel SSD latency

Performance in Ops/sec/$ • 16K lookups/sec and 160K inserts/sec • Overall cost of $400 • 42 lookups/sec/$ and 420 inserts/sec/$ • Orders of magnitude better than 2.5 ops/sec/$ of DRAM based hash tables

Other workloads • Varying fractions of lookups • Results on Trancend SSD Average latency per operation • BufferHash ideally suited for write intensive workloads

Evaluation summary • BufferHash performs orders of magnitude better in ops/sec/$ compared to traditional hashtables on DRAM (and disks) • BufferHash is best suited for FIFO eviction policy • Other policies can be supported at additional cost, details in paper • WAN optimizer using Bufferhash can operate optimally at 200 Mbps, much better than 10 Mbps with BerkeleyDB • Details in paper

Related Work • FAWN (Vasudevan et al., SOSP 2009) • Cluster of wimpy nodes with flash storage • Each wimpy node has its hash table in DRAM • We target… • Hash table much bigger than DRAM • Low latency as well as high throughput systems • HashCache (Badam et al., NSDI 2009) • In-memory hash table for objects stored on disk

WAN optimizer using BufferHash • With BerkeleyDB, throughput up to 10 Mbps • With BufferHash, throughput up to 200 Mbps with Transcend SSD • 500 Mbps with Intel SSD • At 10 Mbps, average throughput per object improves by 65% with BufferHash

SILT Backup Slides

Evaluation Various combinations of indexing schemes Background operations (merge/conversion) Query latency

Flash-based (cloud) storage systems