Spatial data management over flash memory

Spatial data managementover flash memory Ioannis Koltsidas and Stratis D. ViglasSSTD 2011, Minneapolis, MN

Flash: a disruptive technology • Orders of magnitude better performance than HDD • Low power consumption • Dropping prices • Idea: throw away HDDs and replace everything with flash SSDs • Not enough capacity • Not enough money to buy the not-enough-capacity • However, flash technology is somewhat enforced • Mobile devices • Low-power data centers and clusters • Potentially all application areas dealing with spatial data • We must seamlessly integrate Flash into the storage hierarchy • Need custom, flash-aware solutions Koltsidas and Viglas, SSTD 2011

Outline • Flash-based device design • Flash memory • Solid state drives • Spatial data challenges • Taking advantage of asymmetry • Storage and indexing • Buffering and caching Koltsidas and Viglas, SSTD 2011

Flash memory cells • Flash cell: a floating gate transistor • Float gate • Control Gate • Oxide Layer • Electrons get trapped in the float gate • Two states: float gate charged or not (‘0’ or ‘1’) • The charge changes the threshold voltage (VT) of the cell • To read: apply a voltage between possible VT values • the MOSFET channel conducts (‘1’) • or, it remains insulating (‘0’) • After a number of program/erase cycles, the oxide wears out Source Line Bit Line Control Gate Float Gate N Source N Drain Oxide Layer P P-Type Silicon Substrate • Single-Level-Cell (SLC): one bit per cell • Multi-Level-Cell (MLC): two or more bits per cell • The cell can sense the amount of current flow • Programming takes longer, puts more strain on the oxide Koltsidas and Viglas, SSTD 2011

Flash memory arrays • NOR or NAND flash depending on how the cells are connected form arrays • Flash page: the unit of read / program operations (typically 2kB – 8kB) • Flash block: the unit of erase operations (typically 32 – 128 pages) • Before a page can be re-programmed, the whole block has to be erased first • Reading much faster than writing a page • It takes some time before the cell charge reaches a stable state • Erasing takes two orders of magnitude more time than reading Koltsidas and Viglas, SSTD 2011

Flash-based Solid State Drives (SSDs) • Common I/O interface • Block-addressable interface • No mechanical latency • Access latency independent of the access pattern • 30 to 50 times more efficient in IOPS/$ per GB than HDDs • Read/write asymmetry • Reads are faster than writes • Erase-before-write limitation • Limited endurance and the need for wear leveling • 5 year warranty for enterprise SSDs (assuming 10 complete re-writes per day) • Energy efficiency • 100 – 200 times more efficient than HDDs in IOPS / Watt • Physical properties • Resistance to extreme shock, vibration, temperature, altitude • Near-instant start-up time Koltsidas and Viglas, SSTD 2011

SSD challenges • Host interface • Flash memory: read_flash_page, program_flash_page, erase_flash_block • Typical block device interface:read_sector, write_sector • Writes in place would kill performance, lifetime • Solution: perform writes out-of-place • Amortize block erasures over many write operations • Writes go to spare, erased blocks; old pages are invalidated • Device logical block address (LBA) space ≠ physical block address (PBA) space • Flash Translation Layer (FTL) • Address translation (logical-to-physical mapping) • Garbage collection (block reclamation) • Wear-leveling logical page LBAspace device level Flash Translation Layer flash chip level PBA space flash page spare capacity flash block Koltsidas and Viglas, SSTD 2011

Off-the-shelf SSDs 15k RPM SAS HDD: ~250-300 IOPS 7.2k RPM SATA HDD: ~80 IOPS Consumer Consumer Consumer Enterprise Enterprise ~ 1 order of magnitude > 2 orders of magnitude Koltsidas and Viglas, SSTD 2011

Work so far: better FTL algorithms • Hide the complexity from the user by adding intelligence at the controller level • Great! (for the majority of user-level applications) • But as is usually the case, you can’t have a one-size-fits-all solution • Data management applications have a much better understanding of access patterns • File systems don’t • Spatial data management has even specific needs Koltsidas and Viglas, SSTD 2011

Competing goals • SSD designers assume a generic filesystem above the device Goals: • Hide the complexities of flash memory • Improve performance for generic workloads and I/O patterns • Protect their competitive advantage, by hiding algorithm and implementation details • DBMS designers have full control of the I/O issued to the device Goals: • Predictability for I/O operations, independence of hardware specifics • Clear characterization of I/O patterns • Exploit synergies between query processing and flash memory properties Koltsidas and Viglas, SSTD 2011

A (modest) proposal for areas to focus on • Data structure level • Ways of helping the FTL • Introduce imbalance to tree structures • Trade (cheap) reads for (expensive) writes • Memory management • Add spatial intelligence to the buffer pool • Take advantage of work on spatial trajectory prediction • Combine with cost-based replacement • Prefetch data, delay expensive writes Koltsidas and Viglas, SSTD 2011

Turning asymmetry into an advantage • Common characteristic of all SSDs: low random read latency • Write speed and throughput differ dramatically across types of device • Sometimes write speed is orders of magnitude slower than read speed • Key idea: if we don’t need to write, then we shouldn’t • Procrastination might pay off in the long term • Only write if the cost has been expensed Koltsidas and Viglas, SSTD 2011

Read/write asymmetry • Consider the case where writes are x times more expensive than reads • This means that for each write we avoid, we “gain” x time units • Take any R-tree structure and introduce controlled imbalance • Rebalance when we have expensed the cost balanced insertion original setup unbalanced insertion parent parent parent overflowing child overflowing child newly allocated sibling overflowing child overflow area Koltsidas and Viglas, SSTD 2011

In more detail • Parent P • Overflowing node L • On overflow, allocate overflow node S • Instead of performing three writes (nodes P, L, and S), we perform two (nodes L and S) • We have saved 2x time units • Record at L a counter c • Increment each time we traverse L to get to S • Once counter reaches x, rebalance • The cost has been expensed P L P only L and S nodes are written, not P L c S P rebalance when c>x L S c Koltsidas and Viglas, SSTD 2011

Observations • If there are no “hotspots” in the R-tree then we have potentially huge gains • Counter-intuitive: the more imbalance, the lower the I/O cost • In the worst case, as good as a balanced tree • Method is applicable either at the leaves, or at the index nodes • Likelihood of rebalancing proportional to the level the imbalance was introduced (i.e., the deeper the level of imbalance, the higher the likelihood) • Good fit to data access patterns in location-aware spatial services • Update rate is relatively low; point queries are highly volatile as users move about an area • Extensions in hybrid server-oriented configurations • Both HDDs and SSDs are used for persistent storage • Write-intensive (and potentially unbalanced) nodes placed on the HDD Koltsidas and Viglas, SSTD 2011

Cost-based replacement • Choice of victim depends on probability of reference (as usual) • But the eviction cost is not uniform • Clean pages bear no write cost, dirty pages result in a write • I/O asymmetry: writes more expensive than reads • It doesn’t hurt if we misestimate the heat of a page • So long as we save (expensive) writes • Key idea: combine LRU-based replacement with cost-based algorithms • Applicable both in SSD-only as well as hybrid systems Koltsidas and Viglas, SSTD 2011

In more detail • Starting point: cost-based page replacement • Divide the buffer pool into two regions • Time region: typical LRU • Multiple LRU queues, one per cost class • Order queues based on cost • Evict from time region to cost region • Final victim is always from the cost region Time region Cost region cost Koltsidas and Viglas, SSTD 2011

Location-awareness • Host of work in wireless networks dealing with trajectory prediction • Consider the case where services are offered based on user location • Primary data are stored in an R-tree • User location triggers queries on the R-tree • User motion creates hotspots (more precisely, hot paths) on the tree structure Koltsidas and Viglas, SSTD 2011

Location-aware buffer pool management • What if the classes of the cost segment track user motion? • The lower the utility of the page being in the buffer pool, the higher the eviction cost • Utility correlated with motion trajectory • As the user moves about an area new pages are brought in the buffer pool and older pages are evicted • Potentially huge savings if trajectory is tracked accurately enough • Flashmobs (pun intended!) • Users tend to move in sets into areas of interest • Overall response time of the system minimized • Recency/frequency of access may not be able to predict future behavior • Trajectory tracking potentially will Koltsidas and Viglas, SSTD 2011

Conclusions and outlook • Flash memory and SSDs are becoming ubiquitous • Both at the mobile device and at the enterprise levels • Need for new data structures and algorithms • Existing ones target the memory-disk performance bottleneck • That bottleneck is smaller with SSDs • A new bottleneck has appeared: read/write asymmetry • Introduce imbalance at the data structure level • Trade reads for writes through the allocation of overflow nodes • Take cost into account when managing main memory • Cost-based replacement based on motion tracking and trajectory prediction Koltsidas and Viglas, SSTD 2011

Spatial data management over flash memory