Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications

Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications XiaodongZhang Ohio State University In collaboration with F. Chen,X. Ding, Q. Lu, P. Sadayappan, Ohio State S. Jiang, Wayne State University, Z. Zhang, Iowa State Q. Lu, Intel, J. Lin, IBM Kei Davis,Los Alamos National Lab

Foot Steps of Challenges in HPC • 1970s-80s:Killer applications demand a lot of CPU cycles • a single processor was very slow (below 1MH) • beginning of PP: algorithms, architecture, software • 1980s:communication bottlenecks and burden of PP • challenge I: fast interconnection networks • challenge II: automatic PP, and shared virtual memory • 1990s:“Memory Wall” and utilization of commodity processors • challenge I: cache design and optimization • challenge II: Networks of Workstations for HPC • 2000s and now:“Disk Wall” and Multi-core processors

Moore’s Law Driven Computing Research(IEEE Spectrum, May 2008) hi 10 years of dark age of parallel computing, CPU-memory gap is the Major concern. New era of multicore computing Memory problem continues 25 year of golden age of parallel computing

The disks in 2000 are 57 times “SLOWER” than their ancestors in 1980 --- increasingly widen the Speed Gap between Peta-Scale computing and Peta-Byte acesses. UnbalancedSystem Improvements Bryant and O’Hallaron, “Computer Systems: A Programmer’s Perspective”, Prentice Hall, 2003

Dropping Prices of Solid State Disks (SSDs) (CACM, 7/08) Flash cost ($) per GB

Prices & Performance: Disks, DRAMs, and SSDs (CACM, 7/08)

Power Consumption for Typical Components (CACM, 7/08)

Opportunities of Technology Advancements Single-core CPU reached its peak performance 1971 (2300 transistors on Intel 4004 chip): 0.4 MHz 2005 (1 billion + transistors on Intel Pentium D): 3.75 GHz After 10,000 times improvement, GHz stopped and dropped CPU improvement is reflected by number of cores in a chip Increased DRAM capacity enables large working sets 1971 ($400/MB) to 2006 (0.09 cent/MB): 444,444 times lower Buffer cache is increasingly important to break “disk wall” SSDs (flash memory) can further break the “wall” Low power (6-8X lower than disks, 2X lower than DRAM) Fast random read (200X faster than disks, 25X slower than DRAM) Slow writing (300X slower than DRAM, 12X faster than disks) Relatively expensive (8X more than disks, 5X cheaper than DRAM)

Research and Challenges New issues in Multicore To utilize parallelism in multicore is much more complex Resource competition in multicore causes new problems OS scheduling is multi-core- and shared-cache-unaware Challenges: Caches are not in the scope of OS management Fast data accesses is most desirable Sequential locality in disks is not effectively exploited. Where should flash memory be in the storage hierarchy? How to use flash memory and buffer cache to improve disk performance/energy? Challenges: disks are not in the scope of OS managements

Data-Intensive Scalable Computing (DISC) • Massively Accessing/Processing Data Sets in Parallel. • drafted by R. Bryant at CMU, endorsed by Industries: Intel, Google, Microsoft, Sun, and scientists in many areas. • Applications in science, industry, and business. • Special requirements for DISC Infrastructure: • Top 500 DISC ranked by data throughput, as well FLOPS • Frequent interactions between parallel CPUs and distributed storages. Scalability is challenging. • DISC is not an extension of SC, but demands new technology advancements.

Disk data stored separately No support for collection or management Brought in for computation Time consuming Limits interactivity System collects and maintains data Shared, active data set Computation co-located with disks Faster access Systems Comparison:(courtesy of Bryant) DISC Conventional Computers System System

Outline • Why is multicore the only choice? • Performance bottlenecks in multicores • OS plays a strong supportive role • A case study of DMMS on multicore • OS-based cache partitioning in multicores • Summary

Multi-Core is the only Choice to Continue Moore’s Law 1.73 x 1.73 x Much better performance Similar power consumption 1.13 x 1.00 x 1.00 x 1.02 x 0.87 x 0.51 x Baseline Frequency Over-Clocked (1.2x) Under-Clocked (0.8x) Dual-Core (0.8x) Performance Power Dual-Core R.M. Ramanathan, Intel Multi-Core Processors: Making the Move to Quad-Core and Beyond, white paper

jobs jobs Cache Memory Bus Shared Resource Conflicts in Multicores Cache Sensitive Job Computation Intensive Job Cache Cache conflict Streaming Job Memory • Scheduling two cache sensitive jobs - causingcache conflicts

jobs jobs Cache Memory Bus Shared Resource Conflicts in Multicores Cache Sensitive Job Computation Intensive Job Cache Cache Streaming Job Saturation Memory • Scheduling two cache sensitive jobs- causingcache conflicts • Scheduling two streaming jobs- causingmemory bus congestions

jobs jobs Cache Underutilized resources Memory Bus Shared Resource Conflicts in Multicores Cache Sensitive Job Computation Intensive Job Streaming Job Memory • Scheduling two cache sensitive jobs- causingcache conflicts • Scheduling two streaming jobs- causingmemory bus congestions • Scheduling two CPU intensive jobs – underutilizing cache and bus

jobs jobs Cache Memory Bus Shared Resource Conflicts in Multicores Streaming job pollutes cache Cache Sensitive Job Computation Intensive Job Cache Cache Streaming Job Increased memory activity Memory • Scheduling two cache sensitive jobs - causingcache conflicts • Scheduling two streaming jobs - causingmemory bus congestions • Scheduling two CPU intensive jobs – underutilizing cache and bus • Scheduling cache sensitive & streaming jobs – conflicts & congestion

Cache Cache Memory Bus Many Cores, Limited Cache, Single Bus Cache Cache Cache Cache Cache Cache Cache Cache Memory • Many Cores – oversupplying computational power • Limited Cache – lowering cache capacity per process and per core • Single Bus – increasing bandwidth sharing by many cores

Moore’s Law Driven Relational DBMS Research What should we do as DBMS meets multicores? Architecture-optimized DBMS (MonetDB): Database Architecture Optimized for the New Bottleneck: memory Access (VLDB’99) 1977 – 1997: Parallel DBMS DIRECT, Gamma, Paradise (Wisconsin, Madison) 1976: DBMS: System R and Ingres (IBM, UC Berkeley) 19 1970: Relational data model (E. F. Codd)

Core 1 Query1 Query2 Core 2 H1 H2 Shared cache tb ta Memory Good/Bad News for “Hash Join” select * from Ta, Tb where Ta.x= Tb.y HJ (H2, Ta) Shared cache provides both data sharing and cache contention HJ (H1, Tb) Conflict! sharing! H1 H2 Ta Tb

Consequence of Cache Contention During concurrent query executions in multicore cache contention causes new concerns: • Suboptimal Query Plans • Due to multicore shared cache unawareness • Confused scheduling • generating unnecessary cache conflict • Cache is allocated by demand not by locality • Weak locality blocks, such as one-time accessed blocks pollute and waste cache space

Suboptimal Query Plan • Query optimizer selects the “best”plan for a query. • Query optimizer is not shared-cache aware. • Some query plan is cache sensitive and some are not. • A set of cache sensitive/non-sensitive queries would confuse the scheduler.

Multicore Unaware Scheduling • Default scheduling takes a FIFO order, causing cache conflicts • Multicore-aware optimization: co-schedule queries • “hashjoins” (cache sensitive), “table scans” (insensitive) • Default: co-schedule “hashjoins” (cache conflict!) • Multicore-aware: co-schedule “hashjoin” and “tablescan” 30% Improvement!

Locality-basedCache Allocation • Different queries have different cache utilization. • Cache allocation is demand-based by default. • Weak locality queries should be allocated small space. Co-schedule “hashjoin” (strong locality) and “tablescan” (one-time accesses), allocate more to “hashjoin”. 16% Improvement!

Queries Core Core Shared Last Level Cache A DBMS Framework for Multicores Query Optimizer • Query Optimization(DB level) • Query optimizer generates optimal query plans based on usage of shared cache • Query Scheduling (DB level) • To group queries co-running to minimize access conflicts in shared cache • Cache Partitioning (OS level) • To allocate cache space to maximize cache utilization for co-scheduled queries Query Scheduler Cache Partitioning

Challenges and Opportunities • Challenges to DBMS • DBMS running in user space is not able to directly control cache allocation in multicores • Scheduling: predict potential cache conflicts among co-running queries • Partitioning: Determine access locality for different query operations (join, scan , aggregation, sorting,…) • Opportunities • Query optimizer can provide hints of data access patterns and estimate working set size during query executions • Operation system can manage cache allocation by using page coloring during virtual-physical address mapping.

OS-Based Cache Partitioning Static cache partitioning Predetermines the amount of cache blocks allocated to each program at the beginning of its execution Divides shared cache to multiple regions and partition cache regions through OS page address mapping Dynamic cache partitioning Adjusts cache quota among processes dynamically Dynamically changes processes’ cache usage through OS page address re-mapping

Page Coloring • Physically indexed caches are divided into multiple regions (colors). • All cache lines in a physical page are cached in one of those regions (colors). Physically indexed cache Virtual address virtual page number page offset OS control Address translation … … Physical address physical page number Page offset OS can control the page color of a virtual page through address mapping (by selecting a physical page with a specific value in its page color bits). = Cache address Cache tag Set index Block offset page color bits

Enhancement for Static Cache Partitioning Physical pages are grouped to page bins according to their page color OS address mapping Physically indexed cache 1 2 3 4 … … …… i i+1 i+2 … … Shared cache is partitioned between two processes through address mapping. …… Process 1 … … ... 1 2 Cost: Main memory space needs to be partitioned too (co-partitioning). 3 4 … … …… i i+1 i+2 … … …… Process 2

Page Re-Coloring for Dynamic Partitioning Page re-coloring: Allocate page in new color Copy memory contents Free old page • Pages of a process are organized into linked lists by their colors. • Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points. Allocated color Allocated color 0 1 2 3 …… N - 1 page links table

Reduce Page Migration Overhead Control the frequency of page migration Frequent enough to capture phase changes Reduce large page migration frequency Lazy migration: avoid unnecessary migration Observation: Not all pages are accessed between their two migrations. Optimization: do not migrate a page until it is accessed

With the optimization Only 2% page migration overhead on average Up to 7%. Lazy Page Migration Allocated color Allocated color 0 1 2 3 …… N - 1 Avoid unnecessary page migration for these pages! Process page links

Research in Cache Partitioning Merits of page-color based cache partitioning Static partitioning has low overhead although memory will have to be co-partitioned among processes A measurement-based platform can be built to evaluate cache partitioning methods on real machines Limits and research issues Overhead of dynamic cache partitioning is high The cache is managed indirectly at the page level Can OS directly partition the cache with low overhead? We have proposed a hybrid methods for the purpose

“Disk Wall”is a Critical Issue • Many data-intensive applications generate huge data sets in disks world wide in very fast speed. • LANL Turbulence Simulation: processing100+ TB. • Google searches and accesses over10 billionweb pagesandtens of TB datain Internet. • Internet traffic is expected to increase from1 to 16 million TB/monthdue to multimedia data. • We carry very large digital data, films, photos, … • Data home is the cost-effective & reliable Disks • Slow disk data access is the majorbottleneck

Data-Intensive Scalable Computing (DISC) • Massively Accessing/Processing Data Sets in Parallel. • drafted by R. Bryant at CMU, endorsed by Industries: Intel, Google, Microsoft, Sun, and scientists in many areas. • Applications in science, industry, and business. • Special requirements for DISC Infrastructure: • Top 500 DISC ranked by data throughput, as well FLOPS • Frequent interactions between parallel CPUs and distributed storages. Scalability is challenging. • DISC is not an extension of SC, but demands new technology advancements.

Disk data stored separately No support for collection or management Brought in for computation Time consuming Limits interactivity System collects and maintains data Shared, active data set Computation co-located with disks Faster access Systems Comparison:(courtesy of Bryant) DISC Conventional Computers System System

Sequential Locality is Unique in Disks • SequentialLocality:disk accesses in sequence fastest • Disk speed is limited by mechanical constraints. seek/rotation(high latency and power consumption) • OS can guess sequential disk-layout, but not always right.

Week OS Ability to Exploit Sequential Locality • OS is not exactly aware disk layout • Sequential data placement has been implemented • since Fast File System in BSD (1984) • put files in one directory in sequence in disks • follow execution sequence to place data in disks. • Assume temporal sequence = disk layout sequence. • The assumption is not always right, performance suffers. • Data accesses in both sequential and random patterns • an application accesses multiple files. • Buffer caching/prefetching know little about disk layout.

IBM Ultrastar 18ZX Specification * Our goal: to maximize opportunities of sequential accesses for high speed and high I/O throughput Seq. Read: 4,700 IO/s Rand. Read: < 200 IO/s * Taken from IBM “ULTRASTAR 9LZX/18ZX Hardware/Functional Specification” Version 2.4

Existing Approaches and Limits • Programming for Disk Performance • Hiding disk latency by overlapping computing • Sorting large data sets (SIGMOD’97) • Application dependent and programming burden • Transparent and Informed Prefetching (TIP) • Applications issue hints on their future I/O patterns to guide prefetching/caching (SOSP’99) • Not a general enough to cover all applications • Collective I/O: gather multiple I/O requests • make contiguous disk accesses for parallel programs

Our Objectives • Exploiting sequential locality in disks • by minimizing random disk accesses • making disk-aware caching and prefetching • utilizing both buffer cache and SSDs • Application independent approach • putting disk access information on OS map • Exploiting DUalLOcalities (DULO): • Temporal locality of program execution • Sequential locality of disk accesses

What is Buffer Cache Aware and Unaware? Application I/O Requests • Buffer is an agent between I/O requests and disks. • aware access patterns in time sequence (in agood positionto exploittemporal locality) • not clear about physical layout (limited ability to exploitsequential localityin disks) • Existing functions • send unsatisfied requests to disks • LRU replacement by temporal locality • make prefetch by sequential access assumption. • Ineffectiveness of I/O scheduler:sequential locality in is not open to buffer management. Buffer cache Caching & prefetching I/O Scheduler Disk Driver disk

Limits of Hit-ratio based Buffer Cache Management • Minimizing cache miss ratio by only exploiting temporal locality • Sequentially accessed blocks small miss penalty • Randomly accessed blocks large miss penalty Temporal locality Sequential locality

Hard Disk Drive A C Disk Tracks X1 X2 X3 X4 D B • Unique and critical roles of buffer cache • Buffer cache can influence request stream patterns in disks • If buffer cache is disk-layout-aware, OS is able to • Distinguish sequentially and randomly accessed blocks • Give “expensive” random blocks high caching priority in DRAM/SSD • replace long sequential data blocks timely to disks • Disk accesses become more sequential.

Process Process idle idle idle idle Prefetch requests Synchronous requests Disk Disk Prefetching Efficiency is Performance Critical • Prefetching may incur non-sequential disk access • Non-sequential accesses are much slower than sequential accesses • Disk layout information must be introduced into prefetching policies. It is increasingly difficult to hide disk accesses behind computation

File Y File X B C File Z A Metadata of files XYZ File R D File-level Prefetching is Disk Layout Unaware • Multiple files sequentially allocated on disks cannot be prefetched at once. • Metadata are allocated separately on disks, and cannot be prefetched • Sequentiality at file abstraction may not translate to sequentiality on physical disk. • Deep access history information is usually not recorded.

Opportunities and Challenges • With Disk Spatial Locality (Disk-Seen) • Exploit DULO for fast disk accesses. • Challenges to build Disk-Seen System Infrastructure • Disk layout information is increasingly hidden in disks. • analyze and utilize disk-layout Information • accurately and timely identify long disk sequences • considertrade-offs of temporal and spatial locality (buffer cache hit ratio vs miss penalty: not necessarily follow LRU) • manage its data structures with low overhead • Implement it in OS kernel for practical usage

Disk-SeenTask 1: Make Disk Layout Info. Available • Which disk layout information to use? • Logical block number (LBN): location mapping provided by firmware. (each block is given a sequence number) • Accesses of contiguous LBNs have a performance close to accesses of contiguous blocks on disk. (except bad blocks occur) • The LBN interface is highly portable across platforms. • How to efficiently manage the disk layout information? • LBN is only used to identify disk locations for read/write; • We want to track access times of disk blocks and search for access sequences via LBNs; • Disk block table: a data structure for efficient disk blocks tracking.

Correlation Buffer Staging Section Sequencing Bank Evicting Section Disk-SeenTASK 2:Exploiting Dual Localities (DULO) • Sequence Forming Sequence ---- a number of blocks whose disk locations are adjacent and have been accessed during a limited time period. • Sequence Sorting based on its recency (temporal locality) and size (spatial locality) LRU Stack

Disk-SeenTASK 3:DULO-Caching • Adapted GreedyDual Algorithm • a global inflation value L, and a value H for each sequence • Calculate H values for sequences in sequencing bank: H = L + 1 / Length( sequence ) Random blocks have larger H values • When a sequence (s) is replaced, L = H value of s . L increases monotonically and make future sequences have larger H values • Sequences with smaller H values are placed closer to the bottom of LRU stack H=L0+0.25 H=L0+1 H=L0+1 H=L0+0.25 LRU Stack L=L0 L=L1

Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications