slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications PowerPoint Presentation
Download Presentation
Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications

Loading in 2 Seconds...

play fullscreen
1 / 60

Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications - PowerPoint PPT Presentation


  • 107 Views
  • Uploaded on

Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications. Xiaodong Zhang Ohio State University In collaboration with F. Chen, X. Ding , Q. Lu, P. Sadayappan, Ohio State S. Jiang, Wayne State University, Z. Zhang, Iowa State

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications' - lavina


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Research Issues/Challenges to Systems Software for Multicore and Data-Intensive Applications

XiaodongZhang

Ohio State University

In collaboration with

F. Chen,X. Ding, Q. Lu, P. Sadayappan, Ohio State

S. Jiang, Wayne State University, Z. Zhang, Iowa State

Q. Lu, Intel, J. Lin, IBM

Kei Davis,Los Alamos National Lab

slide2

Foot Steps of Challenges in HPC

  • 1970s-80s:Killer applications demand a lot of CPU cycles
    • a single processor was very slow (below 1MH)
    • beginning of PP: algorithms, architecture, software
  • 1980s:communication bottlenecks and burden of PP
    • challenge I: fast interconnection networks
    • challenge II: automatic PP, and shared virtual memory
  • 1990s:“Memory Wall” and utilization of commodity processors
    • challenge I: cache design and optimization
    • challenge II: Networks of Workstations for HPC
  • 2000s and now:“Disk Wall” and Multi-core processors
moore s law driven computing research ieee spectrum may 2008
Moore’s Law Driven Computing Research(IEEE Spectrum, May 2008)

hi

10 years of dark age

of parallel computing,

CPU-memory gap is the

Major concern.

New era of multicore computing

Memory problem continues

25 year of golden age of parallel computing

slide4

The disks in 2000 are 57 times “SLOWER” than their ancestors in 1980 --- increasingly widen the Speed Gap

between Peta-Scale computing and Peta-Byte acesses.

UnbalancedSystem Improvements

Bryant and O’Hallaron, “Computer Systems: A Programmer’s Perspective”,

Prentice Hall, 2003

opportunities of technology advancements
Opportunities of Technology Advancements

Single-core CPU reached its peak performance

1971 (2300 transistors on Intel 4004 chip): 0.4 MHz

2005 (1 billion + transistors on Intel Pentium D): 3.75 GHz

After 10,000 times improvement, GHz stopped and dropped

CPU improvement is reflected by number of cores in a chip

Increased DRAM capacity enables large working sets

1971 ($400/MB) to 2006 (0.09 cent/MB): 444,444 times lower

Buffer cache is increasingly important to break “disk wall”

SSDs (flash memory) can further break the “wall”

Low power (6-8X lower than disks, 2X lower than DRAM)

Fast random read (200X faster than disks, 25X slower than DRAM)

Slow writing (300X slower than DRAM, 12X faster than disks)

Relatively expensive (8X more than disks, 5X cheaper than DRAM)

research and challenges
Research and Challenges

New issues in Multicore

To utilize parallelism in multicore is much more complex

Resource competition in multicore causes new problems

OS scheduling is multi-core- and shared-cache-unaware

Challenges: Caches are not in the scope of OS management

Fast data accesses is most desirable

Sequential locality in disks is not effectively exploited.

Where should flash memory be in the storage hierarchy?

How to use flash memory and buffer cache to improve disk performance/energy?

Challenges: disks are not in the scope of OS managements

slide10

Data-Intensive Scalable Computing (DISC)

  • Massively Accessing/Processing Data Sets in Parallel.
    • drafted by R. Bryant at CMU, endorsed by Industries: Intel, Google, Microsoft, Sun, and scientists in many areas.
    • Applications in science, industry, and business.
  • Special requirements for DISC Infrastructure:
    • Top 500 DISC ranked by data throughput, as well FLOPS
    • Frequent interactions between parallel CPUs and distributed storages. Scalability is challenging.
    • DISC is not an extension of SC, but demands new technology advancements.
systems comparison courtesy of bryant
Disk data stored separately

No support for collection or management

Brought in for computation

Time consuming

Limits interactivity

System collects and maintains data

Shared, active data set

Computation co-located with disks

Faster access

Systems Comparison:(courtesy of Bryant)

DISC

Conventional Computers

System

System

slide12

Outline

  • Why is multicore the only choice?
  • Performance bottlenecks in multicores
  • OS plays a strong supportive role
  • A case study of DMMS on multicore
  • OS-based cache partitioning in multicores
  • Summary
multi core is the only choice to continue moore s law
Multi-Core is the only Choice to Continue Moore’s Law

1.73 x

1.73 x

Much better performance

Similar power

consumption

1.13 x

1.00 x

1.00 x

1.02 x

0.87 x

0.51 x

Baseline Frequency

Over-Clocked (1.2x)

Under-Clocked (0.8x)

Dual-Core (0.8x)

Performance

Power

Dual-Core

R.M. Ramanathan, Intel Multi-Core Processors: Making the Move to Quad-Core and Beyond, white paper

slide14

jobs

jobs

Cache

Memory Bus

Shared Resource Conflicts in Multicores

Cache Sensitive Job

Computation Intensive Job

Cache

Cache

conflict

Streaming Job

Memory

  • Scheduling two cache sensitive jobs - causingcache conflicts
slide15

jobs

jobs

Cache

Memory Bus

Shared Resource Conflicts in Multicores

Cache Sensitive Job

Computation Intensive Job

Cache

Cache

Streaming Job

Saturation

Memory

  • Scheduling two cache sensitive jobs- causingcache conflicts
  • Scheduling two streaming jobs- causingmemory bus congestions
slide16

jobs

jobs

Cache

Underutilized

resources

Memory Bus

Shared Resource Conflicts in Multicores

Cache Sensitive Job

Computation Intensive Job

Streaming Job

Memory

  • Scheduling two cache sensitive jobs- causingcache conflicts
  • Scheduling two streaming jobs- causingmemory bus congestions
  • Scheduling two CPU intensive jobs – underutilizing cache and bus
slide17

jobs

jobs

Cache

Memory Bus

Shared Resource Conflicts in Multicores

Streaming job pollutes cache

Cache Sensitive Job

Computation Intensive Job

Cache

Cache

Streaming Job

Increased memory activity

Memory

  • Scheduling two cache sensitive jobs - causingcache conflicts
  • Scheduling two streaming jobs - causingmemory bus congestions
  • Scheduling two CPU intensive jobs – underutilizing cache and bus
  • Scheduling cache sensitive & streaming jobs – conflicts & congestion
slide18

Cache

Cache

Memory Bus

Many Cores, Limited Cache, Single Bus

Cache

Cache

Cache

Cache

Cache

Cache

Cache

Cache

Memory

  • Many Cores – oversupplying computational power
  • Limited Cache – lowering cache capacity per process and per core
  • Single Bus – increasing bandwidth sharing by many cores
moore s law driven relational dbms research
Moore’s Law Driven Relational DBMS Research

What should we do as DBMS meets multicores?

Architecture-optimized DBMS (MonetDB):

Database Architecture Optimized for the New Bottleneck: memory Access (VLDB’99)

1977 – 1997: Parallel DBMS

DIRECT, Gamma, Paradise

(Wisconsin, Madison)

1976: DBMS:

System R and Ingres

(IBM, UC Berkeley)

19

1970: Relational data model (E. F. Codd)

good bad news for hash join

Core 1

Query1

Query2

Core 2

H1

H2

Shared cache

tb

ta

Memory

Good/Bad News for “Hash Join”

select * from Ta, Tb where Ta.x= Tb.y

HJ (H2, Ta)

Shared cache provides both

data sharing and cache contention

HJ (H1, Tb)

Conflict!

sharing!

H1

H2

Ta

Tb

consequence of cache contention
Consequence of Cache Contention

During concurrent query executions in multicore cache contention causes new concerns:

  • Suboptimal Query Plans
    • Due to multicore shared cache unawareness
  • Confused scheduling
    • generating unnecessary cache conflict
  • Cache is allocated by demand not by locality
    • Weak locality blocks, such as one-time accessed blocks pollute and waste cache space
suboptimal query plan
Suboptimal Query Plan
  • Query optimizer selects the “best”plan for a query.
  • Query optimizer is not shared-cache aware.
  • Some query plan is cache sensitive and some are not.
  • A set of cache sensitive/non-sensitive queries would confuse the scheduler.
multicore unaware scheduling
Multicore Unaware Scheduling
  • Default scheduling takes a FIFO order, causing cache conflicts
  • Multicore-aware optimization: co-schedule queries
  • “hashjoins” (cache sensitive), “table scans” (insensitive)
  • Default: co-schedule “hashjoins” (cache conflict!)
  • Multicore-aware: co-schedule “hashjoin” and “tablescan”

30%

Improvement!

locality based cache a llocation
Locality-basedCache Allocation
  • Different queries have different cache utilization.
  • Cache allocation is demand-based by default.
  • Weak locality queries should be allocated small space.

Co-schedule “hashjoin” (strong locality) and “tablescan” (one-time accesses), allocate more to “hashjoin”.

16%

Improvement!

a dbms framework for multicores

Queries

Core

Core

Shared Last Level Cache

A DBMS Framework for Multicores

Query

Optimizer

  • Query Optimization(DB level)
    • Query optimizer generates optimal query plans based on usage of shared cache
  • Query Scheduling (DB level)
    • To group queries co-running to minimize access conflicts in shared cache
  • Cache Partitioning (OS level)
    • To allocate cache space to maximize cache utilization for co-scheduled queries

Query

Scheduler

Cache

Partitioning

challenges and opportunities
Challenges and Opportunities
  • Challenges to DBMS
    • DBMS running in user space is not able to directly control cache allocation in multicores
    • Scheduling: predict potential cache conflicts among co-running queries
    • Partitioning: Determine access locality for different query operations (join, scan , aggregation, sorting,…)
  • Opportunities
    • Query optimizer can provide hints of data access patterns and estimate working set size during query executions
    • Operation system can manage cache allocation by using page coloring during virtual-physical address mapping.
os based cache partitioning
OS-Based Cache Partitioning

Static cache partitioning

Predetermines the amount of cache blocks allocated to each program at the beginning of its execution

Divides shared cache to multiple regions and partition cache regions through OS page address mapping

Dynamic cache partitioning

Adjusts cache quota among processes dynamically

Dynamically changes processes’ cache usage through OS page address re-mapping

page coloring
Page Coloring
  • Physically indexed caches are divided into multiple regions (colors).
  • All cache lines in a physical page are cached in one of those regions (colors).

Physically indexed cache

Virtual address

virtual page number

page offset

OS control

Address translation

… …

Physical address

physical page number

Page offset

OS can control the page color of a virtual page through address mapping

(by selecting a physical page with a specific value in its page color bits).

=

Cache address

Cache tag

Set index

Block offset

page color bits

enhancement for static cache partitioning
Enhancement for Static Cache Partitioning

Physical pages are grouped to page bins

according to their page color

OS address mapping

Physically indexed cache

1

2

3

4

……

i

i+1

i+2

Shared cache is partitioned between two processes through address mapping.

……

Process 1

… …

...

1

2

Cost: Main memory space needs to be partitioned too (co-partitioning).

3

4

……

i

i+1

i+2

……

Process 2

page re coloring for dynamic partitioning
Page Re-Coloring for Dynamic Partitioning

Page re-coloring:

Allocate page in new color

Copy memory contents

Free old page

  • Pages of a process are organized into linked lists by their colors.
  • Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points.

Allocated color

Allocated color

0

1

2

3

……

N - 1

page links table

reduce page migration overhead
Reduce Page Migration Overhead

Control the frequency of page migration

Frequent enough to capture phase changes

Reduce large page migration frequency

Lazy migration: avoid unnecessary migration

Observation: Not all pages are accessed between their two migrations.

Optimization: do not migrate a page until it is accessed

lazy page migration
With the optimization

Only 2% page migration overhead on average

Up to 7%.

Lazy Page Migration

Allocated color

Allocated color

0

1

2

3

……

N - 1

Avoid unnecessary page migration for these pages!

Process page links

research in cache partitioning
Research in Cache Partitioning

Merits of page-color based cache partitioning

Static partitioning has low overhead although memory will have to be co-partitioned among processes

A measurement-based platform can be built to evaluate cache partitioning methods on real machines

Limits and research issues

Overhead of dynamic cache partitioning is high

The cache is managed indirectly at the page level

Can OS directly partition the cache with low overhead?

We have proposed a hybrid methods for the purpose

slide34

“Disk Wall”is a Critical Issue

  • Many data-intensive applications generate huge data sets in disks world wide in very fast speed.
      • LANL Turbulence Simulation: processing100+ TB.
      • Google searches and accesses over10 billionweb pagesandtens of TB datain Internet.
      • Internet traffic is expected to increase from1 to 16 million TB/monthdue to multimedia data.
      • We carry very large digital data, films, photos, …
  • Data home is the cost-effective & reliable Disks
      • Slow disk data access is the majorbottleneck
slide35

Data-Intensive Scalable Computing (DISC)

  • Massively Accessing/Processing Data Sets in Parallel.
    • drafted by R. Bryant at CMU, endorsed by Industries: Intel, Google, Microsoft, Sun, and scientists in many areas.
    • Applications in science, industry, and business.
  • Special requirements for DISC Infrastructure:
    • Top 500 DISC ranked by data throughput, as well FLOPS
    • Frequent interactions between parallel CPUs and distributed storages. Scalability is challenging.
    • DISC is not an extension of SC, but demands new technology advancements.
systems comparison courtesy of bryant1
Disk data stored separately

No support for collection or management

Brought in for computation

Time consuming

Limits interactivity

System collects and maintains data

Shared, active data set

Computation co-located with disks

Faster access

Systems Comparison:(courtesy of Bryant)

DISC

Conventional Computers

System

System

slide37

Sequential Locality is Unique in Disks

  • SequentialLocality:disk accesses in sequence fastest
  • Disk speed is limited by mechanical constraints.

seek/rotation(high latency and power consumption)

  • OS can guess sequential disk-layout, but not always right.
slide38

Week OS Ability to Exploit Sequential Locality

  • OS is not exactly aware disk layout
    • Sequential data placement has been implemented
      • since Fast File System in BSD (1984)
      • put files in one directory in sequence in disks
      • follow execution sequence to place data in disks.
    • Assume temporal sequence = disk layout sequence.
  • The assumption is not always right, performance suffers.
    • Data accesses in both sequential and random patterns
    • an application accesses multiple files.
    • Buffer caching/prefetching know little about disk layout.
ibm ultrastar 18zx specification
IBM Ultrastar 18ZX Specification *

Our goal: to maximize opportunities of sequential accesses for high speed and high I/O throughput

Seq. Read: 4,700 IO/s

Rand. Read:

< 200 IO/s

* Taken from IBM “ULTRASTAR 9LZX/18ZX Hardware/Functional Specification” Version 2.4

slide40

Existing Approaches and Limits

  • Programming for Disk Performance
    • Hiding disk latency by overlapping computing
    • Sorting large data sets (SIGMOD’97)
    • Application dependent and programming burden
  • Transparent and Informed Prefetching (TIP)
    • Applications issue hints on their future I/O patterns to guide prefetching/caching (SOSP’99)
    • Not a general enough to cover all applications
  • Collective I/O: gather multiple I/O requests
    • make contiguous disk accesses for parallel programs
slide41

Our Objectives

  • Exploiting sequential locality in disks
    • by minimizing random disk accesses
    • making disk-aware caching and prefetching
    • utilizing both buffer cache and SSDs
  • Application independent approach
    • putting disk access information on OS map
    • Exploiting DUalLOcalities (DULO):
      • Temporal locality of program execution
      • Sequential locality of disk accesses
slide42

What is Buffer Cache Aware and Unaware?

Application I/O Requests

  • Buffer is an agent between I/O requests and disks.
    • aware access patterns in time sequence (in agood positionto exploittemporal locality)
    • not clear about physical layout (limited ability to exploitsequential localityin disks)
  • Existing functions
    • send unsatisfied requests to disks
    • LRU replacement by temporal locality
    • make prefetch by sequential access assumption.
  • Ineffectiveness of I/O scheduler:sequential locality in is not open to buffer management.

Buffer cache

Caching & prefetching

I/O Scheduler

Disk Driver

disk

slide43

Limits of Hit-ratio based Buffer Cache Management

  • Minimizing cache miss ratio by only exploiting temporal locality
    • Sequentially accessed blocks small miss penalty
    • Randomly accessed blocks large miss penalty

Temporal locality

Sequential locality

slide44

Hard Disk Drive

A

C

Disk Tracks

X1

X2

X3

X4

D

B

  • Unique and critical roles of buffer cache
  • Buffer cache can influence request stream patterns in disks
  • If buffer cache is disk-layout-aware, OS is able to
    • Distinguish sequentially and randomly accessed blocks
    • Give “expensive” random blocks high caching priority in DRAM/SSD
    • replace long sequential data blocks timely to disks
    • Disk accesses become more sequential.
prefetching efficiency is performance critical

Process

Process

idle

idle

idle

idle

Prefetch requests

Synchronous requests

Disk

Disk

Prefetching Efficiency is Performance Critical
  • Prefetching may incur non-sequential disk access
    • Non-sequential accesses are much slower than sequential accesses
    • Disk layout information must be introduced into prefetching policies.

It is increasingly difficult to hide disk accesses behind computation

file level prefetching is disk layout unaware

File Y

File X

B

C

File Z

A

Metadata of files XYZ

File R

D

File-level Prefetching is Disk Layout Unaware
  • Multiple files sequentially allocated on disks cannot be prefetched at once.
  • Metadata are allocated separately on disks, and cannot be prefetched
  • Sequentiality at file abstraction may not translate to sequentiality on physical disk.
  • Deep access history information is usually not recorded.
slide47

Opportunities and Challenges

  • With Disk Spatial Locality (Disk-Seen)
    • Exploit DULO for fast disk accesses.
  • Challenges to build Disk-Seen System Infrastructure
    • Disk layout information is increasingly hidden in disks.
    • analyze and utilize disk-layout Information
    • accurately and timely identify long disk sequences
    • considertrade-offs of temporal and spatial locality (buffer cache hit ratio vs miss penalty: not necessarily follow LRU)
    • manage its data structures with low overhead
    • Implement it in OS kernel for practical usage
slide48

Disk-SeenTask 1: Make Disk Layout Info. Available

  • Which disk layout information to use?
    • Logical block number (LBN): location mapping provided by firmware. (each block is given a sequence number)
    • Accesses of contiguous LBNs have a performance close to accesses of contiguous blocks on disk. (except bad blocks occur)
    • The LBN interface is highly portable across platforms.
  • How to efficiently manage the disk layout information?
    • LBN is only used to identify disk locations for read/write;
    • We want to track access times of disk blocks and search for access sequences via LBNs;
    • Disk block table: a data structure for efficient disk blocks tracking.
slide49

Correlation Buffer

Staging Section

Sequencing Bank

Evicting Section

Disk-SeenTASK 2:Exploiting Dual Localities (DULO)

  • Sequence Forming

Sequence ---- a number of blocks whose disk locations are adjacent and have been accessed during a limited time period.

  • Sequence Sorting based on its recency (temporal locality) and size (spatial locality)

LRU Stack

slide50

Disk-SeenTASK 3:DULO-Caching

  • Adapted GreedyDual Algorithm
  • a global inflation value L, and a value H for each sequence
  • Calculate H values for sequences in sequencing bank:

H = L + 1 / Length( sequence )

Random blocks have larger H values

  • When a sequence (s) is replaced,

L = H value of s .

L increases monotonically and make future sequences have larger H values

  • Sequences with smaller H values are placed closer to the bottom of LRU stack

H=L0+0.25

H=L0+1

H=L0+1

H=L0+0.25

LRU Stack

L=L0

L=L1

slide51

DULO-Caching Principles

  • Moving long sequences to the bottom of stack
    • replace them early, get them back fast from disks
  • Replacement priority is set by sequence length.
  • Moving LRU sequences to the bottom of stack
    • exploiting temporal locality of data accesses
  • Keeping random blocks in upper level stack
    • hold them: expensive to get back from disks.
slide52

Disk-SeenTask 5:DULO-Prefetching

Timestamp

Spatial window size

Temporal window size

  • Prefetch size: maximum number of blocks to be prefetched.

LBN

Block initiating prefetching

Resident block

Non-resident block

what can dulo caching prefetch do and not do
What can DULO-Caching/-Prefetch do and not do?
  • Effective to
    • mixed sequential/random accesses. (cache them differently)
    • many small files. (packaging them in prefetch)
    • many one-time sequential accesses(replace them quickly).
    • repeatable complex patterns that cannot be detected without disk info. (remember them)
  • Not effective to
    • dominantly random/sequential accesses. (perform equivalently to LRU)
    • a large file sequentially located in disks. (file-level prefetch can do it)
    • non-repeatable accesses. (perform equivalently to file-level prefetch)
slide54

DiskSeen:a System Infrastructure to SupportDULO-Caching and DULO-Prefetching

Buffer Cache

Caching

area

Prefetching

area

Destaging

area

Disk

On-demand read:

place in stack top

Block transfers between areas

DULO-Prefetching:

adj. window/stream

DULO-Caching:

LRU blks and Long seqs.

slide55

DULO Caching does not affect Execution Times of Pure Sequential or Random Workloads

Diff

(random accesses)

TPC-H Query #6

(sequential accesses)

slide56

DULO Caching Reduces Execution Times for Workloads with Mixed Patterns

PostMark

(mixed patterns of both sequential and random)

slide59

Conclusions (1)

  • Issues of Multicores
    • resource conflicts in shared caches and memory bus
    • OS is not shared-cache-aware
  • multicore- and cache –aware scheduling is essential
    • scheduling jobs based on resource demand initially
    • rescheduling jobs to optimize the resource utilization
  • Building a hybrid OS resource management system
    • dynamically allocate cache space to each process
    • putting cache information into OS map
    • minimize OS and hardware overheads.
slide60

Conclusions (2)

  • Disk performance is limited by
    • OS is unable to effectively exploit sequential locality.
  • The buffer cache is a critical component for storage.
    • temporal locality is mainly exploited by existing OS.
  • Building a Disk-Seen system infrastructure for
    • DULO-Caching and -prefetching
  • Flash memory in the storage hierarchy
    • holding the random accesses
    • serve as L2 cache for DRAM buffer cache
    • caching for low power