slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems PowerPoint Presentation
Download Presentation
Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Loading in 2 Seconds...

play fullscreen
1 / 34

Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems - PowerPoint PPT Presentation


  • 111 Views
  • Uploaded on

Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems. Jiang Lin 1 , Qingda Lu 2 , Xiaoning Ding 2 , Zhao Zhang 1 , Xiaodong Zhang 2 , and P. Sadayappan 2. 1 Department of ECE Iowa State University. 2 Department of CSE

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems' - miracle


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Jiang Lin1, Qingda Lu2, Xiaoning Ding2,

Zhao Zhang1, Xiaodong Zhang2, and P. Sadayappan2

1 Department of ECE

Iowa State University

2 Department of CSE

The Ohio State University

shared caches can be a critical bottleneck in multi core processors
Shared Caches Can be a Critical Bottleneck in Multi-Core Processors
  • L2/L3 caches are shared by multiple cores
    • Intel Xeon 51xx (2core/L2)
    • AMD Barcelona (4core/L3)
    • Sun T2, ... (8core/L2)
  • Effective cache partitioning is critical to address the bottleneck caused by the conflicting accesses in shared caches.
  • Several hardware cache partitioning methods have been proposed with different optimization objectives
    • Performance: [HPCA’02], [HPCA’04], [Micro’06]
    • Fairness: [PACT’04], [ICS’07], [SIGMETRICS’07]
    • QoS: [ICS’04], [ISCA’07]

……

Core

Core

Core

Shared L2/L3 cache

2

limitations of simulation based studies
Limitations of Simulation-Based Studies
  • Excessive simulation time
    • Whole programs can not be evaluated. It would take several weeks/months to complete a single SPEC CPU2006 benchmark
    • As the number of cores continues to increase, simulation ability becomes even more limited
  • Absence of long-term OS activities
    • Interactions between processor/OS affect performance significantly
  • Proneness to simulation inaccuracy
    • Bugs in simulator
    • Impossible to model many dynamics and details of the system

3

our approach to address the issues
Our Approach to Address the Issues

Design and implement OS-based Cache Partitioning

  • Embedding cache partitioning mechanism in OS
    • By enhancing page coloring technique
    • To support both static and dynamic cache partitioning
  • Evaluate cache partitioning policieson commodity processors
    • Execution- and measurement-based
    • Run applications to completion
    • Measure performance with hardware counters

4

four questions to answer
Four Questions to Answer
  • Can we confirm the conclusions made by the simulation-based studies?
  • Can we provide new insights and findings that simulation is not able to?
  • Can we make a case for our OS-based approach as an effective option to evaluate multicore cache partitioning designs?
  • What are advantages and disadvantages for OS-based cache partitioning?

5

outline
Outline
  • Introduction
  • Design and implementation of OS-based cache partitioning mechanisms
  • Evaluation environment and workload construction
  • Cache partitioning policies and their results
  • Conclusion

6

os based cache partitioning mechanisms
OS-Based Cache Partitioning Mechanisms
  • Static cache partitioning
    • Predetermines the amount of cache blocks allocated to each program at the beginning of its execution
    • Page coloring enhancement
    • Divides shared cache to multiple regions and partition cache regions through OS page address mapping
  • Dynamic cache partitioning
    • Adjusts cache quota among processes dynamically
    • Page re-coloring
    • Dynamically changes processes’ cache usage through OS page address re-mapping

7

page coloring
Page Coloring
  • Physically indexed caches are divided into multiple regions (colors).
  • All cache lines in a physical page are cached in one of those regions (colors).

Physically indexed cache

Virtual address

virtual page number

page offset

OS control

Address translation

… …

Physical address

physical page number

Page offset

OS can control the page color of a virtual page through address mapping

(by selecting a physical page with a specific value in its page color bits).

=

Cache address

Cache tag

Set index

Block offset

page color bits

8

enhancement for static cache partitioning
Enhancement for Static Cache Partitioning

Physical pages are grouped to page bins

according to their page color

OS address mapping

Physically indexed cache

1

2

3

4

……

i

i+1

i+2

Shared cache is partitioned between two processes through address mapping.

……

Process 1

… …

...

1

2

Cost: Main memory space needs to be partitioned too (co-partitioning).

3

4

……

i

i+1

i+2

……

9

Process 2

dynamic cache partitioning
Dynamic Cache Partitioning
  • Why?
    • Programs have dynamic behaviors
    • Most proposed schemes are dynamic
  • How?
    • Page re-coloring
  • How to handle overhead?
    • Measure overhead by performance counter
    • Remove overhead in result (emulating hardware schemes)

10

dynamic cache partitioning through page re coloring
Dynamic Cache Partitioning through Page Re-Coloring
  • Page re-coloring:
    • Allocate page in new color
    • Copy memory contents
    • Free old page

Allocated color

Allocated color

0

  • Pages of a process are organized into linked lists by their colors.
  • Memory allocation guarantees that pages are evenly distributed into all the lists (colors) to avoid hot points.

1

2

3

……

N - 1

page links table

11

control the page migration overhead
Control the Page Migration Overhead
  • Control the frequency of page migration
    • Frequent enough to capture application phase changes
    • Not too often to introduce large page migration overhead
  • Lazy migration: avoid unnecessary page migration
    • Observation: Not all pages are accessed between their two migrations.
    • Optimization: do not migrate a page until it is accessed

12

lazy page migration
Lazy Page Migration
  • After the optimization
    • On average, 2% page migration overhead
    • Up to 7%.

Allocated color

Allocated color

0

1

2

3

……

N - 1

Avoid unnecessary page migration for these pages!

Process page links

13

outline1
Outline
  • Introduction
  • Design and implementation of OS-based cache partitioning mechanisms
  • Evaluation environment and workload construction
  • Cache partitioning policies and their results
  • Conclusion

14

experimental environment
Experimental Environment
  • Dell PowerEdge1950
    • Two-way SMP, Intel dual-core Xeon 5160
    • Shared 4MB L2 cache, 16-way
    • 8GB Fully Buffered DIMM
  • Red Hat Enterprise Linux 4.0
    • 2.6.20.3 kernel
    • Performance counter tools from HP (Pfmon)
    • Divide L2 cache into 16 colors

15

benchmark classification
Benchmark Classification

6

9

6

8

29 benchmarks from SPEC CPU2006

  • Is it sensitive to L2 cache capacity?
    • Red group: IPC(1M L2 cache)/IPC(4M L2 cache) < 80%
      • Give red benchmarks more cache: big performance gain
    • Yellow group: 80% <IPC(1M L2 cache)/IPC(4M L2 cache) < 95%
      • Give yellow benchmarks more cache: moderate performance gain
  • Else: Does it extensively access L2 cache?
    • Green group: > = 14 accesses / 1K cycle
      • Give it small cache
    • Black group: < 14 accesses / 1K cycle
      • Cache insensitive

16

workload construction
Workload Construction

6

9

6

2-core

6

RR (3 pairs)

9

RY (6 pairs)

YY (3 pairs)

6

RG (6 pairs)

YG (6 pairs)

GG (3 pairs)

27 workloads: representative benchmark combinations

17

outline2
Outline
  • Introduction
  • OS-based cache partitioning mechanism
  • Evaluation environment and workload construction
  • Cache partitioning policies and their results
    • Performance
    • Fairness
  • Conclusion

18

performance metrics
Performance – Metrics
  • Divide metrics into evaluation metrics and policy metrics [PACT’06]
    • Evaluation metrics:
      • Optimization objectives, not always available during run-time
    • Policy metrics
      • Used to drive dynamic partitioning policies: available during run-time
      • Sum of IPC, Combined cache miss rate, Combined cache misses

19

static partitioning
Static Partitioning
  • Total #color of cache: 16
  • Give at least two colors to each program
    • Make sure that each program get 1GB memory to avoid swapping (because of co-partitioning)
  • Try all possible partitionings for all workloads
    • (2:14), (3:13), (4:12) ……. (8,8), ……, (13:3), (14:2)
    • Get value of evaluation metrics
    • Compared with performance of all partitionings with performance of shared cache

20

performance optimal static partitioning
Performance – Optimal Static Partitioning
  • Confirm that cache partitioning has significant performance impact
  • Different evaluation metrics have different performance gains
  • RG-type of workloads have largest performance gains (up to 47%)
  • Other types of workloads also have performance gains (2% to 10%)

21

a new finding
A New Finding
  • Workload RG1: 401.bzip2 (Red) + 410.bwaves (Green)
  • Intuitively, giving more cache space to 401.bzip2 (Red)
    • Increases the performance of 401.bzip2 largely (Red)
    • Decreases the performance of 410.bwaves slightly (Green)
  • However, we observe that

22

insight into our finding1
Insight into Our Finding
  • We have the same observation in RG4, RG5 and YG5
  • This is not observed by simulation
    • Did not model main memory sub-system in detail
      • Assumed fixed memory access latency
  • Shows the advantages of our execution- and measurement-base study
performance dynamic partition policy
Performance - Dynamic Partition Policy

Init: Partition the cache as (8:8)

Yes

finished

Exit

No

Run current partition (P0:P1) for one epoch

  • A simple greedy policy.
  • Emulate policy of [HPCA’02]

Try one epoch for each of the two neighboring

partitions: (P0 – 1: P1+1) and (P0 + 1: P1-1)

Choose next partitioning with

best policy metrics measurement

25

performance static dynamic
Performance – Static & Dynamic
  • Use combined miss rates as policy metrics
  • For RG-type, and some RY-type:
    • Static partitioning outperforms dynamic partitioning
  • For RR- and RY-type, and some RY-type
    • Dynamic partitioning outperforms static partitioning

26

fairness metrics and policy pact 04
Fairness – Metrics and Policy [PACT’04]
  • Metrics
    • Evaluation metrics FM0
      • difference in slowdown, small is better
    • Policy metrics
  • Policy
    • Repartitioning and rollback

27

fairness result
Fairness - Result
  • Dynamic partitioning can achieve better fairness
    • If we use FM0 as both evaluation metrics and policy metrics
  • None of policy metrics (FM1 to FM5) is good enough to drive the partitioning policy to get comparable fairness with static partitioning
  • Strong correlation was reported in simulation-based study – [PACT’04]
  • None of policy metrics has consistently strong correlation with FM0
    • SPEC CPU2006 (ref input)  SPEC CPU2000 (test input)
      • Complete trillions of instructions  less than one billion instruction
    • 4MB L2 cache  512KB L2 cache

28

conclusion
Conclusion
  • Confirmed some conclusions made by simulations
  • Provided new insights and findings
    • Give cache space from one to another, increase performance of both
    • Poor correlation between evaluation and policy metrics for fairness
  • Made a case for our OS-based approach as an effective option for evaluation of multicore cache partitioning
  • Advantages of OS-based cache partitioning
    • Working on commodity processors for an execution- and measurement-based study
  • Disadvantages of OS-based cache partitioning
    • Co-partitioning (may underutilize memory), migration overhead

29

ongoing work
Ongoing Work
  • Reduce migration overhead on commodity processors
  • Cache partitioning at the compiler level
    • Partition cache at object level
  • Hybrid cache partitioning method
    • Remove the cost of co-partitioning
    • Avoid page migration overhead

30

jiang lin 1 qingda lu 2 xiaoning ding 2 zhao zhang 1 xiaodong zhang 2 and p sadayappan 2
Jiang Lin1, Qingda Lu2, Xiaoning Ding2,

Zhao Zhang1, Xiaodong Zhang2, and P. Sadayappan2

Gaining Insights into Multi-Core Cache Partitioning: Bridging the Gap between Simulation and Real Systems

Thanks!

1 Iowa State University

2 The Ohio State University

fairness correlation between evaluation metrics and policy metrics reported by pact 04
Fairness - Correlation between Evaluation Metrics and Policy Metrics (Reported by [PACT’04])

Strong correlation was reported in simulation study – [PACT’04]

33

fairness correlation between evaluation metrics and policy metrics our result
Fairness - Correlation between Evaluation Metrics and Policy Metrics (Our result)
  • None of policy metrics has consistently strong correlation with FM0
    • SPEC CPU2006 (ref input)  SPEC CPU2000 (test input)
    • Complete trillions of instructions  less than one billion instruction
    • 4MB L2 cache  512KB L2 cache

34