Organizing the last line of defense before hitting the memory wall for chip multiprocessors cmps
This presentation is the property of its rightful owner.
Sponsored Links
1 / 44

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs) PowerPoint PPT Presentation


  • 56 Views
  • Uploaded on
  • Presentation posted in: General

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs). C. Liu, A. Sivasubramaniam , M. Kandemir The Pennsylvania State University [email protected] Outline. CMPs and L2 organization Shared Processor-based Split L2

Download Presentation

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Organizing the last line of defense before hitting the memory wall for chip multiprocessors cmps

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

C. Liu, A. Sivasubramaniam, M. Kandemir

The Pennsylvania State University

[email protected]


Outline

Outline

  • CMPs and L2 organization

  • Shared Processor-based Split L2

  • Evaluation using SpecOMP/Specjbb

  • Summary of Results


Why cmps

Why CMPs?

  • Can exploit coarser granularity of parallelism

  • Better use of anticipated billion transistor designs

    • Multiple and simpler cores

  • Commercial and research prototypes

    • Sun MAJC

    • Piranha

    • IBM Power 4/5

    • Stanford Hydra

    • ….


Higher pressure on memory system

Higher pressure on memory system

  • Multiple active threads => larger working set

  • Solution?

    • Bigger Cache.

    • Faster interconnect.

  • What if we have to go off-chip?

  • The cores need to share the limited pins.

  • Impact of off-chip accesses may be much worse than incurring a few extra cycles on-chip

  • Needs a close scrutiny of on-chip caches.


On chip cache hierarchy

On-chip Cache Hierarchy

  • Assume 2 levels

    • L1 (I/D) is private

    • What about L2?

  • L2 is the last line of defense before going off-chip, and is the focus of this paper.


Private p l2

L2 $

L2 $

L2 $

L2 $

L2 $

L2 $

I N

T E

R C

O N

N E

C T

Private (P) L2

I$

D$

I$

D$

L1

L1

Coherence Protocol

Advantages:

Less interconnect traffic

Insulates L2 units

Disadvantages:

Duplication

Load imbalance

Offchip Memory


Shared interleaved si l2

I N

T E

R C

O N

N E

C T

Shared-Interleaved (SI) L2

Coherence Protocol

L1

I$

D$

I$

D$

L2

Disadvantages:

Interconnect traffic

Interference between cores

Advantages:

No duplication

Balance the load


Desirables

Desirables

  • Approach the behavior of private L2s, when the sharing is not significant

  • Approach the behavior of private L2 when load is balanced or when there is interference

  • Approach behavior of shared L2 when there is significant sharing

  • Approach behavior of shared L2 when demands are uneven.


Shared processor based split l2

I$

D$

I$

D$

I N

T E

R C

O N

N E

C T

$

$

$

$

$

$

$

$

$

$

$

$

Shared Processor-based Split L2

L1

Table and Split Select

L2

Processors/cores are allocated L2 splits


Lookup

Lookup

  • Look up all splits allocated to requesting core simultaneously.

  • If not found, then look at all other splits (extra latency).

  • If found, move block over to one of its splits (chosen randomly), and removing it from the other split.

  • Else, go off-chip and place block in one of its splits (chosen randomly).


Organizing the last line of defense before hitting the memory wall for chip multiprocessors cmps

Note …

  • Note, a core cannot place blocks that evict blocks useful to another (as in Private case)

  • A core can look at (shared) blocks of other cores – at a slightly higher cost without being as high as off-chip accesses (as in Shared case).

  • There is at most 1 copy of a block in L2.


Organizing the last line of defense before hitting the memory wall for chip multiprocessors cmps

I N

T E

R C

O N

N E

C T

Shared Split Uniform (SSU)

I$

D$

I$

D$

L1

Table and Split Select

$

$

$

$

$

$

$

$

$

$

$

$

L2


Organizing the last line of defense before hitting the memory wall for chip multiprocessors cmps

I N

T E

R C

O N

N E

C T

Shared Split Non-Uniform (SSN)

I$

D$

I$

D$

L1

Table and Split Select

$

$

$

$

$

$

$

$

$

$

$

$

L2


Split table

$

$

$

$

$

$

$

$

$

$

Split Table

P0

P1

P2

P3


Evaluation

Evaluation

  • Using Simics complete system simulator

  • Benchmarks: SpecOMP2000 + Specjbb

  • Reference dataset used

  • Several billion instructions were simulated.

  • A bus interconnect was simulated with MESI.


Default configuration

Default configuration


Benchmarks specomp specjbb

Benchmarks (SpecOMP + Specjbb)


Ssn terminology

SSN Terminology

  • With a total L2 of 2MB (16 splits of 128K each) to be allocated to 8 cores, SSN-152 refers to

    • 512K (4 splits) allocated to 1 CPU

    • 256K (2 splits) allocated to each of 5 CPUs

    • 128K (1 split) allocated to each of 2 CPUs

  • Determining how much to allocate to each CPU (and when) – postpone for future work.

  • Here, we use a profile based approach based on L2 demands.


Application behavior

Application behavior

  • Intra-application heterogeneity

    • Spatial: (among CPUs) allocate non-uniform splits to different CPUs.

    • Temporal: (for each CPU)change the number of splits allocated to a CPU at different points of time.

  • Inter-application heterogeneity

    • Different applications running at same time can have different L2 demands.


Definition

Definition

  • SHF (Spatial Heterogeneity Factor)

  • THF (Temporal Heterogeneity Factor)


Spatial heterogeneity factor

Spatial heterogeneity Factor


Temporal heterogeneity factor

Temporal Heterogeneity Factor


Results si

Results: SI


Results ssu

Results: SSU


Results ssn

Results: SSN


Summary of results

Summary of Results

  • When P does better than S (e.g. apsi), SSU/SSN does as well (if not better) as P.

  • When S does better than P (e.g. swim, mgrid, specjbb), SSU/SSN does as well (if not better) as S.

  • In nearly all cases (except applu), some configuration of SSU/SSN does the best.

  • On the average we get over 11% improvement in IPC over the best S/P configuration(s).


Inter application heterogenity

Inter-application Heterogenity

  • Different applications have different L2 demands

  • These applications could even be running concurrently on different CPUs.


Inter application results

Inter-application results

  • ammp+apsi, low+high.

  • ammp+fma3d, both low

  • swim+apsi, both high, imbalanced + balanced.

  • swim+mgrid,both high, imbalanced + imbalanced


Inter application ammp apsi

Inter-application: ammp+apsi

  • SSN-152

  • 1.25MB dynamically allocated to apsi, 0.75MB to ammp.

  • Graph shows the rough 5:3 allocation.

  • Better overall IPC value.

Low miss rate for apsi and not affecting the miss rate of ammp.


Concluding remarks

Concluding Remarks

  • Shared Processor-based Split L2 is a flexible way of approaching the behavior of shared or private L2 (based on what is preferable)

  • It accommodates spatial and temporal heterogeneity in L2 demands both within an application and across applications.

  • Becomes even more important with higher off-chip accesses.


Future work

Future Work

  • How to configure the split sizes – statically, dynamically and a combination of the two?


Backup slides

Backup Slides


Meaning

Meaning

  • Capture the heterogeneity between CPUs (spatial) or over the epochs (temporal) of the load imposed on the L2 structure.

  • Weighted by L1 accesses reflect the effect on the overall IPC.

    • If the overall access are low, there is not going to be a significant impact on the IPC even though the standard deviation is high.


Results p

Results: P


Results si1

Results: SI


Results ssu1

Results: SSU


Results

In swim, mgrid, specjbb with high L1 miss rate means higher pressure on L2,which results significant IPC improvement(30.9% to 42.5%)

Results

Except applu, shared splitL2 perform the best.


Why private l2 does better in some

Why private L2 does better in some?

  • L2 performance:

    • The degree of sharing

    • The imbalance of load imposed on L2

  • For applu and swim+apsi,

    • Only 12% of the blocks are shared at any time, mainly shared between 2 CPUs.

    • Not much spatial/temporal heterogeneity.


Why we use ipc instead of the execution time

Why we use IPC instead of the execution time?

  • We could not finish any of the benchmark, since we are using the “reference” dataset.

  • Another possible indicator is the number of iterations executed of certain loop (for example, the dominating loop) for unit amount of time.

  • We did this and find the direct correlation between the IPC value and the number of iterations.


Results1

Results


Closer look specjbb

Closer look: specjbb

  • SSU is over 31% better than the private L2.

  • Direct correlation between the L2 misses and the IPC values.

  • P never exceeds 2.5, while SSU sometimes push over 3.0


Sensitivity larger l2

Sensitivity: Larger L2

  • 2MB -> 4MB -> 8MB

    • Miss rates go down, difference arising from miss rate diminish. ‘swim’ still get considerable savings.

    • If application size keep growing up, the split shared L2 is still going to help.

    • More splits of L2 -> finer granularity -> could help SSN.


Sensitivity longer memory access

Sensitivity: Longer memory access

120 cycles -> 240 cyclesBenefits are amplified


  • Login