Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs) - PowerPoint PPT Presentation

Organizing the last line of defense before hitting the memory wall for chip multiprocessors cmps
Download
1 / 44

  • 80 Views
  • Uploaded on
  • Presentation posted in: General

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs). C. Liu, A. Sivasubramaniam , M. Kandemir The Pennsylvania State University anand@cse.psu.edu. Outline. CMPs and L2 organization Shared Processor-based Split L2

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Presentation

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Organizing the last line of defense before hitting the memory wall for chip multiprocessors cmps

Organizing the Last Line of Defense before hitting the Memory Wall for Chip-Multiprocessors (CMPs)

C. Liu, A. Sivasubramaniam, M. Kandemir

The Pennsylvania State University

anand@cse.psu.edu


Outline

Outline

  • CMPs and L2 organization

  • Shared Processor-based Split L2

  • Evaluation using SpecOMP/Specjbb

  • Summary of Results


Why cmps

Why CMPs?

  • Can exploit coarser granularity of parallelism

  • Better use of anticipated billion transistor designs

    • Multiple and simpler cores

  • Commercial and research prototypes

    • Sun MAJC

    • Piranha

    • IBM Power 4/5

    • Stanford Hydra

    • ….


Higher pressure on memory system

Higher pressure on memory system

  • Multiple active threads => larger working set

  • Solution?

    • Bigger Cache.

    • Faster interconnect.

  • What if we have to go off-chip?

  • The cores need to share the limited pins.

  • Impact of off-chip accesses may be much worse than incurring a few extra cycles on-chip

  • Needs a close scrutiny of on-chip caches.


On chip cache hierarchy

On-chip Cache Hierarchy

  • Assume 2 levels

    • L1 (I/D) is private

    • What about L2?

  • L2 is the last line of defense before going off-chip, and is the focus of this paper.


Private p l2

L2 $

L2 $

L2 $

L2 $

L2 $

L2 $

I N

T E

R C

O N

N E

C T

Private (P) L2

I$

D$

I$

D$

L1

L1

Coherence Protocol

Advantages:

Less interconnect traffic

Insulates L2 units

Disadvantages:

Duplication

Load imbalance

Offchip Memory


Shared interleaved si l2

I N

T E

R C

O N

N E

C T

Shared-Interleaved (SI) L2

Coherence Protocol

L1

I$

D$

I$

D$

L2

Disadvantages:

Interconnect traffic

Interference between cores

Advantages:

No duplication

Balance the load


Desirables

Desirables

  • Approach the behavior of private L2s, when the sharing is not significant

  • Approach the behavior of private L2 when load is balanced or when there is interference

  • Approach behavior of shared L2 when there is significant sharing

  • Approach behavior of shared L2 when demands are uneven.


Shared processor based split l2

I$

D$

I$

D$

I N

T E

R C

O N

N E

C T

$

$

$

$

$

$

$

$

$

$

$

$

Shared Processor-based Split L2

L1

Table and Split Select

L2

Processors/cores are allocated L2 splits


Lookup

Lookup

  • Look up all splits allocated to requesting core simultaneously.

  • If not found, then look at all other splits (extra latency).

  • If found, move block over to one of its splits (chosen randomly), and removing it from the other split.

  • Else, go off-chip and place block in one of its splits (chosen randomly).


Organizing the last line of defense before hitting the memory wall for chip multiprocessors cmps

Note …

  • Note, a core cannot place blocks that evict blocks useful to another (as in Private case)

  • A core can look at (shared) blocks of other cores – at a slightly higher cost without being as high as off-chip accesses (as in Shared case).

  • There is at most 1 copy of a block in L2.


Organizing the last line of defense before hitting the memory wall for chip multiprocessors cmps

I N

T E

R C

O N

N E

C T

Shared Split Uniform (SSU)

I$

D$

I$

D$

L1

Table and Split Select

$

$

$

$

$

$

$

$

$

$

$

$

L2


Organizing the last line of defense before hitting the memory wall for chip multiprocessors cmps

I N

T E

R C

O N

N E

C T

Shared Split Non-Uniform (SSN)

I$

D$

I$

D$

L1

Table and Split Select

$

$

$

$

$

$

$

$

$

$

$

$

L2


Split table

$

$

$

$

$

$

$

$

$

$

Split Table

P0

P1

P2

P3


Evaluation

Evaluation

  • Using Simics complete system simulator

  • Benchmarks: SpecOMP2000 + Specjbb

  • Reference dataset used

  • Several billion instructions were simulated.

  • A bus interconnect was simulated with MESI.


Default configuration

Default configuration


Benchmarks specomp specjbb

Benchmarks (SpecOMP + Specjbb)


Ssn terminology

SSN Terminology

  • With a total L2 of 2MB (16 splits of 128K each) to be allocated to 8 cores, SSN-152 refers to

    • 512K (4 splits) allocated to 1 CPU

    • 256K (2 splits) allocated to each of 5 CPUs

    • 128K (1 split) allocated to each of 2 CPUs

  • Determining how much to allocate to each CPU (and when) – postpone for future work.

  • Here, we use a profile based approach based on L2 demands.


Application behavior

Application behavior

  • Intra-application heterogeneity

    • Spatial: (among CPUs) allocate non-uniform splits to different CPUs.

    • Temporal: (for each CPU)change the number of splits allocated to a CPU at different points of time.

  • Inter-application heterogeneity

    • Different applications running at same time can have different L2 demands.


Definition

Definition

  • SHF (Spatial Heterogeneity Factor)

  • THF (Temporal Heterogeneity Factor)


Spatial heterogeneity factor

Spatial heterogeneity Factor


Temporal heterogeneity factor

Temporal Heterogeneity Factor


Results si

Results: SI


Results ssu

Results: SSU


Results ssn

Results: SSN


Summary of results

Summary of Results

  • When P does better than S (e.g. apsi), SSU/SSN does as well (if not better) as P.

  • When S does better than P (e.g. swim, mgrid, specjbb), SSU/SSN does as well (if not better) as S.

  • In nearly all cases (except applu), some configuration of SSU/SSN does the best.

  • On the average we get over 11% improvement in IPC over the best S/P configuration(s).


Inter application heterogenity

Inter-application Heterogenity

  • Different applications have different L2 demands

  • These applications could even be running concurrently on different CPUs.


Inter application results

Inter-application results

  • ammp+apsi, low+high.

  • ammp+fma3d, both low

  • swim+apsi, both high, imbalanced + balanced.

  • swim+mgrid,both high, imbalanced + imbalanced


Inter application ammp apsi

Inter-application: ammp+apsi

  • SSN-152

  • 1.25MB dynamically allocated to apsi, 0.75MB to ammp.

  • Graph shows the rough 5:3 allocation.

  • Better overall IPC value.

Low miss rate for apsi and not affecting the miss rate of ammp.


Concluding remarks

Concluding Remarks

  • Shared Processor-based Split L2 is a flexible way of approaching the behavior of shared or private L2 (based on what is preferable)

  • It accommodates spatial and temporal heterogeneity in L2 demands both within an application and across applications.

  • Becomes even more important with higher off-chip accesses.


Future work

Future Work

  • How to configure the split sizes – statically, dynamically and a combination of the two?


Backup slides

Backup Slides


Meaning

Meaning

  • Capture the heterogeneity between CPUs (spatial) or over the epochs (temporal) of the load imposed on the L2 structure.

  • Weighted by L1 accesses reflect the effect on the overall IPC.

    • If the overall access are low, there is not going to be a significant impact on the IPC even though the standard deviation is high.


Results p

Results: P


Results si1

Results: SI


Results ssu1

Results: SSU


Results

In swim, mgrid, specjbb with high L1 miss rate means higher pressure on L2,which results significant IPC improvement(30.9% to 42.5%)

Results

Except applu, shared splitL2 perform the best.


Why private l2 does better in some

Why private L2 does better in some?

  • L2 performance:

    • The degree of sharing

    • The imbalance of load imposed on L2

  • For applu and swim+apsi,

    • Only 12% of the blocks are shared at any time, mainly shared between 2 CPUs.

    • Not much spatial/temporal heterogeneity.


Why we use ipc instead of the execution time

Why we use IPC instead of the execution time?

  • We could not finish any of the benchmark, since we are using the “reference” dataset.

  • Another possible indicator is the number of iterations executed of certain loop (for example, the dominating loop) for unit amount of time.

  • We did this and find the direct correlation between the IPC value and the number of iterations.


Results1

Results


Closer look specjbb

Closer look: specjbb

  • SSU is over 31% better than the private L2.

  • Direct correlation between the L2 misses and the IPC values.

  • P never exceeds 2.5, while SSU sometimes push over 3.0


Sensitivity larger l2

Sensitivity: Larger L2

  • 2MB -> 4MB -> 8MB

    • Miss rates go down, difference arising from miss rate diminish. ‘swim’ still get considerable savings.

    • If application size keep growing up, the split shared L2 is still going to help.

    • More splits of L2 -> finer granularity -> could help SSN.


Sensitivity longer memory access

Sensitivity: Longer memory access

120 cycles -> 240 cyclesBenefits are amplified


  • Login