Xstream cross core s patial stream ing based mlc prefetchers for p arallel a pplications in cmps
This presentation is the property of its rightful owner.
Sponsored Links
1 / 37

Biswabandan Panda , Shankar Balachandran { biswa,shankar [email protected] PowerPoint PPT Presentation


  • 129 Views
  • Uploaded on
  • Presentation posted in: General

XSTREAM : Cross -core S patial Stream ing based MLC Prefetchers for P arallel A pplications in CMPs. Biswabandan Panda , Shankar Balachandran { biswa,shankar [email protected] Indian Institute of Technology Madras, India PACT 2014. Quick Summary.

Download Presentation

Biswabandan Panda , Shankar Balachandran { biswa,shankar [email protected]

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Xstream cross core s patial stream ing based mlc prefetchers for p arallel a pplications in cmps

XSTREAM: Cross-core Spatial Streaming based MLC Prefetchers for Parallel Applications in CMPs

Biswabandan Panda,Shankar Balachandran

{biswa,[email protected]

Indian Institute of Technology Madras, India

PACT 2014


Quick summary

Quick Summary

  • Problem: Per-Core Spatial Streaming based MLC prefetchers

    • Oblivious to the cross-core (XCORE) streams.

  • Opportunity:

    • Spatial streams spread across the cores.

  • Our Contribution: XSTREAM (XCORE Stream Prefetching)

    • Communication of the streams from one prefetcher to another Just In Time.

  • Improves Performance with negligible storage:

    • 11.3(9)% speedup in 4(8) core CMPs with an additional storage of 50KB.


Background on streams

Background on Streams

Stream – A sequence of cache-line-aligned memory addresses.

Temporal Streams – Sequences of temporally correlated addresses, exploited by TMS. [Wenisch, ISCA ‘05].

Spatial Streams – Streams, which are correlated in space, exploited by SMS [Somogyi, ISCA ‘06].

SpatioTemporal Streams – Temporal correlation among the spatial regions, and spatial correlation within a region, exploited by STeMS [Somogyi, ISCA ‘09].


Streams with examples

Streams With Examples

Spatial Streams

A

A + 2

B

C

A + 3

A + 7

Spatial streams – Scans over fixed data layout.

Temporal Streams

A

A + 2

B

C

A + 3

A + 7

Temporal streams – Pointer Chasing Codes.

Focus of This Talk – Spatial Streaming.


Sms 101 per core training

SMS101: Per Core Training

  • Divides the memory space into fixed size regions,

  • indexed by a signature (PC/offset) .

  • Each signature contains a bit vector.

  • Each bit in the bit vector corresponds to a cache line.

Pattern History Table

(PHT)

Accumulation Table (AT)

Tag

PC/Offset

Bit Vector

Sig

Bit Vector

PC/1

0111

PC/1

0101

.

.

.

Active Generation Table (AGT)

.

.

.

Filter Table (FT)

PC/1

0111

.

.

Tag

PC/Offset

Miss to A+1

A

PC/1

1

Miss to A+3

A

2

A

Miss to A+2

Eviction/

Invalidation A

3


Baseline organization

Baseline Organization

Core

0

Core

1

Core

N-1

.

.

.

L1

L1

L1

SMS

SMS

L2

L2

L2

XBar

Shared L3

IMC

.

.

.

DDR3

K-1

DDR3

0

DDR3

1


Xcore spatial streams an example

XCORE Spatial Streams: An Example

Lost

Opportunity

C D E H

EHA B

Core 0

Core 0

Demand Misses

C DF G

A B F G

Core 1

Core 1

Time

ICORE(Intra-core) Streams – recur within a core.

E and H at core 0, F and G at core 1.

XCORE Streams – spread&recur across multiple cores.

A, B, C, and D.


Prior works and xstream

Prior Works and XSTREAM


Spatial signatures in a 4 core cmp

Spatial Signatures in a 4-core CMP

80%


Distribution of responses on an mlc miss

Distribution of Responses on an MLC miss

90%

Solution : Increase the size of the L1 and L2s.

L1: 32 to 128 KB – 1.8 % improvement in Exec. Time.

L2: 256KB to 1MB – 8 % improvement in Exec. Time.

XSTREAM – 11.3 % with 50KB of additional hardware.


Xcore signatures observations

XCORE Signatures - Observations

  • Observation 1: 80%of the spatial signatures recur in

    2 or more cores (XCORE signatures).

  • Observation 2:Maximum of only 4068 XCORE

    signatures present in PARSEC that recur at-least 4 times.

  • Observation 3:Separated by 32Kcycles on an average.

    Enough time to communicate the streams from

    one prefetcherto another.


Xstream in a nutshell

XSTREAM in a NutShell

  • Cross-core Spatial Prefetching framework: Based on spatial streams.

    • An MLC prefetchercommunicates (Master Prefetcher) spatial streams to other MLC prefetchers (Worker Prefetchers).

    • Communication happens Just In Time.

  • XSTREAM is

    • a Data forwarding framework.

    • an Inter-core Prefetching framework.


Ideal xstream

Ideal XSTREAM

23%

&

19%


Working steps of xstream

Working Steps of XSTREAM

STEP I : XCORETraining (XSTREAM Detection)

Identifies the XCORE signatures and the corresponding master/worker prefetcher.

STEP II : XCORE Timeliness (XCORE Timeliness)

Finds and stores the difference in time between the recurrence of XCORE streams.

STEP III : XCORE Communication (XCORE Comm.)

JIT(based on STEP II) Communication of the trained streams from the master to the worker prefetcher.


Shared xstream detector

Shared XSTREAM Detector

Entry 0

Entry K-1

.

.

.

Sig 0

.

.

Done

Master

Time

BV

Done

Master

Time

BV

.

Sig S-1

  • BV : Bit Vector

  • Time (When) : The time at which the BV is inserted.

  • Master (Who) : Core-Id of the prefetcher who has sent the BV.

  • Done: Whether the entry has already participated in the XSTREAM detection step.

Sig

Done

Master

Time

BV

t

PC1

0

t

0 1 1 1

0

0

0


Enhanced per core pht

Enhanced Per Core PHT

Pattern History Table

(PHT0)

XSTREAM : Pattern History Table (PHT0)

Sig

BV

Sig

BV

Worker

Time

Init

PC/1

0011

0

PC/1

0011

1

δ

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

  • Worker : Core-Id of the prefetcher in which the signature will recur.

  • Time : The time gap after which the signature will recur at the worker.

  • Init : Whether the master has initiated the comm.


Baseline xstream

Baseline + XSTREAM

XBuff-S : Shared XBuffer

An Interface Between SMS and XSTREAM detector.

Stores the BV and the Core-Id of the prefetcher.

Core

0

Core

1

Core

N-1

f0 , Vdd

.

.

.

L1

L1

L1

XBuff-P

XBuff-P

SMS

L2

L2

L2

SMS

.

.

.

.

.

.

XSTREAM uses the clock domain which drives the LLC. This clock domain is oblivious to P(C) states.

XBar

XBuff-S

.

.

.

XBuff-P : Private Xbuffer (Per Core)

Buffers the BV sent by the master prefetcher and the predicted worker which will use the BV in future.

XSTREAM

Detector

Shared L3

f1, Vdd

17


Xstream detection in a 2 core cmp

XSTREAM Detection in a 2-core CMP

2

Core 0

PHT 1

PHT 0

Core 1

Sig BV1 Worker Time Init

Sig BV0 Worker Time Init

0 0 1 1

0 1 1 1

PC/1

0

0

PC/1

AGT0

AGT1

XBuff-S

XBuff-S

.

BV0

BV1

.

.

.

BV0

.

.

BV1

XSTREAM Detector

Done Master Time BV

Sig

0

0

t

0 1 1 1

PC/1

0

1

0 0 1 1

t+δ

1


Xstream detection 2

XSTREAM Detection - 2

4

XSTREAM Detector

PHT 0

Core 0

Sig

BV

Worker

Time

Init

Sig

Done

Master

Time

BV

PC/1

0 1 1 1

1

δ

0

1

0

0

t

0 1 1 1

PC/1

0

1

t+δ

0 0 1 1

#1s

>= CCth

Y

3


Xcore communication

XCORE Communication

PHT 0

Core 0

PHT 1

Core 1

Sig

BV0

Worker

Time

Sig

BV1

Init

Worker

Time

Init

1

PC/1

1 0 0 1

δ

PC/1

0 1 1 0

1

0

0

t

1

4

BV1U BV0

AGT0

3

t + δ

1 1 1 1

BV0

5

BV0

.

XBuff-P

PFQ

.

.

.

.

.

BV0

.

XBuff-S

.

.

XBar

BV0

2

XSTREAM Detector

Sig

Done

Master

Time

BV

t

0

0

1 00 1

PC/1


Xcore timeliness implementation

XCORE Timeliness - Implementation

XCORE communication depends on the accurate prediction of the time difference ( ).

δ

Local/global registers at the PHTs/XSTREAM detector, store a 4-bit encoded cycle value [cycle/4000].

estimates are tuned to minimize the noise from the

Interconnect.

δ

On an average, these estimates are accurate for 67% of the time.


Interconnect support xstream transactions

Interconnect Support (XSTREAM Transactions)

X-Request : Between an MLC prefetcher and the shared XSTREAM detector.

Contents : BV with the Core-Id of the Master.

X-Response : Between XSTREAM detector and the master prefetcher.

Contents :Time Field with the Core-Id of the Worker.

X-Comm : Between the Master and the Worker.

Contents : BV and the Worker.


Simulation methodology

Simulation Methodology


Parameters specific to xstream

Parameters Specific to XSTREAM


Speedup in a 4 core cmp

Speedup in a 4-core CMP

<=4%

> 4% & <= 13%

> 13 % & <= 29%

11.3%


Speedup in an 8 core cmp

Speedup in an 8-core CMP

< 2%

> 2% & <=10%

> 10% & < 30%

9.0%


4 core to 8 core

4-core to 8-core

dedup and ferret spawn 2+3n and 2+4n threads in an n-core system.

fluidanimate, freqmine, and vips – slowest thread limits the performance.

X264 and streamcluster – degree of XCORE comm. increases with the increase in the core count.


Storage overhead 4 core cmp

Storage Overhead (4 core CMP)

XSTREAM incurs an overhead which is a little more than

1/6 thof a single MLC (256KB).


Summary of evaluations

Summary of Evaluations

  • XSTREAM consumes

    • 18% of the spared interconnect bandwidth

    • (which is 57% of the theoretical limit).

    • 7.23GB/sec (average) DRAM bandwidth.


Other results in the paper

Other Results in the Paper

  • Sensitivity study with various cache sizes of L1/L2/L3.

  • Scalability

    • Scales well for 16-core CMP too (10% Improvement).

  • Quantitative Analysis of each PARSEC benchmark.

  • Special Cases and specific issues.

  • Analysis of Bandwidth/DRAM Traffic.

  • Detailed Calculation of the Hardware Overhead

    • 2.3% increase in the L2 cache area.


Conclusion

Conclusion

  • A new Spatial Streaming mechanism for private MLCs.

  • Key Idea: Communication of spatial streams from one prefetcher to another.

  • Key properties:

    • Low Hardware Cost

    • Simple and Practical hardware implementation

    • Just In Time Communication

  • Improvesexecution time

    • Outperforms state-of-the-art spatial streaming technique by 11.3%( 9%) in 4 (8) core CMPs respectively.


Thank you

Thank You

This work is supported by IBM India Shared University Research (SUR) Grant and a Ph.D. Fellowship from

Tata Consultancy Services.


Backup slides

Backup Slides


Prefetch metrics 4 core

Prefetch Metrics – 4-core

Acc, Cov – Higher the better

PF Traffic – Lower the better


Prefetch metrics 8 core

Prefetch Metrics – 8-core

Acc, Cov – Higher the better

PF Traffic – Lower the better


Reduction in demand miss rate 4 core

Reduction in Demand Miss Rate (4-core)


Reduction in demand miss rate 8 core

Reduction in Demand Miss Rate (8-core)

37


  • Login