xstream cross core s patial stream ing based mlc prefetchers for p arallel a pplications in cmps
Download
Skip this Video
Download Presentation
Biswabandan Panda , Shankar Balachandran { biswa,shankar }@cse.iitm.ac

Loading in 2 Seconds...

play fullscreen
1 / 37

Biswabandan Panda , Shankar Balachandran { biswa,shankar }@cse.iitm.ac - PowerPoint PPT Presentation


  • 180 Views
  • Uploaded on

XSTREAM : Cross -core S patial Stream ing based MLC Prefetchers for P arallel A pplications in CMPs. Biswabandan Panda , Shankar Balachandran { biswa,shankar }@cse.iitm.ac.in Indian Institute of Technology Madras, India PACT 2014. Quick Summary.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Biswabandan Panda , Shankar Balachandran { biswa,shankar }@cse.iitm.ac' - cady


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
xstream cross core s patial stream ing based mlc prefetchers for p arallel a pplications in cmps

XSTREAM: Cross-core Spatial Streaming based MLC Prefetchers for Parallel Applications in CMPs

Biswabandan Panda,Shankar Balachandran

{biswa,shankar}@cse.iitm.ac.in

Indian Institute of Technology Madras, India

PACT 2014

quick summary
Quick Summary
  • Problem: Per-Core Spatial Streaming based MLC prefetchers
    • Oblivious to the cross-core (XCORE) streams.
  • Opportunity:
    • Spatial streams spread across the cores.
  • Our Contribution: XSTREAM (XCORE Stream Prefetching)
    • Communication of the streams from one prefetcher to another Just In Time.
  • Improves Performance with negligible storage:
    • 11.3(9)% speedup in 4(8) core CMPs with an additional storage of 50KB.
background on streams
Background on Streams

Stream – A sequence of cache-line-aligned memory addresses.

Temporal Streams – Sequences of temporally correlated addresses, exploited by TMS. [Wenisch, ISCA ‘05].

Spatial Streams – Streams, which are correlated in space, exploited by SMS [Somogyi, ISCA ‘06].

SpatioTemporal Streams – Temporal correlation among the spatial regions, and spatial correlation within a region, exploited by STeMS [Somogyi, ISCA ‘09].

streams with examples
Streams With Examples

Spatial Streams

A

A + 2

B

C

A + 3

A + 7

Spatial streams – Scans over fixed data layout.

Temporal Streams

A

A + 2

B

C

A + 3

A + 7

Temporal streams – Pointer Chasing Codes.

Focus of This Talk – Spatial Streaming.

sms 101 per core training
SMS101: Per Core Training
  • Divides the memory space into fixed size regions,
  • indexed by a signature (PC/offset) .
  • Each signature contains a bit vector.
  • Each bit in the bit vector corresponds to a cache line.

Pattern History Table

(PHT)

Accumulation Table (AT)

Tag

PC/Offset

Bit Vector

Sig

Bit Vector

PC/1

0111

PC/1

0101

.

.

.

Active Generation Table (AGT)

.

.

.

Filter Table (FT)

PC/1

0111

.

.

Tag

PC/Offset

Miss to A+1

A

PC/1

1

Miss to A+3

A

2

A

Miss to A+2

Eviction/

Invalidation A

3

baseline organization
Baseline Organization

Core

0

Core

1

Core

N-1

.

.

.

L1

L1

L1

SMS

SMS

L2

L2

L2

XBar

Shared L3

IMC

.

.

.

DDR3

K-1

DDR3

0

DDR3

1

xcore spatial streams an example
XCORE Spatial Streams: An Example

Lost

Opportunity

C D E H

EHA B

Core 0

Core 0

Demand Misses

C DF G

A B F G

Core 1

Core 1

Time

ICORE(Intra-core) Streams – recur within a core.

E and H at core 0, F and G at core 1.

XCORE Streams – spread&recur across multiple cores.

A, B, C, and D.

distribution of responses on an mlc miss
Distribution of Responses on an MLC miss

90%

Solution : Increase the size of the L1 and L2s.

L1: 32 to 128 KB – 1.8 % improvement in Exec. Time.

L2: 256KB to 1MB – 8 % improvement in Exec. Time.

XSTREAM – 11.3 % with 50KB of additional hardware.

xcore signatures observations
XCORE Signatures - Observations
  • Observation 1: 80%of the spatial signatures recur in

2 or more cores (XCORE signatures).

  • Observation 2:Maximum of only 4068 XCORE

signatures present in PARSEC that recur at-least 4 times.

  • Observation 3:Separated by 32Kcycles on an average.

Enough time to communicate the streams from

one prefetcherto another.

xstream in a nutshell
XSTREAM in a NutShell
  • Cross-core Spatial Prefetching framework: Based on spatial streams.
    • An MLC prefetchercommunicates (Master Prefetcher) spatial streams to other MLC prefetchers (Worker Prefetchers).
    • Communication happens Just In Time.
  • XSTREAM is
    • a Data forwarding framework.
    • an Inter-core Prefetching framework.
ideal xstream
Ideal XSTREAM

23%

&

19%

working steps of xstream
Working Steps of XSTREAM

STEP I : XCORETraining (XSTREAM Detection)

Identifies the XCORE signatures and the corresponding master/worker prefetcher.

STEP II : XCORE Timeliness (XCORE Timeliness)

Finds and stores the difference in time between the recurrence of XCORE streams.

STEP III : XCORE Communication (XCORE Comm.)

JIT(based on STEP II) Communication of the trained streams from the master to the worker prefetcher.

shared xstream detector
Shared XSTREAM Detector

Entry 0

Entry K-1

.

.

.

Sig 0

.

.

Done

Master

Time

BV

Done

Master

Time

BV

.

Sig S-1

  • BV : Bit Vector
  • Time (When) : The time at which the BV is inserted.
  • Master (Who) : Core-Id of the prefetcher who has sent the BV.
  • Done: Whether the entry has already participated in the XSTREAM detection step.

Sig

Done

Master

Time

BV

t

PC1

0

t

0 1 1 1

0

0

0

enhanced per core pht
Enhanced Per Core PHT

Pattern History Table

(PHT0)

XSTREAM : Pattern History Table (PHT0)

Sig

BV

Sig

BV

Worker

Time

Init

PC/1

0011

0

PC/1

0011

1

δ

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

.

  • Worker : Core-Id of the prefetcher in which the signature will recur.
  • Time : The time gap after which the signature will recur at the worker.
  • Init : Whether the master has initiated the comm.
baseline xstream
Baseline + XSTREAM

XBuff-S : Shared XBuffer

An Interface Between SMS and XSTREAM detector.

Stores the BV and the Core-Id of the prefetcher.

Core

0

Core

1

Core

N-1

f0 , Vdd

.

.

.

L1

L1

L1

XBuff-P

XBuff-P

SMS

L2

L2

L2

SMS

.

.

.

.

.

.

XSTREAM uses the clock domain which drives the LLC. This clock domain is oblivious to P(C) states.

XBar

XBuff-S

.

.

.

XBuff-P : Private Xbuffer (Per Core)

Buffers the BV sent by the master prefetcher and the predicted worker which will use the BV in future.

XSTREAM

Detector

Shared L3

f1, Vdd

17

xstream detection in a 2 core cmp
XSTREAM Detection in a 2-core CMP

2

Core 0

PHT 1

PHT 0

Core 1

Sig BV1 Worker Time Init

Sig BV0 Worker Time Init

0 0 1 1

0 1 1 1

PC/1

0

0

PC/1

AGT0

AGT1

XBuff-S

XBuff-S

.

BV0

BV1

.

.

.

BV0

.

.

BV1

XSTREAM Detector

Done Master Time BV

Sig

0

0

t

0 1 1 1

PC/1

0

1

0 0 1 1

t+δ

1

xstream detection 2
XSTREAM Detection - 2

4

XSTREAM Detector

PHT 0

Core 0

Sig

BV

Worker

Time

Init

Sig

Done

Master

Time

BV

PC/1

0 1 1 1

1

δ

0

1

0

0

t

0 1 1 1

PC/1

0

1

t+δ

0 0 1 1

#1s

>= CCth

Y

3

xcore communication
XCORE Communication

PHT 0

Core 0

PHT 1

Core 1

Sig

BV0

Worker

Time

Sig

BV1

Init

Worker

Time

Init

1

PC/1

1 0 0 1

δ

PC/1

0 1 1 0

1

0

0

t

1

4

BV1U BV0

AGT0

3

t + δ

1 1 1 1

BV0

5

BV0

.

XBuff-P

PFQ

.

.

.

.

.

BV0

.

XBuff-S

.

.

XBar

BV0

2

XSTREAM Detector

Sig

Done

Master

Time

BV

t

0

0

1 00 1

PC/1

xcore timeliness implementation
XCORE Timeliness - Implementation

XCORE communication depends on the accurate prediction of the time difference ( ).

δ

Local/global registers at the PHTs/XSTREAM detector, store a 4-bit encoded cycle value [cycle/4000].

estimates are tuned to minimize the noise from the

Interconnect.

δ

On an average, these estimates are accurate for 67% of the time.

interconnect support xstream transactions
Interconnect Support (XSTREAM Transactions)

X-Request : Between an MLC prefetcher and the shared XSTREAM detector.

Contents : BV with the Core-Id of the Master.

X-Response : Between XSTREAM detector and the master prefetcher.

Contents :Time Field with the Core-Id of the Worker.

X-Comm : Between the Master and the Worker.

Contents : BV and the Worker.

speedup in a 4 core cmp
Speedup in a 4-core CMP

<=4%

> 4% & <= 13%

> 13 % & <= 29%

11.3%

speedup in an 8 core cmp
Speedup in an 8-core CMP

< 2%

> 2% & <=10%

> 10% & < 30%

9.0%

4 core to 8 core
4-core to 8-core

dedup and ferret spawn 2+3n and 2+4n threads in an n-core system.

fluidanimate, freqmine, and vips – slowest thread limits the performance.

X264 and streamcluster – degree of XCORE comm. increases with the increase in the core count.

storage overhead 4 core cmp
Storage Overhead (4 core CMP)

XSTREAM incurs an overhead which is a little more than

1/6 thof a single MLC (256KB).

summary of evaluations
Summary of Evaluations
  • XSTREAM consumes
    • 18% of the spared interconnect bandwidth
    • (which is 57% of the theoretical limit).
    • 7.23GB/sec (average) DRAM bandwidth.
other results in the paper
Other Results in the Paper
  • Sensitivity study with various cache sizes of L1/L2/L3.
  • Scalability
    • Scales well for 16-core CMP too (10% Improvement).
  • Quantitative Analysis of each PARSEC benchmark.
  • Special Cases and specific issues.
  • Analysis of Bandwidth/DRAM Traffic.
  • Detailed Calculation of the Hardware Overhead
    • 2.3% increase in the L2 cache area.
conclusion
Conclusion
  • A new Spatial Streaming mechanism for private MLCs.
  • Key Idea: Communication of spatial streams from one prefetcher to another.
  • Key properties:
    • Low Hardware Cost
    • Simple and Practical hardware implementation
    • Just In Time Communication
  • Improvesexecution time
    • Outperforms state-of-the-art spatial streaming technique by 11.3%( 9%) in 4 (8) core CMPs respectively.
thank you
Thank You

This work is supported by IBM India Shared University Research (SUR) Grant and a Ph.D. Fellowship from

Tata Consultancy Services.

prefetch metrics 4 core
Prefetch Metrics – 4-core

Acc, Cov – Higher the better

PF Traffic – Lower the better

prefetch metrics 8 core
Prefetch Metrics – 8-core

Acc, Cov – Higher the better

PF Traffic – Lower the better

ad