effective instruction scheduling techniques for an interleaved cache clustered vliw processor n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor PowerPoint Presentation
Download Presentation
Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor

Loading in 2 Seconds...

play fullscreen
1 / 33

Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor - PowerPoint PPT Presentation


  • 141 Views
  • Uploaded on

Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor. Enric Gibert 1 Jes ús Sánchez 1,2 Antonio González 1,2. 1 Dept. d’Arquitectura de Computadors Universitat Politècnica de Catalunya (UPC) Barcelona. 2 Intel Barcelona Research Center Intel Labs

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor' - amos-roberts


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
effective instruction scheduling techniques for an interleaved cache clustered vliw processor

Effective Instruction Scheduling Techniques for an Interleaved Cache Clustered VLIW Processor

Enric Gibert1

Jesús Sánchez1,2

Antonio González1,2

1Dept. d’Arquitectura de Computadors

Universitat Politècnica de Catalunya (UPC)

Barcelona

2Intel Barcelona Research Center

Intel Labs

Barcelona

motivation
Motivation
  • Capacity vs. Communication-bound
  • Clustered microarchitectures
    • Simpler + faster
    • Power consumption
    • Communications not homogeneous
  • Clustering  embedded/DSP domain
clustered micro architectures
Clustered Microarchitectures

L2 cache

L1 cache

GOAL: distribute the memory hierarchy!!!

Memory buses

FUs

FUs

FUs

FUs

Reg. File

Reg. File

Reg. File

Reg. File

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

Register-to-register communication buses

contributions
Contributions
  • Distribution of data cache:
    • Interleaved cache clustered VLIW processor
  • Hardware enhancement:
    • Attraction Buffers
  • Effective instruction scheduling techniques
    • Modulo scheduling
    • Loop unrolling + smart assignment of latencies + padding
talk outline
Talk Outline
  • MultiVLIW
  • Interleaved-cache clustered VLIW processor
  • Instruction scheduling algorithms and techniques
  • Hardware enhancement: Attraction Buffers
  • Simulation framework
  • Results
  • Conclusions
multivliw

TAG+STATE+DATA

TAG+STATE+DATA

TAG+STATE+DATA

TAG+STATE+DATA

MultiVLIW

L2 cache

cache block

Cache-Coherence Protocol!!!

cache module

cache module

cache module

cache module

Func. Units

Func. Units

Func. Units

Func. Units

Register File

Register File

Register File

Register File

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

Register-to-register communication buses

talk outline1
Talk Outline
  • MultiVLIW
  • Interleaved-cache clustered VLIW processor
  • Instruction scheduling algorithms and techniques
  • Hardware enhancement: Attraction Buffers
  • Simulation framework
  • Results
  • Conclusions
interleaved cache

remote hit

TAG

W0

W1

W2

W3

W4

W5

W6

W7

local hit

subblock 1

local miss

remote miss

Interleaved Cache

L2 cache

cache block

TAG

W0

W4

TAG

W1

W5

TAG

W2

W6

TAG

W3

W7

cache module

cache module

cache module

cache module

Func. Units

Func. Units

Func. Units

Func. Units

Register File

Register File

Register File

Register File

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

Register-to-register communication buses

talk outline2
Talk Outline
  • MultiVLIW
  • Interleaved-cache clustered VLIW processor
  • Instruction scheduling algorithms and techniques
  • Hardware enhancement: Attraction Buffers
  • Simulation framework
  • Results
  • Conclusions
base scheduling algorithm

successful

not successful

not successful

BASE Scheduling Algorithm

II=II+1

0

>0

How

Many?

Select possible

clusters

START

Next node

Best profit in

output edges

Sort nodes

Least loaded

successful

>1

1

How

Many?

Schedule it

scheduling algorithm
Scheduling Algorithm
  • For word-interleaved cache clustered processors
  • Scheduling steps:
      • Loop unrolling
      • Assignment of latencies to memory instructions
        • latencies  stall time  + compute time 
        • latencies  stall time  + compute time 
      • Order instructions (DDG nodes)
      • Cluster assignment and scheduling
step 1 loop unrolling

25% local accesses

25% local accesses

100% local accesses

ld r3, a[i]

ld r31, a[i]

ld r32, a[i+1]

ld r33, a[i+2]

ld r34, a[i+3]

STEP 1: Loop Unrolling

a[0]

a[4]

a[1]

a[5]

a[2]

a[6]

a[3]

a[7]

cache

module

cache

module

cache

module

cache

module

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

  • Selectiveunrolling:
    • No unrolling
    • UnrollxN
    • OUF unrolling

for (i=0; i<MAX; i+=4) {

ld r31, a[i] (stride 16 bytes)

ld r32, a[i+1] (stride 16 bytes)

ld r33, a[i+2] (stride 16 bytes)

ld r34, a[i+3] (stride 16 bytes)

...

}

for (i=0; i<MAX; i++) {

ld r3, a[i]

r4 = OP(r3)

st r4, b[i]

}

Strides multiple of NxI

Optimum Unrolling Factor (OUF)

step 2 latency assignment

MII=23

MII=33

MII=28

MII=9

L=15

L=15

L=15

L=1

L=15

L=15

L=15

L=5

MII=10

MII=22

MII=22

MII=22

REC2

L=10

L=15

L=1

L=5

REC1

STEP 2: Latency Assignment

n6

load

LH=1 cycle

RH=5 cycles

LM=10 cycles

RM=15 cycles

n1

load

n2

load

n7

div

distance=1

L=1

L=8

memory dependences

register-flow deps.

n8

add

n3

add

distance=1

L=1

L=1

n4

store

L=1

n5

sub

steps 3 and 4
STEPS 3 and 4
  • Step 3: Order instructions
  • Step 4: Cluster assignment and scheduling
scheduling restrictions

CLUSTER 3

CLUSTER 2

a[0]

a[3]

a[4]

a[7]

Scheduling Restrictions

NEXT MEMORY LEVEL

memory buses

Cache module

Cache module

NON-DETERMINISTIC BUS LATENCY!!!

CLUSTER 1

CLUSTER 4

steps 3 and 41
STEPS 3 and 4
  • Step 3: Order instructions
  • Step 4: Cluster assignment and scheduling
    • Non-memory instructions  same as BASE
      • Minimize register communications + maximize workload
    • Memory instructions:
      • Memory instructions in same chain  same cluster
      • IPBC (Interleaved Preferred Build Chains)
        • Average “preferred cluster” of the chain
        • Padding  meaningful preferred cluster information
          • Stack frames
          • Dynamically allocated data
      • IBC (Interleaved Build Chains)
        • Minimize register communications of 1st instr. of chain

NxI boundary

memory dependent chains

Preferred = 2

Preferred = 1

L=1

L=5

Preferred = 1

L=1

Preferred=2

Memory Dependent Chains

n6

load

LH=1 cycle

RH=5 cycles

LM=10 cycles

RM=15 cycles

n1

load

n2

load

n7

div

distance=1

L=1

L=8

memory dependences

register-flow deps.

n8

add

n3

add

distance=1

L=1

L=1

order={n5, n4, n3, n2, n1, n8, n7, n6}

n4

store

L=1

n5

sub

talk outline3
Talk Outline
  • MultiVLIW
  • Interleaved-cache clustered VLIW processor
  • Instruction scheduling algorithms and techniques
  • Hardware enhancement: Attraction Buffers
  • Simulation framework
  • Results
  • Conclusions
attraction buffers

a[3]

a[7]

Local accesses = 0%

Local accesses = 50%

Attraction Buffers
  • Cost-effective mechanism   local accesses

a[0]

a[4]

a[1]

a[5]

a[2]

a[6]

a[3]

a[7]

cache

module

cache

module

cache

module

cache

module

ABuffer

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

ld r3, a[3]

ld r3, a[7]

...

stride 16 bytes

talk outline4
Talk Outline
  • MultiVLIW
  • Interleaved-cache clustered VLIW processor
  • Instruction scheduling algorithms and techniques
  • Hardware enhancement: Attraction Buffers
  • Simulation framework
  • Results
  • Conclusions
evaluation framework
Evaluation Framework
  • IMPACT C compiler
  • Mediabench benchmark suite
talk outline5
Talk Outline
  • MultiVLIW
  • Interleaved-cache clustered VLIW processor
  • Instruction scheduling algorithms and techniques
  • Hardware enhancement: Attraction Buffers
  • Simulation framework
  • Results
  • Conclusions
local accesses
Local Accesses

OUF=Optimum UF

P=Padding

NC=No Chains

why remote accesses
Why Remote Accesses?
  • Double precision accesses (mpeg2dec)
  • Unclear “preferred cluster” information
      • Indirect accesses (e.g. a[b[i]]) (jpegdec, jpegenc, pegwitdec, pegwitenc)
      • Different alignment (epicenc, jpegdec, jpegenc)
      • Strides not multiple of NxI (selective unrolling, …)
  • Memory dependent chains (epicdec, pgpdec, pgpenc, rasta)
  • for (k=0; k<MAX; k++){
  • for (i=k; i<MAX; i++)
  • load a[i]
  • }
talk outline6
Talk Outline
  • MultiVLIW
  • Interleaved-cache clustered VLIW processor
  • Instruction scheduling algorithms and techniques
  • Hardware enhancement: Attraction Buffers
  • Simulation framework
  • Results
  • Conclusions
conclusions
Conclusions
  • Interleaved cache clustered VLIW processor
  • Effective instruction scheduling techniques
    • Smart assignment of latencies
    • Loop unrolling + padding (27%  local hits)
  • Source of remote accesses and stall time
  • Attraction Buffers (  stall time up to 34%)
  • Cycle count results:
    • MultiVLIW (7% slowdown but simpler hardware)
    • Unified cache (11% speedup)
question latency assignment
Question: Latency Assignment

MII(REC1)=20

MII(DDG)=10

question padding
Question: Padding

void foo(int *array, int *accum) {

*accum = 0;

for (i=0; i<MAX; i++)

*accum += array[i];

}

void main() {

int *a, value;

a = malloc(MAX*sizeof(int));

foo(a, &value);

}

accum

a[1]

a[5]

...

a[0]

a[4]

...

a[2]

a[6]

...

a[3]

a[7]

...

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4

question coherence
Question: Coherence
  • Memory Dependent Chains
    • Modified data
      • Present in only one Attraction Buffer
    • Data present in multiple Attraction Buffers
      • Replicated in read-only manner
    • Local scheduling technique
      • At end of loop  flush Attraction Buffer’s contents

ABuffer

ABuffer

ABuffer

ABuffer

a[2]

a[2]

a[2]

CLUSTER 1

CLUSTER 2

CLUSTER 3

CLUSTER 4