Compiling for edge architectures the trips prototype compiler
Download
1 / 67

Compiling for EDGE Architectures: The TRIPS Prototype Compiler - PowerPoint PPT Presentation


  • 336 Views
  • Uploaded on

Compiling for EDGE Architectures: The TRIPS Prototype Compiler. Kathryn McKinley Doug Burger, Steve Keckler, Jim Burrill 1 , Xia Chen, Katie Coons, Sundeep Kushwaha, Bert Maher, Nick Nethercote, Aaron Smith, Bill Yoder et al. The University of Texas at Austin

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Compiling for EDGE Architectures: The TRIPS Prototype Compiler' - jaden


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Compiling for edge architectures the trips prototype compiler l.jpg
Compiling for EDGE Architectures:The TRIPS Prototype Compiler

Kathryn McKinley

Doug Burger, Steve Keckler,

Jim Burrill1, Xia Chen, Katie Coons, Sundeep Kushwaha, Bert Maher, Nick Nethercote, Aaron Smith, Bill Yoder

et al.

The University of Texas at Austin

1University of Massachusetts, Amherst


Technology scaling hitting the wall l.jpg
Technology Scaling Hitting the Wall

Qualitatively …

Analytically …

35 nm

70 nm

100 nm

130 nm

20 mm chip edge

Either way … Partitioning for on-chip communication is key


Oo superscalars out of steam l.jpg
OO SuperScalars Out of Steam

Clock ride is over

  • Wire and pipeline limits

  • Quadratic out-of-order issue logic

  • Power, a first order constraint

    Problems for any architectural solution

  • ILP - instruction level parallelism

  • Memory and on-chip latency

    Major vendors ending processor lines


Oo superscalars out of steam4 l.jpg
OO SuperScalars Out of Steam

Clock ride is over

  • Wire and pipeline limits

  • Quadratic out-of-order issue logic

  • Power, a first order constraint

    Problems for any architectural solution

  • ILP - instruction level parallelism

  • Memory and on-chip latency

    Major vendors ending processor lines

What’s next?


Post risc solutions l.jpg
Post-RISC Solutions

  • CMP - An evolutionary path

    • Replicate what we already have 2 to N times on a chip

    • Coarse grain parallelism

    • Exposes the resources to the programmer and compiler

  • Explicit Data Graph Execution (EDGE)

    1. Program graph is broken into sequence of blocks

    • Blocks commit atomically or not - a block never partially commits

      2. Dataflow within a block, ISA support for direct producer-consumer communication

    • No shared named registers (point-to-point dataflow edges only)

    • Memory is still a shared namespace

    • The block’s dataflow graph (DFG) is explicit in the architecture


Outline l.jpg
Outline

  • TRIPS Execution Model & ISA

  • TRIPS Architectural Constraints

  • Compiler Structure

  • Spatial Path Scheduling


Block atomic execution model l.jpg

write

read

write

read

Gtile

addi

addi

D[0]

bro_t

addi

mov

write

addi

lw_f

mov

addi

bro_t

lw_f

Gtile

D[0]

addi

write

Block Atomic Execution Model

TRIPS block

Flow Graph

Dataflow

Graph

Execution

Substrate

add

add

ld

cmp

read

Register File

write

read

read

shl

ld

cmp

br

ld

shl

sw

br

Data Caches

write

read

sw

sw

add

br

write

  • TRIPS block - single entry constrained hyperblock

  • Dataflow execution w/ target position encoding


Trips block constraints l.jpg
TRIPS Block Constraints

Fixed Size: 128 instructions

  • Padded with no-ops if needed

    Load/Store Identifiers: 32 load or store queue identifiers

  • More than 32 static loads and stores is possible

    Registers: 32 reads and 32 writes, 8 to each of 4 banks (in addition to 128)

Register banks

32 reads

32 writes

32 loads

32

stores

1 - 128

instruction

DFG

Memory

Memory

PC read

PC

terminating

branch

PC

  • Constant Output: all stores and writes execute, one branch

    • Simplifies hardware logic for detecting block completion

    • Every path of execution through a block must produce the same stores and register writes

Simplifies the hardware, more work for the compiler


Compiler phases classic l.jpg
Compiler Phases (Classic)

TIL: TRIPS Intermediate Language - RISC-like three-address form

TASL: TRIPS Assembly Language - dataflow target form w/ locations encoded

PRE

Global Value Numbering

Scalar Replacement

Global Variable Replacement

SCC

Copy Propagation

Array Access Strength Reduction

LICM

Tree Height Reduction

Useless Copy Removal

Dead Variable Elimination

Scale Compiler (UTexas/UMass)

C

FORTRAN

Frontend

Inlining

Unrolling/Flattening

Scalar Optimizations

Code Generation

Alpha

TRIPS TIL

SPARC

PPC


Backend compiler flow l.jpg
Backend Compiler Flow

Hyperblock

Formation

If-conversion

Loop peeling

While loop unrolling

Instruction merging

Predicate optimizations

TIL

Resource

Allocation

Register allocation

Reverse if-conversion & split

Load/Store ID assignment

SSA for constant outputs

Scheduling

Fanout insertion

Instruction placement

Target form generation

TASL


Correctness progressively satisfy constraints l.jpg
Correctness:Progressively Satisfy Constraints

Constraint

128 instructions

32 load/store IDs

32 reg. read/write

(8 per 4 banks)

constant output

Hyperblock

Formation

If-conversion

Loop peeling

While loop unrolling

Instruction merging

Predicate optimizations

TIL

Resource

Allocation

Register allocation

Reverse if-conversion & split

Load/Store ID assignment

SSA for constant outputs

Scheduling

Fanout insertion

Instruction placement

Target form generation

TASL


Predication hyperblock formation l.jpg

P

Predication & Hyperblock Formation

Predication

  • Convert control dependence to data dependence

  • Improves instruction fetch bandwidth

  • Eliminates branch mispredictions

  • Adds overhead

  • Any instruction can have a predicate, but...

  • Predicate head (low power) or bottom (speculative)

    Hyperblock

  • Scheduling region (set of basic blocks)

  • Single entry, multiple exit, predicated instructions

  • Expose parallelism w/o over saturating resources

  • Must satisfy block constraints

head

bottom

P

P


Accuracy l.jpg
Accuracy?

Constraint

128 instructions

32 load/store IDs

32 reg. read/write

(8 per 4 banks)

constant output

Hyperblock

Formation

If-conversion

Loop peeling

While loop unrolling

Instruction merging

Predicate optimizations

TIL

Resource

Allocation

Register allocation

Reverse if-conversion & split

Load/Store ID assignment

SSA for constant outputs

Scheduling

Fanout insertion

Instruction placement

Target form generation

TASL


Block atomic execution model14 l.jpg

write

read

write

read

Gtile

addi

addi

D[0]

bro_t

addi

mov

write

addi

lw_f

mov

addi

bro_t

lw_f

Gtile

D[0]

addi

write

Block Atomic Execution Model

TRIPS block

Flow Graph

Dataflow

Graph

Execution

Substrate

add

add

ld

cmp

read

Register File

write

read

read

shl

ld

cmp

br

ld

shl

sw

br

Data Caches

write

read

sw

sw

add

br

write

TRIPS block - single entry constrained hyperblock

Dataflow execution w/ target position encoding


Spatial scheduling problem l.jpg

add

mul

mul

ld

ld

ld

ld

mul

mul

add

st

Spatial Scheduling Problem

Partitioned microarchitecture


Spatial scheduling problem16 l.jpg

add

ld

mul

mul

ld

ld

ld

ld

ld

mul

mul

st

ld

ld

add

st

Spatial Scheduling Problem

Partitioned microarchitecture

Anchor points


Spatial scheduling problem17 l.jpg

add

ld

mul

add

mul

mul

mul

ld

mul

ld

ld

ld

ld

mul

mul

st

add

ld

ld

add

mul

st

Spatial Scheduling Problem

Balance latency and concurrency

Partitioned microarchitecture

Anchor points


Outline18 l.jpg
Outline

  • Background

  • Spatial Path Scheduling

  • Simulated Annealing

  • Extending SPS

  • Conclusions and Future Work


Dissecting the problem l.jpg

Static

Dynamic

VLIW

(SPSI)

Bad idea

(DPSI)

Static

TRIPS

(SPDI)

Superscalars

(DPDI)

Dynamic

Dissecting the Problem

  • Scheduling can have two components

    • Placement: Where an instruction executes

    • Issue: When an instruction executes

Placement

Issue

EDGE


Slide20 l.jpg

i3

i2

i1

i2

i2

i2

Explicit Data Graph Execution

  • Block-atomic execution

    • Instruction groups fetch, execute, and commit atomically

  • Direct instruction communication

    • Explicitly encode dataflow graph by specifying targets

RISC

EDGE

R4

R5

R6

add r1, r4, r5

add r2, r5, r6

add r3, r1, r2

i1: addi3

i2: add i3

i3: add i4

add

add

Centralized Register File

add


Scheduling for trips l.jpg

R0

R1

R2

R3

Ctrl

E0

E1

E2

E3

D0

E4

E5

E6

E7

D1

E8

E9

E10

E11

D2

E12

E13

E14

E15

D3

Scheduling for TRIPS

  • TRIPS ISA

    • Up to 128 instructions/block

    • Any instruction can be in any slot

  • TRIPS microarchitecture

    • Up to 8 blocks in flight

    • 1 cycle latency between

      adjacent ALUs

  • Known

    • Execution latencies

    • Lower bound for

      communication latency

  • Unknown (estimated)

    • Memory access latencies

    • Resource conflicts

Register File

Data Cache


Scheduling for trips22 l.jpg

R0

R1

R2

R3

Ctrl

D0

E2

D1

E4

D2

D3

Scheduling for TRIPS

  • TRIPS ISA

    • Up to 128 instructions/block

    • Any instruction can be in any slot

  • TRIPS microarchitecture

    • Up to 8 blocks in flight

    • 1 cycle latency between

      adjacent ALUs

  • Known

    • Execution latencies

    • Lower bound for

      communication latency

  • Unknown

    • Memory access latencies

    • Resource conflicts

Register File

Data Cache


Greedy scheduling for trips l.jpg
Greedy Scheduling for TRIPS

  • GRST [PACT ‘04]: Based on VLIW list-scheduling

  • Augmented with five heuristics

    • Prioritizes critical path (C)

    • Reprioritizes after each placement (R)

    • Accounts for data cache locality (L)

    • Accounts for register output locality (O)

    • Load balancing for local issue contention (B)

  • Drawbacks

    • Unnecessary restrictions on scheduling order

    • Inelegant and overly specific

      Replace heuristics with elegant approach designed for spatial scheduling


Greedy scheduling for trips24 l.jpg
Greedy Scheduling for TRIPS

  • GRST [PACT ‘04]: Based on VLIW list-scheduling

  • Augmented with five heuristics

    • Prioritizes critical path (C)

    • Reprioritizes after each placement (R)

    • Accounts for data cache locality (L)

    • Accounts for register output locality (O)

    • Load balancing for local issue contention (B)

  • Drawbacks

    • Unnecessary restrictions on scheduling order

    • Inelegant and overly specific

      Replace heuristics with elegant approach designed for spatial scheduling


Outline25 l.jpg
Outline

  • Background

  • Spatial Path Scheduling

  • Simulated Annealing

  • Extending SPS

  • Conclusions and Future Work


Spatial path scheduling overview l.jpg

read

add

mul

ctrl

R1

R2

br

ld

ld

D0

ld

add

mul

D1

ld

mul

add

D0

D1

ctrl

br

read

mul

Legend

add

Register

Data cache

write

Execution

Control

Spatial Path Scheduling Overview

Scheduler

Dataflow

Graph

Placement

Topology


Spatial path scheduling overview27 l.jpg

read

add

mul

br

ld

ld

D0

D1

ctrl

read

mul

Legend

add

Register

Data cache

write

Execution

Control

Spatial Path Scheduling Overview

R1

add

mul

Scheduler

Dataflow

Graph

ctrl

D0

D1

R2

mul

ld

ld

Placement

Topology


Spatial path scheduling overview28 l.jpg

read

add

mul

br

ld

ld

add

R1

mul

R2

add

ld

mul

ld

br

D0

D1

ctrl

D0

D1

read

mul

Legend

add

Register

Data cache

write

Execution

Control

Spatial Path Scheduling Overview

Scheduler

Dataflow

Graph

Placement

Topology


Spatial path scheduling overview29 l.jpg

read R2

add

mul

br

ld

ld

read R1

mul

add

write R1

Spatial Path Scheduling Overview

Initialize all known anchor points

Until all instructions are scheduled:

  • Populate the open list

  • Find placement costs

  • Choose the minimum cost location

  • Schedule the instruction whose minimum placement cost is largest

    (Choose the max of the mins)


Spatial path scheduling example l.jpg

read R2

add

mul

br

ld

ld

ctrl

R1

R2

D0

D1

ctrl

D0

read R1

mul

Legend

D1

Register

add

Data cache

Execution

write R1

Control

Unplaced

Spatial Path Scheduling Example

  • Initialize all known anchor points

Register File

Data Cache


Spatial path scheduling example31 l.jpg

read R2

add

mul

br

ld

ld

D0

D1

ctrl

read R1

mul

add

write R1

Spatial Path Scheduling Example

  • Populate the open list

    (marked in yellow)

Open list: Instructions that are candidates for scheduling

We include: Instructions with no parents, or with at least one placed parent


Spatial path scheduling example32 l.jpg

read R2

1

add

mul

3

br

ld

ld

1

D0

D1

ctrl

3

read R1

mul

3

add

1

write R1

1

Spatial Path Scheduling Example

  • Calculate placement cost for

    each instruction in the open

    list at each slot

Placement cost(i,slot): Longest path length through i if placed at slot

cost = inputCost + execCost + outputCost

(includes communication and execution latencies)


Spatial path scheduling example33 l.jpg

read R2

1

mul

3

ld

1

D1

3

mul

3

add

1

write R1

1

Spatial Path Scheduling Example

  • Calculate placement cost for

    each instruction in the open

    list at each slot

5

Register File

ctrl

R1

R2

1 cycle

D0

mul

E1

3

3 cycles

D1

5 cycles

Data Cache

1

Total placement cost = 16 + 3 + 3 = 22


Spatial path scheduling example34 l.jpg

read R2

add

mul

br

ld

ld

ctrl

R1

R2

D0

D1

ctrl

D0

22

22

24

26

read R1

mul

D1

22

22

24

26

add

24

24

26

28

write R1

26

26

28

30

Spatial Path Scheduling Example

  • Calculate placement cost for

    each instruction in the open

    list at each slot

Register File

Data Cache


Spatial path scheduling example35 l.jpg

read R2

add

mul

br

ld

ld

ctrl

R1

R2

D0

D1

ctrl

D0

22

22

24

26

read R1

mul

D1

22

22

24

26

add

24

24

26

28

write R1

26

26

28

30

Spatial Path Scheduling Example

  • Choose the minimum cost

    location for each instruction

Register File

Data Cache


Spatial path scheduling example36 l.jpg

read R2

add

mul

br

ld

ld

ctrl

R1

R2

D0

D1

ctrl

D0

22

22

24

26

read R1

mul

D1

22

22

24

26

add

24

24

26

30

write R1

26

26

28

30

Spatial Path Scheduling Example

  • Break ties

  • Example heuristics:

    • Links consumed

    • ALU utilization

Register File

Data Cache


Spatial path scheduling example37 l.jpg

read R2

add

mul

br

ld

ld

D0

D1

ctrl

read R1

mul

add

write R1

Spatial Path Scheduling Example

  • Place the instruction with the

    highest minimum cost

    (Choose the max of the mins)

Register File

ctrl

R1

R2

D0

mul

D1

Data Cache


Spatial path scheduling algorithm l.jpg
Spatial Path Scheduling Algorithm

Schedule (block, topology)

initialize known anchor points

while (not all instructions scheduled)

for each instruction in open list, i

for each available location, n

calculate placement cost for (i, n)

keep track of n with min placement cost

keep track of i with highest min placement cost

schedule i with highest min placement cost

Per-block complexity:

SPS: O(i2 * n) i = # of instructions

n = # of ALUs

GRST: O(i2 + i * n)

Exhaustive search: i !


Sps benefits and limitations l.jpg
SPS Benefits and Limitations

  • Benefits

    • Automatically exploits known communication latencies

    • Designed for spatial scheduling

    • Minimizes critical path length at each step

    • Naturally encompasses four of five GRST heuristics

  • Limitations of basic algorithm

    • Does not account for resource contention

    • Uses no global information

    • Minimum communication latencies may be optimistic


Experimental methodology l.jpg
Experimental Methodology

  • 26 hand-optimized microbenchmarks

    • Extracted from SPEC2000, EEMBC, Livermore Loops,

      MediaBench, and C libraries

    • Average dynamic instructions fetched/block: 67.3 (Ranges from

      14.5 to 117.5)

  • Cycle-accurate simulator

    • Within 4% of RTL on average

    • Models communication and contention delays

  • Comparison points

    • Greedy Scheduling for TRIPS (GRST)

    • Simulated annealing


Sps performance l.jpg
SPS Performance

Geometric mean of speedup over GRST: 1.19

Basic SPS


Sps performance42 l.jpg
SPS Performance

Geometric mean of speedup over GRST: 1.19

Basic SPS


Sps performance43 l.jpg
SPS Performance

Geometric mean of speedup over GRST: 1.19

Basic SPS


Outline44 l.jpg
Outline

  • Background

  • Spatial Path Scheduling

  • Simulated Annealing

  • Extending SPS

  • Conclusions and Future Work


How well can we do l.jpg
How well can we do?

  • Simulated annealing

    • Artificial intelligence search technique

    • Uses random perturbations to avoid local optima

    • Approximates a global optimum

  • Cost function: simulated cycles

    • Uncertainty makes static cost functions insufficient

    • Best cost function

  • Purpose

    • Optimization

    • Discover performance upper bound

    • Tool to improve scheduler


Speedup with simulated annealing l.jpg
Speedup with Simulated Annealing

Geometric mean of speedup over GRST

Basic SPS: 1.19 Annealed: 1.40

Basic SPS

Annealed


Speedup with simulated annealing47 l.jpg
Speedup with Simulated Annealing

Geometric mean of speedup over GRST

Basic SPS: 1.19 Annealed: 1.40

Basic SPS

Annealed


Speedup with simulated annealing48 l.jpg
Speedup with Simulated Annealing

Geometric mean of speedup over GRST

Basic SPS: 1.19 Annealed: 1.40

Basic SPS

Annealed


Outline49 l.jpg
Outline

  • Background

  • Spatial Path Scheduling

  • Simulated Annealing

  • Extending SPS

  • Conclusions and Future Work


Extending sps l.jpg
Extending SPS

  • Contention

    • Network link contention

    • Local and Global ALU contention

  • Global register prioritization

  • Path volume scheduling


Alu contention l.jpg

ctrl

R1

R2

D0

add

add

br

ld

mul

mul

ld

D2

ALU Contention

  • What if two instructions are ready to execute on the same ALU at the same time?

read R2

add

mul

Register File

br

ld

ld

D0

D2

ctrl

read R1

mul

Data Cache

add

write R1


Local vs global alu contention l.jpg
Local vs. Global ALU Contention

  • Local ALU contention

    • Keep track of expected issue time

    • Increase placement cost if conflict occurs

  • Global ALU contention

    • Resource utilization in previous/next block

    • Weighting function

  • Modify placement cost


Speedup over grst l.jpg
Speedup over GRST

Geometric mean of speedup over GRST

Basic SPS: 1.19 SPS extended: 1.31 Annealed: 1.40

Basic SPS

SPS extended

Annealed


Speedup over grst54 l.jpg
Speedup over GRST

Geometric mean of speedup over GRST

Basic SPS: 1.19 SPS extended: 1.31 Annealed: 1.40

Basic SPS

SPS extended

Annealed


Speedup over grst55 l.jpg
Speedup over GRST

Geometric mean of speedup over GRST

Basic SPS: 1.19 SPS extended: 1.31 Annealed: 1.40

Basic SPS

SPS extended

Annealed


Related work l.jpg
Related Work

  • Scheduling for VLIW [Ellis, Fisher]

  • Scheduling for other partitioned architectures

    • Partitioned VLIW [Gilbert, Kailas, Kessler, Özer, Qian, Zalamea]

    • RAW [Lee]

    • Wavescalar [Mercaldi]

  • ASIC and FPGA place and route [Paulin]

    • Resource conflicts known statically

    • Substrate may not be fixed

    • Simulated annealing [Betz]


Conclusions and future work l.jpg
Conclusions and Future Work

  • Future work

    • Register allocation

    • Memory placement

    • Reliability-aware scheduling

  • Conclusions

    • General spatial instruction scheduling algorithm

    • Reasons explicitly about anchor points

    • Performance within 4% of annealed results



Mapping instructions to physical locations l.jpg
Mapping instructions to Physical Locations

  • Scheduler converts operand format to target format, and assigns IDs

  • ID assigned to each instruction indicates physical location

  • The microarchitecture can interpret this ID in many different ways

  • To schedule well, the scheduler must understand how the microarchitecture translates ID -> Physical location

    TIL (operand format): TASL(target format)

    read t0, g1

    read t1, g2

    muli t2, t1, 4

    ld t3, 0(t2)

    ld t4, 4(t2)

    mul t5, t3, t4

    add t6, t5, t0

    addi t7, t1, 8

    br t7

    write g1, t6

R[1] read, G[1], N[5]

R[2] read, N[2], N[6]

N[2] muli, N[34], N[1]

N[34] ld, N[32]

N[1] ld, N[32]

N[32] mul, N[5]

N[5] add, W[1]

N[6] addi, N[0]

N[0] br

Scheduler

W[1] write, G[1]


Mapping instructions to physical locations60 l.jpg

R0

R0

R0

R1

R1

R1

R2

R2

R2

R3

R3

R3

0

1

2

3

32

33

34

35

64

65

66

67

96

97

98

99

Mapping instructions to Physical Locations

  • Scheduler converts operand format to target format, and assigns IDs

  • ID assigned to each instruction indicates physical location

  • The microarchitecture can interpret this ID in many different ways

  • To schedule well, the scheduler must understand how the microarchitecture translates ID -> Physical location

    TASL(target format)

ctrl

R[1] read, G[1], N[5]

R[2] read, N[2], N[6]

D0

N[2] muli, N[34], N[1]

N[34] ld, N[32]

N[1] ld, N[32]

N[32] mul, N[5]

N[5] add, W[1]

N[6] addi, N[0]

N[0] br

D1

D2

D3

W[1] write, G[1]


Mapping instructions to physical locations61 l.jpg

R4

R0

R5

R1

R6

R2

R7

R3

4

5

6

7

36

37

38

39

68

69

70

71

100

101

102

103

Mapping instructions to Physical Locations

  • Scheduler converts operand format to target format, and assigns IDs

  • ID assigned to each instruction indicates physical location

  • The microarchitecture can interpret this ID in many different ways

  • To schedule well, the scheduler must understand how the microarchitecture translates ID -> Physical location

    TASL(target format)

ctrl

R[1] read, G[1], N[5]

R[2] read, N[2], N[6]

D0

N[2] muli, N[34], N[1]

N[34] ld, N[32]

N[1] ld, N[32]

N[32] mul, N[5]

N[5] add, W[1]

N[6] addi, N[0]

N[0] br

D1

D2

D3

W[1] write, G[1]


Mapping instructions to physical locations62 l.jpg

R0,R4,

… R28

R1,R5,

… R29

R2,R6,

… R30

R3,R7,

… R31

0,4,8,

… 28

1,5,9,

… 29

2,6,10,

… 30

3,7,11,

… 31

32,36,

… 60

33,37,

… 61

34,38,

… 62

35,39,

… 63

64,68,

… 92

65,69,

… 93

66,70,

… 94

67,69,

… 95

96,100,

… 124

97,101,

… 125

98,101,

… 126

99,102,

… 127

Mapping instructions to Physical Locations

  • Scheduler converts operand format to target format, and assigns IDs

  • ID assigned to each instruction indicates physical location

  • The microarchitecture can interpret this ID in many different ways

  • To schedule well, the scheduler must understand how the microarchitecture translates ID -> Physical location

    TASL(target format)

ctrl

R[1] read, G[1], N[5]

R[2] read, N[2], N[6]

D0

N[2] muli, N[34], N[1]

N[34] ld, N[32]

N[1] ld, N[32]

N[32] mul, N[5]

N[5] add, W[1]

N[6] addi, N[0]

N[0] br

D1

D2

D3

W[1] write, G[1]



Simulated annealing l.jpg
Simulated Annealing

  • Cost function: Simulated cycles

  • Prune space further with critical path tool

Guided vs. unguided Annealing for memset_hand


Contention l.jpg
Contention

  • ALU contention

    • Local (within a block) - Estimate temporal schedule

    • Global (between blocks) - Probabilistic - use weighting function

  • Network link contention

    • Precise measurements too inaccurate

    • Estimate with threshold, weighting function

  • Weight network link and global ALU contention based on annealed results

criticality

concurrency

weight = (1 - fullness) * (1 - )


Global register prioritization l.jpg
Global Register Prioritization

  • Problem: Any register dependence may be important with speculative execution

  • Solution: Extend path lengths through registers

    Register prioritization:

  • Schedule smaller loops before larger loops

  • Schedule loop-carried dependences first

  • Extend placement cost through registers to previous/next block


Path volume scheduling l.jpg
Path Volume Scheduling

  • Problem: The basic SPS algorithm does not account for the number of instructions in the path

  • Solution: Perform a depth-first search with iterative deepening to find the shortest path that holds all instructions


ad