optimizing compilers cisc 673 spring 2009 instruction scheduling n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling PowerPoint Presentation
Download Presentation
Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling

Loading in 2 Seconds...

play fullscreen
1 / 25

Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling - PowerPoint PPT Presentation


  • 122 Views
  • Uploaded on

Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling. John Cavazos University of Delaware. Instruction Scheduling. Reordering instructions to improve performance Takes into account anticipated latencies Machine-specific Performed late in optimization pass

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling' - ima


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
optimizing compilers cisc 673 spring 2009 instruction scheduling

Optimizing CompilersCISC 673Spring 2009Instruction Scheduling

John Cavazos

University of Delaware

instruction scheduling
Instruction Scheduling
  • Reordering instructions to improve performance
  • Takes into account anticipated latencies
    • Machine-specific
  • Performed late in optimization pass
  • Instruction-Level Parallelism (ILP)
modern architectures features
Modern Architectures Features
  • Superscalar
    • Multiple logic units
  • Multiple issue
    • 2 or more instructions issued per cycle
  • Speculative execution
    • Branch predictors
    • Speculative loads
  • Deep pipelines
types of instruction scheduling
Types of Instruction Scheduling
  • Local Scheduling
    • Basic Block Scheduling
  • Global Scheduling
    • Trace Scheduling
    • Superblock Scheduling
    • Software Pipelining
scheduling for different computer architectures
Scheduling for different Computer Architectures
  • Out-of-order Issue
    • Scheduling is useful
  • In-order issue
    • Scheduling is very important
  • VLIW
    • Scheduling is essential!
challenges to ilp
Challenges to ILP
  • Structural hazards:
    • Insufficient resources to exploit parallelism
  • Data hazards
    • Instruction depends on result of previous instruction still in pipeline
  • Control hazards
    • Branches & jumps modify PC
    • affect which instructions should be in pipeline
recall from architecture
Recall from Architecture…
  • IF – Instruction Fetch
  • ID – Instruction Decode
  • EX – Execute
  • MA – Memory access
  • WB – Write back

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

structural hazards
Structural Hazards

Instruction latency: execute takes > 1 cycle

addf R3,R1,R2

IF

ID

EX

EX

MA

WB

IF

ID

stall

EX

EX

MA

WB

addf R3,R3,R4

Assumes floating point ops take 2 execute cycles

data hazards
Data Hazards

Memory latency: data not ready

lw R1,0(R2)

IF

ID

EX

MA

WB

IF

ID

EX

stall

MA

WB

add R3,R1,R4

control hazards
Control Hazards

ID

EX

MA

WB

Taken Branch

IF

IF

---

---

---

---

Instr + 1

Branch Target

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

Branch Target + 1

basic block scheduling
Basic Block Scheduling
  • For each basic block:
    • Construct directed acyclic graph (DAG) using dependences between statements
      • Node = statement / instruction
      • Edge (a,b) = statement a must execute before b
    • Schedule instructions using the DAG
data dependences
Data Dependences
  • If two operations access the same register and one access is a write, they are dependent
  • Types of data dependences

RAW=Read after Write

WAW

WAR

r1 = r2 + r3

r2 = r5 * 6

r1 = r2 + r3

r4 = r1 * 6

r1 = r2 + r3

r1 = r4 * 6

Cannot reorder two dependent instructions

basic block scheduling example
Basic Block Scheduling Example

Original Schedule Dependence DAG

a) lw R2, (R1)

b) lw R3, (R1) 4

c) R4  R2 + R3

d) R5  R2 - 1

a

b

2

2

2

d

c

Schedule 1 (5 cycles) Schedule 2 (4 cycles)

  • a) lw R2, (R1)
  • lw R3, (R1) 4
  • --- nop -----
  • c) R4  R2 + R3
  • d) R5  R2 - 1
  • a) lw R2, (R1)
  • b) lw R3, (R1) 4
  • R5  R2 - 1
  • c) R4  R2 + R3
scheduling algorithm
Scheduling Algorithm
  • Construct dependence dag on basic block
  • Put roots in candidate set
  • Use scheduling heuristics (in order) to select instruction
  • While candidate set not empty
    • Evaluate all candidates and select best one
    • Delete scheduled instruction from candidate set
    • Add newly-exposed candidates
instruction scheduling heuristics
Instruction Scheduling Heuristics
  • NP-complete = we need heuristics
  • Bias scheduler to prefer instructions:
    • Earliest execution time
    • Have many successors
      • More flexibility in scheduling
    • Progress along critical path
    • Free registers
      • Reduce register pressure
    • Can be a combination of heuristics
computing priorities
Computing Priorities

Height(n) =

  • exec(n) if n is a leaf
  • max(height(m)) + exec(n)

for m, where m is a successor of n

Critical path(s) = path through the dependence DAG with longest latency

example determine height and cp
Example – Determine Height and CP

Assume:

memory instrs = 3

mult = 2 = (to

have result in

register)

rest = 1 cycle

a

3

b

c

3

1

d

e

2

3

f

g

2

3

h

2

Critical path: _______

i

example
Example

13

a

3

10

b

c

12

3

1

d

e

10

9

2

3

f

g

7

8

2

3

h

5

2

i

3

___ cycles

global scheduling superblock
Global Scheduling: Superblock
  • Definition:
    • single trace of contiguous, frequently executed blocks
    • a single entry and multiple exits
  • Formation algorithm:
    • pick a trace of frequently executed basic block
    • eliminate side entrance (tail duplication)
  • Scheduling and optimization:
    • speculate operations in the superblock
    • apply optimization to scope defined by superblock
superblock formation

A

100

A

100

B

90

C

10

B

90

C

10

D

0

E

90

E

90

D

0

F

100

F

90

F’

10

Superblock Formation

Tail duplicate

Select a trace

optimizations within superblock

r1 = r2*3

r1 = r2*3

r1 = r2*3

r2 = r2 +1

r2 = r2 +1

r2 = r2 +1

r3 = r2*3

r3 = r2*3

r3 = r1

r3 = r2*3

r3 = r2*3

trace selection

tail duplication

CSE within superblock

(no merge since single entry)

Optimizations within Superblock
  • By limiting the scope of optimization to superblock:
    • optimize for the frequent path
    • may enable optimizations that are not feasible otherwise (CSE, loop invariant code motion,...)
  • For example: CSE
scheduling algorithm complexity
Scheduling Algorithm Complexity
  • Time complexity: O(n2)
    • n = max number of instructions in basic block
  • Building dependence dag: worst-case O(n2)
    • Each instruction must be compared to every other instruction
  • Scheduling then requires each instruction be inspected at each step = O(n2)
  • Average-case: small constant (e.g., 3)
very long instruction word vliw
Very Long Instruction Word (VLIW)
  • Compiler determines exactly what is issued every cycle (before the program is run)
  • Schedules also account for latencies
  • All hardware changes result in a compiler change
  • Usually embedded systems (hence simple HW)
  • Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)
sample vliw code
Sample VLIW code

VLIW processor: 5 issue

2 Add/Sub units (1 cycle)

1 Mul/Div unit (2 cycle, unpipelined)

1 LD/ST unit (2 cycle, pipelined)

1 Branch unit (no delay slots)

Add/Sub

Add/Sub

Mul/Div

Ld/St

Branch

c = a + b

d = a - b

e = a * b

ld j = [x]

nop

g = c + d

h = c - d

nop

ld k = [y]

nop

nop

nop

i = j * c

ld f = [z]

br g

next time
Next Time
  • Phase-ordering