Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling - PowerPoint PPT Presentation

ima
optimizing compilers cisc 673 spring 2009 instruction scheduling n.
Skip this Video
Loading SlideShow in 5 Seconds..
Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling PowerPoint Presentation
Download Presentation
Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling

play fullscreen
1 / 25
Download Presentation
Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling
127 Views
Download Presentation

Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Optimizing CompilersCISC 673Spring 2009Instruction Scheduling John Cavazos University of Delaware

  2. Instruction Scheduling • Reordering instructions to improve performance • Takes into account anticipated latencies • Machine-specific • Performed late in optimization pass • Instruction-Level Parallelism (ILP)

  3. Modern Architectures Features • Superscalar • Multiple logic units • Multiple issue • 2 or more instructions issued per cycle • Speculative execution • Branch predictors • Speculative loads • Deep pipelines

  4. Types of Instruction Scheduling • Local Scheduling • Basic Block Scheduling • Global Scheduling • Trace Scheduling • Superblock Scheduling • Software Pipelining

  5. Scheduling for different Computer Architectures • Out-of-order Issue • Scheduling is useful • In-order issue • Scheduling is very important • VLIW • Scheduling is essential!

  6. Challenges to ILP • Structural hazards: • Insufficient resources to exploit parallelism • Data hazards • Instruction depends on result of previous instruction still in pipeline • Control hazards • Branches & jumps modify PC • affect which instructions should be in pipeline

  7. Recall from Architecture… • IF – Instruction Fetch • ID – Instruction Decode • EX – Execute • MA – Memory access • WB – Write back IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB

  8. Structural Hazards Instruction latency: execute takes > 1 cycle addf R3,R1,R2 IF ID EX EX MA WB IF ID stall EX EX MA WB addf R3,R3,R4 Assumes floating point ops take 2 execute cycles

  9. Data Hazards Memory latency: data not ready lw R1,0(R2) IF ID EX MA WB IF ID EX stall MA WB add R3,R1,R4

  10. Control Hazards ID EX MA WB Taken Branch IF IF --- --- --- --- Instr + 1 Branch Target IF ID EX MA WB IF ID EX MA WB Branch Target + 1

  11. Basic Block Scheduling • For each basic block: • Construct directed acyclic graph (DAG) using dependences between statements • Node = statement / instruction • Edge (a,b) = statement a must execute before b • Schedule instructions using the DAG

  12. Data Dependences • If two operations access the same register and one access is a write, they are dependent • Types of data dependences RAW=Read after Write WAW WAR r1 = r2 + r3 r2 = r5 * 6 r1 = r2 + r3 r4 = r1 * 6 r1 = r2 + r3 r1 = r4 * 6 Cannot reorder two dependent instructions

  13. Basic Block Scheduling Example Original Schedule Dependence DAG a) lw R2, (R1) b) lw R3, (R1) 4 c) R4  R2 + R3 d) R5  R2 - 1 a b 2 2 2 d c Schedule 1 (5 cycles) Schedule 2 (4 cycles) • a) lw R2, (R1) • lw R3, (R1) 4 • --- nop ----- • c) R4  R2 + R3 • d) R5  R2 - 1 • a) lw R2, (R1) • b) lw R3, (R1) 4 • R5  R2 - 1 • c) R4  R2 + R3

  14. Scheduling Algorithm • Construct dependence dag on basic block • Put roots in candidate set • Use scheduling heuristics (in order) to select instruction • While candidate set not empty • Evaluate all candidates and select best one • Delete scheduled instruction from candidate set • Add newly-exposed candidates

  15. Instruction Scheduling Heuristics • NP-complete = we need heuristics • Bias scheduler to prefer instructions: • Earliest execution time • Have many successors • More flexibility in scheduling • Progress along critical path • Free registers • Reduce register pressure • Can be a combination of heuristics

  16. Computing Priorities Height(n) = • exec(n) if n is a leaf • max(height(m)) + exec(n) for m, where m is a successor of n Critical path(s) = path through the dependence DAG with longest latency

  17. Example – Determine Height and CP Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle a 3 b c 3 1 d e 2 3 f g 2 3 h 2 Critical path: _______ i

  18. Example 13 a 3 10 b c 12 3 1 d e 10 9 2 3 f g 7 8 2 3 h 5 2 i 3 ___ cycles

  19. Global Scheduling: Superblock • Definition: • single trace of contiguous, frequently executed blocks • a single entry and multiple exits • Formation algorithm: • pick a trace of frequently executed basic block • eliminate side entrance (tail duplication) • Scheduling and optimization: • speculate operations in the superblock • apply optimization to scope defined by superblock

  20. A 100 A 100 B 90 C 10 B 90 C 10 D 0 E 90 E 90 D 0 F 100 F 90 F’ 10 Superblock Formation Tail duplicate Select a trace

  21. r1 = r2*3 r1 = r2*3 r1 = r2*3 r2 = r2 +1 r2 = r2 +1 r2 = r2 +1 r3 = r2*3 r3 = r2*3 r3 = r1 r3 = r2*3 r3 = r2*3 trace selection tail duplication CSE within superblock (no merge since single entry) Optimizations within Superblock • By limiting the scope of optimization to superblock: • optimize for the frequent path • may enable optimizations that are not feasible otherwise (CSE, loop invariant code motion,...) • For example: CSE

  22. Scheduling Algorithm Complexity • Time complexity: O(n2) • n = max number of instructions in basic block • Building dependence dag: worst-case O(n2) • Each instruction must be compared to every other instruction • Scheduling then requires each instruction be inspected at each step = O(n2) • Average-case: small constant (e.g., 3)

  23. Very Long Instruction Word (VLIW) • Compiler determines exactly what is issued every cycle (before the program is run) • Schedules also account for latencies • All hardware changes result in a compiler change • Usually embedded systems (hence simple HW) • Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)

  24. Sample VLIW code VLIW processor: 5 issue 2 Add/Sub units (1 cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1 LD/ST unit (2 cycle, pipelined) 1 Branch unit (no delay slots) Add/Sub Add/Sub Mul/Div Ld/St Branch c = a + b d = a - b e = a * b ld j = [x] nop g = c + d h = c - d nop ld k = [y] nop nop nop i = j * c ld f = [z] br g

  25. Next Time • Phase-ordering