Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling

127 Views

Download Presentation
## Optimizing Compilers CISC 673 Spring 2009 Instruction Scheduling

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Optimizing CompilersCISC 673Spring 2009Instruction**Scheduling John Cavazos University of Delaware**Instruction Scheduling**• Reordering instructions to improve performance • Takes into account anticipated latencies • Machine-specific • Performed late in optimization pass • Instruction-Level Parallelism (ILP)**Modern Architectures Features**• Superscalar • Multiple logic units • Multiple issue • 2 or more instructions issued per cycle • Speculative execution • Branch predictors • Speculative loads • Deep pipelines**Types of Instruction Scheduling**• Local Scheduling • Basic Block Scheduling • Global Scheduling • Trace Scheduling • Superblock Scheduling • Software Pipelining**Scheduling for different Computer Architectures**• Out-of-order Issue • Scheduling is useful • In-order issue • Scheduling is very important • VLIW • Scheduling is essential!**Challenges to ILP**• Structural hazards: • Insufficient resources to exploit parallelism • Data hazards • Instruction depends on result of previous instruction still in pipeline • Control hazards • Branches & jumps modify PC • affect which instructions should be in pipeline**Recall from Architecture…**• IF – Instruction Fetch • ID – Instruction Decode • EX – Execute • MA – Memory access • WB – Write back IF ID EX MA WB IF ID EX MA WB IF ID EX MA WB**Structural Hazards**Instruction latency: execute takes > 1 cycle addf R3,R1,R2 IF ID EX EX MA WB IF ID stall EX EX MA WB addf R3,R3,R4 Assumes floating point ops take 2 execute cycles**Data Hazards**Memory latency: data not ready lw R1,0(R2) IF ID EX MA WB IF ID EX stall MA WB add R3,R1,R4**Control Hazards**ID EX MA WB Taken Branch IF IF --- --- --- --- Instr + 1 Branch Target IF ID EX MA WB IF ID EX MA WB Branch Target + 1**Basic Block Scheduling**• For each basic block: • Construct directed acyclic graph (DAG) using dependences between statements • Node = statement / instruction • Edge (a,b) = statement a must execute before b • Schedule instructions using the DAG**Data Dependences**• If two operations access the same register and one access is a write, they are dependent • Types of data dependences RAW=Read after Write WAW WAR r1 = r2 + r3 r2 = r5 * 6 r1 = r2 + r3 r4 = r1 * 6 r1 = r2 + r3 r1 = r4 * 6 Cannot reorder two dependent instructions**Basic Block Scheduling Example**Original Schedule Dependence DAG a) lw R2, (R1) b) lw R3, (R1) 4 c) R4 R2 + R3 d) R5 R2 - 1 a b 2 2 2 d c Schedule 1 (5 cycles) Schedule 2 (4 cycles) • a) lw R2, (R1) • lw R3, (R1) 4 • --- nop ----- • c) R4 R2 + R3 • d) R5 R2 - 1 • a) lw R2, (R1) • b) lw R3, (R1) 4 • R5 R2 - 1 • c) R4 R2 + R3**Scheduling Algorithm**• Construct dependence dag on basic block • Put roots in candidate set • Use scheduling heuristics (in order) to select instruction • While candidate set not empty • Evaluate all candidates and select best one • Delete scheduled instruction from candidate set • Add newly-exposed candidates**Instruction Scheduling Heuristics**• NP-complete = we need heuristics • Bias scheduler to prefer instructions: • Earliest execution time • Have many successors • More flexibility in scheduling • Progress along critical path • Free registers • Reduce register pressure • Can be a combination of heuristics**Computing Priorities**Height(n) = • exec(n) if n is a leaf • max(height(m)) + exec(n) for m, where m is a successor of n Critical path(s) = path through the dependence DAG with longest latency**Example – Determine Height and CP**Assume: memory instrs = 3 mult = 2 = (to have result in register) rest = 1 cycle a 3 b c 3 1 d e 2 3 f g 2 3 h 2 Critical path: _______ i**Example**13 a 3 10 b c 12 3 1 d e 10 9 2 3 f g 7 8 2 3 h 5 2 i 3 ___ cycles**Global Scheduling: Superblock**• Definition: • single trace of contiguous, frequently executed blocks • a single entry and multiple exits • Formation algorithm: • pick a trace of frequently executed basic block • eliminate side entrance (tail duplication) • Scheduling and optimization: • speculate operations in the superblock • apply optimization to scope defined by superblock**A**100 A 100 B 90 C 10 B 90 C 10 D 0 E 90 E 90 D 0 F 100 F 90 F’ 10 Superblock Formation Tail duplicate Select a trace**r1 = r2*3**r1 = r2*3 r1 = r2*3 r2 = r2 +1 r2 = r2 +1 r2 = r2 +1 r3 = r2*3 r3 = r2*3 r3 = r1 r3 = r2*3 r3 = r2*3 trace selection tail duplication CSE within superblock (no merge since single entry) Optimizations within Superblock • By limiting the scope of optimization to superblock: • optimize for the frequent path • may enable optimizations that are not feasible otherwise (CSE, loop invariant code motion,...) • For example: CSE**Scheduling Algorithm Complexity**• Time complexity: O(n2) • n = max number of instructions in basic block • Building dependence dag: worst-case O(n2) • Each instruction must be compared to every other instruction • Scheduling then requires each instruction be inspected at each step = O(n2) • Average-case: small constant (e.g., 3)**Very Long Instruction Word (VLIW)**• Compiler determines exactly what is issued every cycle (before the program is run) • Schedules also account for latencies • All hardware changes result in a compiler change • Usually embedded systems (hence simple HW) • Itanium is actually an EPIC-style machine (accounts for most parallelism, not latencies)**Sample VLIW code**VLIW processor: 5 issue 2 Add/Sub units (1 cycle) 1 Mul/Div unit (2 cycle, unpipelined) 1 LD/ST unit (2 cycle, pipelined) 1 Branch unit (no delay slots) Add/Sub Add/Sub Mul/Div Ld/St Branch c = a + b d = a - b e = a * b ld j = [x] nop g = c + d h = c - d nop ld k = [y] nop nop nop i = j * c ld f = [z] br g**Next Time**• Phase-ordering