1 / 15

CSE P501 – Compiler Construction

CSE P501 – Compiler Construction. Instruction Scheduling Issues Latencies List scheduling. Instruction Scheduling is . . . a b c d e f g h. b a d f c g h f. Schedule. Execute in-order to get correct answer. Issue in new order eg : memory fetch is slow eg : divide is slow

polly
Download Presentation

CSE P501 – Compiler Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE P501 – Compiler Construction Instruction Scheduling Issues Latencies List scheduling Jim Hogg - UW - CSE - P501

  2. Instruction Scheduling is . . . a b c d e f g h b a d f c g h f Schedule • Execute in-order to get correct answer • Issue in new order • eg: memory fetch is slow • eg: divide is slow • Overall faster • Still get correct answer! • Originally devised for super-computers • Now used everywhere: • in-order procs - older ARM • out-of-order procs - newer x86 • Compiler does 'heavy lifting' - reduce chip power Jim Hogg - UW - CSE - P501

  3. Chip Complexity, 1 Following factors make scheduling complicated: • Different kinds of instruction take different times (in clock cycles) to complete • Modern chips have multiple functional units • so they can issue several operations per cycle • "super-scalar" • Loads are non-blocking • ~50 in-flight loads and ~50 in-flight stores JIm Hogg - UW - CSE - P501

  4. Typical Instruction Timings JIm Hogg - UW - CSE - P501

  5. Load Latencies Core Core • Instruction ~5 per cycle • Register 1 cycle • L1 Cache ~4 cycles • L2 Cache ~10 cycles • L3 Cache ~40 cycles • DRAM ~100 ns L1 = 64 KB per core L2 = 256 KB per core L3 = 2-8 MB shared DRAM

  6. Super-Scalar JIm Hogg - UW - CSE - P501

  7. Chip Complexity, 2 • Branch costs vary (branch predictor) • Branches on some processors have delay slots (eg: Sparc) • Modern processors have branch-predictor logic in hardware • heuristics predict whether branches are taken or not • keeps pipelines full • GOAL: Scheduler should reorder instructions to • hide latencies • take advantage of multiple function units (and delay slots) • help the processor effectively pipeline execution • However, many chips schedule on-the-fly too • eg: Haswell out-of-order window = 192 ops JIm Hogg - UW - CSE - P501

  8. Data Dependence Graph a leaf c b e d g f root h i read-after-write = RAW = true dependence = flow dependence write-after-read = WAR = anti-dependence write-after-write = WAW = output-dependence The scheduler has freedom to re-order instructions, so long as it complies with inter-instruction dependencies JIm Hogg - UW - CSE - P501

  9. Scheduling Really Works ... Original Scheduled a = 2*a*b*c*d 1 Functional Unit Load or Store: 3 cycles Multiply: 2 cycles Otherwise: 1 cycle New schedule uses extra register: r3 Preserves (WAW) output-dependency JIm Hogg - UW - CSE - P501

  10. Scheduler: Job Description • The Job • Given code for some machine; and latencies for each instruction, reorder to minimize execution time • Constraints • Produce correct code • Minimize wasted cycles • Avoid spilling registers • Don't take forever to reach an answer JIm Hogg - UW - CSE - P501

  11. Job Description - Part 2 • foreach instruction in dependence graph • Denote current instruction as ins • Denote number of cyles to execute as ins.delay • Denote cycle number in which ins should start as ins.start • foreach instruction depthat is dependent on ins • Ensure ins.start + ins.delay<= dep.start What if the scheduler makes a mistake? On-chip hardware stalls the pipeline until operands become available: so slower, but still correct! JIm Hogg - UW - CSE - P501

  12. Dependence Graph + Timings a13 c12 b10 e10 d9 g8 f7 h5 i3 • Superscripts show path length to end of computation • a-b-d-f-h-i is critical path • Can schedule leaves any time - no constraints • Since a has longest delay, schedule it first; then c; then ... 1 Functional Unit Load or Store: 3 cycles Multiply: 2 cycles Otherwise: 1 cycle JIm Hogg - UW - CSE - P501

  13. List Scheduling • Build a precedence graph D • Compute a priority function over the nodes in D • typical: longest latency-weighted path • Rename registers to remove WAW conflicts • Create schedule, one cycle at a time • Use queue of operations that are Ready • At each cycle • Choose a Ready operation and schedule it • Update Ready queue JIm Hogg - UW - CSE - P501

  14. List Scheduling Algorithm cycle = 1 // clock cycle number Ready = leaves of D // ready to be scheduled Active = { } // being executed while Ready Active {} do foreachins Active do if ins.start + ins.delay < cycle then remove ins from Active foreach successor suc of ins in D do if suc Ready then Ready = {suc} endif enddo endif endforeach if Ready {} then remove an instruction, ins, from Ready ins.start = cycle; Active= ins; endif cycle++ endwhile

  15. Beyond Basic Blocks • List scheduling dominates, but moving beyond basic blocks can improve quality of the code. Possibilities: • Schedule extended basic blocks (EBBs) • Watch for exit points – limits reordering or requires compensating • Trace scheduling • Use profiling information to select regions for scheduling using traces (paths) through code JIm Hogg - UW - CSE - P501

More Related