1 / 34

Survey of Low-Complexity, Low Power Instruction Scheduling

Survey of Low-Complexity, Low Power Instruction Scheduling. Alex Li Lewen Lo Sara Sadeghi Baghsorkhi. Motivation. Scalability of instruction window size Extract greater ILP Power consumption CAM logic is power hungry Complexity

Download Presentation

Survey of Low-Complexity, Low Power Instruction Scheduling

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Survey of Low-Complexity, Low Power Instruction Scheduling Alex Li Lewen Lo Sara Sadeghi Baghsorkhi

  2. Motivation • Scalability of instruction window size • Extract greater ILP • Power consumption • CAM logic is power hungry • Complexity • Wire delay of associative logic dominates gate delay in scheduler

  3. Outline • Wakeup logic optimizations • Distributed instruction queues • Waiting Instruction Buffer • Preschedulers • Cyclone • Wakeup-Free

  4. Wakeup Logic Result Reg Tag Wakeup Logic = = Opcode FU Type Dest Reg V1 Src Reg 1 V2 Src Reg 2 R ••• To Select Logic

  5. Gated Tag Matching tail head • Rationale • Parts of IQ wasting energy • Energy-wasting sources • Empty area • Ready operands • Issued instructions • Solution • Gate the comparators! Folegnani et al. ISCA2001

  6. Gated Tag Matching • Furthermore… • young instr. contributes little to performance • Solution: Dynamic resizing • Use limit pointer & performance counters • Reduce size as long as < IPC threshold • Increase size if > IPC threshold for a set period • Cost: Additional logic for • Gated comparators • Performance counter • Claims • 128-entry queue, effective size ~ 43 • 4% performance loss • 90.7% wakeup logic energy savings • 14.9% chip energy savings • Significant energy savings • Based on conventional design, no performance benefit limit tail head Folegnani et al. ISCA2001

  7. Tag Elimination • Rationale • 1 ready operand in most instr. (80-96%) • Last arriving operand wakes up instr. • Base Approach • Issue window w/ 2-, 1-, and 0-comparator entries • Insert instr. based on operand readiness • Advanced Approach • Eliminate 2-comparator entries • Predict last-arriving operand • Re-issue on misprediction • Results on (32 1-comp/32 0-comp) • Slight IPC loss (1-3%), • Account for reduced delay, good speedup (25-45%) • 65-75% lower energy-delay product • Drastically reduce associative logic (1/4) • reduce energy • no performance impact (even speedup) = = 2-comp entries = = = = 1-comp entries = = 0-comp entries Ernst et al. ISCA2002

  8. N-use Issue Logic • Rationale • 1 (or few) dep. instr. for most instr (75-78%) • Approach • More SRAM (N-use table) • Less CAM (I-buffer) • Wakeup dependents only • Claims • 2-use table, 2-entry I-buffer comparable to 64-entry CAM (~4% slowdown) • 96 regs  192 entries in 2-use table! • Justifications • DOES reduce CAM (64 to 2 cells) • Energy to support 2-use table  gated entries • Less complex, but maybe more area • Cycle time may be reduced • Drastically different design Canal et al. ICS2001

  9. Distributed Instruction Queue(FIFO) • Instructions in a queue are a dependence chain. • Only instructions at the head of the queues can be ready. • Works well for INT codes, but poor for FP codes. • Large number of FIFOs increases its complexity Palacharla et al 97

  10. Distributed Instruction Queue(Buffer) • Multiple dependence chains share a queue. • Queues are not FIFOs but they do not require wake-up. • Different Dispatch Order and Issue Order • Use latencies at issue time to decide which will be the next selected instruction. • Still a Simple Selection Logic • Same Performance and Less Power Consumption Abella and Gonzalez 04

  11. Selection Logic Abella and Gonzalez 04

  12. Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer LD r1, 1024(r0) ADD r3, r1, r2 ADD r4, r1, r4 SLL r3, 0x4, r3 SUB r4, r4, r2 ADD r5, r3, r4 Instruction Dispatch Issued Data Cache LD r6, 256(r5) Functional Unit ADD r6, r6, r0 Lebeck et al 02

  13. Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer ADD r3, r1, r2 ADD r4, r1, r4 SLL r3, 0x4, r3 SUB r4, r4, r2 Cache Miss ADD r5, r3, r4 LD r6, 256(r5) Instruction Dispatch Data Cache ADD r6, r6, r0 Functional Unit Load miss on r1 Lebeck et al 02

  14. Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer ADD r3, r1, r2 ADD r4, r1, r4 SLL r3, 0x4, r3 SUB r4, r4, r2 Cache Miss ADD r5, r3, r4 LD r6, 256(r5) Instruction Dispatch Data Cache ADD r6, r6, r0 Functional Unit Lebeck et al 02

  15. Waiting Instruction Buffer r3, r4 Issue Queue Waiting Instruction Buffer SLL r3, 0x4, r3 SUB r4, r4, r2 ADD r5, r3, r4 LD r6, 256(r5) Cache Miss ADD r6, r6, r0 Instruction Dispatch Data Cache Functional Unit Lebeck et al 02

  16. Waiting Instruction Buffer r3, r4 Issue Queue Waiting Instruction Buffer ADD r5, r3, r4 LD r6, 256(r5) ADD r6, r6, r0 Cache Miss Instruction Dispatch Data Cache Functional Unit Lebeck et al 02

  17. Waiting Instruction Buffer Issue Queue Waiting Instruction Buffer LD r6, 256(r5) ADD r6, r6, r0 …. Miss Resolved …. …. Instruction Dispatch Data Cache Functional Unit Lebeck et al 02

  18. Waiting Instruction Buffer Instructions reinserted Issue Queue Waiting Instruction Buffer LD r6, 256(r5) ADD r6, r6, r0 …. …. …. ADD r3, r1, r2 Instruction Dispatch Data Cache ADD r4, r1, r4 Functional Unit SLL r3, 0x4, r3 Lebeck et al 02

  19. Waiting Instruction Buffer • No support for Back-to-back Execution with Parent Loads that Miss in the Cache • Power Consumption • Several Instructions Moves between the Issue Queue and the WIB • A Large WIB

  20. Motivation behind Preschedulers • Compiler-heavy scheduling • “Dumber” scheduler • More conservative (on branches, load/store addresses, other run-time things) • Hardware-intensive scheduling • Takes advantage of knowledge at run-time • Much more complex

  21. Motivation behind Preschedulers • Some dead instructions sit in scheduler slots • Reduce dead slots by only sending fireable instructions • Increases effective instruction window • Eliminates associative logic, decreasing: • Complexity • Delay (allowing for a possible clock speed increase) • Power consumption

  22. Dataflow-based Prescheduler • Register Use Line Table (RULT), width W • Active line = ready instructions • line = max(a,b,c) + x • Max line of current line, lines of both operands • Circular setup • Each cycle, increment active line

  23. Dataflow Prescheduler Performance 8-entry issue buffer, 12 lines, 8 FIFOs 16-entry issue buffer, 12 lines, 16 FIFOs • Avg. 54% performance increase for 8-entry buffer • Avg. 33% performance increase for 16-entry buffer Michaud et al. HPCA2001

  24. Cyclone • Re-vamp the scheduler (take advantage of higher perf.) • Instrs from prescheduler go into countdown • When countdown reaches N/2 -> main queue • Main queue entries promote to the right • Column 0 is issued each cycle Ernst et al. ISCA2003

  25. Cyclone (cont’d) • Replay mechanism • Register File Ready Bits for final operand check • Store set predictor • A conservative method avoiding load/store dependence messiness

  26. Cyclone Performance • Decrease in latency • 8-decode, 8-issue Cyclone takes ~12% of area compared to 64-instruction 8-issue CAM Ernst et al. ISCA2003

  27. Cyclone Analysis • Eliminates both wakeup and selection logic • Competition for issue ports • Congestion • Collisions during promotion (modifying promotion paths only shifts the pressure) • Replay-decode collisions

  28. Wakeup-Free (WF) schemes:WF-Replay • Latency counters + selection logic • Uses entire scheduler • For 32 entry queue, issue width 4, 9% performance hit (vs. 25.5% of cyclone) • Issue width 6, performance hit of 0.2%, Issue width 8, performance hit of 0 Hu et al. HPCA2004

  29. WF-Precheck • Do a precheck instead of replay • Check Reg Ready Bits before issuing • If not ready, recalculate timing • Increases complexity of selection logic Hu et al. HPCA2004

  30. Segmented Issue Queue Hu et al. HPCA2004

  31. Segmented Issue Queue Commentary • Rows represent different classes of latencies • Only select on lowest row (latency 0) • Sinking/Collapsing structure to prevent pileups

  32. WF-Segment Performance • 5.8% perf. loss (3.5% vs. Precheck) Hu et al. HPCA2004

  33. Conclusions • Low-power optimizations tend to target control logic • Don’t change underlying structure • Low-complexity optimizations • More creative designs • Low power • No appreciable performance loss (possibly speedup )

  34. Backup Slides

More Related