1 / 23

Operation Tables for Scheduling in the presence of Partial Bypassing

L. S. C. Operation Tables for Scheduling in the presence of Partial Bypassing. Aviral Shrivastava 1 Eugene Earlie 2 Nikil Dutt 1 Alex Nicolau 1. 2 Strategic CAD Labs, Intel, Hudson, MA, USA. 1 Center For Embedded Computer Systems, University of California, Irvine, CA, USA. RF. X2. F.

cliff
Download Presentation

Operation Tables for Scheduling in the presence of Partial Bypassing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. L S C Operation Tablesfor Scheduling in the presence of Partial Bypassing Aviral Shrivastava1 Eugene Earlie2 Nikil Dutt1 Alex Nicolau1 2Strategic CAD Labs, Intel, Hudson, MA, USA 1Center For Embedded Computer Systems, University of California, Irvine, CA, USA

  2. RF X2 F D OR X1 WB Bypassing Improves Performance • Pipelining improves performance • Limited by pipeline hazards • Bypasses eliminate certain data hazards • Further improve performance RF X2 F D OR WB X1 R1 R4  R4 + R1 R1  R2 + R3 R1 R4  R4 + R1 R1  R2 + R3

  3. M1 RF X2 F D X1 WB M2 Impact of Bypassing • Area and Power consumption • Wide multiplexers • Bypass Control logic • Bypass wires • Cycle time • Bypasses may be a part of timing-critical path • Overall chip complexity • deeply pipelined • out-of-order processors • Wiring congestion P. Ahuja et al., The Performance Impact of incomplete bypassing in processor pipelines MICRO 1995 A. Abnous and N. Bagerzadeh, Pipelining and bypassing in a VLIW processor, IEEE Trans... 1995.

  4. Bypassing in Embedded Systems • Bypassing increases performance • But may have significant impact on Area, Power Consumption, Wire congestion etc.. • The Embedded Systems Dilemma • No Bypassing - Too low performance • Full Bypassing - Too much area, power, wire congestion • How to customize Bypassing?

  5. RF X2 F D OR X1 WB Partial Bypassing – Solution and Problem • Solution – • Only the most beneficial bypasses are present • Implements a trade-off between Performance, Area, Power consumption, etc.. of the processor • Problem – • How to Compile for a processor with partial bypassing?

  6. Related Work • Compilation for partial bypassing • P. Ahuja et al. [MICRO’95] • Manual Compilation • M. Buss et al. [CASES’01] • Optimize inter-cluster copy operations • K. Fan et al. [ASSP’03] • FU-allocation strategy for VLIW processors • No existing generic compilation technique • RISC, superscalar, superpipline • No instruction reordering • No accurate “pipeline hazard detection” technique We present : An accurate, generic, retargetable pipeline hazard detection technique

  7. Pipeline Hazards • Data Hazards • Resource Hazards • Resource Hazards – Structural Information • Reservation Tables RF C3 C1 C2 X2 F D OR X1 WB

  8. Resource Hazard Detection Resource Hazard RF C3 C1 C2 X2 F D OR X1 WB

  9. Data Hazard Detection • Control Flow Graph – Register Information • Operation Latency • Least delay (in cycles) by which dependent operations must be separated to avoid data hazard a Time 1 a b 1 2 2 b c c 2 2 1 d d e e d 3 f e 4 f Control Flow Graph with operation latencies Scheduled operations

  10. RF X2 F D OR X1 WB Traditional - Operation Latency • Operation Latency of a non-bypassed or fully bypassed pipeline is a constant RF X2 D OR X1 WB F R1 R1 R4  R4 + R1 R1  R2 + R3 R4  R4 + R1 R1  R2 + R3 No Bypassing: Operation Latency = 3 Full Bypassing: Operation Latency = 1

  11. Partial Bypasses - Operation Latency • Operation Latency ill-defined RF X2 F D OR X1 X3 WB R4  R4 + R1 R1  R2 + R3 Partial Bypassing: Operation Latency = ?? • Delay (in cycles) depends on the structure • Processor pipeline • Presence/absence of bypasses • Need structural information to detect data hazards

  12. Partial Bypassing - Pipeline Hazards • Traditionally (No or Full Bypassing) • Resource Hazards - Structural information • Data Hazards - Register information + Operation Latency • Partial Bypassing • Resource Hazards - Structural information • Data Hazards - Register information + Structural information • Structural information captured by Reservation Tables • Augment Reservation Tables with register information Our Contribution - Operation Table

  13. Reservation Table 1. F 2. D 3. OR C1 RF C2 RF 4. X1 5. X2 6. WB C3 RF • Reservation Table is a binding between • Operation and processor resources • Does not support multiple datapaths RF C3 C1 C2 X2 F D OR X1 WB Reservation Table for ADD

  14. RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Enhanced Reservation Table 1. F 2. D 3. OR C1 RF C2 RF C5 BRF 4. X1 C4 BRF 5. X2 6. WB C3 RF • Reservation Table is a binding between • Operation and processor resources Reservation Table for ADD

  15. RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Operation Table 1. F 2. D 3. OR ReadOperands R2 C1 RF R3 C2 RF C5 BRF DestOperands R1 RF 4. X1 WriteOperands R1 C4 BRF 5. X2 6. WB WriteOperands R1 C3 RF • Operation Table is a binding between • Operation and Processor Resources and Registers • Can be used to detect both data and resource hazards Operation Table for ADD R1 R2 R3

  16. RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Pipeline Hazard Detection using OT

  17. RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Resource Hazard Detection Resource Hazard

  18. RF BRF C3 C1 C2 C4 C5 X2 F D OR X1 WB Data Hazard Detection Data Hazard

  19. Scheduling using Operation Tables • Operation Tables provide a way to accurately detect pipeline hazards • detect data and resource hazards • Most scheduling algorithms have two main components • Generate possible reorderings • Evaulate each to find the best one. • Most Scheduling algorithms should be able to leverage from a better evaluation mechanism

  20. Experimental Setup • Platform – Intel XScale • 7-stage super-pipelined RISC • Benchmarks – MiBench • Scheduler • instruction reordering within Basic Block • Currently a post pass in the compiler Application gcc –O3 Executable OT – based Scheduler Executable Cycle Accurate Simulator Cycle Accurate Simulator GCC Cycles OT Cycles Performance Improvement = (GCC Cycles – OT Cycles)/GCC Cycles Intel XScale Microarchitecture Programmers Reference Manual, http://www.developer.intel.com M. R. Gauthus et al. MiBench: A free commercially representative…, IEEE Workshop… 2001

  21. Up to 20% Performance Improvement Performance Improvement = (GCC Cycles – OT Cycles)/GCC Cycles

  22. Summary • Bypassing improves performance but is costly in terms of area, power etc.. • Partial bypassing presents valuable trade-offs, however poses challenges in compilation • Operation latencies in a partially bypassed pipeline are ill-defined • We define Operation Table (OT) as a binding between an operation and the processors resources and registers • OTs can be used to accurately detect hazards even in the presence of partial bypassing in processors • OT based simple Basic Block level scheduling results in up to 20% performance improvement

  23. Thank You! Questions/Comments? aviral@ics.uci.edu

More Related