1 / 48

Out-of-Order Execution Structures

Out-of-Order Execution Structures. Based on: Complexity-Effective Superscalar Processors S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97. MIPS R10000-Like Design . Fetch: Read instructions from I-Cache Predict Branches Pass on to Decode phase. Fetch Phase. Decode: Parse instruction

gusty
Download Presentation

Out-of-Order Execution Structures

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Out-of-Order Execution Structures ECE1773 - Fall ‘07 ECE Toronto

  2. Based on: Complexity-Effective Superscalar Processors S. Palacharla, N. Jouppi and J. E. Smith, ISCA 97 MIPS R10000-Like Design ECE1773 - Fall ‘07 ECE Toronto

  3. Fetch: Read instructions from I-Cache Predict Branches Pass on to Decode phase Fetch Phase ECE1773 - Fall ‘07 ECE Toronto

  4. Decode: Parse instruction Shuffle opcode parts to appropriate ports for rename Decode Phase ECE1773 - Fall ‘07 ECE Toronto

  5. Rename: Map Architectural registers to Physical Eliminate False Dependences Passes renamed instructions to scheduler Called Dispatch Renaming Phase ECE1773 - Fall ‘07 ECE Toronto

  6. Wakeup: Instructions check whether they become ready From Writeback: physical register names Select: Amongst the ready select those to execute Structural hazards Scheduling Phase ECE1773 - Fall ‘07 ECE Toronto

  7. Read source operands Register File Read Phase ECE1773 - Fall ‘07 ECE Toronto

  8. Bypass and Execute Phase ECE1773 - Fall ‘07 ECE Toronto

  9. Data Cache Access Phase ECE1773 - Fall ‘07 ECE Toronto

  10. Write result to register file Broadcast tag in order to wakeup waiting instructions Notice that the tag broadcast should happen TWO cycles in advance of the result production Writeback Phase ECE1773 - Fall ‘07 ECE Toronto

  11. Reservation Station Model • Used by Pentium Pro, PowerPC 604 • Re-order buffer holds values • Renaming points to re-order buffer entries • Tomasulo-like ECE1773 - Fall ‘07 ECE Toronto

  12. Physical Register File vs. Reservation Station • Physical Register File • Values reside in the register file • At writeback instructions broadcast the register name • Reservation Stations: • Values reside: • In the register file upon commit • Non-speculative • In reservation stations prior to commit • Speculative ECE1773 - Fall ‘07 ECE Toronto

  13. Quantifying Complexity • Critical Path Delay as a function of architectural parameters • Instruction Window size (WinSize) • Issue Width (IW) • Full-custom Implementations • Study the critical path • Delay model • Extrapolate how it will scale with “future” technologies ECE1773 - Fall ‘07 ECE Toronto

  14. Renaming • Inputs: • IW instructions • Up to 2 x Input register names • Up to 1 x Output register name • Outputs: • 2 x input physical registers • 1 x new output physical register • 1 x previous physical register name for checkpointing • Updated rename table • Superscalar Issue complicates things a bit ECE1773 - Fall ‘07 ECE Toronto

  15. s1 s1 s2 s2 old d d Renaming One Instruction new reg from free list Write port p0 RAT 2 For mispeculation recovery Read port 1 Read port 1 1 Read port p31 ECE1773 - Fall ‘07 ECE Toronto

  16. new d new d new d new d d Old d d Old d ps2 s2 ps2 s2 ps1 ps1 s1 s1 Renaming Two Instructions Cross Bundle Dependency Check Logic RAT ? ? ? ECE1773 - Fall ‘07 ECE Toronto

  17. Renaming More Instructions • Dependency Checking logic for instruction i must match against all preceding destinations • If there are multiple matches it must enforce priority: • Pick the one closest to this instruction ECE1773 - Fall ‘07 ECE Toronto

  18. RAT: SRAM Implementation bitlines SRAM cell decoder Arch reg #ARCH REGS lg(#PHYS REGS) Sense amp Phys reg ECE1773 - Fall ‘07 ECE Toronto

  19. SRAM RAT cell ECE1773 - Fall ‘07 ECE Toronto

  20. RAT: CAM Implementation • One CAM per physical register • Active bit indicates the current map • New version by setting active bit CAM cell Arch reg Active bit encoder Phys reg #PHYS REGS lg(#ARCH REGS) ECE1773 - Fall ‘07 ECE Toronto

  21. CAM Cell ECE1773 - Fall ‘07 ECE Toronto

  22. SRAM vs. CAM • SRAM: • Arch reg rows • Lg(phy reg) cols • SRAM read/write • CAM: • Phy reg rows • Lg(arch reg) cols • CAM match • Update: • Reset previous valid bit • Set current valid bit ECE1773 - Fall ‘07 ECE Toronto

  23. Scheduler: Part #1 - Wakeup ECE1773 - Fall ‘07 ECE Toronto

  24. Tree of Arbiters GRANT Signals REQ Signals Root enabled if FU available Anyreq raised if any req is active, Grant Issued if arbiter enabled Scheduler: Part #2 - Select For a Single FU Location based select policy ECE1773 - Fall ‘07 ECE Toronto

  25. Select for more than one FUs • Handling Multiple FUs of Same Type: • Stack Select logic blocks in series - hierarchy • Mask the Request granted to previous unit • NOT Feasible for More than 2 FUs • Alternative: • statically partition issue window among FUs – MIPS R10000, HP PA 8000 ECE1773 - Fall ‘07 ECE Toronto

  26. Datapath and Bypass Commonly Used Layout: Turn on Tri-State A to pass result of FU1 to left operand of FU0 1 Bit-Slice ECE1773 - Fall ‘07 ECE Toronto

  27. Complexity Analysis • Critical path delay as a function of: • Issue Width • Window Size • Register Renaming Table • Wakeup and Select • Bypass paths ECE1773 - Fall ‘07 ECE Toronto

  28. Methodology • A representative CMOS design is selected from published alternatives • Implemented the circuits for 3 technologies: • 0.8micron, 0.35micron and 0.18 micron • Optimize for speed • Wire parasitics in delay model • Rmetal, Cmetal ECE1773 - Fall ‘07 ECE Toronto

  29. Methodology • Feature size scaling: 1 / S • Voltage scaling: 1 / U • Logic Delay = (CLx V) / I • Capac. Load: CL= 1  1 / S • Supply Voltage: V = 1  1 / U • Average charge/discharge current: I = 1  1 / U • So, Logic Delay = (1 / S x 1 / U ) / (1 / U) = 1 / S ECE1773 - Fall ‘07 ECE Toronto

  30. Wire Delay • L: wire length • Intrinsic RC delay  • Rmetal: resistance per unit length • Cmetal: capacitance per unit length • 0.5: 1st order approximation of distributed RC model – uniformly distributed R & C ECE1773 - Fall ‘07 ECE Toronto

  31. Wire Delay Scaling • Metal Thickness doesn’t scale much • Width ~ 1/S • Rmetal ~ S • Fringe Capacitance dominates in smaller feature sizes – edges to parallel wires and the substrate • Parallel plate – scales with 1 / S • Cmetal ~ S • Length scales with 1/S • Overall Scale factor: S x S x (1/S)2 = 1 • Wire delay remains constant ECE1773 - Fall ‘07 ECE Toronto

  32. Register Renaming Table ECE1773 - Fall ‘07 ECE Toronto

  33. r1 r4 r4 r4 r4 Dependency Checking Logic • Accessed in Parallel with Map Table • Every Logical Reg compared against logical dest regs of current rename group • For IW=2,4,8, delay less than map table ECE1773 - Fall ‘07 ECE Toronto

  34. Renaming Delay • SRAM scheme • Delay Components: • Time to decode the arch reg index • Time to drive wordline • Time to pull down bit line • Time for SenseAmp to detect pull-down • MUX time ignored as control from dep. Check logic comes in advance ECE1773 - Fall ‘07 ECE Toronto

  35. Renaming Circuit ECE1773 - Fall ‘07 ECE Toronto

  36. Decoder Delay ECE1773 - Fall ‘07 ECE Toronto

  37. Decoder Delay • Predecoding for speed • Length of predecode lines: • Cellheight: Height of single cell excluding wordlines • Wordline spacing • NVREG: # of virtual reg-s • x3: 3-operand instr-s ECE1773 - Fall ‘07 ECE Toronto

  38. Decoder Delay • Tnand fall delay of NAND • Tnor rise delay of NOR • Rnandpd NAND pull-down channel resistance + Predecode line metal resistance • Ceq diff-n Cap. of NAND + gate Cap. of NOR + interconnect Cap. ECE1773 - Fall ‘07 ECE Toronto

  39. Decoder Delay • Substitute • Predecode line length, Req and Ceq we get: • c2: intrinsic RC delay of predecode line • c2 very small • Decoder delay ~linearly dependent on IW ECE1773 - Fall ‘07 ECE Toronto

  40. Rename Delay • Wordline • c2: intrinsic RC delay of wordline • c2 very small  • Wordline delay ~linearly dependent on IW ECE1773 - Fall ‘07 ECE Toronto

  41. Bitline: c2 very small Bitline delay ~linearly dependent on IW SenseAmp delay ~linearly dependent on IW Rename Delay ECE1773 - Fall ‘07 ECE Toronto

  42. Feature size -  [increase in bitline&wordline delay with increasing IW] 0.8um: IW 2 8  Bitline delay + 37% 0.18um: IW 28  Bitline delay + 53% Total delay increases linearly with IW Each Component shows linear increase with IW Bitline Delay > Wordline Delay Bitline length ~ # of Logical reg-s Wordline length ~ width of physical reg designator Rename Logic Delay Scaling IW impact on delay worsenswith decreasing feature size ECE1773 - Fall ‘07 ECE Toronto

  43. Wakeup Delay • Critical Path: Mismatch  Pull ready signal low • Delay Components: • Tag drivers  drive tag lines - vertical • Mismatched bit: pull down stack  pull matchline low – horizontal • Final OR gate  or all the matchlines of an operand tag • Ttagdrive ~ Driver Pullup R & Tagline length & Tagline Load C • Quadratic component significant for IW>2 & 0.18um ECE1773 - Fall ‘07 ECE Toronto

  44. Wakeup Delay • Quadratic component Small for both cases • Both delays ~linearly dependent on IW ECE1773 - Fall ‘07 ECE Toronto

  45. Wakeup Delay: IW and Window Size • 0.18um Process • Quadratic dependence • Issue width has greater effect  increase all 3 delay components • As IW & WinSize + together  delay actually changes like: THIS ECE1773 - Fall ‘07 ECE Toronto

  46. Wakeup Delay: Window Size • 8 way & 0.18 Process • Tag drive delay increases rapidly with WinSize + • Match OR delay constant ECE1773 - Fall ‘07 ECE Toronto

  47. Wakeup Delay: Feature size • 8 way & 64 entry window • Tag drive and Tag match delays do not scale as well as MatchOR delay • Match OR  logic delay • Others  also have wire delays ECE1773 - Fall ‘07 ECE Toronto

  48. Selection Logic and Bypass Delay • Selection • Logarithmically dependent on WinSize • Bypass: Delay dependent on (IW)2 ECE1773 - Fall ‘07 ECE Toronto

More Related