1 / 32

Realizing High IPC Through a Scalable, Multipath Microarchitecture

Realizing High IPC Through a Scalable, Multipath Microarchitecture. David Kaeli Northeastern University Computer Architecture Research Laboratory Boston, MA USA. The Team. David Morano Alireza Khalafi Marcos de Alba Northeastern University Boston, MA USA. Augustus Uht

evadne
Download Presentation

Realizing High IPC Through a Scalable, Multipath Microarchitecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Realizing High IPC Through a Scalable, Multipath Microarchitecture David Kaeli Northeastern University Computer Architecture Research Laboratory Boston, MA USA

  2. The Team David Morano Alireza Khalafi Marcos de Alba Northeastern University Boston, MA USA Augustus Uht Sean Langford* University of Rhode Island Kingston, RI USA (*now at CMU)

  3. The Road to High IPC • Many studies have concluded that typical programs (e.g., SPECint) contain a significant amount of Instruction Level Parallelism (ILP) • Lam and Wilson reported an IPC of ~40 for SP-CD-MF (speculative execution, perfect control dependence information, multi-path execution) • Gonzalez and Gonzalez reported an IPC of ~37 for an infinite instruction window, but no value prediction (IPC went down to just under 10 for a 128 entry instruction window) • So why are we still living with low, single-digit, IPC’s??? • Nobody has been aggressive enough!!!

  4. Machine Philosophy • Issue a column of instructions on every cycle (not always possible) • Spend the rest of the time executing, squashing, snarfing and re-executing as necessary to preserve true control flow and data flow dependencies • Retire instructions at a rate of a column at a time • Design a datapath that is scalable in terms of latency as the size of the machine grows • ISA independent

  5. Outline for this Talk • Overview of the Levo microarchitecture • Discussion of scalability within the Levo datapath • Disjoint execution • Simulation methodology and results • Comments and summary

  6. Levo Microarchitectural Features • In-order instruction load, in-order retirement, rampantly out-of-order execution • Active stations – a more intelligent version of Tomasulo’s reservation stations • Instruction/operand/memory/predicate time tags – used to enforce data and control dependencies in a distributed fashion • Hardware runtime predication – used for all BBs with targets within the execution window • Distributed register file – reduces contention for a shared register file • Aggressive speculation – execute instructions, independent of any data flow or control flow dependencies • Disjoint execution to cover control hazards • Limit study with real hardware constraints

  7. In-order Instruction Load • Instructions are fetched in static order from I-cache, except: • Unconditional jump paths are followed • Loops are dynamically unrolled • Conditional branches with far targets (the target is greater than 2/3rds the size of the execution window), if the branch is strongly predicted taken, begin static fetching from the target • A conventional 2-level gshare branch predictor is used • Dynamic run-time predicates are generated so that every branch domain in the Execution Window is control independent • Nullify operations are broadcast to cause dependent instructions to re-execute Microarchitecture

  8. Microarchitecture Memory Window I-Cache n x m Time-ordered Execution Window

  9. Active Stations • More intelligent version of Tomasulo reservation stations • Each AS holds: • A single instruction • Instruction operands • A time tag denoting its logical position in the execution window • Each AS shares a processing element with a number of other AS’s (as defined by the size of a sharing group) Microarchitecture

  10. Active Stations • Communicate with other active stations in order to: • Snoop for the latest operand values • Forward the results to other active stations • Request a value from other active stations • Re-execute its instruction with new operand values • Handles control flow changes through runtime predication Microarchitecture

  11. Time Tags • Enforce the nominal sequential order of the instructions executed • Accompany all in-flight register values, memory values and predicate values • Have two parts • Column tag – is decremented by 1 whenever the left-most column is loaded • Row tag – does not change Row Column Microarchitecture

  12. Execution Window Sharing Group Column m-1 Column 0 AS(0,m-1) Row 0 Row 0 1 AS(1,m-1) 1 2 2 3 3 PE AS(2,m-1) n-1 n-1 AS(3,m-1) n rows by m columns A sharing group of 4 mainline ASs sharing a single PE Microarchitecture

  13. LD LD LD = = Active Station Operand Snooping and Snarfing result operand forwarding bus time tag address value AS time tag time tag address value path >= != < time tag address value path timetag execute or re-execute Microarchitecture

  14. Last Snarfed Instruction, Time Tag Instruction Result Time Tag (LSTT). Number (ResTT) In Active Station . R4 = 1 R4 = 1 1 1. R4a = 1 – R4 = 2 5 5. R4 = 2 – R4b = 2 9 R3 = R4 9. R3 = R4 1, then 5 R3 = R4b Sequential Out-of-Order (OOO) Execution. Out-of-Order (OOO) Execution. Execution - I9 only snarfs I5 result - I1 result and ResTT broadcast, (at end, (at end, R3 holds ‘2’) – R3 = 1, LSTT = 1 R3 holds ‘2’) - I5 result and ResTT broadcast, – R3 = 2, LSTT = 5 (at end, R3 holds ‘2’) (Same result if I5 broadcasts first; LSTT is set to and stays at ‘5’; I1 result not snarfed by I9.) (a) Program Code (b) With Renaming (c) With Time Tags

  15. Scalable Microarchitecture • Time tags size grows linearly with the total number of ASs • No reorder buffer (typically grows O(n2)) • No centralized architected register file • Register forwarding units hold the ISA-defined register state • Forwarding transactions maintain state • Segmented result buses – fixed length • Distributed L0 caching in the datapath

  16. Observation About Register Lifetimes • The MultiScalar Project demonstrated that register lifetimes are short (spanning 1-2 basic blocks, within 32 instructions) • If we have instructions laid out in a time-ordered fashion, the probability we will have to forward in time very far is low • As a result, we can segment our interconnection fabric, assuming that communicates will only span either the current, or at most the next, segment

  17. Segmented Buses (Spanning Buses) • Use segmented buses to propagate execution results to later stations • Adjacent segments are interconnected with Forwarding Units (one forwarding unit, per bus, per column) • Register Forwarding/Filter Units (RFUs) hold a version of the ISA register state • Memory Forwarding/Filter Units (MFUs) and Predicate Forwarding Units (PFUs) are also provided • Backwarding buses are also provided • The number of I/Os to a FU is independent of the machine size and only depends on the column height • Segmented buses help to preserve scalability in our datapath Microarchitecture

  18. from previous column from previous column from previous column M D M D M D FU FU FU AS AS AS AS AS AS AS AS AS AS AS AS M D M D M D FU FU FU AS AS AS AS AS AS AS AS AS AS AS AS M D M D M D FU FU FU AS AS AS AS AS AS AS AS AS AS AS AS FU FU FU to next column to next column to next column

  19. Register Forwarding/Filter Units • Capture the persistent register state • All buses are register transaction buses • Consolidate update transactions on input • Updates are forwarded to the output bus request logic immediately when possible • Requests are “filtered” based on time-tag value • Updates are managed in the file store in FIFO order backward in time forward in time backwarding read buses backwarding write bus ISA register file per path primary backwarding read bus forwarding read buses logic logic primary forwarding read bus forwardwarding write bus time-tag

  20. Memory Forwarding/Filter Units • Serve as an L0 cache • All buses are memory buses (number of which set according to interleave factor) • Consolidate update transactions on input • Updates are "forwarded" to the output bus request logic immediately when possible • Requests are “filtered” based on time-tag value • Current policy is to queue outgoing requests or responses in FIFOs until the buses are granted for use backward in time forward in time backwarding write buses memory cache FIFO backwarding read buses logic logic forwardwarding write buses forwarding read buses FIFO time-tag

  21. Disjoint Path Execution • Levo can only obtain high IPC if: • we can provide a large window of instructions to execute • a large percentage of the instructions on the eventual committed control-flow path are included in the window • To address the issues with hard-to-predict conditional control flow, we utilize disjoint path spawning in Levo and DEE

  22. Disjoint Path Execution • To enable path spawning we provide a disjoint path (D-path) set of AS’s that share a processing element with a mainline set of AS’s • D-paths are spawned in the case of hammock branches • The D-path is copied from the mainline path • The sign of the associated predicate is inverted for the D-path • The D-path receives lower priority for the PE than the mainline • When a hammock branch is mispredicted, we can treat the D-path as the new mainline path, and continue execution accordingly and DEE

  23. A100 LW R2,20(R4) A104 SUB R2,R2,#1 A108 BEQZ R2,TAR1 Label Addr Instruction History START: A100 LW R2,20(R4) A104 SUB R2,R2,#1 A108 BEQZ R2,TAR1 Weakly T A10C ADD R2,R2,#4 A110 SW 30(R4),R2 TAR1: A114 LW R2,30(R4) A118 SUB R2,R2,#8 A11C BEQZ R2,TAR2 Weakly NT A120 SW 20(R4),R2 TAR2: A124 ADD R2,R2,#10 A128 SUB R1,R1,#1 A12C BNEQZ R1,START Strongly T A130 SW 40(R4),R2 . . A10C ADD R2,R2,#4 A110 SW 30(R4),R2 A114 LW R2,28(R4) A118 SUB R2,R2,#8 A11C BEQZ R2, TAR2 A120 SW 20(R4),R2 A124 ADD R2,R2,#10 A128 SUB R1,R1,#1 A12C BNEQZ R1,START Mainline path Disjoint path A130 SW 40(R4),R2

  24. Modeling and Results • Present work utilizes • MIPS-1/MIPS-2 machine • SGI compiler • SPECint 95 (compress, go and ijpeg) and 2000 (bzip2, crafty, gcc, gzip, mcf, parser and vertex) benchmarks • 3 levels of modeling • Trace-driven model (FastLevo) – results in this presentation • Detailed cycle-accurate model (LevoSim) – still under development • Synthesizable VHDL hardware model (HDLevo) – validation • Design space exploration • Impact of D-paths • Real vs. ideal memory • Range of bus latency issues performance

  25. Modeling parms performance

  26. Modeling parms performance

  27. IPC obtained with Levo performance

  28. Speedup obtained using D-paths versus single path execution(harmonic means) performance

  29. IPC of Levo compared to modeling 100% L1 I/D hitsharmonic means performance

  30. Summary of additional experiments • Varying the L1-D/L2 hit time (versus 1 cycle) • Increased L1-D HT to 2/4/8 cycles = 10/22/43% IPC loss • Increased L2 HT to 2/4/8/16 cycles = .8/2.3/4.7/8.9% IPC loss • Varying the number of buses per FU • Decreased to 1 bus/FU = 14% IPC loss • Increased to 4 buses/FU = 3% IPC gain • Removal of stride predictor = .8% IPC loss • Varying the number of columns per D-path • Increased to 2 cols/D-path = 8% IPC loss • Use of D-paths = 45% IPC gain • Varying the number of branch prediction tables • Decreased from 1 per row to a single of same total size = .4% IPC loss performance

  31. Comments and Future Directions • I-fetch is the main barrier to further gains in IPC • The use of a detailed VHDL model of critical components in Levo has allowed us to design scalable resources • A number of novel microarchitectural features are present in a single design • Future challenges in Levo include: • Improved I-fetch – (EV8, trace cache, dynamic D-paths) • Finish design of an ARB-like memory • Consider compiler support to aid in-order issue and D-path execution • Consider multithreaded extensions to support coarse-grained multithreading

  32. To learn more about visit: http://www.ece.neu.edu/info/architecture/research/Levo.html Also see our paper at europar02.

More Related