1 / 65

Superscalar Processors

Superscalar Processors. Superscalar Execution How it can help Issues: Maintaining Sequential Semantics Scheduling Scoreboard Superscalar vs. Pipelining Example: Alpha 21164 and 21064. Sequential Execution Semantics. Contract: The machine should appear to behave like this.

malin
Download Presentation

Superscalar Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Superscalar Processors • Superscalar Execution • How it can help • Issues: • Maintaining Sequential Semantics • Scheduling • Scoreboard • Superscalar vs. Pipelining • Example: Alpha 21164 and 21064

  2. Sequential Execution Semantics • Contract: The machine should appear to behave like this.

  3. Sequential Execution Semantics • We will be studying techniques that exploit the semantics of Sequential Execution. • Sequential Execution Semantics: • instructions appear as if they executed in the program specified order • and one after the other • Alternatively • At any given point in time we should be able to identify an instruction so that: • 1. All preceding instructions have executed • 2. None following has executed

  4. Pipelined Execution • Pipelining: Partial Overlap of Instructions • Initiate one instruction per cycle • Subsequent instructions overlap partially • Commit one instruction per cycle Program Order Pipelining

  5. Superscalar - In-order • Two or more consecutive instructions in the original program order can execute in parallel • This is the dynamic execution order • N-way Superscalar • Can issue up to N instructions per cycle • 2-way, 3-way, … Program Order Pipelining Superscalar

  6. Superscalar vs. Pipelining loop: ld r2, 10(r1) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Pipelining: sum += a[i--] time fetch decode ld fetch decode add fetch decode sub fetch decode bne Superscalar: fetch decode ld fetch decode add fetch decode sub fetch decode bne

  7. Superscalar Performance • Performance Spectrum? • What if all instructions were dependent? • Speedup = 0, Superscalar buys us nothing • What if all instructions were independent? • Speedup = N where N = superscalarity • Again key is typical program behavior • Some parallelism exists ECE1773 - Fall ‘07 ECE Toronto

  8. “Real-Life” Performance • OLTP = Online Transaction Processing SOURCE: Partha Ranganathan Kourosh Gharachorloo** Sarita Adve* Luiz André Barroso** Performance of Database Workloads on Shared-Memory Systems with Out-of-Order Processors ASPLOS98

  9. “Real Life” Performance SPEC CPU 2000: Simplescalar sim: 32K I$ and D$, 8K bpred

  10. Superscalar Issue (s) • An instruction at decode can execute if: • Dependences • RAW • Input operand availability • WAR and WAW • Must check against Instructions: • Simultaneously Decoded • In-progress in the pipeline (i.e., previously issued) • Recall the register vector from pipelining • Increasingly Complex with degree of superscalarity • 2-way, 3-way, …, n-way ECE1773 - Fall ‘07 ECE Toronto

  11. Issue Rules (s) • Stall at decode if: • RAW dependence and no data available • Source registers against previous targets • WAR or WAW dependence • Target register against previous targets + sources • No resource available • This check is done in program order ECE1773 - Fall ‘07 ECE Toronto

  12. Issue Mechanism – A Group of Instructions at Decode • Assume 2 source & 1 target max per instr. • comparators for 2-way: • 3 for tgt and 2 for src (tgt: WAW + WAR, src: RAW) • comparators for 4-way: • 2nd instr: 3 tgt and 2 src • 3rd instr: 6 tgt and 4 src • 4th instr: 9 tgt and 6 src tgt src1 src1 • simplifications • may be possible • resource checking • not shown tgt src1 src1 Program order tgt src1 src1

  13. Issue – Checking for Dependences with In-Flight instructions • Naïve Implementation: • Compare registers with all outstanding registers • RAW, WAR and WAW • How many comparators we need? • Stages x Superscalarity x Regs per Instruction • Priority enforcers? • But we need some of this for bypassing • RAW ECE1773 - Fall ‘07 ECE Toronto

  14. Issue – Checking for Dependences with In-Flight instructions • Scoreboard: • Pending Write per register, one bit • Set at decode / Reset at writeback • Pending Read? • Not needed if all reads are done in order • WAR and WAW not possible • Can handle structural hazards • Busy indicators per resource • Can handle bypass • Where a register value is produced • R0 busy, in ALU0, at time +3 ECE1773 - Fall ‘07 ECE Toronto

  15. Implications • Need to multiport some structures • Register File • Multiple Reads and Writes per cycle • Register Availability Vector (scoreboard) • Multiple Reads and Writes per cycle • From Decode and Commit • Also need to worry about WAR and WAW • Resource tracking • Additional issue conditions • Many Superscalars had additional restrictions • E.g., execute one integer and one floating point op • one branch, or one store/load ECE1773 - Fall ‘07 ECE Toronto

  16. Preserving Sequential Semantics (s) • In principle not much different than pipelining • Program order is preserved in the pipeline • Some instructions proceed in parallel • But order is clearly defined • Defer interrupts to commit stage (i.e., writeback) • Flush all subsequent instructions • may include instructions committing simultaneously • Allow all preceding instructions to commit • Recall comparisons are done in program order • Must have sufficient time in clock cycle to handle these ECE1773 - Fall ‘07 ECE Toronto

  17. Preserving Sequential Semantics loop: ld r2, 10(r1) add r3, r3, r2 sub r1, r1, 1 bne r1, r0, loop Pipelining: sum += a[i--] time fetch decode ld fetch decode add fetch decode sub fetch decode bne Superscalar: fetch decode ld fetch decode add fetch decode sub fetch decode bne

  18. Interrupts Example Exception raised Exception taken fetch decode ld fetch decode add fetch decode div fetch decode bne fetch decode bne Exception raised Exception raised Exception taken fetch decode ld fetch decode add fetch decode div fetch decode bne fetch decode bne

  19. Superscalar and Pipelining • In principle they are orthogonal • Superscalar non-pipelined machine • Pipelined non-superscalar • Superscalar and Pipelined (common) • Additional functionality needed by Superscalar: • Another bound on clock cycle • At some point it limits the number of pipeline stages ECE1773 - Fall ‘07 ECE Toronto

  20. Superscalar vs. Superpipelining (s) • Superpipelining: • Vaguely defined as deep pipelining, i.e., lots of stages • Superscalar issue complexity: limits super-pipelining • How do they compare? • 2-way Superscalar vs. Twice the stages • Not much difference. fetch decode inst fetch decode inst fetch decode inst fetch decode inst F1 F2 D1 D2 E1 E2 F1 F2 D1 D2 E1 E2 F1 F2 D1 D2 E1 E2 F1 F2 D1 D2 E1 E2

  21. Superscalar vs. Superpipelining WANT 2X PERFORMANCE: fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst fetch decode inst F1 F2 D1 D2 E1 E2 F1 F2 D1 D2 E1 E2 F1 F2 D1 D2 E1 E2 F1 F2 D1 D2 E1 E2 F1 F2 D1 D2 E1 E2 F1 F2 D1 D2 E1 E2 F1 F2 D1 D2 E1 E2 F1 F2 D1 D2 E1 E2

  22. Superscalar vs. Superpipelining WANT 2X PERFORMANCE: RAW fetch decode inst fetch decode decode inst fetch decode inst decode inst fetch fetch decode inst fetch fetch decode inst fetch decode inst RAW F1 F2 D1 D2 inst F1 F2 D1 D2 D2 inst F1 F2 D1 D1 D2 inst F1 F2 F2 D1 D2 D2 inst F1 F1 F2 D1 D1 D2 inst F1 F2 F2 D1 D2 D1 D2 inst F1 F2 F2 D1 D1 D2 inst

  23. Pipeline Performance (s) • g = fraction of time pipeline is filled • 1-g = fraction of time pipeline is not filled (stalled) • 1-g = performance suffers

  24. Superscalar vs. Superpipelining: Another View Source: Lipasti, Shen, Wood, Hill, Sohi, Smith (CMU/Wisconsin) Amdhal’s Law Work performed N No. of Processors f 1 - f 1 Time • f = fraction that is vectorizable (parallelism) • v = speedup for f • Overall speedup: ECE1773 - Fall ‘07 ECE Toronto

  25. Amdhal’s Law: Sequential Part Limits Performance • Parallelism can’t help if there isn’t any • Even if v is infinite • Performance limited by nonvectorizable portion (1-f) N No. of Processors f 1 - f 1 Time

  26. Amdahl’s Law

  27. Case Study: Alpha 21164 ECE1773 - Fall ‘07 ECE Toronto

  28. 21164: Int. Pipe ECE1773 - Fall ‘07 ECE Toronto

  29. 21164: Memory Pipeline ECE1773 - Fall ‘07 ECE Toronto

  30. 21164: Floating-Point Pipe ECE1773 - Fall ‘07 ECE Toronto

  31. Performance Comparison 4-way 2-way Source:

  32. CPI Comparison ECE1773 - Fall ‘07 ECE Toronto

  33. Compiler Impact Optimized Base Performance

  34. Issue Cycle Distribution - 21164 ECE1773 - Fall ‘07 ECE Toronto

  35. Issue Cycle Distribution - 21064 ECE1773 - Fall ‘07 ECE Toronto

  36. Stall Cycles - 21164 Data Dependences/Data Stalls No instructions ECE1773 - Fall ‘07 ECE Toronto

  37. Stall Cycles Distrubution • Model: When no instruction is committing Does not capture overlapping factors: Stall due to dependence while committing Stall due to cache miss while committing ECE1773 - Fall ‘07 ECE Toronto

  38. Replay Traps • Tried to do something and couldn’t • Store and write-buffer is full • Can’t complete instruction • Load and miss-address-file full • Can’t complete instruction • Assumed Cache hit and was miss • Dependent instructions executed • Must re-execute dependent instructions • Re-execute the instruction and everything that follows ECE1773 - Fall ‘07 ECE Toronto

  39. Replay Traps Explained • ld r1 • add _, r1 F D E M W Cache hit F D D E M W F D E M M W Cache miss F D D D E M W ECE1773 - Fall ‘07 ECE Toronto

  40. Optimistic Scheduling M D E • ld r1 • add _, r1 F D E M W Cache hit F D D E M W Hit/miss known here add should start execution here Must decide that add should execute Start making scheduling decisions ECE1773 - Fall ‘07 ECE Toronto

  41. Optimistic Scheduling #2 M D E • ld r1 • add _, r1 F D E M W Cache hit F D D E M W Hit/miss known here Guess Hit/Miss add should start execution here Must decide that add should execute Start making scheduling decisions ECE1773 - Fall ‘07 ECE Toronto

  42. Stall Distribution ECE1773 - Fall ‘07 ECE Toronto

  43. 21164 Microarchitecture • Instruction Fetch/Decode + Branch Units • Integer Execution Unit • Floating-Point Execution Unit • Memory Address Translation Unit • Cache Control and Bus Interface • Data Cache • Instruction Cache • Second-Level Cache ECE1773 - Fall ‘07 ECE Toronto

  44. Instruction Decode/Issue • Up to four insts/cycle • Naturally aligned groups • Must start at 16 byte boundary (INT16) • Simplifies Fetch path (in a second) • All of group must issue before next group gets in • Simplifies Scheduling • No need for reshuffling ECE1773 - Fall ‘07 ECE Toronto

  45. Instruction Decode/Issue • Up to four insts/cycle • Naturally aligned groups • Must start at 16 byte boundary (INT16) • Simplifies Fetch path Where instructions come from? I-Cache: CPU needs: ECE1773 - Fall ‘07 ECE Toronto

  46. Fetching Four Instructions Where instructions come from? I-Cache: CPU needs: Software must guarantee alignment at 16 byte boundaries Lots of NOPs ECE1773 - Fall ‘07 ECE Toronto

  47. Instruction Buffer and Prefetch • I-buffer feeding issue • 4-entry, 8-instruction prefetch buffer • Check I-Cache and PB in parallel • PB hit: Fill Cache, Feed pipeline • PB miss: Prefetch four lines ECE1773 - Fall ‘07 ECE Toronto

  48. Branch Execution • One cycle delay  Calc. target PC • Naïve implementation: • Can fetch every other cycle • Branch Prediction to avoid the delay • Up to four pending branches in stage 2 • Assignment to Functional Units • One at stage 3 • Instruction Scheduling/Issue • One at stage 4 • Instruction Execution • Full and execute from right PC ECE1773 - Fall ‘07 ECE Toronto

  49. Return Address Stack • Returns  Target Address Changes • Conventional Branch Prediction can’t handle • Predictable change • Return address = Call site return point • Detect Calls • Push return address onto hardware stack • Return pops address • Speculative • 12-entry “stack” • Circular queue  overflow/underflow messes it up ECE1773 - Fall ‘07 ECE Toronto

  50. Instruction Translation Buffer • Translate Virtual Addresses to Physical • 48-entry fully-associative • Pages 8KB to 4MB • Not-last-used/Not-MRU replacement • 128 Address space identifiers ECE1773 - Fall ‘07 ECE Toronto

More Related