1 / 71

Procesadores Superescalares

Procesadores Superescalares. Prof. Mateo Valero. Las Palmas de Gran Canaria 26 de Noviembre de 1999. Initial developments. Mechanical machines 1854: Boolean algebra by G. Boole 1904: Diode vacuum tube by J.A. Fleming 1946: ENIAC by J.P. Eckert and J. Mauchly

ivo
Download Presentation

Procesadores Superescalares

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Procesadores Superescalares Prof. Mateo Valero Las Palmas de Gran Canaria 26 de Noviembre de 1999

  2. Initial developments • Mechanical machines • 1854: Boolean algebra by G. Boole • 1904: Diode vacuum tube by J.A. Fleming • 1946: ENIAC by J.P. Eckert and J. Mauchly • 1945: Stored program by J.V. Neuman • 1949: EDSAC by M. Wilkes • 1952: UNIVAC I and IBM 701

  3. Eniac 1946

  4. EDSAC 1949

  5. Pipeline

  6. Fetch Decode Rename Instruction Window Wakeup+ select Register file Bypass Data Cache Superscalar Processor Fetch of multiple instructions every cycle. Rename of registers to eliminate added dependencies. Instructions wait for source operands and for functional units. Out- of -order execution, but in order graduation. Scalable Pipes

  7. Technology Trends and Impact Delay in Psec. Issue Width= 4Issue Width= 8 ROB Size = 32ROB Size = 64 S. Palacharla et al ¨Complexity Effective…¨. ISCA 1997. Denver.

  8. Physical Scalability Die reachable (percent) 0,25 0,18 0,13 0,1 0,08 0,06 Processor generation (microns) Doug Matzke. ¨ Will Physical Scalability… ¨. IEEE Computer. Sept. 1997. pp 37-39.

  9. Register influence on ILP 8-way fetch/issue window of 256 entries up to 1 taken branch g-share 64k entries One cycle latency • Spec95

  10. Register File Latency • 66% and 20% performance improvement when moving from 2 to 1-cycle latency

  11. Outline • Virtual-physical register • A register file cache • VLIW architectures

  12. Virtual-Physical Registers • Motivation • Conventional renaming scheme • Virtual-Physical Registers Icache Decode&Rename Commit Register used Register unused Register used

  13. Example Cache miss: 20 Fdiv: 20 Fmul: 10 Fadd: 5 load f2, 0(r4) fdiv f2, f2, f10 fmul f2, f2, f12 fadd f2, f2, 1 load p1, 0(r4) fdiv p2, p1, p10 fmul p3, p2, p12 fadd p4, p2, 1 rename • Register pressure: average registers per cycle Conventional: 3.6 Virtual-Physical: 0.7

  14. Percentage of Used/Wasted Registers

  15. Virtual-Physical register • Physical register play two different roles • Keep track of dependences (decode) • Provide a storage location for results (write-back) • Proposal: Three types of registers • Logical: Architected registers • Virtual-Physical (VP): Keep track of dependences • Physical: Store values • Approach • Decode: rename from logical to VP • Write-back (or issue): rename from VP to physical

  16. R2 General Map Table Phy. Map Table R1 Lreg VP Preg V Preg Inst. queue Src2 Src1 D ROB Lreg VPreg C Virtual-Physical Registers • Hardware support VPreg Fetch Execute Decode Write-back Commit Issue

  17. Virtual-Physical Registers • No free physical register • Re-execute but… if it is the oldest instruction… • Avoiding deadlock • A number (NRR) of registers are reserved for the oldest instructions • 21% speedup for Spec95 on a 8-way issue [HPCA-4] • Conclusions • Optimal NRR is different for each program • For a given program, best NRR may be different for different sections of code

  18. Performance evaluation SimpleScalar OoO with modified renaming 8-way issue RUU: 128 entries FU (latency) 8 Simple int. (1) 4 Int Mult (7) 6 Simple FP (4) 4 FP Mult (4) 4 FP Div (16) 4 mem ports L1 Dcache 32 KB, 2-way, 32 B/line, 1 cycle L1 Icache 32 KB, 2-way, 64 B/line, 1 cycle L2 cache 1 MB, 2-way, 64 B/line, 12 cycles Main memory 50 cycles Branch prediction 18-bit Gshare 2 taken branches Benchmarks: SPEC95 Compac/Dec compilers -O5 Virtual-Physical Registers

  19. Virtual-Physical Registers • Performance evaluation

  20. IPC and NRR

  21. Virtual-Physical Registers • What is the optimal allocation policy ? • Approximation • Registers should be allocated to the instructions that can use them earlier (avoid unused registers) • If some instruction should be stall because of the lack of registers, choose the latest instructions (delaying the earliest would also delay the commit of the latest) • Implementation • Each instruction allocates a physical register in the write-back. If none available, it steals the register from the latest instruction after the current

  22. DSY Performance SpecInt95 SpecFp99

  23. Performance and Number of Registers SpecIn95 SpecFp95

  24. Outline • Virtual-physical register • A register file cache • VLIW architecture

  25. Register Requirements

  26. Register File Latency • 66% and 20% performance improvement when moving from 2 to 1-cycle latency

  27. Register File Bypass SpecInt95

  28. Register File Bypass SpecFP95

  29. Register File Cache • Organization • Bank 1 (Register File) • All registers (128) • 2-cycle latency • Bank 2 (Reg. File Cache) • A subset of registers (16) • 1-cycle latency RF RFC

  30. OoO simulator 8-way issue/commit Functional Units (lat.) 2 Simple integer (1) 3 Complex integer Mult. (2) Div. (14) 4 Simple FP (2) 2 FP div.: 2 (14) 3 Branch (1) 4 Load/Store 128-entry ROB 16-bit Gshare Icache and Dcache 64 KB 2-way set-associative 1/8-cycle hit/miss Dcache: Lock-up free-16 outstanding misses Benchmarks Spec95 DEC compiler -O4 (int.) -O5 (FP) 100 million after inizialitations Access time and area models Extension to Wilton&Jouppi models Experimental Framework

  31. Caching Policy (1 of 3) • First policy • Many values (85%-Int and 84%-FP) are used at most once • Thus, only non-bypassed values are cached • FIFO replacement RF RFC

  32. Performance • 20% and 4% improvement over 2-cycle • 29% and 13% degradation over 1-cycle

  33. RF RFC Caching Policy (1 of 2) • Second policy • Values that are sources of any non-issued instruction with all its operands ready • Not issued because of lack of functional units • or, the other operand in in the main register file

  34. Performance • 24% and 5% improvement over 2-cycle • 25% and 12% degradation over 1-cycle

  35. Caching Policy (1 of 3) • Third policy • Values that are sources of any non-issued instruction with all its operands ready • Prefetching • Table that for each physical register indicates which is the other operand of the first instruction that uses it • Replacement: give priority to those values already read at least once

  36. Performance • 27% and 7% improvement over 2-cycle • 24% and 11% degradation over 1-cycle

  37. Speed for Different RFC Architectures Taken into account access time SpecInt95

  38. Speed for Different RFC Architectures SpecFp95

  39. Conclusions • Register file access time is critical • Virtual-physical registers significantly reduce the register pressure • 24% improvement for SpecFP95 • A register file cache can reduce the average access time • 27% and 7% improvement for a two-level, locality-based partitioning architecture

  40. High performance instruction fetch through a software/hardware cooperation Alex Ramirez Josep Ll. Larriba-Pey Mateo Valero UPC-Barcelona

  41. Fetch Decode Rename Instruction Window Wakeup+ select Register file Bypass Data Cache Superscalar Processor Fetch of multiple instructions every cycle. Rename of registers to eliminate added dependencies. Instructions wait for source operands and for functional units. Out- of -order execution, but in order graduation. J.E. Smith and S.Vajapeyam.¨Trace Processors…¨ IEEE Computer.Sept. 1997. pp68-74.

  42. Motivation Branch /Jump outcome Instruction Fetch & Decode Instruction Execution • Instruction Fetch rate important not only in steady state • Program start-up • Miss-speculation points • Program segments with little ILP Instruction Queue(s)

  43. Motivation • Instruction fetch effectively limits the performance of superscalar processors • Even more relevant at program startup points • More aggressive processors need higher fetch bandwidth • Multiple basic block fetching becomes necessary • Current solutions need extensive additional hardware • Branch address cache • Collapsing buffer: multi-ported cache • Trace cache: special purpose cache

  44. PostgreSQL 64KB I1, 64KB D1, 256KB L2 L=0 B L B

  45. Programs Behaviour 64KB I1, 64KB D1, 256KB L2

  46. Fetch Address • Scalar Fetch Unit • Few instructions per cycle • 1 branch • Limitations • Prediction accuracy • I-cache miss rate • Prev. work, code reordering • Fisher (IEEE Tr. on Comp. 81) • Hwu and Chang (ISCA’89) • Petis and Hansen (Sigplan’90) • Torrellas et al. (HPCA’95) • Kalamatianos et al. (HPCA’98) Instruction Cache (i-cache) Branch Prediction Mechanism Next Address Logic Shift & Mask Scalar Fetch Unit To Decode Next Fetch Address Software, reduce cache misses The Fetch Unit (1 of 3)

  47. Hardware, form traces at run time The Fetch Unit (2 of 3) Fetch Address • Aggressive Fetch Unit • Lot of instructions per cycle • Several branches • Limitations • Prediction accuracy • Sequentiality • I-cache miss rate • Prev. work, trace building • Yeh et al. (ICS’93) • Conte et al. (ISCA’95) • Rottenberg et al. (MICRO’96) • Friendly et al. (MICRO’97) Instruction Cache (i-cache) Branch Target Buffer Return Stack Multiple Branch Predictor Next Address Logic Shift & Mask Aggressive Core Fetch Unit To Decode Next Fetch Address

  48. Trace Cache b0 Trace is a sequence of logically contiguos instructions. Trace cache line stores a segment of the dynamic instruction traces across multiple, potentially, taken branches:(b1-b2-b4, b1-b3-b7….) It is indexed by fetch address and branches outcome History-based fetch mecanism. b1 b3 b2 b6 b7 b4 b5 b8

  49. Fetch Address Instruction Cache (i-cache) Branch Target Buffer Return Stack Multiple Branch Predictor Next Address Logic Shift & Mask Aggressive Core Fetch Unit The Fetch Unit (3 of 3) Trace Cache (t-cache) Fill Buffer Trace Cache aims at forming traces run time To Decode Next Fetch Address From Fetch or Commit

  50. Our Contribution • Mixed software-hardware approach • Optimize performance at compile-time • Use profiling information • Make optimum use of the available hardware • Avoid redundant work at run-time • Do not repeat what was done at compile-time • Adapt hardware to the new software • Software Trace Cache • Profile-directed code reordering & mapping • Selective Trace Storage • Fill Unit modification

More Related