1 / 30

Microarchitecture of S uperscalars (4) Decoding

Microarchitecture of S uperscalars (4) Decoding. Dezső Sima Fall 2007. (Ver. 2.0).  Dezső Sima, 2007. Overview. 1. Overview. 2. Straightforward parallel decoding. 3. Predecoding. 4. Decoding with CISC/RISC conversion. 4.1 Overview. 4.2 Decoding into µops.

torin
Download Presentation

Microarchitecture of S uperscalars (4) Decoding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Microarchitecture of Superscalars (4)Decoding Dezső Sima Fall 2007 (Ver. 2.0)  Dezső Sima, 2007

  2. Overview 1. Overview 2. Straightforward parallel decoding 3. Predecoding 4. Decoding with CISC/RISC conversion 4.1 Overview 4.2 Decoding into µops 4.3 Decoding into macroops 5. Using a trace cache 6. Decoding with instruction grouping 6.1 Overview 6.2 Grouping of RISC instructions 6.3 Grouping of CISC instructions

  3. 1. Overview Decoding techniques used in superscalars Straightforward parallel decoding Predecoding Decoding with CISC/RISC conversion Using a trace cache Decoding with instruction grouping 1. gen. RISC superscalars Beginning with2. gen. superscalars Beginning with 2. gen. superscalarCISCs P4-family Grouping of RISC instructions Grouping of CISC instructions Decoding into µops Decoding into macroops Pentium M K7 (Athlon) K8 (Hammer) Core POWER4 Intel AMD (up to two µops) POWER5 Beginning with the Pentium Pro Beginning with the K7

  4. 2 Straightforward parallel decoding Figure 2.1: The PowerPC 601’s front end Source: Stokes, J.H., „PowerPC on Apple: An architecture history”, Aug. 2004. http://arstechnica.com/articles

  5. Instruction Instruction buffer buffer Decode / Issue / Check Decode / Issue / Check Typical FX- F D/I . . . F D I . . . pipeline layout 3 Predecoding (1) Icache Icache Superscalar issue Scalar issue Figure 3.1: Contrasting the decoding and instruction issues in a scalar and a 4-way superscalar processor

  6. I-cache 3 Predecoding (1) Second-level cache (or memory) Typically 128 bits/cycle When instructions are written into the I-cache, the predecode unitof a RISC processor appends 4-7 bitsto each instruction. AMD’s CISC processors append n-bitsto each byte (K5, K6: 5 bits/byte ; K7, K8: 3 bits/byte). Predecode unit E.g. 148 bits/cycle Figure 3.2: The principle of predecoding Source: Sima, D. et al., „ACA”, Addison-Wesley 1997

  7. 3 Predecoding (2) Figure 3.3: The introduction of predecoding Source: Sima, D. et al., „ACA”, Addison-Wesley 1997

  8. 3. Predecoding (3) Figure 3.4: Variable length instruction decoding in the Athlon Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com

  9. 3 Predecoding (4) Figure 3.5: Opteron’s instruction cache and decoding Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com

  10. 4 Decoding with CISC/RISC conversion 4.1 Overview Decoding with CISC/RISC conversion CISC instructions Decoding with CISC/RISC conversion RISC core Modification of the program state Retiring with RISC/CISC after RISC/CISC re-conversion conversion Examples: PPro K6 µops macroops Figure 4.1: Principle of decoding with CISC/RISC conversion Source: Sima, D. et al., „ACA”, Addison-Wesley 1997

  11. 4.2 Decoding into µops (1) Figure 4.2: The Microarchitecture of the Pentium Pro Source: Shanley, T. ,”Pentium Pro Processor System Architecture”, Addison-Wesley Press, 1997

  12. 4.2 Decoding into µops (2) Figure 4.3: Basic misprediction pipeline of the Pentium III Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001

  13. 4.2 Decoding into µops (3) Figure 4.4: Decoding in AMD’s K6 Source: Shriver, B., Smith,.B.,”The Anatomy of a High-Performance Microprocessor” IEEE Computer Society Press, 1998

  14. 4.2 Decoding into µops (4) Figure 4.5: The Microarchitecture of the Pentium M (Yonah) Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

  15. 4.2 Decoding into µops (5) Figure 4.6: The Microarchitecture of the Core processor family Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

  16. 4.3 Decoding into macroops (1) Figure 4.7: AMD AthlonTM the Microarchitecture of the Athlon Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

  17. 4.3 Decoding into macroops (2) Figure 4.8: Decoding in the Athlon (1) Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

  18. 4.3 Decoding into macroops (3) Figure 4.9: Decoding in the Athlon (2) Source: Meyer, D., „The AMD-K7 Processor”, MPF. Oct. 1998

  19. 4.3 Decoding into macroops (4) Each MacroOp: 1 or 2 operations (OPs) eg: ADD EAX, EBX 1 ADD OP AND EAX, [EBX+16] 1 LOAD OP 1 AND OP Up to 3 MacroOps per cycle with up to 3 FX + 2 L/S OPs (dual ported D$!) per cycle

  20. 4.3 Decoding into macroops (5) Figure 4.10: The Microarchitecture of the Hammer Source: Weber, F., „AMD’s Next Generation Microprocessor Architecture”, MPF. Oct. 2001

  21. 5 Using a trace cache (1) Figure 5.1: The Microarchitecture of the Pentium 4 (Willamette)

  22. 5 Using a trace cache (2) Figure 5.2: Basic misprediction pipeline of the Pentium 4 (Willamette) Source: Hinton, G. et al., „The Microarchitecture of the Pentium 4 Processor”, Intel Technology Journal Q1, 2001

  23. 5 Using a trace cache (3) Figure 5.3: The Microarchitecture of the Pentium 4 (Prescott) Source: Kanter, D., „Intel’s next Generation Microarchitecture Unveiled”, Real World Tech., 2006 March 9.

  24. 6. Decoding with instruction grouping 6.1 Overview Decoding with instruction grouping Grouping of RISC instructions Grouping of CISC instructions Pentium M K7 (Athlon) K8 (Hammer) Core arch. POWER4 POWER5

  25. 6.2 Grouping of RISC instructions (1) Up to 3 MacroOps are decoded per cycle, these MacroOps are allocated a line in the ROB The ROB has 24 lines of 3 entries each. The ROB retires a line if it is the oldest one and all MacroOps in that line are completed. Figure 5.3: Instruction grouping in the K7 and K8 Source: de Vries, H., „Understanding the detailed Architecture of AMD’s 64 bit Core”, Sept.2003, http://www.chip-architect.com

  26. 6.2 Grouping of RISC instructions (2) Schedulers EUs Decoders Figure 6.1: Out of order execution of MacroOps from the FX schedulers in the K8L (to be introduced in Q2 2007) (The K8L scheduler has 8*3 entires vs 6*3 in the K8) Source: Malich, Y.„AMD's Next Generation Microarchitecture Preview: from K8 to K8L”, Aug. 2006.

  27. EU EU 6.2 Grouping of RISC instructions (3) Instruction Dispatch instructiongroups in-order,forward individual instructions to theissue queues groups Issue queues Execute individual instructions ooo Execution units Retire isntruction groups in-order,modify program state ROB Retire Figure 6.1: The principle of instruction grouping in IBM’s POWER4 and POWER5 processors

  28. 6.2 Grouping of RISC instructions (4) Figure 6.2: Implementation of instruction grouping in IBM’s POWER 5 processor Source: Sinharoy, B. et al. „POWER5 system microarchitecture”, IBM J.,Res.& Dev., July/Sept. 2005.

  29. 6.3 Grouping of CISC instructions (1) (Intel: macro-op fusion) Introduced in the Core architecture x86 instructions: macro-ops internal instructions: μops Macro-op fusion:combines two macro ops into a single μop. Specifically:x86 compare or test instructions are fused with x86 jumpsto produce a single μop. Any decoder can perform macro-op fusion but only one macro-op fusion can be performed in each cycle. In the Core architecture the max. decode bandwidth is 4+1 x86 instructions/cycle Macro-op fusion can reduce the number of μops by about 10%.

  30. 6.3 Grouping of CISC instructions (2) Benefits: • Fewer μops Increased performance • ooo execution becomes more effective as the instruction window includes now more (~10%) x86 instructions

More Related