1 / 72

Evolution of the ILP Processing

Evolution of the ILP Processing. Dezső Sima Fall 2008. (Ver. 2.0).  Dezső Sima, 2008. Foreword.

Download Presentation

Evolution of the ILP Processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evolution of the ILP Processing Dezső Sima Fall 2008 (Ver. 2.0)  Dezső Sima, 2008

  2. Foreword The steady demand for higher processor performance has provoked the successive introduction of temporal, issue and intra-instruction parallelism into processor operation. Consequently, traditional sequential processors, pipelined processors, superscalar processors and superscalar processors with multimedia and 3D support mark subsequent evolutionary phases of microprocessors. On the other hand the introduction of each basic technique mentioned gave rise to specific system bottlenecks whose resolution called for innovative new techniques. Thus, the emergence of pipelined instruction processing stimulated the introduction of caches and of speculative branch processing. The debut of superscalar instruction issue gave rise to more advanced memory subsystems and to more advanced branch processing. The desire to further increase per cycle performance of first generation superscalars called for avoiding their issue bottleneck by the introduction of shelving, renaming and a concerted enhancement of all relevant subsystems of the microarchitecture. Finally, the utilization of intra-instruction parallelism through SIMD instructions required an adequate extension of the ISA and the system architecture. With the main dimensions of the parallelism - more or less exhausted in the second generation superscalars for general purpose applications -, increasing the clock frequency remained the single major possibility to increase performance further on. The rapid increase of the clock frequencies, however led to limits of evolution, as discussed in Chapter II.

  3. Structure 1. Paradigms of ILP-processing 2. Introduction of temporal parallelism 3. Introduction of issue parallelism 3.1. VLIW processing 3.2. Supercalar processing 4. Introduction of data parallelism 5. The main road of evolution 6. Outlook

  4. 1. Paradigms of ILP-processing

  5. ENIAC NORC CDC-6600 Cray-1 Cray-2 Cray-3 Cray T3E ? super- computer Cray-4 UNIVAC /360 /370 /390 z/900 mainframe PDP-8 PDP-11 VAX x minicomputer RS/6000 Xeon server/workstation PPro 4004 8080 8088 microcomputer 80286 80486 PII PIII P4 desktop PC Altair 80386 Pentium 8088 value PC Celeron 1950 1960 1970 1980 1990 2000 1.1. Introduction (1) Figure 1.1: Evolution of computer classes

  6. 1.1. Introduction (2) 1.2. ábra: The integer performance of Intel’s x86 line of processors

  7. 1.2. Paradigms of ILP-processing (1) Paradigms of ILP-processing Issue parallelism Temporal parallelism Static dependency resolution Pipeline processors VLIW processors

  8. VLIWprocessing Independent instructions (static dependency resolution) F E F E F E Processor VLIW: Very Large Instruction Word Instructions

  9. 1.2. Paradigms of ILP processing (1) Paradigms of ILP processing Issue parallelism Temporal parallelism Static dependency resolution Dynamic dependency resolution Pipeline processors VLIW processors Superscalar processors

  10. Superscalarprocessing VLIWprocessing Independent instructions (static dependency resolution) Dependent instructions Dynamicdependency resolution F E F E F E F E F E F E Processor Processor VLIW: Very Large Instruction Word Instructions

  11. 1.2. Paradigms of ILP processing (1) Paradigms of ILP processing Issue parallelism Data parallelism Temporal parallelism Static dependency resolution Dynamic dependency resolution SIMD extension Pipeline processors VLIW processors Superscalar processors

  12. 1.2. Paradigms of ILP-processing (2) Issueparallelism Data parallelism Static dependency resolution Sequential processing Temporal parallelism VLIW processors EPIC processors Dynamic dependency resolution Pipeline processors. Superscalar processors Superscalar proc.s with SIMD extension ~ ‘85 ~ ‘90 ~ ’95 -‘00 Figure 1.3: The emergence of ILP-paradigms and processor types

  13. Absolute performance Ideal case Real case Sequential Pipeline VLIW/ superscalar SIMD extension 1.3. Performance potential of ILP-processors (1)

  14. Clock frequency Data parall. Temporal parall. Issue parall. Efficiency of spec. exec. 1.3. ILP processzorok teljesítménypotenciálja (2) Performance components of ILP-processors: with: Clock frequency Depends on technology/ μarchitecture Per cycle efficiency Depends on ISA, μarchitecture, system architecture, OS, compiler, application

  15. 2. Introduction of temporal parallelism

  16. Pipeline processors Overlapping all phases E W D F i i i i +1 i i +2 i i +3 37 Atlas (1963) 38 IBM 360/91 (1967) 41 R2000 (1988) 42 i80386 (1985) 43 M68030 (1988) 2.1. Introduction (1) Types of temporal parallelism in ILP processors (F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle) Figure 2.1: Implementation alternatives of temporal parallelism

  17. x86 80386 80486 80286 M68000 68030 68020 68040 R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 Pipeline (scalar) processors 2.1. Introduction (2) Figure 2.2: The appearance of pipeline processors

  18. 2.2.1. Overview 2.2. Processing bottlenecks evoked and their resolution The scarcity of memory bandwidth (2.2.2) The problem of branch processing (2.2.3)

  19. 2.2.2. The scarcity of memory bandwidth (1) Sequential processing Pipeline processing More instructions and data need to be fetched per cycle Larger memory bandwidth

  20. C(8) 80286 80386 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68030 68040 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 Pipeline (scalar) processors without cache(s) Pipeline (scalar) processors with cache(s) Universal cache (size in kB) C(n) Instruction/data cache (sizes in kB) C(n/m) 2.2.2. The scarcity of memory bandwidth (2) Figure 2.3: Introduction of caches

  21. bc Conditional branch bti Branch target instruction 2.2.3. The problem of branch processing (1) (E.g. in case of conditional branches) clock cycles bc D E W F ii F D E ii+1 F D ii+2 Decode ii+4 bti F Conditionchecking (branch!) Brach address calculation Figure 2.4: Processing of a conditional branch on a 4-stage pipeline

  22. Basic block Basic block Conditional branches Instructions other than conditional branches Guessed path Approved path 2.2.3. The problem of branch processing (2) Figure 2. 5: Principle of branch prediction in case of a conditional branch

  23. C(8) 80286 80386 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68030 68040 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 (Scalar)pipeline processors Speculative execution of branches 2.2.3. The problem of branch processing (3) Figure 2.6: Introduction of branch prediction in (scalar) pipeline processors

  24. 2.3. Generations of pipeline processors (1) Cache Speculative branch processing 1. generation pipelined no no 1.5. generation pipelined yes no 2. generation pipelined yes yes

  25. 80386 68030 68040 R4000 R6000 2.3. Generations of pipeline processors (2) C(8) 80286 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 1. generation pipelined (no cache, no speculative branch processing) 1.5.generation pipelined (cache, no speculative branch processing) 2.generation pipelined (cache, speculative branch processing) Figure 2. 7: Generations of pipeline processors

  26. 2.4. Exhausting the available temporal parallelism 2. generation pipeline processors already exhaust the available temporal parallelism

  27. 3. Introduction of issue parallelism

  28. 3.1. Options to implement issue parallelism VLIW (EPIC)instruction issue Static dependency resolution (3.2) Superscalarinstruction issue Pipeline processing Dynamic dependency resolution (3.3)

  29. E E E E U U U U 3.2. VLIW processing (1) Memory/cache VLIW instructions with independent sub-instructions (static dependency resolution) VLIW processor ~ (10-30 EUs) Figure 3.1: Principle of VLIW processing

  30. 3.2. VLIW processing (2) VLIW: Very Long Instruction Word Term: 1983 (Fisher) Length of sub-instructions ~32 bit Instruction length: ~ n*32 bit n: Number of execution units (EU) Static dependency resulution with parallel optimization Complex VLIW compiler

  31. The term ‘VLIW’ 3.2. VLIW processing (3) Figure 3.2: Experimental and commercially available VLIW processors Source: Sima et al., ACA, Addison-Wesley, 1997

  32. 3.2. VLIW processing (4) Benefits of static dependecy resolution: Less complex processors Earlier appearance Either higher fc or larger ILP

  33. 3.2. VLIW processing (5) Drawbacks of static dependency resolution: Completely new ISA New compilers, OS Rewriting of applications Achieving the critical mass to convince the market The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs) for dependency resolution and parallel optimization New proc. models require new compiler versions

  34. 3.2. VLIW processing (6) Drawbacks of static dependency resolution (cont.): VLIW instructions are only partially filled Purely utilized memory space and bandwidth

  35. 3.2. VLIW processing (7) Commercial VLIW processors: Trace(1987) Multiflow Cydra-5 (1989) Cydrome In a few years both firms became bankrupt Developers: to HP, IBM They became initiators/developers of EPIC processors

  36. 3.2. VLIW processing (8) VLIW EPIC Integration of SIMD instructions and advanced superscalar features 1994: Intel, HP announced the cooperation 1997: The EPIC term was born 2001: IA-64  Itanium

  37. 3.3. Superscalar processing 3.3.1. Introduction (1) Pipeline processing Superscalar instruction issue Main attributes of superscalar processing: Dynamic dependency resolution Compatible ISA

  38. 3.3.1. Intoduction (2) Figure 3.3: Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

  39. 3.3.1. Introduction (3) Figure 3.4: Emergence of superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

  40. 3.3.2. Attributes of first generation superscalars (1) • 2-3 RISC instructions/cycle or • 2 CISC instructions/cycle „wide” Width: Core: • Static branch prediction • Single ported, blocking L1 data caches, Off-chip L2 caches attached via the processor bus Cache: • Alpha 21064 Examples: • PA 7100 • Pentium

  41. 3.3.2. Attributes of first generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming direct issue: ~ 2 instructions / cycle (Wall 1989, Lam, Wilson 1992) Source: Sima et al., ACA, Addison-Wesley, 1997

  42. 3.3.2. Attributes of first generation superscalars (3) Consistency of processor features (2) Reasonable core width: 2 - 3 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (2 - 3) = 0.8 – 1.2 instructions/cycle Single port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (2 – 3) = 1.6 – 2.4 2 – 3 FX EUs L/S ~ 0.4 * (2 – 3) = 0.8 – 1.2 1 L/S EU Branch ~ 0.2 * (2 – 3) = 0.4 – 0.6 1 B EU FP ~ (0.01 – 0.05) * (2 – 3) 1 FP EU

  43. (a): Simplified structure of the mikroarchitecture assuming direct issue (b): The issue process 3.3.3. The bottleneck evoked and its resolution (1) The issue bottleneck Figure 3.5: The principle of direct issue

  44. 3.3.3. The bottleneck evoked and its resolution (2) Eliminating the issue bottleneck Figure 3.6: Principle of the buffered (out of order) issue

  45. 3.3.3. The bottleneck evoked and its resolution (3) First generation (narrow) superscalars Second generation (wide) superscalars Elimination of the issue bottleneck and in addition widening the processing width of all subsystems of the core

  46. 3.3.4. Attributes of second generation superscalars (1) First generation ”narrow” superscalars Second generation ”wide” superscalars • 2-3 RISC instructions/cycle or2 CISC instructions/cycle „wide” • 4 RISC instructions/cycles or3 CISC instruction/cycle „wide” Width: • Static branch prediction • Buffered (ooo) issue • Predecoding • Dynamic branch prediction • Register renaming • ROB Core: Caches: • Single-ported, blocking • L1 data caches • Off-chip L2 caches • attached via the processor bus • Dual-ported, non-blockingL1 data caches • direct attached off-chip L2 caches • Alpha 21064 Examples: • Alpha 21264 • PA 7100 • PA 8000 • Pentium • Pentium Pro • K6

  47. 3.3.4. Attributes of second generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming buffered issue: ~ 4 – 6 instructions / cycle (Wall 1990) Source: Sima et al., ACA, Addison-Wesley, 1997

  48. Figure 3.7: Extent of parallelism available in general purpose applications assuming buffered issue Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990

  49. 3.3.4. Attributes of second generation superscalars (3) Consistency of processor features (2) Reasonable core width: 4 - 5 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (4 - 5) = 1.6 – 2 instructions/cycle Dual port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (4 – 5) = 3.2 – 4 3 – 4 FX EUs L/S ~ 0.4 * (4 – 5) = 1.6 – 2 2 L/S EU Branch ~ 0.2 * (4 – 5) = 0.8 – 1 1 B EU FP ~ (0.01 – 0.05) * (4 – 5) 1 FP EU

  50. 3.3.5. Exhausting the issue parallelism In general purpose applications 2. generation („wide”) superscalars already exhaust the parallelism available at the instruction level

More Related