Evolution of the ILP Processing

Evolution of the ILP Processing Dezső Sima Fall 2008 (Ver. 2.0)  Dezső Sima, 2008

Foreword The steady demand for higher processor performance has provoked the successive introduction of temporal, issue and intra-instruction parallelism into processor operation. Consequently, traditional sequential processors, pipelined processors, superscalar processors and superscalar processors with multimedia and 3D support mark subsequent evolutionary phases of microprocessors. On the other hand the introduction of each basic technique mentioned gave rise to specific system bottlenecks whose resolution called for innovative new techniques. Thus, the emergence of pipelined instruction processing stimulated the introduction of caches and of speculative branch processing. The debut of superscalar instruction issue gave rise to more advanced memory subsystems and to more advanced branch processing. The desire to further increase per cycle performance of first generation superscalars called for avoiding their issue bottleneck by the introduction of shelving, renaming and a concerted enhancement of all relevant subsystems of the microarchitecture. Finally, the utilization of intra-instruction parallelism through SIMD instructions required an adequate extension of the ISA and the system architecture. With the main dimensions of the parallelism - more or less exhausted in the second generation superscalars for general purpose applications -, increasing the clock frequency remained the single major possibility to increase performance further on. The rapid increase of the clock frequencies, however led to limits of evolution, as discussed in Chapter II.

Structure 1. Paradigms of ILP-processing 2. Introduction of temporal parallelism 3. Introduction of issue parallelism 3.1. VLIW processing 3.2. Supercalar processing 4. Introduction of data parallelism 5. The main road of evolution 6. Outlook

1. Paradigms of ILP-processing

ENIAC NORC CDC-6600 Cray-1 Cray-2 Cray-3 Cray T3E ? super- computer Cray-4 UNIVAC /360 /370 /390 z/900 mainframe PDP-8 PDP-11 VAX x minicomputer RS/6000 Xeon server/workstation PPro 4004 8080 8088 microcomputer 80286 80486 PII PIII P4 desktop PC Altair 80386 Pentium 8088 value PC Celeron 1950 1960 1970 1980 1990 2000 1.1. Introduction (1) Figure 1.1: Evolution of computer classes

1.1. Introduction (2) 1.2. ábra: The integer performance of Intel’s x86 line of processors

1.2. Paradigms of ILP-processing (1) Paradigms of ILP-processing Issue parallelism Temporal parallelism Static dependency resolution Pipeline processors VLIW processors

VLIWprocessing Independent instructions (static dependency resolution) F E F E F E Processor VLIW: Very Large Instruction Word Instructions

1.2. Paradigms of ILP processing (1) Paradigms of ILP processing Issue parallelism Temporal parallelism Static dependency resolution Dynamic dependency resolution Pipeline processors VLIW processors Superscalar processors

Superscalarprocessing VLIWprocessing Independent instructions (static dependency resolution) Dependent instructions Dynamicdependency resolution F E F E F E F E F E F E Processor Processor VLIW: Very Large Instruction Word Instructions

1.2. Paradigms of ILP processing (1) Paradigms of ILP processing Issue parallelism Data parallelism Temporal parallelism Static dependency resolution Dynamic dependency resolution SIMD extension Pipeline processors VLIW processors Superscalar processors

1.2. Paradigms of ILP-processing (2) Issueparallelism Data parallelism Static dependency resolution Sequential processing Temporal parallelism VLIW processors EPIC processors Dynamic dependency resolution Pipeline processors. Superscalar processors Superscalar proc.s with SIMD extension ~ ‘85 ~ ‘90 ~ ’95 -‘00 Figure 1.3: The emergence of ILP-paradigms and processor types

Absolute performance Ideal case Real case Sequential Pipeline VLIW/ superscalar SIMD extension 1.3. Performance potential of ILP-processors (1)

Clock frequency Data parall. Temporal parall. Issue parall. Efficiency of spec. exec. 1.3. ILP processzorok teljesítménypotenciálja (2) Performance components of ILP-processors: with: Clock frequency Depends on technology/ μarchitecture Per cycle efficiency Depends on ISA, μarchitecture, system architecture, OS, compiler, application

2. Introduction of temporal parallelism

Pipeline processors Overlapping all phases E W D F i i i i +1 i i +2 i i +3 37 Atlas (1963) 38 IBM 360/91 (1967) 41 R2000 (1988) 42 i80386 (1985) 43 M68030 (1988) 2.1. Introduction (1) Types of temporal parallelism in ILP processors (F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle) Figure 2.1: Implementation alternatives of temporal parallelism

x86 80386 80486 80286 M68000 68030 68020 68040 R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 Pipeline (scalar) processors 2.1. Introduction (2) Figure 2.2: The appearance of pipeline processors

2.2.1. Overview 2.2. Processing bottlenecks evoked and their resolution The scarcity of memory bandwidth (2.2.2) The problem of branch processing (2.2.3)

2.2.2. The scarcity of memory bandwidth (1) Sequential processing Pipeline processing More instructions and data need to be fetched per cycle Larger memory bandwidth

C(8) 80286 80386 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68030 68040 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 Pipeline (scalar) processors without cache(s) Pipeline (scalar) processors with cache(s) Universal cache (size in kB) C(n) Instruction/data cache (sizes in kB) C(n/m) 2.2.2. The scarcity of memory bandwidth (2) Figure 2.3: Introduction of caches

bc Conditional branch bti Branch target instruction 2.2.3. The problem of branch processing (1) (E.g. in case of conditional branches) clock cycles bc D E W F ii F D E ii+1 F D ii+2 Decode ii+4 bti F Conditionchecking (branch!) Brach address calculation Figure 2.4: Processing of a conditional branch on a 4-stage pipeline

Basic block Basic block Conditional branches Instructions other than conditional branches Guessed path Approved path 2.2.3. The problem of branch processing (2) Figure 2. 5: Principle of branch prediction in case of a conditional branch

C(8) 80286 80386 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68030 68040 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 (Scalar)pipeline processors Speculative execution of branches 2.2.3. The problem of branch processing (3) Figure 2.6: Introduction of branch prediction in (scalar) pipeline processors

2.3. Generations of pipeline processors (1) Cache Speculative branch processing 1. generation pipelined no no 1.5. generation pipelined yes no 2. generation pipelined yes yes

80386 68030 68040 R4000 R6000 2.3. Generations of pipeline processors (2) C(8) 80286 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 1. generation pipelined (no cache, no speculative branch processing) 1.5.generation pipelined (cache, no speculative branch processing) 2.generation pipelined (cache, speculative branch processing) Figure 2. 7: Generations of pipeline processors

2.4. Exhausting the available temporal parallelism 2. generation pipeline processors already exhaust the available temporal parallelism

3. Introduction of issue parallelism

3.1. Options to implement issue parallelism VLIW (EPIC)instruction issue Static dependency resolution (3.2) Superscalarinstruction issue Pipeline processing Dynamic dependency resolution (3.3)

E E E E U U U U 3.2. VLIW processing (1) Memory/cache VLIW instructions with independent sub-instructions (static dependency resolution) VLIW processor ~ (10-30 EUs) Figure 3.1: Principle of VLIW processing

3.2. VLIW processing (2) VLIW: Very Long Instruction Word Term: 1983 (Fisher) Length of sub-instructions ~32 bit Instruction length: ~ n*32 bit n: Number of execution units (EU) Static dependency resulution with parallel optimization Complex VLIW compiler

The term ‘VLIW’ 3.2. VLIW processing (3) Figure 3.2: Experimental and commercially available VLIW processors Source: Sima et al., ACA, Addison-Wesley, 1997

3.2. VLIW processing (4) Benefits of static dependecy resolution: Less complex processors Earlier appearance Either higher fc or larger ILP

3.2. VLIW processing (5) Drawbacks of static dependency resolution: Completely new ISA New compilers, OS Rewriting of applications Achieving the critical mass to convince the market The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs) for dependency resolution and parallel optimization New proc. models require new compiler versions

3.2. VLIW processing (6) Drawbacks of static dependency resolution (cont.): VLIW instructions are only partially filled Purely utilized memory space and bandwidth

3.2. VLIW processing (7) Commercial VLIW processors: Trace(1987) Multiflow Cydra-5 (1989) Cydrome In a few years both firms became bankrupt Developers: to HP, IBM They became initiators/developers of EPIC processors

3.2. VLIW processing (8) VLIW EPIC Integration of SIMD instructions and advanced superscalar features 1994: Intel, HP announced the cooperation 1997: The EPIC term was born 2001: IA-64  Itanium

3.3. Superscalar processing 3.3.1. Introduction (1) Pipeline processing Superscalar instruction issue Main attributes of superscalar processing: Dynamic dependency resolution Compatible ISA

3.3.1. Intoduction (2) Figure 3.3: Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

3.3.1. Introduction (3) Figure 3.4: Emergence of superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

3.3.2. Attributes of first generation superscalars (1) • 2-3 RISC instructions/cycle or • 2 CISC instructions/cycle „wide” Width: Core: • Static branch prediction • Single ported, blocking L1 data caches, Off-chip L2 caches attached via the processor bus Cache: • Alpha 21064 Examples: • PA 7100 • Pentium

3.3.2. Attributes of first generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming direct issue: ~ 2 instructions / cycle (Wall 1989, Lam, Wilson 1992) Source: Sima et al., ACA, Addison-Wesley, 1997

3.3.2. Attributes of first generation superscalars (3) Consistency of processor features (2) Reasonable core width: 2 - 3 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (2 - 3) = 0.8 – 1.2 instructions/cycle Single port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (2 – 3) = 1.6 – 2.4 2 – 3 FX EUs L/S ~ 0.4 * (2 – 3) = 0.8 – 1.2 1 L/S EU Branch ~ 0.2 * (2 – 3) = 0.4 – 0.6 1 B EU FP ~ (0.01 – 0.05) * (2 – 3) 1 FP EU

(a): Simplified structure of the mikroarchitecture assuming direct issue (b): The issue process 3.3.3. The bottleneck evoked and its resolution (1) The issue bottleneck Figure 3.5: The principle of direct issue

3.3.3. The bottleneck evoked and its resolution (2) Eliminating the issue bottleneck Figure 3.6: Principle of the buffered (out of order) issue

3.3.3. The bottleneck evoked and its resolution (3) First generation (narrow) superscalars Second generation (wide) superscalars Elimination of the issue bottleneck and in addition widening the processing width of all subsystems of the core

3.3.4. Attributes of second generation superscalars (1) First generation ”narrow” superscalars Second generation ”wide” superscalars • 2-3 RISC instructions/cycle or2 CISC instructions/cycle „wide” • 4 RISC instructions/cycles or3 CISC instruction/cycle „wide” Width: • Static branch prediction • Buffered (ooo) issue • Predecoding • Dynamic branch prediction • Register renaming • ROB Core: Caches: • Single-ported, blocking • L1 data caches • Off-chip L2 caches • attached via the processor bus • Dual-ported, non-blockingL1 data caches • direct attached off-chip L2 caches • Alpha 21064 Examples: • Alpha 21264 • PA 7100 • PA 8000 • Pentium • Pentium Pro • K6

3.3.4. Attributes of second generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming buffered issue: ~ 4 – 6 instructions / cycle (Wall 1990) Source: Sima et al., ACA, Addison-Wesley, 1997

Figure 3.7: Extent of parallelism available in general purpose applications assuming buffered issue Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990

3.3.4. Attributes of second generation superscalars (3) Consistency of processor features (2) Reasonable core width: 4 - 5 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (4 - 5) = 1.6 – 2 instructions/cycle Dual port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (4 – 5) = 3.2 – 4 3 – 4 FX EUs L/S ~ 0.4 * (4 – 5) = 1.6 – 2 2 L/S EU Branch ~ 0.2 * (4 – 5) = 0.8 – 1 1 B EU FP ~ (0.01 – 0.05) * (4 – 5) 1 FP EU

3.3.5. Exhausting the issue parallelism In general purpose applications 2. generation („wide”) superscalars already exhaust the parallelism available at the instruction level

Evolution of the ILP Processing

Evolution of the ILP Processing

Presentation Transcript

The Evolution of Evolution

e -ILP

Evolution of the

The Evolution of Word Processing – Pencils to Wikis

The Evolution of

ILP Presentation

ILP Principles

ILP (Recap)

Evolution of the Graphical Processing Unit

ALF-ILP

1. Evolution of ILP-processing

The evolution of

Dynamic ILP

Evolution of the ILP Processing

ILP Session

ILP 2.0

The Evolution Of

The Evolution Of

The Evolution Of

The Evolution Of

The evolution of a transaction processing system

ILP 2.0