1 / 65

1. Evolution of ILP-processing

1. Evolution of ILP-processing. Dezső Sima Fall 2006.  D. Sima, 2006. Structure. 1. Paradigms of ILP-processing. 2 . Introduction of temporal parallelism. 3 . Introduction of issue parallelism. 3.1. VLIW processing. 3.2. Supercalar processing. 4. Introduction of data parallelism.

helena
Download Presentation

1. Evolution of ILP-processing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 1. Evolution of ILP-processing Dezső Sima Fall 2006  D. Sima, 2006

  2. Structure 1. Paradigms of ILP-processing 2. Introduction of temporal parallelism 3. Introduction of issue parallelism 3.1. VLIW processing 3.2. Supercalar processing 4. Introduction of data parallelism 5. The main road of evolution 6. Outlook

  3. ENIAC NORC CDC-6600 Cray-1 Cray-2 Cray-3 Cray T3E ? super- computer Cray-4 UNIVAC /360 /370 /390 z/900 mainframe PDP-8 PDP-11 VAX x minicomputer RS/6000 Xeon server/workstation PPro 4004 8080 8088 microcomputer 80286 80486 PII PIII P4 desktop PC Altair 80386 Pentium 8088 value PC Celeron 1950 1960 1970 1980 1990 2000 1. Paradigms of ILP-processing 1.1. Introduction (1) Figure 1.1: Evolution of computer classes

  4. 1.1. Introduction (2) 1.2. ábra: The integer performance of Intel’s x86 line of processors

  5. 1.2. Paradigms of ILP-processing (1) Paradigms of ILP-processing Issue parallelism Temporal parallelism Static dependency resolution Pipeline processors VLIW processors

  6. VLIWprocessing Independent instructions (static dependency resolution) F E F E F E Processor VLIW: Very Large Instruction Word Instructions

  7. 1.2. Paradigms of ILP processing (1) Paradigms of ILP processing Issue parallelism Temporal parallelism Static dependency resolution Dynamic dependency resolution Pipeline processors VLIW processors Superscalar processors

  8. Superscalarprocessing VLIWprocessing Independent instructions (static dependency resolution) Dependent instructions Dynamicdependency resolution F E F E F E F E F E F E Processor Processor VLIW: Very Large Instruction Word Instructions

  9. 1.2. Paradigms of ILP processing (1) Paradigms of ILP processing Issue parallelism Data parallelism Temporal parallelism Static dependency resolution Dynamic dependency resolution SIMD extension Pipeline processors VLIW processors Superscalar processors

  10. 1.2. Paradigms of ILP-processing (2) Issueparallelism Data parallelism Static dependency resolution Sequential processing Temporal parallelism VLIW processors EPIC processors Dynamic dependency resolution Pipeline processors. Superscalar processors Superscalar proc.s with SIMD extension ~ ‘85 ~ ‘90 ~ ’95 -‘00 Figure 1.3: The emergence of ILP-paradigms and processor types

  11. Absolute performance Ideal case Real case Sequential Pipeline VLIW/ superscalar SIMD extension 1.3. Performance potential of ILP-processors (1)

  12. Clock frequency Data parall. Temporal parall. Issue parall. Efficiency of spec. exec. 1.3. ILP processzorok teljesítménypotenciálja (2) Performance components of ILP-processors: with: Clock frequency Depends on technology/ μarchitecture Per cycle efficiency Depends on ISA, μarchitecture, system architecture, OS, compiler, application

  13. Pipeline processors Overlapping all phases E W D F i i i i +1 i i +2 i i +3 37 Atlas (1963) 38 IBM 360/91 (1967) 41 R2000 (1988) 42 i80386 (1985) 43 M68030 (1988) 2. Introduction of temporal parallelism 2.1. Introduction (1) Types of temporal parallelism in ILP processors (F: fetch cycle, D: decode cycle, E: execute cycle, W: write cycle) Figure 2.1: Implementation alternatives of temporal parallelism

  14. x86 80386 80486 80286 M68000 68030 68020 68040 R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 Pipeline (scalar) processors 2.1. Introduction (2) Figure 2.2: The appearance of pipeline processors

  15. 2.2.1. Overview 2.2. Processing bottlenecks evoked and their resolution The scarcity of memory bandwidth (2.2.2) The problem of branch processing (2.2.3)

  16. 2.2.2. The scarcity of memory bandwidth (1) Sequential processing Pipeline processing More instructions and data need to be fetched per cycle Larger memory bandwidth

  17. C(8) 80286 80386 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68030 68040 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 Pipeline (scalar) processors without cache(s) Pipeline (scalar) processors with cache(s) Universal cache (size in kB) C(n) Instruction/data cache (sizes in kB) C(n/m) 2.2.2. The scarcity of memory bandwidth (2) Figure 2.3: Introduction of caches

  18. bc Conditional branch bti Branch target instruction 2.2.3. The problem of branch processing (1) (E.g. in case of conditional branches) clock cycles bc D E W F ii F D E ii+1 F D ii+2 Decode ii+4 bti F Conditionchecking (branch!) Brach address calculation Figure 2.4: Processing of a conditional branch on a 4-stage pipeline

  19. Basic block Basic block Conditional branches Instructions other than conditional branches Guessed path Approved path 2.2.3. The problem of branch processing (2) Figure 2. 5: Principle of branch prediction in case of a conditional branch

  20. C(8) 80286 80386 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68030 68040 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R4000 R6000 R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 (Scalar)pipeline processors Speculative execution of branches 2.2.3. The problem of branch processing (3) Figure 2.6: Introduction of branch prediction in (scalar) pipeline processors

  21. 2.3. Generations of pipeline processors (1) Cache Speculative branch processing 1. generation pipelined no no 1.5. generation pipelined yes no 2. generation pipelined yes yes

  22. 80386 68030 68040 R4000 R6000 2.3. Generations of pipeline processors (2) C(8) 80286 x86 80486 C(0,1/4) C(1/4,1/4) C(4,4) 68020 M68000 C(8,8) C(16) C(4,4) C(4,4) R2000 MIPS R R3000 1980 81 82 83 84 85 86 87 88 89 1990 91 92 1. generation pipelined (no cache, no speculative branch processing) 1.5.generation pipelined (cache, no speculative branch processing) 2.generation pipelined (cache, speculative branch processing) Figure 2. 7: Generations of pipeline processors

  23. 2.4. Exhausting the available temporal parallelism 2. generation pipeline processors already exhaust the available temporal parallelism

  24. 3. Introduction of issue parallelism 3.1. Options to implement issue parallelism VLIW (EPIC)instruction issue Static dependency resolution (3.2) Superscalarinstruction issue Pipeline processing Dynamic dependency resolution (3.3)

  25. E E E E U U U U 3.2. VLIW processing (1) Memory/cache VLIW instructions with independent sub-instructions (static dependency resolution) VLIW processor ~ (10-30 EUs) Figure 3.1: Principle of VLIW processing

  26. 3.2. VLIW processing (2) VLIW: Very Long Instruction Word Term: 1983 (Fisher) Length of sub-instructions ~32 bit Instruction length: ~ n*32 bit n: Number of execution units (EU) Static dependency resulution with parallel optimization Complex VLIW compiler

  27. The term ‘VLIW’ 3.2. VLIW processing (3) Figure 3.2: Experimental and commercially available VLIW processors Source: Sima et al., ACA, Addison-Wesley, 1997

  28. 3.2. VLIW processing (4) Benefits of static dependecy resolution: Less complex processors Earlier appearance Either higher fc or larger ILP

  29. 3.2. VLIW processing (5) Drawbacks of static dependency resolution: Completely new ISA New compilers, OS Rewriting of applications Achieving the critical mass to convince the market The compiler uses technology dependent parameters (e.g. latencies of EUs and caches, repetition rates of EUs) for dependency resolution and parallel optimization New proc. models require new compiler versions

  30. 3.2. VLIW processing (6) Drawbacks of static dependency resolution (cont.): VLIW instructions are only partially filled Purely utilized memory space and bandwidth

  31. 3.2. VLIW processing (7) Commercial VLIW processors: Trace(1987) Multiflow Cydra-5 (1989) Cydrome In a few years both firms became bankrupt Developers: to HP, IBM They became initiators/developers of EPIC processors

  32. 3.2. VLIW processing (8) VLIW EPIC Integration of SIMD instructions and advanced superscalar features 1994: Intel, HP announced the cooperation 1997: The EPIC term was born 2001: IA-64  Itanium

  33. 3.3. Superscalar processing 3.3.1. Introduction (1) Pipeline processing Superscalar instruction issue Main attributes of superscalar processing: Dynamic dependency resolution Compatible ISA

  34. 3.3.1. Intoduction (2) Figure 3.3: Experimental superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

  35. 3.3.1. Introduction (3) Figure 3.4: Emergence of superscalar processors Source: Sima et al., ACA, Addison-Wesley, 1997

  36. 3.3.2. Attributes of first generation superscalars (1) • 2-3 RISC instructions/cycle or • 2 CISC instructions/cycle „wide” Width: Core: • Static branch prediction • Single ported, blocking L1 data caches, Off-chip L2 caches attached via the processor bus Cache: • Alpha 21064 Examples: • PA 7100 • Pentium

  37. 3.3.2. Attributes of first generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming direct issue: ~ 2 instructions / cycle (Wall 1989, Lam, Wilson 1992) Source: Sima et al., ACA, Addison-Wesley, 1997

  38. 3.3.2. Attributes of first generation superscalars (3) Consistency of processor features (2) Reasonable core width: 2 - 3 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (2 - 3) = 0.8 – 1.2 instructions/cycle Single port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (2 – 3) = 1.6 – 2.4 2 – 3 FX EUs L/S ~ 0.4 * (2 – 3) = 0.8 – 1.2 1 L/S EU Branch ~ 0.2 * (2 – 3) = 0.4 – 0.6 1 B EU FP ~ (0.01 – 0.05) * (2 – 3) 1 FP EU

  39. (a): Simplified structure of the mikroarchitecture assuming direct issue (b): The issue process 3.3.3. The bottleneck evoked and its resolution (1) The issue bottleneck Figure 3.5: The principle of direct issue

  40. 3.3.3. The bottleneck evoked and its resolution (2) Eliminating the issue bottleneck Figure 3.6: Principle of the buffered (out of order) issue

  41. 3.3.3. The bottleneck evoked and its resolution (3) First generation (narrow) superscalars Second generation (wide) superscalars Elimination of the issue bottleneck and in addition widening the processing width of all subsystems of the core

  42. 3.3.4. Attributes of second generation superscalars (1) First generation ”narrow” superscalars Second generation ”wide” superscalars • 2-3 RISC instructions/cycle or2 CISC instructions/cycle „wide” • 4 RISC instructions/cycles or3 CISC instruction/cycle „wide” Width: • Static branch prediction • Buffered (ooo) issue • Predecoding • Dynamic branch prediction • Register renaming • ROB Core: Caches: • Single-ported, blocking • L1 data caches • Off-chip L2 caches • attached via the processor bus • Dual-ported, non-blockingL1 data caches • direct attached off-chip L2 caches • Alpha 21064 Examples: • Alpha 21264 • PA 7100 • PA 8000 • Pentium • Pentium Pro • K6

  43. 3.3.4. Attributes of second generation superscalars (2) Consistency of processor features (1) Dynamic instruction frequencies in gen. purpose applications: FX instrtuctions ~ 40 % Load instructions ~ 30 % Store instructions ~ 10 % Branches ~ 20 % FP instrtuctions ~ 1-5 % Available parallelism in gen. purpose applications assuming buffered issue: ~ 4 – 6 instructions / cycle (Wall 1990) Source: Sima et al., ACA, Addison-Wesley, 1997

  44. Figure 3.7: Extent of parallelism available in general purpose applications assuming buffered issue Source: Wall: Limits of ILP, WRL TN-15, Dec. 1990

  45. 3.3.4. Attributes of second generation superscalars (3) Consistency of processor features (2) Reasonable core width: 4 - 5 instructions/cycle Required number of data cache ports (np): np ~ 0.4 * (4 - 5) = 1.6 – 2 instructions/cycle Dual port data caches Required EU-s (Each L/S instruction generates an address calculation as well): FX ~ 0.8 * (4 – 5) = 3.2 – 4 3 – 4 FX EUs L/S ~ 0.4 * (4 – 5) = 1.6 – 2 2 L/S EU Branch ~ 0.2 * (4 – 5) = 0.8 – 1 1 B EU FP ~ (0.01 – 0.05) * (4 – 5) 1 FP EU

  46. 3.3.5. Exhausting the issue parallelism In general purpose applications 2. generation („wide”) superscalars already exhaust the parallelism available at the instruction level

  47. 4. Introduction of data parallelism 4.1. Overview (1) Figure 4.1: Implementation alternatives of data parallelism

  48. 4.1. Overview (2) Superscalar extension SIMD instructions (FX/FP) Multiple operations within a single instruction Superscalar issue EPIC extension Figure 4.2: Principle of intruducing SIMD instructions in superscalar and VLIW (EPIC) processors

  49. 4.2. The appeareance of SIMD instructions in superscalars (1) Intel’s and AMD’s ISA extensions (MMX, SSE, SSE2, SSE3, 3DNow!, 3DNowProfessional) Figure 4.3: The emergence of FX-SIMD and FP-SIMD instructions in superscalars

  50. A 2.5. and 3. generation superscalars (1) 2.5. generation superscalars Second generation superscalars FX SIMD (MM) 3. generation superscalars FX SIMD + FP SIMD (MM+3D)

More Related