1 / 37

ECE8833 Polymorphous and Many-Core Computer Architecture

ECE8833 Polymorphous and Many-Core Computer Architecture. Lecture 1 Early ILP Processors and Performance Bound Model. Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering. Decoupled Access/Execute Computer Architectures James E. Smith, ACM TOCS, 1984

Download Presentation

ECE8833 Polymorphous and Many-Core Computer Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE8833 Polymorphous and Many-Core Computer Architecture Lecture 1 Early ILP Processors and Performance Bound Model Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering

  2. Decoupled Access/Execute Computer Architectures James E. Smith, ACM TOCS, 1984 (a earlier version was published in ISCA 1982)

  3. 64-bit register LV v1, mem[a1] MULV v3, v2, v1 ADDV v5, v4, v3 Background of DAE, circa. 1982 • Written at a time when vector machine was dominating LV v1, mem[a1] MULV v3, v2, v1 ADDV v5, v4, v3 LV v1, mem[a1] 63 0 MULV v3, v2, v1 4096-bit ADDV v5, v4, v3 Time line Vector chaining (Cray-1)

  4. Memory v1 v2 MUL v4 v3 ADD v5 Background of DAE, circa. 1982 • Written at a time when vector machine was dominating LV v1, mem[a1] MULV v3, v2, v1 ADDV v5, v4, v3 What about modern SIMD ISA ?

  5. Today State-of-the-art ? • Intel AVX • Intel Larrabee NI

  6. DAE, circa. 1982 • Fine-grained parallelism: Vector vs. Superscalar • What about scalar performance? • Remember what’s Flynn’s bottleneck? Page 290

  7. Flynn’s Bottleneck • ILP  1.86  • Programs on IBM 7090 • Basically, he sort of said one cannot execute more than one instruction per cycle • ILP exploited within basic blocks • [Riseman & Foster’72] • Breaking control dependency • A perfect machine model • Benchmark includes numerical programs, assembler and compiler BB0 BB1 BB2 BB4 BB3

  8. DAE, circa. 1982, 1984 • Issues in CDC6600 & IBM 360/91 • Overlap instructions by OoO  complex control  slower clock  offset the benefit • Complex issue methods were abandoned by their manufacturers • Less determinism • Problems in HW debugging • Errors may not be reproducible • Complexity can be shifted to system software

  9. Decoupled Access/Execute Architecture • An architecture with two instruction streams to break Flynn’s bottleneck • Access processor • eXecute processor • Hey, this was 1980s • Separate RFs (A0, A1, A2 .. , An-1 & X0, X1, X2 .. ,Xm-1), which can be totally incompatible • Synchronization issue?

  10. DAE

  11. Data Movement XLQ, XSQ, are specified as registers at the ISA level Data In Data Out paired

  12. Register-to-Register Synch Xi Aj

  13. Branch Synch-up • One Runhead • One execute uncond. Jump (BFQ instruction) Branch outcomes in XBQ can be used to reduce I-fetch from X-Processor.

  14. DAE Code Example

  15. Modern Issue Consideration • Despite it is a ‘82/’84 paper, it considers

  16. Precise Exception • Simple approach  force the instructions to complete in order • In DAE, applied to each of the streams separately • Example of Imprecise exception issues • Require cautiousness when coding A and E programs

  17. Requirement for Precise Exception

  18. Why (and How) It Works? • Avg. speedup = 1.58 for LFK • Executions between 2 processors are somewhat balanced • Why? • Work nicely as shown in LFK • X-processor’s computation is not as fast • 6-cycle FP add • 7-cycle FP multiply • A-process takes care of • Memory (11-cycle load) • Branch resolution

  19. Disadvantages of DAE Architecture • Writing 2 separate programs • What High-level language ? • Who should do it? • Certain duplication in Hardware • Instruction memory/cache • Instruction fetch unit • Decoder

  20. X (A) Interleaving Instruction Streams • Use a bit to tag streams • No split branch instruction • X7 is XLQ or XSQ; • Once loaded, it is used once. • It must be stored after X-processor writes to it

  21. Summary of DAE Architecture • 2-wide issue per cycle • Allow a constrained type of OoO • Data accesses could be done well in advance (i.e., “slip” ahead) • Enable certain level of data prefetching • Was novel in 1982!

  22. The ZS-1 Central Processor James E. Smith, et al. in ASPLOS-II, 1987

  23. Astronautics ZS-1 ZS-1 Central Processor • A realization of DAE (by the same author) • Decouple instruction stream into • Fixed point/memory • Floating-point operations • Communicate via Architectural queues • Is extensively pipelined • 22.5 MFLOPS, 45 MIPS

  24. Communicate with memory ZS-1 Central Processor 31 A (and X) registers + 1 Queue entry = 5-bit encoded operands Hold 24 insts Hold 4 insts

  25. ZS-1 Central Processor + Instruction cannot be issued unless the dependency is resolved. + A load may bypass independent stores + Maintain load-load, store-store order

  26. (A)=100 R3=25 Core 2 Store (A), R3 Can Load Bypass Load? • Why not? Core 1 (1) (2) (3) Load R1, (A) Load R2, (A) • What’s wrong with (2)(3)(1)?

  27. ZS-1: Processing of Two Iterations S: splitter B: inst buffer read D: decoded I: issued E: Execution

  28. IBM RS/6000 and POWER • Evolved from IBM ACS and 801 • Foundation of POWER architecture (Performance Optimization With Enhanced RISC) • 10 discrete chips in the early POWER1 system • Single chip solution in RSC and some subsequent POWER2 version called P2SC

  29. POWER2 Processor Node • 8 Discrete chips on MCM • 66.7 MHz, 6-issue (2 reserved for br/comp) • 2 FXUs • Memory, INT, Logical • 2 per cycles • 3 dual-pipe FPUs can perform • 2 DP Fma • 2 FP loads • 2 FP stores Instruction Cache Unit I-Cache (32KB) Dual Branch Processors Dispatch Instruction Buffer Instruction Buffer Sync Arithmetic Execution Unit Store Execution Unit Load Execution Unit Execution Unit w/o Mult/Div Execution Unit w Mult/Div - - - Fixed-Point Unit (FXU) Floating-Point Unit (FPU) Memory Unit (64MB – 512MB) Data Cache Unit (DCU) 4 separate chips (32KB each) Storage Control Unit Optional Secondary Cache (1 or 2MB)

  30. GAP A GAP C GAP S GAP P MACS Performance Bound Model • To analyze achievable performance (mostly FP) in scientific applications M Bound MA Bound MAC Bound MACS Bound Physically Measured Actual Run Time

  31. MACS Performance Bound Model • Gap A (keep you from attaining peak performance) • Excessive loads/stores (more than essential ones, i.e., a[i] = b[i]) • Loop bookkeeping • GAP C (reason we may want to have 432?) • Hardware restriction (architectural registers) • Redundant instructions • Load/store overhead in function calls • GAP S • Weak scheduling algorithm • Resource conflicts preventing tighter schedule • Sol: Modulo scheduling to compact the code • GAP P • Cache misses, inter-core communication, system effect (i.e., context switches) • Sol: prefetch, loop blocking, loop fusion, loop exchange, etc.

  32. POWER2 M Bound (Ideal, Ideal) M Bound Peak = 1 fmato 2 FPU pipelines = 0.25 CPF Dispatch Instruction Buffer Arithmetic Execution Unit Store Execution Unit Load Execution Unit - - - Floating-Point Unit (FPU)

  33. POWER2 MA Bound (Ideal compiler and rest) • Given the visible workload of the high level application • Calculate the essential operations must be performed MA Bound Time bound for all FP operations Essential, minimum FP operations to complete the computation A factor of 4 for div and sqrt is a common choice to reflect their relative weight to other computations

  34. POWER2 MA Bound (Ideal compiler and rest) Non-pipelined FP ops 2 pipelines Max 4 dispatches to FPU and FXU Other fixed-point considered irrelevant Simplified memory model

  35. POWER2 MAC Bound Similar to computing MA Bound but using actual, generated instruction count MAC Bound

  36. POWER2 MACS Bound Similar to computing MAC Bound but the numerator is the actual compiler-scheduled code MACS Bound

  37. IBM SP2 Performance Bound • Later expansion to include inter-processor communication bound

More Related