1 / 32

VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5

VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5. The Generic Processor Microarchitecture trends Performance/power/frequency implications Insights. Today's lecture: Comprehend performance, power and area implications of various Microarchitectures. References of the day.

cosima
Download Presentation

VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5 • The Generic Processor • Microarchitecture trends • Performance/power/frequency implications • Insights Today's lecture: Comprehend performance, power and area implications of various Microarchitectures

  2. References of the day • “Computer Architecture - A Quantitative Approach” (The second edition), John L. Hennessy, David A. Patterson, Chapter 3-4 (p. 125-370) • “Computer Organization and Design”, John L. Hennessy, David A. Patterson, Chapter 5-6, 9 (p. 268-451, 594-646) • “Tuning the Pentium Pro Micro-Architecture”, David Papworth, IEEE Micro, April 1996 • IA-64 Application Architecture Tutorial, Allan D. Knies, Hot-Chips 11, August 1999. • “Billion-Transistor Architecture: There and Back again” Doug Burger, James Goodman, Computer, March • “VLIW Architecture for a Tree Scheduling Compiler”, R. Colwell, R. Nix, J. O’Donnell, D. Papworth, P. Rodman, ACM 1987 • Joseph Fisher - "The VLIW Machine: A Multiprocessor for Compiling Scientific Code", Computer July 1984. • “The IBM 360/91: Machine Philosophy and Instruction Handling”, R.M. Tomasulo et al, IBM Journal of Research and Development 11:1, 1967 • “VLIW Architecture for a Tree Scheduling Compiler”, R. Colwell, R. Nix, J. O’Donnell, D. Papworth, P. Rodman, ACM 1987 • Joseph Fisher - "The VLIW Machine: A Multiprocessor for Compiling Scientific Code", Computer July 1984. • “Tuning the Pentium Pro Micro-Architecture”, David Papworth, IEEE Micro, April 1996 • IA-64 Application Architecture Tutorial, Allan D. Knies, Hot-Chips 11, August 1999 Some of the lectures material have been prepared by Ronny Ronen

  3. Computing platformMessages • Balanced design • Power  CV2f • System Performance • Transactions overhead • Memory as a scratch pad • Scheduling • System efficiency • … • CPU • ILP and IPC vs. Frequency • External vs. internal frequency • Speculation • Branch Prediction • $ (Caches) • Memory disambiguation • Instructions and Data Prefetch • Value prediction • …. • Multithread • Multithread on single core • Multi-cores system • $ in multi-core • Asymmetry • NUMA • Scheduling in MC • Mtulti-Core vs. Multi-thread machines • ….

  4. The Generic Processor Instruction supply Sophisticated organization to “service” instructions • Instruction supply • Instruction cache • Branch prediction • Instruction decoder • ... • Execution engine • Instruction scheduler • Register files • Execution units • ... • Data supply • Data cache • TLB’s • … • Goal - Maximum throughput – balanced design Data supply Execution engine

  5. Power & Performance • Performance 1/Execution Time (IPC x Frequency) / #-of-instructions-in-Task For a given instruction stream: Performance depends on the number of instructions executed per time-unit: • Performance IPC x Frequency Sometimes, Measured in MIPS - Million Instructions Per Second • PowerC x V2 x Frequency C = overall capacitance: for a given technology, is ~proportional to the # of transistors • Energy Efficiency = Performance/Power • Measured in MIPS/Watt Message: Power = C x V2 x Frequency

  6. [John DeVale & Bryan Black, 2006] Microprocessor Performance Evolution MRM IPC Itanium YNH Intel P-M Power 3 Power 4 AMD Opetron AMD Athlon Intel P4 Message: Frequency vs. IPC

  7. 866  95% 87892% 80787% 708  90% Real life:Performance vs. frequency Message: Internal vs. external frequency * Source: Intel ® Pentium ® 4 Processor and Intel ® 850 Performance Brief, April2002

  8. Microarchitecture • Micro-Processor Core – Performance/ power/area insights • Parallelism • Pipeline stalls/Bypasses • Superpipeline • Static/Dynamic scheduling • Branch prediction • Memory Hierarchy • VLIW / EPIC

  9. ... PE PE PE PE PE PE PE PE PE ... f a n n b c d a a e b c Parallelism Evolution Performance, power, area Insights? Pipeline Superscalar - In order Basic configuration PE PE=Processor Element ... a Instruction a b c n VLIW Superscalar - Out of Order

  10. IF ID IE IW FF FD FE FW MF MD ME MW BF BD BE BW IF ID IE IW FF FD FE FW MF MD ME MW BF BD BE BW IF ID IE st st st st IW FF FD FE st st st st FW MF MD ME st st st st MW BF BD BE st st st st BW IF ID st st st st IE IW FF FD st st st st FE FW MF MD st st st st ME MW BF BD st st st st BE BW Static Scheduling: VLIW / EPICPerformance, power, area Insights? • Static scheduling of instructions by compiler • VLIW: Very Long Instruction Word (MultiFlow, TI6X family) • EPIC: Explicit Parallel Instruction set Computer (IA64) • Shorter pipe, wider machine, global view=> potentially huge ILP (wider & simpler than plain superscalar!) • Many nops, sensitive to varying latencies (memory accesses) • Low utilization • Huge code size • Highly depends on compiler • EPIC overcomes some of theselimitations: • Advance loads (hide memory latency) • Predicated execution (avoid branches) • Decoder templates (reduce nops) But at increased complexity I: integer F: Float M: Memory B: Branch st: stall Gray: nop Pipeline stages Perf/power Examples Intel Itanium® proc. DSPs increase time decrease

  11. Dynamic SchedulingPerformance, power, area Insights? • Scheduling instructions at run time, by the HW • Advantages: • Works on the dynamic instruction flow:Can schedule across procedures, modules... • Can see dynamic values (memory addresses) • Can accommodate varying latencies and cases (e.g. cache miss) • Disadvantages • Can schedule within a limited window only • Should be fast - cannot be too smart Perf/power increase decrease

  12. 1 1 2 2 3 3 4 4 5 5 6 6 7 7 8 8 9 9 1F 1D 1E 1E 1W 2F 2D 2E 2E 2E 2W 3F 3D 3E 3E 3E 3E 3W 4F 4D 4E 4E 4E 4E 4E 4W 5F 5D 5E 5E 5E 5E 5E 5E 5W 1F 1D 1E 1E 1W 2F 2D 2E 2E 2E 2W 3F 3D 3E 3E 3w 3W 4F 4D 4E 4E 4E 4W 5F 5D 5E 5E 5E 5E 5W Out Of Order Execution • In Order Execution: instructions are processed in their program order. • Limitation to potential Parallelism. • OOO: Instructions are executed based on “data flow” rather than program order Before: src -> dest (1) load (r10), r21(2) mov r21, r31 (2 depends on 1)(3) load a, r11 (4) mov r11, r22 (4 depends on 3)(5) mov r22, r23 (5 depends on 4) After:(1)load (r10), r21; (3) load a, r11;<wait for loads to complete>(2) mov r21,r31; (4) mov r11,r22;(5) mov r22,r23; • Usually highly superscalar t In Order Processing t Out of Order Processing In Order vs. OOO execution.Assuming: - Unlimited resources- 2 cycles load latency Examples: Intel Pentium® II/III/4 Compaq Alpha 21264

  13. Out Of Order (cont.)Performance, power, area Insights? • Advantages • Help exploit Instruction Level Parallelism (ILP) • Help cover latencies (e.g., cache miss, divide) • Artificially increase the Register file size (i.e. number of registers) ? • Superior/complementary to compiler scheduler • Dynamic instruction window • Make usage of more registers than the Architecture Registers ? • Complex microarchitecture • Complex scheduler • Large instruction window • Speculative execution • Requires reordering back-end mechanism (retirement) for: • Precise interrupt resolution • Misprediction/speculation recovery • Memory ordering Perf/power increase decrease

  14. Branch PredictionPerformance, power, area Insights? • Goal - ensure instruction supply by correct prefetching • In the past - prefetcher assumed fall-through • Lose on unconditional branch (e.g., call) • Lose on frequently taken branches (e.g., loops) • Dynamic Branch prediction • Predicts whether a branch is taken/not taken • Predicts branch target address • Misprediction cost varies (higher w/ increased pipeline depth) • Typical Branch prediction rates: ~90%-96% 4%-10% misprediction, 10-25 branches between mispredictions 50-125 instructions between mispredictions • Misprediction cost increased with • Pipeline depth • Machine width • e.g. 3 width x 10 stages = 30 inst flushed! ? Perf/power increase decrease

  15. Caches In computer engineering, a cache (pronounced /kæʃ/kash in US and /keɪʃ/ kaysh in Aust/NZ) is a component that transparently stores data so that future requests for that data can be served faster (Wikipedia)

  16. Small Fast <500B CPU Registers 0.25ns 64KB 1-2ns L1 cache 8MB 5ns L2 cache Speed Capacity (Size) Main memory (DRAM) 4GB 100ns DISK/Flash 1ms/ 100GB Slow Big Memory hierarchyPerformance, power, area Insights? 10us Perf/power: What are the parameters to consider here?

  17. Environment and motivation Moore’s Law: 2X transistors (cores?) per chip every technology generationhowever, current process generation provide almost same clock rate • Processor running single process can compute only as fast as memory • A 3Ghz processor can execute an “add” operation in 0.33ns • Today’s “external Main memory” latency is 50-100ns • Naïve implementation: loads/stores can be 300x slower than other operations

  18. Cache Motivation CPU - DRAM Gap (latency) µProc 60%/yr. (2X/1.5yr) “Moore’s Law” CPU-DRAM Gap Processor-Memory Performance Gap:(grows 50% / year) DRAM 9%/yr. (2X/10 yrs) • Memory latency can be handle by: • Multi-threaded engine (no cache)  every memory access = off-chip access  BW and power implications? • Caches  every Cache miss = off-chip access  BW and power implications?

  19. Memory Hierarchy ! Number of CPU cycles to reach memory domain  latency Memory * 1,000,000 C to Disk 10,000 C to SSD 1C T=300 C Registers CPU Disk/SSD C=CPU cycles 046267 Computer Architecture 1 U Weiser

  20. Cache A cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations 048750 CMP Cache/Mem Arch – Uri W.,Evgeny B.

  21. Memory Hierarchysolution I – single core environment ! A fast memory structure between CPU and memory solves latency issue Memory 1,000,000 C to Disk 10,000 C to SSD 1C 10 C 300 C Registers CPU Cache Disk/SSD C=CPU cycles 046267 Computer Architecture 1 U Weiser

  22. Memory Hierarchysolution II – Multi-thread environment ! Memory Many Threads execution hide latency 1,000,000 C to Disk 10,000 C to SSD 300 C Performance1 BW1, P1 execution Disk/SSD execution Memory access (300 C) Memory access 046267 Computer Architecture 1 U Weiser 22

  23. Memory HierarchySolution II – Multi-thread environment ! Memory 1,000,000 C to Disk 10,000 C to SSD Memory structure ($) between CPU and memory serves as BW filter 10 C 300 C Cache Disk/SSD Same performance:Performance1 BW1*MR, P1*MR 046267 Computer Architecture 1 U Weiser MR=Cache Miss rate

  24. Power, Performance, Area:Insights – 1 • Energy to process one instruction: Wi • increases with the complexity of the processorE.g., OOO processor consumes more energy per instruction than an in-order processor  Perf/Power • Energy efficiency=Perf/Power • value deteriorates as speculation increases and complexity grows • Area efficiency = Performance/area • Leakage become a major issue • Effectiveness of area – how to get more performance for a given area(secondary to power)

  25. Power, Performance, Area:Insights - 2 • Performance • Perf a IPC * f • Voltage Scaling • Increased operating voltage to increase frequency • f = k * V (within a given voltage range) • Power & Energy consumption • P a C * V2 * f  P ~ a * C * V3 • E = P * t • Tradeoff • Maximum performance • Minimum energy  1% perf  1% power < W/O voltage scaling> • Maximum performance within constrained power  1% perf  3% power <with voltage scaling>

  26. Many things do not scale Wire delays Power Memory latencies and bandwidth Instruction Level parallelism (ILP) … We solve one: we fertilize the others! Performance = frequency * IPC Increasing IPC => more work per instruction Prediction, renaming, scheduling, etc… More useless work: Speculation, replays... More Frequency => More pipe stages Less gate delays per stage More gate delays per instruction overall Bigger loss due to flushes, cache misses, prefetch miss We may “gain” Performance => But with a lot of area and power! Power, Performance, Area Insight- 3

  27. Static Scheduling: VLIW / EPICA short architectural case study • Why “new”? ….CISC = Old • Why reviving? ….OOO complexity •  Advantages – simplicity (pipeline, dependency, dynamic) •  reasons: • EOL of X86? • Business? • Servers? • Questions to ask? • Technical • Business • Controllers • Questions to ask? • Technical • Business

  28. Static Issuing - exampleVLIW-Very Long Instruction WordMultiflow 7/200 • A VLIW Performs many program steps at once. • Many operations are grouped together into Very Long Instruction Word and execute together Memory Register File LD/ST FADD FMUL IALU Instruction Word LD/ST FADD FMUL IALU BRANCH Ref: “VLIW Architecture for a Trace Scheduling Compiler” Colwell. Nix, O’Donnell

  29. Multiflow 7/200 (cont’)Compiler Basic Concept Optimized compiler arrange instructions according to instruction timing example: LD #B, R1 LD #C, R2 FADD R1, R2, R3 LD #D, R4 LD #E, R5 FADD R4, R5, R6 FMUL R6, R3, R1 STO R1, #A LD #G, R7 LD #H, R8 FMULL R7, R8, R9 LD #X, R4 LD #Y, R5 FMULL R4, R5, R6 FADD R6, R9, R1 STO R1, #F A = (B+C) * (D+E) F = G*H + X*Y Assume latencies: Load 3 FADD 3 FMUL 3 Store 1

  30. Assume latencies: Load 3 FADD 3 FMUL 3 Store 1 Multiflow 7/200 (cont’)Compiler Basic Concept Example (Cont.): A = (B+C) * (D+E) F = G*H + X*Y LD/ST IALU FADD FMUL BR LD #B, R1 LD #C, R2 LD #D, R4 LD #E, R5 LD #G, R7 FADD R1,R2,R3 LD #H, R8 LD #X, R4 FADD R4,R5,R6 LD #Y, R5 FMUL R7,R8,R9 FMUL R3,R6,R1 FMUL R4,R5,R6 STO R1, #A FADD R9,R6,R1 - - - - - - - - - : stalled cycle, takes time, but no space. Overall latency 17 cycles.Very Low code efficiency: <25%! STO R1, #F

  31. Intel® Itanium™ Processor Block Diagram IA-32 Decode and Control L1 Instruction Cache and Fetch/Pre-fetch Engine ITLB ECC Branch Prediction Instruction Queue 8 bundles B B B M M I I F F 9 Issue Ports Register Stack Engine / Re-Mapping L2 Cache L3 Cache Branch & Predicate Registers 128 Integer Registers 128 FP Registers Scoreboard, Predicate ,NaTs, Exceptions Branch Units Integer and MM Units Dual-Port L1 Data Cache and DTLB Floating Point Units ALAT SIMD FMAC SIMD FMAC ECC ECC Bus Controller ECC ECC ECC

  32. Instruction Types M: Memory I: Shifts, MM A: ALU B: Branch F: Floating point L+X: Long Template types Regular: MII, MLX, MMI, MFI, MMF Stop: MI_I M_MI Branch: MIB, MMB, MFB, MBB, BBB All come in two versions: with stop at end without stop at end template 5 bits IA64 Instruction Template 128 bits Instruction 2 41 bits Instruction 1 41 bits Instruction 0 41 bits • Microarchitecture considerations: • Can run N bundles per clock (Merced = 2) • Limits on numbers of memory ports (Merced =2, future > 2?)

More Related