1 / 23

CS 7810 Lecture 3

CS 7810 Lecture 3. Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin ISCA ’00. Previous Papers. Limits of ILP – it is probably worth doing o-o-o superscalar

Download Presentation

CS 7810 Lecture 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CS 7810 Lecture 3 Clock Rate vs. IPC: The End of the Road for Conventional Microarchitectures V. Agarwal, M.S. Hrishikesh, S.W. Keckler, D. Burger UT-Austin ISCA ’00

  2. Previous Papers • Limits of ILP – it is probably worth doing o-o-o • superscalar • Complexity-Effective – wire delays make the • implementations harder and increase latencies • Today’s paper – these latencies severely impact • IPCs and slow the growth in processor performance

  3. 1995-2000

  4. 1995-2000 • Clock speed has improved by 50% every year • Reduction in logic delays • Deeper pipelines  This will soon end • IPC has gone up dramatically (the increased • complexity was worth it)  Will this end too?

  5. Wire Scaling • Multiple wire layers – the SIA roadmap predicts • dimensions (somewhat aggressive) • As transistor widths shrink, wires become thinner, • and their resistivity goes up (quadratically – Table 1) • Parallel-plate capacitance reduces, but coupling • capacitance increases (slight overall increase) • The equations are different, but the end result is • similar to Palacharla’s (without repeaters)

  6. Wire Scaling

  7. Wire Scaling • With repeaters, delay of a fixed-length wire does • not go up quadratically as we shrink gate-width • In going from 250nm  35nm, • 5mm wire delay 170ps  390ps • delay to cross X gates 170ps  55ps • SIA clock speed 0.75GHz  13.5GHz • delay to cross X gates 0.13 cyc  0.75 cycles • We could increase wire width, but that compromises • bandwidth

  8. Clock Scaling • Logic delay (the FO4 delay) scales linearly with • gate length • Likewise, work per pipeline stage has also been • shrinking • The SIA predicts that today’s 16 FO4/stage delay • will shrink to 5.6 FO4/stage • A 64-bit add takes 5.5 FO4 – hence, they examine • SIA (super-aggressive), 8-FO4 (aggressive), and • 16-FO4 (conservative) scaling strategies

  9. Clock Scaling

  10. Clock Scaling • While the 15-20% improvement in technology • scaling will continue, the 15-20% improvement • in pipeline depth will cease

  11. On-Chip Wire Delays • The number of bits reachable in a cycle are • shrinking (by more than a factor of two across • three generations) •  Structures that fit in a cycle today, will have • to be shrunk (smaller regfiles, issue queues) • Chip area is steadily increasing •  Less than 1% of the chip reachable in a • cycle, 30 cycles to go across the chip! • Processors are becoming communication-bound

  12. Processor Structure Delays • To model the microarchitecture, they estimate • the delays of all wire-limited structures • Weakness: bypass delays are not considered

  13. Microarchitecture Scaling • Capacity Scaling: constant access latencies in • cycles (simpler designs), scale capacities down • to make it fit • Pipeline Scaling: constant capacities, latencies • go up, hence, deeper pipelines • Any other approaches?

  14. Microarchitecture Scaling • Capacity Scaling: constant access latencies in • cycles (simpler designs), scale capacities down • to make it fit • Pipeline Scaling: constant capacities, latencies • go up, hence, deeper pipelines • Replicated Capacity Scaling: fast core with few • resources, but lots of them – high IPC if you can • localize communication

  15. IPC Comparisons 2-cycle wakeup 2-cycle regread 2-cycle bypass Pipeline Scaling 20-IQ F 20-IQ F F F 40 Regs 40 Regs F F F F Capacity Scaling 15-IQ F F Replicated Capacity Scaling 30 Regs F 15-IQ F 15-IQ F F F 30 Regs 30 Regs F F

  16. Methodology

  17. Results

  18. Results • Every instruction experiences longer latencies • IPCs are much lower for aggressive clocks • Overall performance is still comparable for all • approaches

  19. Results • In 17 years, we are seeing only a 7-fold speedup • (historically, it should have been 1720) – annual • increase of 12.5% • Slow growth because pipeline depth and IPC • increase will stagnate

  20. Questionable Assumptions • Additional transistors are not being used to • improve IPC • All instructions pay wire-delay penalties

  21. Conclusions • Large monolithic cores will perform poorly – • microarchitectures will have to be partitioned • On-chip caches will be the biggest bottlenecks – • 3-cycle 0.5KB L1s, 30-50-cycle 2MB L2s • Future proposals should be wire-delay-sensitive

  22. Next Class’ Paper • “Dynamic Code Partitioning for Clustered • Architectures”, UPC-Barcelona, 2001 • Instruction steering heuristics to balance load • and minimize communication

  23. Title • Bullet

More Related