1 / 39

A few issues on the design of future multicores

A few issues on the design of future multicores. André Seznec IRISA/INRIA. Single Chip Uniprocessor: the end of the road. (Very) wide issue superscalar processors are not cost effective: More than quadratic complexity on many key components: Register file Bypass network Issue logic

mahola
Download Presentation

A few issues on the design of future multicores

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A few issues on the design of future multicores André Seznec IRISA/INRIA

  2. Single Chip Uniprocessor: the end of the road • (Very) wide issue superscalar processors are not cost effective: • More than quadratic complexity on many key components: • Register file • Bypass network • Issue logic • Limited performance return Failure of EV8 = end of very wide issue superscalar processors

  3. Hardware thread parallelism • High-end single chip component: • Chip multiprocessors: • IBM Power 5, dual-core Intel Pentium 4, dual-core Athlon-64 • Many CMP SoCs for embedded markets • Cell • (Simultaneous) Multithreading: • Pentium 4, Power 5, • Multithreading

  4. Thread parallelism • Expressed by the application developer: • Depends on the application itself • Depends on the programming language or paradigm • Depends on the programmer • Discovered by the compiler: • Automatic (static) parallelization • Exploited by the runtime: • Task scheduling • Dynamically discovered/exploited by hardware or software: • Speculative hardware/software threading

  5. Direction of (single chip) architecture:betting on parallelism success • (Future) applications are intrinsically parallel: • As much as possible simple cores • (Future) applications are moderately parallel • A few complex state-of-the-art superscalar cores SSC: Sea of Simple Cores FCC: Few Complex Cores

  6. SSC: Sea of Simple Cores

  7. FCC: Few Complex Cores 4-way O-O-O superscalar 4-way O-O-O superscalar 4-way O-O-O superscalar •••• Shared L3 cache

  8. Common architectural design issues

  9. Instruction Set Architecture • Single ISAs ? • Extension of “conventional” multiprocessors • Shared or distributed memory ? • Hetorogeneous ISAs: • A la CELL ?: (master processor + slave processors) x N • A la SoC ? : specialized coprocessors • Radically new architecture ? • Which one ?

  10. Hardware accelerators ? • SIMD extensions: • Seems to be accepted, report the burden to applications developers and compilers • Reconfigurable datapaths: • Popular when you get a well defined intrinsically parallel application • Vector extensions: • Might be the right move when targeting essentially scientific computing

  11. On-chip memory/processors/memory bandwidth • The uniprocessor credo was: “Use the remaining silicon for caches” • New issue: • An extra processor or more cache Extra processing power = • increased memory bandwidth demand • Increased power consumption, more temperature hot spots Extra cache = decreased (external) memory demand

  12. Memory hierarchy organization ?

  13. μP μP μP μP μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ $ $ $ $ Flat: sharing a big L2/L3 cache? L3 cache

  14. μP μP μP μP μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ $ $ $ $ Flat: communication issues?through the big cache L3 cache

  15. L3 cache μP μP μP μP μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ $ $ $ $ Flat: communication issues?Grid-like ?

  16. μP μP μP μP μP μP μP μP $ $ $ $ $ $ $ $ L3 $ L2 $ L2 $ L2 $ L2 $ Hierarchical organization ?

  17. Hierarchical organization ? • Arbitration at all levels • Coherency at all levels • Interleaving at all levels • Bandwidth dimensioning

  18. NoC structure • Very dependent of the memory hierarchy organization !! • + sharing coprocessors/hardware accelerators • + I/O buses/(processors ?) • + memory interface • + network interface

  19. μP μP μP μP μP μP $ $ $ $ $ $ L2 $ L2 $ L2 $ Example L3 $ Memory Int. IO

  20. Multithreading ? • An extra level thread parallelism !! • Might be an interesting alternative to prefetching on massively parallel applications

  21. Power and thermal issues • Voltage/frequency scaling to adapt to the workload ? • Adapting the workload to the available power ? • Adapting/dimensioning the architecture to the power budget • Activity migration for managing temperatures ?

  22. General issues for software/compiler • Parallelism detection and partitioning: • find the correct granularity • Memory bandwidth mastering • Non-uniform memory latency • Optimizing sequential code portions

  23. SSC design specificities

  24. Basic core granularity • RISC cores • VLIW cores • In-order superscalar cores

  25. Homogeneous vs. heterogeneous ISAs • Core specialization: • RISC + VLIW or DSP slaves ? • Master core + a set of special purpose cores ?

  26. Sharing issue • Simple cores: • Lot of duplications and lots of unused resources at any time • Adjacent cores can share: • Caches • Functional units: FP, mult/div , multimedia, • Hardware accelerators

  27. IL1 $ IL1 $ Inst. fetch Inst. fetch Hardware accelerator μP μP FP FP μP μP DL1 $ DL1 $ L2 cache An example of sharing

  28. Multithreading/prefetching • Multithreading: • Is the extra complexity worth for simple cores ? • Prefetching: • Is it worth ? • Sharing prefetch engines ?

  29. Vision of a SSC (my own vision )

  30. I $ I $ I $ I $ μP μP μP μP FP FP FP FP μP μP μP μP L2 cache D $ D $ D $ D $ SSC: the basic brick

  31. Memory interface I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ I $ network interface μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP FP μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP μP L2 cache L2 cache L2 cache L2 cache D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ D $ System interface L3 cache

  32. FCC design specificities

  33. Only limited available thread parallelism ? • Focus on uniprocessor architecture: • Find the correct tradeoff between complexity and performance • Power and temperature issues • Vector extensions ? • Contiguous vectors ( a la SSE) ? • Strided vectors in L2 caches ( Tarantula-like)

  34. Performance enablers • SMT for parallel workloads ? • Helper threads ? • Run ahead threads • Speculative multithreading hardware support

  35. Intermediate design ? • SCCs: • Shine on massively parallel applications • Poor/ limited performance on sequential sections • FCCs: • Moderate performance on parallel applications • Good performance on sequential sections

  36. Mix of FCC and SSC Amdahl’s law 

  37. Ultimate Out-of-order Superscalar I $ I $ μP μP FP FP μP μP L2 cache D $ D $ The basic brick

  38. Ult. O-O-O Ult. O-O-O Ult. O-O-O Ult. O-O-O Memory interface I $ I $ I $ I $ I $ I $ I $ I $ L3 cache network interface L2 $ L2 $ L2 $ L2 $ D $ D $ D $ D $ D $ D $ D $ D $ System interface

  39. Conclusion • The era of uniprocessor has come to the end • No clear trend to continue • Might be time for more architecture diversity

More Related