single chip multiprocessors redefining the microarchitecture of multiprocessors n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors PowerPoint Presentation
Download Presentation
Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors

Loading in 2 Seconds...

play fullscreen
1 / 49
vianca

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors - PowerPoint PPT Presentation

121 Views
Download Presentation
Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin

  2. Outline • Waves of innovation in architecture • Innovation in uniprocessors • Opportunities for innovation • Example CMP innovations • Parallel processing models • CMP memory hierarcies

  3. Waves of Research and Innovation • A new direction is proposed or new opportunity becomes available • The center of gravity of the research community shifts to that direction • SIMD architectures in the 1960s • HLL computer architectures in the 1970s • RISC architectures in the early 1980s • Shared-memory MPs in the late 1980s • OOO speculative execution processors in the 1990s

  4. Waves • Wave is especially strong when coupled with a “step function” change in technology • Integration of a processor on a single chip • Integration of a multiprocessor on a chip

  5. Uniprocessor Innovation Wave • Integration of processor on a single chip • The inflexion point • Argued for different architecture (RISC) • More transistors allow for different models • Speculative execution • Then the rebirth of uniprocessors • Continue the journey of innovation • Totally rethink uniprocessor microarchitecture

  6. The Next Wave • Can integrate a simple multiprocessor on a chip • Basic microarchitecture similar to traditional MP • Rethink the microarchitecture and usage of chip multiprocessors

  7. Broad Areas for Innovation • Overcoming traditional barriers • New opportunities to use CMPs for parallel/multithreaded execution • Novel use of on-chip resources (e.g., on-chip memory hierarchies and interconnect)

  8. Remainder of Talk Roadmap • Summary of traditional parallel processing • Revisiting traditional barriers and overheads to parallel processing • Novel CMP applications and workloads • Novel CMP microarchitectures

  9. Multiprocessor Architecture • Take state-of-the-art uniprocessor • Connect several together with a suitable network • Have to live with defined interfaces • Expend hardware to provide cache coherence and streamline inter-node communication • Have to live with defined interfaces

  10. Software Responsibilities • Reason about parallelism, execution times and overheads • This is hard • Use synchronization to ease reasoning • Parallel trends towards serial with the use of synchronization • Very difficult to parallelize transparently

  11. Net Result • Difficult to get parallelism speedup • Computation is serial • Inter-node communication latencies exacerbate problem • Multiprocessors rarely used for parallel execution • Typical use: improve throughput • This will have to change • Will need to rethink “parallelization”

  12. Rethinking Parallelization • Speculative multithreading • Speculation to overcoming other performance barriers • Revisiting computation models for parallelization • New parallelization opportunities • New types of workloads (e.g., multimedia) • New variants of parallelization models

  13. Speculative Multithreading • Speculatively parallelize an application • Use speculation to overcome ambiguous dependences • Use hardware support to recover from mis-speculation • E.g., multiscalar • Use hardware to overcome barriers

  14. Overcoming Barriers: Memory Models • Weak models proposed to overcome performance limitations of SC • Speculation used to overcome “maybe” dependences • Series of papers showing SC can achieve performance of weak models

  15. Implications • Strong memory models not necessarily low performance • Programmer does not have to reason about weak models • More likely to have parallel programs written

  16. Overcoming Barriers: Synchronization • Synchronization to avoid “maybe” dependences • Causes serialization • Speculate to overcome serialization • Recent work on techniques to dynamically elide synchronization constructs

  17. Implications • Programmer can make liberal use of synchronization to ease programming • Little performance impact of synchronization • More likely to have parallel programs written

  18. Revisiting Parallelization Models • Transactions • simplify writing of parallel code • very high overhead to implement semantics in software • Hardware support for transactions will exist • Speculative multithreading is ordered transactions • No software overhead to implement semantics • More applications likely to be written with transactions

  19. New Opportunities for CMPs • New opportunities for parallelism • Emergence of multimedia workloads • Amenable to traditional parallel processing • Parallel execution of overhead code • Program demultiplexing • Separating user and OS

  20. New Opportunities for CMPs • New opportunities for microarchitecture innovation • Instruction memory hierarchies • Data memory hierarchies

  21. Reliable Systems • Software is unreliable and error-prone • Hardware will be unreliable and error-prone • Improving hardware/software reliability will result in significant software redundancy • Redundancy will be source of parallelism

  22. Software Reliability & Security Reliability & Security via Dynamic Monitoring - Many academic proposals for C/C++ code - Ccured, Cyclone, SafeC, etc… - VM performs checks for Java/C# High Overheads! - Encourages use of unsafe code

  23. A Program B C B’ D C’ Software Reliability & Security Monitoring Code - Divide program into tasks - Fork a monitor thread to check computation of each task - Instrument monitor thread with safety checking code A’

  24. A B A’ B’ C’ C’ Software Reliability & Security - Commit/abort at task granularity - Precise error detection achieved by re-executing code w/ in-lined checks D C COMMIT D COMMIT ABORT

  25. Software Reliability & Security Fine-grained instrumentation - Flexible, arbitrary safety checking Coarse-grained verification - Amortizes thread startup latency over many instructions Initial Results: Software-based Fault Isolation (Wahbe et al. SOSP 1993) - Assumes no misspeculation due to traps, interrupts, speculative storage overflow

  26. Software Reliability & Security

  27. Program De-multiplexing • New opportunities for parallelism • 2-4X parallelism • Program is a multiplexing of methods (or functions) onto single control flow • De-multiplex methods of program • Execute methods in parallel in dataflow fashion

  28. Data-flow Execution • Data-flow Machines • Dataflow in programs • No control flow, PC • Nodes are instrs. on FU • OoO Superscalar Machines • Sequential programs • Limited dataflow with ROB • Nodes are instrs. on FU 3 1 2 6 4 5 7

  29. Program Demultiplexing Nodes - methods (M) Processors - FUs 3 1 2 6 4 Demux’ed Execution Sequential Program 5 7

  30. On Trigger- Execute 6 1 2 Demux’ed exec. (6) Sequential Program 3 4 On Call- Use 5 6 7 Program Demultiplexing • Triggers • usually fire after data dep of M • Chosen with software support • Handlers generate params • PD • Data-flow based spec. execution • Differs from control-flow spec.

  31. Data-flow Execution of Methods Earliest possible Exec. time of M Exec. Time + overheads of M

  32. Impact of Parallelization • Expect different characteristics for code on each core • More reliance on inter-core parallelism • Less reliance on intra-core parallelism • May have specialized cores

  33. Microarchitectural Implications • Processor Cores • Skinny, less complex • Will SMT go away? • Perhaps specialized • Memory structures (i-caches, TLBs, d-caches) • Different organizations possible

  34. Microarchitectural Implications • Novel memory hierarchy opportunities • Use on-chip memory hierarchy to improve latency and attenuate off-chip bandwidth • Pressure on non-core techniques to tolerate longer latencies • Helper threads, pre-execution

  35. Instruction Memory Hierarchy • Computation spreading

  36. Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D L2 L2 L2 L2 Interconnect Network

  37. Code Commonality • Large fraction of the code executed is common to all the processors both at 8KB page (left) and 64-byte cache block (right) granularity • Poor utilization of L1 I-cache due to replication

  38. Removing Duplication MPKI (64K) 18.4 17.9 23.3 3.91 19.1 26.8 22.1 4.211 • Avoiding duplication significantly reduces misses • Shared L1 I-cache may be impractical

  39. Computation Spreading • Avoid reference spreading by distributing the computation based on code regions • Each processor is responsible for a particular region of code (for multiple threads) • References to a particular code region is localized • Computation from one thread is carried out in different processors

  40. Example TIME Canonical Model Computation Spreading

  41. L1 cache performance • Significant reduction in instruction misses • Data Cache Performance is deteriorated • Migration of computation disrupts locality of data reference

  42. Data Memory Hierarchy • Co-operative caching

  43. Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D L2 L2 L2 L2 Interconnect Network

  44. Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D Interconnect Network Shared L2

  45. Hybrid Schemes (CMP-NUCA, Victim replication, CMP-NuRAPID, etc) … … Cooperative Caching Configurable/malleable Caches (Liu et al. HPCA’04, Huh et al. ICS’05, etc) A Spectrum of CMP Cache Designs Shared Caches Private Caches Unconstrained capacity sharing; Best capacity No capacity sharing; Best latency

  46. CMP Cooperative Caching P P L1I L1D L1D L1I L2 L2 L2 L2 L1I L1D L1D L1I P P • Basic idea: Private caches cooperatively form an aggregate global cache • Use private caches for fast access / isolation • Share capacity through cooperation • Mitigate interference via cooperation throttling • Inspired by cooperative file/web caches

  47. Reduction of Off-chip Accesses Trace-based simulation: 1MB L2 cache per core C-Clean: good for commercial workloads C-Singlet: Good for all benchmarks C-1Fwd: good for heterogeneous workloads Reduction of off-chip access rate Shared (LRU) Most reduction MPKI (private): 3.21 11.63 5.53 1.67 0.50 5.97 15.84 5.28

  48. Cycles Spent on L1 Misses Latencies: 10 cycles Local L2; 30 cycles next L2, 40 cycles remote L2, 300 cycles off-chip accesses Trace-based simulation: no miss overlapping P:Private C:C-Clean F:C-1Fwd S:Shared PCFS PCFS PCFS PCFS PCFS PCFS PCFS PCFS OLTP Apache ECperf JBB Barnes Ocean Mix1 Mix2

  49. Summary • Start of a new wave of innovation to rethink parallelism and parallel architecture • New opportunities for innovation in CMPs • New opportunities for parallelizing applications • Expect little resemblance between MPs today and CMPs in 15 years • We need to invent and define differences