Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors Guri Sohi University of Wisconsin

Outline • Waves of innovation in architecture • Innovation in uniprocessors • Opportunities for innovation • Example CMP innovations • Parallel processing models • CMP memory hierarcies

Waves of Research and Innovation • A new direction is proposed or new opportunity becomes available • The center of gravity of the research community shifts to that direction • SIMD architectures in the 1960s • HLL computer architectures in the 1970s • RISC architectures in the early 1980s • Shared-memory MPs in the late 1980s • OOO speculative execution processors in the 1990s

Waves • Wave is especially strong when coupled with a “step function” change in technology • Integration of a processor on a single chip • Integration of a multiprocessor on a chip

Uniprocessor Innovation Wave • Integration of processor on a single chip • The inflexion point • Argued for different architecture (RISC) • More transistors allow for different models • Speculative execution • Then the rebirth of uniprocessors • Continue the journey of innovation • Totally rethink uniprocessor microarchitecture

The Next Wave • Can integrate a simple multiprocessor on a chip • Basic microarchitecture similar to traditional MP • Rethink the microarchitecture and usage of chip multiprocessors

Broad Areas for Innovation • Overcoming traditional barriers • New opportunities to use CMPs for parallel/multithreaded execution • Novel use of on-chip resources (e.g., on-chip memory hierarchies and interconnect)

Remainder of Talk Roadmap • Summary of traditional parallel processing • Revisiting traditional barriers and overheads to parallel processing • Novel CMP applications and workloads • Novel CMP microarchitectures

Multiprocessor Architecture • Take state-of-the-art uniprocessor • Connect several together with a suitable network • Have to live with defined interfaces • Expend hardware to provide cache coherence and streamline inter-node communication • Have to live with defined interfaces

Software Responsibilities • Reason about parallelism, execution times and overheads • This is hard • Use synchronization to ease reasoning • Parallel trends towards serial with the use of synchronization • Very difficult to parallelize transparently

Net Result • Difficult to get parallelism speedup • Computation is serial • Inter-node communication latencies exacerbate problem • Multiprocessors rarely used for parallel execution • Typical use: improve throughput • This will have to change • Will need to rethink “parallelization”

Rethinking Parallelization • Speculative multithreading • Speculation to overcoming other performance barriers • Revisiting computation models for parallelization • New parallelization opportunities • New types of workloads (e.g., multimedia) • New variants of parallelization models

Speculative Multithreading • Speculatively parallelize an application • Use speculation to overcome ambiguous dependences • Use hardware support to recover from mis-speculation • E.g., multiscalar • Use hardware to overcome barriers

Overcoming Barriers: Memory Models • Weak models proposed to overcome performance limitations of SC • Speculation used to overcome “maybe” dependences • Series of papers showing SC can achieve performance of weak models

Implications • Strong memory models not necessarily low performance • Programmer does not have to reason about weak models • More likely to have parallel programs written

Overcoming Barriers: Synchronization • Synchronization to avoid “maybe” dependences • Causes serialization • Speculate to overcome serialization • Recent work on techniques to dynamically elide synchronization constructs

Implications • Programmer can make liberal use of synchronization to ease programming • Little performance impact of synchronization • More likely to have parallel programs written

Revisiting Parallelization Models • Transactions • simplify writing of parallel code • very high overhead to implement semantics in software • Hardware support for transactions will exist • Speculative multithreading is ordered transactions • No software overhead to implement semantics • More applications likely to be written with transactions

New Opportunities for CMPs • New opportunities for parallelism • Emergence of multimedia workloads • Amenable to traditional parallel processing • Parallel execution of overhead code • Program demultiplexing • Separating user and OS

New Opportunities for CMPs • New opportunities for microarchitecture innovation • Instruction memory hierarchies • Data memory hierarchies

Reliable Systems • Software is unreliable and error-prone • Hardware will be unreliable and error-prone • Improving hardware/software reliability will result in significant software redundancy • Redundancy will be source of parallelism

Software Reliability & Security Reliability & Security via Dynamic Monitoring - Many academic proposals for C/C++ code - Ccured, Cyclone, SafeC, etc… - VM performs checks for Java/C# High Overheads! - Encourages use of unsafe code

A Program B C B’ D C’ Software Reliability & Security Monitoring Code - Divide program into tasks - Fork a monitor thread to check computation of each task - Instrument monitor thread with safety checking code A’

A B A’ B’ C’ C’ Software Reliability & Security - Commit/abort at task granularity - Precise error detection achieved by re-executing code w/ in-lined checks D C COMMIT D COMMIT ABORT

Software Reliability & Security Fine-grained instrumentation - Flexible, arbitrary safety checking Coarse-grained verification - Amortizes thread startup latency over many instructions Initial Results: Software-based Fault Isolation (Wahbe et al. SOSP 1993) - Assumes no misspeculation due to traps, interrupts, speculative storage overflow

Software Reliability & Security

Program De-multiplexing • New opportunities for parallelism • 2-4X parallelism • Program is a multiplexing of methods (or functions) onto single control flow • De-multiplex methods of program • Execute methods in parallel in dataflow fashion

Data-flow Execution • Data-flow Machines • Dataflow in programs • No control flow, PC • Nodes are instrs. on FU • OoO Superscalar Machines • Sequential programs • Limited dataflow with ROB • Nodes are instrs. on FU 3 1 2 6 4 5 7

Program Demultiplexing Nodes - methods (M) Processors - FUs 3 1 2 6 4 Demux’ed Execution Sequential Program 5 7

On Trigger- Execute 6 1 2 Demux’ed exec. (6) Sequential Program 3 4 On Call- Use 5 6 7 Program Demultiplexing • Triggers • usually fire after data dep of M • Chosen with software support • Handlers generate params • PD • Data-flow based spec. execution • Differs from control-flow spec.

Data-flow Execution of Methods Earliest possible Exec. time of M Exec. Time + overheads of M

Impact of Parallelization • Expect different characteristics for code on each core • More reliance on inter-core parallelism • Less reliance on intra-core parallelism • May have specialized cores

Microarchitectural Implications • Processor Cores • Skinny, less complex • Will SMT go away? • Perhaps specialized • Memory structures (i-caches, TLBs, d-caches) • Different organizations possible

Microarchitectural Implications • Novel memory hierarchy opportunities • Use on-chip memory hierarchy to improve latency and attenuate off-chip bandwidth • Pressure on non-core techniques to tolerate longer latencies • Helper threads, pre-execution

Instruction Memory Hierarchy • Computation spreading

Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D L2 L2 L2 L2 Interconnect Network

Code Commonality • Large fraction of the code executed is common to all the processors both at 8KB page (left) and 64-byte cache block (right) granularity • Poor utilization of L1 I-cache due to replication

Removing Duplication MPKI (64K) 18.4 17.9 23.3 3.91 19.1 26.8 22.1 4.211 • Avoiding duplication significantly reduces misses • Shared L1 I-cache may be impractical

Computation Spreading • Avoid reference spreading by distributing the computation based on code regions • Each processor is responsible for a particular region of code (for multiple threads) • References to a particular code region is localized • Computation from one thread is carried out in different processors

Example TIME Canonical Model Computation Spreading

L1 cache performance • Significant reduction in instruction misses • Data Cache Performance is deteriorated • Migration of computation disrupts locality of data reference

Data Memory Hierarchy • Co-operative caching

Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D L2 L2 L2 L2 Interconnect Network

Conventional CMP Organizations L1I L1I L1I L1I P P P P …… L1D L1D L1D L1D Interconnect Network Shared L2

Hybrid Schemes (CMP-NUCA, Victim replication, CMP-NuRAPID, etc) … … Cooperative Caching Configurable/malleable Caches (Liu et al. HPCA’04, Huh et al. ICS’05, etc) A Spectrum of CMP Cache Designs Shared Caches Private Caches Unconstrained capacity sharing; Best capacity No capacity sharing; Best latency

CMP Cooperative Caching P P L1I L1D L1D L1I L2 L2 L2 L2 L1I L1D L1D L1I P P • Basic idea: Private caches cooperatively form an aggregate global cache • Use private caches for fast access / isolation • Share capacity through cooperation • Mitigate interference via cooperation throttling • Inspired by cooperative file/web caches

Reduction of Off-chip Accesses Trace-based simulation: 1MB L2 cache per core C-Clean: good for commercial workloads C-Singlet: Good for all benchmarks C-1Fwd: good for heterogeneous workloads Reduction of off-chip access rate Shared (LRU) Most reduction MPKI (private): 3.21 11.63 5.53 1.67 0.50 5.97 15.84 5.28

Cycles Spent on L1 Misses Latencies: 10 cycles Local L2; 30 cycles next L2, 40 cycles remote L2, 300 cycles off-chip accesses Trace-based simulation: no miss overlapping P:Private C:C-Clean F:C-1Fwd S:Shared PCFS PCFS PCFS PCFS PCFS PCFS PCFS PCFS OLTP Apache ECperf JBB Barnes Ocean Mix1 Mix2

Summary • Start of a new wave of innovation to rethink parallelism and parallel architecture • New opportunities for innovation in CMPs • New opportunities for parallelizing applications • Expect little resemblance between MPs today and CMPs in 15 years • We need to invent and define differences

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors

Single-Chip Multiprocessors: Redefining the Microarchitecture of Multiprocessors

Presentation Transcript

Interconnect-Aware Coherence Protocols for Chip Multiprocessors

Single-Chip Multiprocessors: the Rebirth of Parallel Architecture

Chapter4 Multiprocessors

Core to Memory Interconnection Implications for Forthcoming On-Chip Multiprocessors

Chapter 6 Multiprocessors and Thread-Level Parallelism

Multiprocessors and Threads

Inter-Core Cooperative TLB Prefetchers for Chip Multiprocessors

Multiprocessors continued

Multiprocessors—Directory Schemes

Microvisor : A Runtime Architecture for Thermal Management in Chip Multiprocessors

A Novel 3D Layer-Multiplexed On-Chip Network

ImanFaraji

The Performance of Spin Lock Alternatives for Shared-Memory Multiprocessors

Scalable Multiprocessors(II)

Advanced Topics in Pipelining - SMT and Single-Chip Multiprocessor

URPC for Shared Memory Multiprocessors

Multiprocessors

MULTIPROCESSORS

Hierarchical Floorplanning of Chip Multiprocessors using Subgraph Discovery

Issues in Multiprocessors

Hyper-Threading , Chip multiprocessors and both

Chapter 6