Multi-This, Multi-That, …

Multi-This, Multi-That, …

Limits on IPC • Lam92 • This paper focused on impact of control flow on ILP • Speculative execution can expose 10-400 IPC • assumes no machine limitations except for control dependencies and actual dataflow dependencies • Wall91 • This paper looked at limits more broadly • No branch prediction, no register renaming, no memory disambiguation: 1-2 IPC • ∞-entry bpred, 256 physical registers, perfect memory disambiguation: 4-45 IPC • perfect bpred, register renaming and memory disambiguation: 7-60 IPC • This paper did not consider “control independent” instructions

Practical Limits • Today, 1-2 IPC sustained • far from the 10’s-100’s reported by limit studies • Limited by: • branch prediction accuracy • underlying DFG • influenced by algorithms, compiler • memory bottleneck • design complexity • implementation, test, validation, manufacturing, etc. • power • die area

Differences Between Real Hardware and Limit Studies? • Real branch predictors aren’t 100% accurate • Memory disambiguation is not perfect • Physical resources are limited • can’t have infinite register renaming w/o infinite PRF • need infinite-entry ROB, RS and LSQ • need 10’s-100’s of execution units for 10’s-100’s of IPC • Bandwidth/Latencies are limited • studies assumed single-cycle execution • infinite fetch/commit bandwidth • infinite memory bandwidth (perfect caching)

Bridging the Gap Watts/ IPC Power has been growing exponentially as well 100 10 1 Diminishing returns w.r.t. larger instruction window, higher issue-width Single-Issue Pipelined Limits Superscalar Out-of-Order (Today) Superscalar Out-of-Order (Hypothetical- Aggressive)

Past the Knee of the Curve? Made sense to go Superscalar/OOO: good ROI Performance Very little gain for substantial effort Scalar In-Order “Effort” Moderate-Pipe Superscalar/OOO Very-Deep-Pipe Aggressive Superscalar/OOO

So how do we get more Performance? • Keep pushing IPC and/or frequenecy? • possible, but too costly • design complexity (time to market), cooling (cost), power delivery (cost), etc. • Look for other parallelism • ILP/IPC: fine-grained parallelism • Multi-programming: coarse grained parallelism • assumes multiple user-visible processing elements • all parallelism up to this point was user-invisible

User Visible/Invisible • All microarchitecture performance gains up to this point were “free” • in that no user intervention required beyond buying the new processor/system • recompilation/rewriting could provide even more benefit, but you get some even if you do nothing • Multi-processing pushes the problem of finding the parallelism to above the ISA interface

4-wide OOO CPU Task A Task B Benefit Task A 3-wide OOO CPU 3-wide OOO CPU Task B Task A 2-wide OOO CPU Task B 2-wide OOO CPU Workload Benefits runtime Task A Task B 3-wide OOO CPU This assumes you have two tasks/programs to execute…

… If Only One Task runtime Task A 3-wide OOO CPU 4-wide OOO CPU Task A Benefit Task A 3-wide OOO CPU 3-wide OOO CPU No benefit over 1 CPU Task A 2-wide OOO CPU Performance degradation! 2-wide OOO CPU Idle

Sources of (Coarse) Parallelism • Different applications • MP3 player in background while you work on Office • Other background tasks: OS/kernel, virus check, etc. • Piped applications • gunzip -c foo.gz | grep bar | perl some-script.pl • Within the same application • Java (scheduling, GC, etc.) • Explicitly coded multi-threading • pthreads, MPI, etc.

(Execution) Latency vs. Bandwidth • Desktop processing • typically want an application to execute as quickly as possible (minimize latency) • Server/Enterprise processing • often throughput oriented (maximize bandwidth) • latency of individual task less important • ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time

parallelizable 1CPU 2CPUs 3CPUs 4CPUs Benefit of MP Depends on Workload • Limited number of parallel tasks to run on PC • adding more CPUs than tasks provide zero performance benefit • Even for parallel code, Amdahl’s law will likely result in sub-linear speedup • In practice, parallelizable portion may not be evenly divisible

Basic Models for Parallel Programs • Shared Memory • there’s some portion of memory (maybe all) which is shared among all cooperating processes (threads) • communicate by reading/writing to shared locations • Message Passing • no memory is shared • explicitly send a message to/from threads to communicate data/control

Shared Memory Model • That’s basically it… • need to fork/join threads, synchronize (typically locks) Main Memory Write X Read X CPU0 CPU1

Cache Coherency Main Memory CPU0 has most recent value, so it services the load request instead of going to main memory Other copies (if any) are now stale Write X Read X L2 L2 Need to invalidate other copies CPU0 CPU1 Or update other copies Normally don’t need to modify main memory: requests serviced by other CPUs (only update on evict/writeback)

Cache Coherency Protocols • Not covered in this course • Milos’ 8803 next semester will cover in more detail • Many different protocols • different number of states • different bandwidth/performance/complexity tradeoffs • current protocols usually referred to by their states • ex. MESI, MOESI, etc.

Message Passing Protocols • Explicitly send data from one thread to another • need to track ID’s of other CPUs • broadcast may need multiple send’s • each CPU has own memory space • Hardware: send/recv queues between CPUs Send Recv CPU0 CPU1

Common Memory “Memory0” “Memory1” CPU0 CPU1 Send (42) Recv ( ) Write Q1=42 Read Q1 You Can Fake One on the Other Apart from communication delays, each individual memory space has same contents Memory0 Memory1 CPU0 CPU1 Write X=42 “X” was written Send (X,42) Recv( ) Faking Shared Memory on Message-Passing CPUs

Shared Memory Focus • Most small-medium multi-processors (these days) use some sort of shared memory • shared memory doesn’t scale as well to larger number of nodes • communications are broadcast based • bus becomes a severe bottleneck • message passing doesn’t need centralized bus • can arrange multi-processor like a graph • nodes = CPUs, edges = independent links/routes • can have multiple communications/messages in transit at the same time

SMP Machines • SMP = Symmetric Multi-Processing • Symmetric = All CPUs are “equal” • Equal = any process can run on any CPU • contrast with older parallel systems with master CPU and multiple worker CPUs CPU0 CPU1 CPU2 CPU3 Pictures found from google images

Hardware Modifications for SMP • Processor • mainly support for cache coherence protocols • includes caches, write buffers, LSQ • control complexity increases, as memory latencies may be substantially more variable • Motherboard • multiple sockets (one per CPU) • datapaths between CPUs and memory controller • Other • Case: larger for bigger mobo, better airflow • Power: bigger power supply for N CPUs • Cooling: need to remove N CPUs’ worth of heat

Chip-Multiprocessing • Simple SMP on the same chip Intel “Smithfield” Block Diagram AMD Dual-Core Athlon FX Pictures found from google images

Shared Caches CPU0 CPU1 • Resources can be shared between CPUs • ex. IBM Power 5 L2 cache shared between both CPUs (no need to keep two copies coherent) L3 cache is also shared (only tags are on-chip; data are off-chip)

Benefits? • Cheaper than mobo-based SMP • all/most interface logic integrated on to main chip (fewer total chips, single CPU socket, single interface to main memory) • less power than mobo-based SMP as well (communication on-die is more power-efficient than chip-to-chip communication) • Performance • on-chip communication is faster • Efficiency • potentially better use of hardware resources than trying to make wider/more OOO single-threaded CPU

Performance vs. Power • 2x CPUs not necessarily equal to 2x performance • 2x CPUs  ½ power for each • maybe a little better than ½ if resources can be shared • Back-of-the-Envelope calculation: • 3.8 GHz CPU at 100W • Dual-core: 50W per CPU • P  V3: Vorig3/VCMP3 = 100W/50W  VCMP = 0.8 Vorig • f  V: fCMP = 3.0GHz Benefit of SMP: Full power budget per socket!

Simultaneous Multi-Threading • Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC • poor utilization • SMP: 2-4 CPUs, but need independent tasks • else poor utilization as well • SMT: Idea is to use a single large uni-processor as a multi-processor

SMT (4 threads) CMP 2x HW Cost Approx 1x HW Cost SMT (2) Regular CPU

Overview of SMT Hardware Changes • For an N-way (N threads) SMT, we need: • Ability to fetch from N threads • N sets of registers (including PCs) • N rename tables (RATs) • N virtual memory spaces • But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)

Cycle-Multiplexed fetch logic RS PC0 Decode, etc. I$ fetch PC1 PC2 cycle % N SMT Fetch RS • Duplicate fetch logic fetch PC0 Decode, Rename, Dispatch I$ PC1 fetch PC2 fetch • Alternatives • Other-Multiplexed fetch logic • Duplicate I$ as well

SMT Rename • Thread #1’s R12 != Thread #2’s R12 • separate name spaces • need to disambiguate RAT0 PRF RAT PRF Thread0 Register # Thread-ID concat RAT1 Thread1 Register # Register #

Shared RS Entries Sub T5 = T17 – T2 Add T12 = RT20 + T8 Load T25 = 0[T31] Xor T14 = T12 ^ T19 Load T23 = 0[T14] Sub T19 = T12 – T16 Xor T31 = T17 ^ T5 Add T17 = RT29 + T3 SMT Issue, Exec, Bypass, … • No change needed After Renaming Thread 0: Add R1 = R2 + R3 Sub R4 = R1 – R5 Xor R3 = R1 ^ R4 Load R2 = 0[R3] Thread 0: Add T12 = RT20 + T8 Sub T19 = T12 – T16 Xor T14 = T12 ^ T19 Load T23 = 0[T14] Thread 1: Add R1 = R2 + R3 Sub R4 = R1 – R5 Xor R3 = R1 ^ R4 Load R2 = 0[R3] Thread 1: Add T17 = RT29 + T3 Sub T5 = T17 – T2 Xor T31 = T17 ^ T5 Load T25 = 0[T31]

SMT Cache • Each process has own virtual address space • TLB must be thread-aware • translate (thread-id,virtual page)  physical page • Virtual portion of caches must also be thread-aware • VIVT cache must now be (virutal addr, thread-id)-indexed, (virtual addr, thread-id)-tagged • Similar for VIPT cache

SMT Commit • One “Commit PC” per thread • Register File Management • ARF/PRF organization • need one ARF per thread • Unified PRF • need one “architected RAT” per thread • Need to maintain interrupts, exceptions, faults on a per-thread basis • like OOO needs to appear to outside world that it is in-order, SMT needs to appear as if it is actually N CPUs

SMT Design Space • Number of threads • Full-SMT vs. Hard-partitioned SMT • full-SMT: ROB-entries can be allocated arbitrarily between the threads • hard-partitioned: if only one thread, use all ROB entries; if two threads, each is limited to one half of the ROB (even if the other thread uses only a few entries); possibly similar for RS, LSQ, PRF, etc. • Amount of duplication • Duplicate I$, D$, fetch engine, decoders, schedulers, etc.? • There’s a continuum of possibilities between SMT and CMP • ex. could have CMP where FP unit is shared SMT-styled

SMT Performance • When it works, it fills idle “issue slots” with work from other threads; throughput improves • But sometimes it can cause performance degradation! Time( ) < Time( ) Finish one task, then do the other Do both at same time using SMT

How? • Cache thrashing L2 I$ D$ Executes reasonably quickly due to high cache hit rates Thread0 just fits in the Level-1 Caches I$ D$ Caches were just big enough to hold one thread’s data, but not two thread’s worth Context switch to Thread1 I$ D$ Now both threads have significantly higher cache miss rates Thread1 also fits nicely in the caches

This is all combinable • Can have a system that supports SMP, CMP and SMT at the same time • Take a dual-socket SMP motherboard… • Insert two chips, each with a dual-core CMP… • Where each core supports two-way SMT • This example provides 8 threads worth of execution, shared on 4 actual “cores”, split across two physical packages

OS Confusion • SMT/CMP is supposed to look like multiple CPUs to the software/OS Performance worse than if SMT was turned off and used 2-way SMP only A CPU0 A/B 2-way SMT B CPU1 2-way SMT idle idle CPU2 idle CPU3 2 cores (either SMP/CMP) Say OS has two tasks to run… Schedule tasks to (virtual) CPUs

OS Confusion (2) • Asymmetries in MP-Hierarchy can be very difficult for the OS to deal with • need to break abstraction: OS needs to know which CPUs are real physical processor (SMP), which are shared in the same package (CMP), and which are virtual (SMT) • Distinct applications should be scheduled to physically different CPUs • no cache contention, no power contention • Cooperative applications (different threads of the same program) should maybe be scheduled to the same physical chip (CMP) • reduce latency of inter-thread communication, possibly reduce duplication if shared L2 is used • Use SMT as last choice

Multi-* is Happening Just a question of exactly how? Number of cores, support for SMT/SoeMT, asymmetric cores, etc.?

Multi-This, Multi-That, …

Multi-This, Multi-That, …

Presentation Transcript

Multi-objective Optimization Using Particle Swarm Optimization

Multi-Q Introduction

Multi-Q Introduction

Multi-Q Introduction

RTI Implementer Series Module 3: Multi-Level Prevention System

Multi-Q Introduction

Lecture 3 (Complexities of Parallelism)

Multi-Q Introduction

Multi-Tiered Systems of Support (MTSS) in Secondary Schools

Practical Implementation of Multi-Arm Multi-Stage Trials Mahesh Parmar Director

Multi-Q Introduction

Multi-Q Introduction

Multi-Q Introduction

Outils de Développement de Systèmes Multi-Agents (SMA)

Multi-Q Introduction

The CBorg Mobile Multi Agent Systems

Multi-Q Introduction

Lecture 25: Multi-view stereo, continued

Multi-Q Introduction

Gravity settler