On Power and Multi-Processors

On Power and Multi-Processors Finishing up power issues and how those issues have led us to multi-core processors. Introduce multi-processor systems.

Power from last time • Power is perhaps the performance limiter • Can’t remove enough heat to keep performance increasing • Even for things with plugs, energy is an issue • $$$$ for energy, $$$$ to remove heat. • For things without plugs, energy is huge • Cost of batteries • Time between charging

What we did last time (1/5) • Power is important • It has become a (perhaps the) limiting factor in performance

When last we met (2/5) • Power is important to computer architects. • It limits performance. • With cooling techniques we can increase our power budget (and thus our performance). • But these techniques get very expensive very very quickly. • Both to build and operate active cooling devices. Costs power (and thus $) to cool Active cooling system costs $ to build

When we last met (3/5) • Energy is important to computer architects • Energy is what we really pay for (10¢ per kWh) • Energy is what limits batteries (usually listed as mAh) • AA batteries tend to have 1000-2000 mAh or (assuming 1.5V) 1.5-3.0Wh* • iPad battery is rated at about 25Wh. • Some devices limited by energythey can scavenge. http://iprojectideas.blogspot.com/2010/06/human-power-harvesting.html * Watch those units. Assuming it takes 5Wh of energy to charge a 3Wh battery, how much does it cost for the energy to charge that battery? What about an iPad?

When we last met (4/5)Power vs. Energy What uses power in a chip?

When last we met (5/5) • How do performance and power relate?* • Power approximately proportional to V2f. • Performance is approximately proportional to f • The required voltage is approximately proportional to the desired frequency.** • If you accept all of these assumptions, we get that an X increase in performance would require an X3 increase in power. • This is a really important fact *These are all pretty rough rules of thumb. Consider the second one and discuss its shortcomings. **This one in particular tends to hold only over fairly small (10-20%?) changes in V.

What we didn’t quite do last time • For things without plugs, energy is huge • Cost of batteries • Time between charging • (somewhat extreme) Example: • iPad has ~42Wh (11,560mAh) which is huge • Still run down a battery in hours • Costs $99 for a new one (2 years of use or so)

Dynamic (Capacitive) Power Dissipation • Data dependent – a function of switching activity What uses power in a chip?

Capacitance: Function of wire length, transistor size Supply Voltage: Has been dropping with successive fab generations Clock frequency: Increasing… Activity factor: How often, on average, do wires switch? Capacitive Power dissipation What uses power in a chip? Power ~ ½ CV2Af

Static Power: Leakage Currents • Subthreshold currents grow exponentially with increases in temperature, decreases in threshold voltage • But threshold voltage scaling is key to circuit performance! • Gate leakage primarily dependent on gate oxide thickness, biases • Both type of leakage heavily dependent on stacking and input pattern What uses power in a chip?

Capacitance: Function of wire length, transistor size Supply Voltage: Has been dropping with successive fab generations Clock frequency: Increasing… Activity factor: How often, on average, do wires switch? Capacitive Power dissipation What uses power in a chip? Power ~ ½ CV2Af

Aside: What do we mean by Power? • Max Power: Artificial code generating max CPU activity • Worst-case App Trace: Practical applications worst-case • Thermal Power: Running average of worst-case app power over a time period corresponding to thermal time constant • Average Power: Long-term average of typical apps (minutes) • Transient Power: Variability in power consumption for supply net What uses power in a chip?

Lowering Dynamic Power • Reducing Vdd has a quadratic effect • Has a negative (~linear) effect on performance however • Lowering CL • May improve performance as well • Keep transistors small (keeps intrinsic capacitance (gate and diffusion) small) • Reduce switching activity • A function of signal transition stats and clock rate • Clock Gating idle units • Impacted by logic and architecture decisions What uses power in a chip?

Static Power: Leakage Currents • Subthreshold currents grow exponentially with increases in temperature, decreases in threshold voltage • But threshold voltage scaling is key to circuit performance! • Gate leakage primarily dependent on gate oxide thickness, biases • Both type of leakage heavily dependent on stacking and input pattern What uses power in a chip?

Lowering Static Power • Design-time Decisions • Use fewer, smaller transistors -- stack when possible to minimize contacts with Vdd/Gnd • Multithreshold process technology (multiple oxides too!) • Use “high-Vt” slow transistors whenever possible • Dynamic Techniques • Reverse-Body Bias (dynamically adjust threshold) • Low-leakage sleep mode (maintain state), e.g. XScale • Vdd-gating (Cut voltage/gnd connection to circuits) • Zero-leakage sleep mode • Lose state, overheads to enable/disable What uses power in a chip?

Power vs. Energy • Power consumption in Watts • Determines battery life in hours • Sets packaging limits • Energy efficiency in joules or kWh • Rate at which energy is consumed over time • Energy = power * delay (joules = watts * seconds) • Lower energy number means less power to perform a computation at same frequency What uses power in a chip?

Power vs. Energy What uses power in a chip?

Power vs. Energy • Power-delay Product (PDP) = Pavg * t • PDP is the average energy consumed per switching event • Energy-delay Product (EDP) = PDP * t • Takes into account that one can trade increased delay for lower energy/operation • Energy-delay2 Product (EDDP) = EDP * t2 • Gives a sense of what voltage scaling could get you. What uses power in a chip?

E vs. EDP vs. ED2P • Currently have a processor design: • 80W, 1 BIPS, 1.5V, 1GHz • Want to reduce power, willing to lose some performance • Cache Optimization: • IPC decreases by 10%, reduces power by 20% => Final Processor: 900 MIPS, 64W • Relative E = MIPS/W (higher is better) = 14/12.5 = 1.125x • Energy is better, but is this a “better” processor? What uses power in a chip?

Not necessarily • 80W, 1 BIPS, 1.5V, 1GHz • Cache Optimization: • IPC decreases by 10%, reduces power by 20% => Final Processor: 900 MIPS, 64W • Relative E = MIPS/W (higher is better) = 14/12.5 = 1.125x • Relative EDP = MIPS2/W = 1.01x • Relative ED2P = MIPS3/W = .911x • What if we just adjust frequency/voltage on processor? • How to reduce power by 20%? • P = CV2F = CV3 => Drop voltage by 7% (and also Freq) => .93*.93*.93 = .8x • So for equal power (64W) • Cache Optimization = 900MIPS • Simple Voltage/Frequency Scaling = 930MIPS What uses power in a chip?

So? • What we’ve concluded is that if we want to increase performance by a factor of X, we might be looking at a factor of X3 power! • But if you are paying attention, that’s just for voltage scaling! • What about other techniques? Performance & Power

Other techniques? • Well, we could try to improve the amount of ILP we take advantage of. • That probably involves making a “wider” processor (more superscalar) • What are the costs associated with doing that? • How much bigger do things get? • What do we expect the performance gains to be? • How about circuit techniques? • Historically the “threshold voltage” has dropped as circuits get smaller. • So power drops. • This has (mostly) stopped being true. • And it’s actually what got us in trouble to begin with! Performance & Power

So we are hosed… • I mean if voltage scaling doesn’t work, circuit shrinking doesn’t help (much), and ILP techniques don’t clearly work… • What’s left? • How about we drop performance to 80% of what it was and have 2 processors? • How much power does that take? • How much performance could we get? • Pros/Cons? • What if I wanted 8 processors? • How much performance drop needed per processor? Performance & Power

Multiprocessors Keeping it all working together

Thread-Level Parallelism structacct_t { int bal; }; shared structacct_t accts[MAX_ACCT]; intid,amt; if (accts[id].bal >= amt) { accts[id].bal -= amt; spew_cash(); } • Thread-level parallelism (TLP) • Collection of asynchronous tasks: not started and stopped together • Data shared loosely, dynamically • Example: database/web server (each query is a thread) • accts is shared, can’t register allocate even if it were scalar • id and amt are private variables, register allocated to r1, r2 0: addi r1,accts,r3 1: ld 0(r3),r4 2: bltr4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 6: ... ... ...

Shared-Memory Multiprocessors • Shared memory • Multiple execution contexts sharing a single address space • Multiple programs (MIMD) • Or more frequently: multiple copies of one program (SPMD) • Implicit (automatic) communication via loads and stores P1 P2 P3 P4 Memory System

Why Shared Memory? Pluses • For applications looks like multitasking uniprocessor • For OS only evolutionary extensions required • Easy to do communication without OS • Software can worry about correctness first then performance Minuses • Proper synchronization is complex • Communication is implicit so harder to optimize • Hardware designers must implement Result • Traditionally bus-based Symmetric Multiprocessors (SMPs), and now the CMPs are the most success parallel machines ever • And the first with multi-billion-dollar markets

Shared-Memory Multiprocessors • Provide a shared-memory abstraction • Familiar and efficient for programmers P1 P2 P3 P4 Cache Cache Cache Cache M1 M2 M3 M4 Interface Interface Interface Interface Interconnection Network

Paired vs. Separate Processor/Memory? • Separate processor/memory • Uniform memory access (UMA): equal latency to all memory • Simple software, doesn’t matter where you put data • Lower peak performance • Bus-based UMAs common: symmetric multi-processors (SMP) • Paired processor/memory • Non-uniform memory access (NUMA): faster to local memory • More complex software: where you put data matters • Higher peak performance: assuming proper data placement CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem Mem Mem Mem

Shared vs. Point-to-Point Networks • Shared network: e.g., bus (left) • Low latency • Low bandwidth: doesn’t scale beyond ~16 processors • Shared property simplifies cache coherence protocols (later) • Point-to-point network: e.g., mesh or ring (right) • Longer latency: may need multiple “hops” to communicate • Higher bandwidth: scales to 1000s of processors • Cache coherence protocols are complex CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem R R Mem Mem R R Mem CPU($) CPU($)

Organizing Point-To-Point Networks • Network topology: organization of network • Tradeoff performance (connectivity, latency, bandwidth)  cost • Router chips • Networks that require separate router chips are indirect • Networks that use processor/memory/router packages are direct • Fewer components, “Glueless MP” • Point-to-point network examples • Indirect tree (left) • Direct mesh or ring (right) R CPU($) CPU($) Mem R R Mem R R Mem R Mem R Mem R Mem R Mem R R Mem CPU($) CPU($) CPU($) CPU($) CPU($) CPU($)

Implementation #1: Snooping Bus MP CPU($) CPU($) CPU($) CPU($) • Two basic implementations • Bus-based systems • Typically small: 2–8 (maybe 16) processors • Typically processors split from memories (UMA) • Sometimes multiple processors on single chip (CMP) • Symmetric multiprocessors (SMPs) • Common, I use one everyday Mem Mem Mem Mem

Implementation #2: Scalable MP • General point-to-point network-based systems • Typically processor/memory/router blocks (NUMA) • Glueless MP: no need for additional “glue” chips • Can be arbitrarily large: 1000’s of processors • Massively parallel processors (MPPs) • In reality only government (DoD) has MPPs… • Companies have much smaller systems: 32–64 processors • Scalable multi-processors CPU($) CPU($) Mem R R Mem Mem R R Mem CPU($) CPU($)

Issues for Shared Memory Systems • Two in particular • Cache coherence • Memory consistency model • Closely related to each other • Different solutions for SMPs and MPPs

An Example Execution Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash • Two $100 withdrawals from account #241 at two ATMs • Each transaction maps to thread on different processor • Track accts[241].bal (address is in r3) CPU0 CPU1 Mem

No-Cache, No-Problem Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash • Scenario I: processors have no caches • No problem 500 500 400 400 300

Cache Incoherence Processor 0 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash • Scenario II: processors have write-back caches • Potentially 3 copies of accts[241].bal: memory, p0$, p1$ • Can get incoherent (inconsistent) 500 V:500 500 D:400 500 D:400 V:500 500 D:400 D:400 500

Hardware Cache Coherence CPU • Coherence controller: • Examines bus traffic (addresses and data) • Executes coherence protocol • What to do with local copy when you see different things happening on bus D$ data CC D$ tags bus

Snooping Cache-Coherence Protocols Bus provides serialization point Each cache controller “snoops” all bus transactions • take action to ensure coherence • invalidate • update • supply value • depends on state of the block and the protocol

Snooping Design Choices Controller updates state of blocks in response to processor and snoop events and generates bus xactions Often have duplicate cache tags Snoopy protocol • set of states • state-transition diagram • actions Basic Choices • write-through vs. write-back • invalidate vs. update Processor ld/st Cache State Tag Data . . . Snoop (observed bus transaction)

PrRd / -- PrWr / BusWr Valid BusWr PrRd / BusRd Invalid PrWr / BusWr The Simple Invalidate Snooping Protocol Write-through, no-write-allocate cache Actions: PrRd, PrWr, BusRd, BusWr

Example time Actions: • PrRd, PrWr, • BusRd, BusWr

ownership O M validity S E exclusiveness I More Generally: MOESI [Sweazey & Smith ISCA86] M - Modified (dirty) O - Owned (dirty but shared) WHY? E - Exclusive (clean unshared) only copy, not dirty S - Shared I - Invalid Variants • MSI • MESI • MOSI • MOESI

MESI example Actions: • PrRd, PrWr, • BRL – Bus Read Line (BusRd) • BWL – Bus Write Line (BusWr) • BRIL – Bus Read and Invalidate • BIL – Bus Invalidate Line • M - Modified (dirty) • E - Exclusive (clean unshared) only copy, not dirty • S - Shared • I - Invalid

On Power and Multi-Processors

On Power and Multi-Processors

Presentation Transcript

Single-Chip Multi-Processors (CMP)

Multi-core Processors and Virtualization

Supporting Multi-Processors

Heterogeneous Multi-Core Processors

Low Power Processors

Development and certification of Avionics Platforms on Multi-Core processors

Multi-core processors

Performance Optimization of HPC Applications on Multi- and Manycore Processors

On Power and Multi-Processors

Video Coding on Multi-core Graphics Processors

STL on Limited Local Memory ( LLM) Multi-core Processors

Lecture 25: Multi-core Processors

Fast Multi-Threading on Shared Memory Multi-Processors

Compilers for DSP Processors and Low-Power

Message Passing On Tightly-Interconnected Multi-Core Processors

Network Processors A generation of multi-core processors

Single-Chip Multi-Processors (CMP)

Network Processors A generation of multi-core processors

Multi-core processors

Heterogeneous Multi-Core Processors

Multi-core processors