Overview

Overview • Motivation (Kevin) • Thermal issues (Kevin) • Power modeling (David) • Thermal management (David) • Optimal DTM (Lev) • Clustering (Antonio) • Power distribution (David) • What current chips do (Lev) • HotSpot (Kevin)

global resources cluster0 cluster1 cluster2 cluster3 interconnection network The clustering approach • Reduce complexity by partitioning • Less latency, area, power and temperature • Fast, simple, distributed units • Communication latency is heterogeneous and exposed to the microarchitecture • Localize critical communication within clusters (fast wires)

The clustering approach (...) • Smaller structures consume less power • Higher power efficiency [Zyuban, IEEE Transactions 01] • Partitioning simplifies power management • Via clock/power gating techniques [Bahar, ISCA 01] • Via dynamic cluster resizing [González, ICCD 03] • Via DVS/DFS • Partitioning reduces temperature • Activity is distributed [Chaparro, TACS 04] • Hopping schemes can be applied [Chaparro, TACS 04] • Adds flexibility for temperature-effective layouts • IPC overheads due to communication/imbalance • Compensated by shorter latency/clock period [Palacharla, ISCA 97], [Canal, HPCA 00]

Icache Fetch & decode Cluster Steering logic Issue-Queue Register File C0 C1 C2 C3 FU FU IC Network Clustered microarchitecture • Dynamic steering • Distributed Issue, Registers, FUs • Inter-cluster register communication

Register Map Table C0 C1 C2 C3 phys. reg. On-demand communication • Map table tracks locations of register values • At rename • allocate register for result, in the assigned cluster • if a source operand is in a remote cluster • insert a copy instruction in remote cluster • allocate register for a copy • At commit • free allocated register(s) by previous mapping log. reg. [Canal, PACT99]

3 3 10 10 X X X X X X 14 Rename Renaming Table Cluster 1 Steering Logic src1 src2 src3 src4 src5 dst Logical 2 3 X X X 1 Physical

src1 dst CL1:10 CL2:27 15 15 15 27 27 X!!! X X X X X X X X X 14 Copy instructions Copy instruction Renaming Table Cluster 2 Steering Logic src1 src2 src3 src4 src5 dst Logical 2 3 X X X 1 Physical

Broadcast communication • Values sent to all register files • Local file is updated earlier than remote ones • Registers are replicated in all files • Register storage waste • Increase in power • Values are written multiple times • Increase in power • May reduce communication penalties • Values are present everywhere • But not at the same time • E.g.: Alpha 21264

Cluster assignment schemes • Main goals • Minimize inter-cluster communication penalty • Maximize workload balance • Main approaches • Static approaches[Farkas, Micro 97] [Sastry, PLDI 98] • Less flexible than dynamic ones: poor load balancing • Dynamic, dependence-based[Palacharla ISCA 97] [Alpha 21264] [Kemp, ICPP 96] • Only consider dependences through unavailable operands • Lack specific balancing mechanisms • Dynamic, workload balance oriented[Baniasadi 00] • Only suitable with low communication penalty architectures • Dynamic, dependence-based and workload balance oriented[Canal HPCA 2000, Parcerisa PACT 2002] • Tries to find best trade-off between communications and workload balance

Cluster assignment schemes • Accurate-Rebalancing Priority RMB 1- To minimize communication penalties: • If unavailable source register: choose producer’s cluster • Else: Select clusters with highest number of source regs. mapped 2- Choose the least loaded one of the above Exception: if imbalance > threshold, then exclude clusters with positive workload, prior to applying rules 1 and 2

Evaluation SpecInt95

Dynamic vs. static steering S. Sastry, S.Palacharla and J.E.Smith, PLDI 1998

Data cache architectures [González, WMPI 04] • Centralized Backend Backend L1 Dcache • Dcache is a cluster • Single Load/Store queue • Simple disambiguation Backend Backend UL2

UL2 DL1 DL1 DL1 DL1 BE 1 BE 2 BE 3 BE 4 Data cache architecture (II) • Attraction caches • Lines are copied on demand • A coherence scheme is needed • Steering must exploit data locality

Data cache architecture (III) • Replicated • Area cost • Traffic due to store broadcast UL2 DL1 DL1 DL1 DL1 BE 1 BE 2 BE 3 BE 4

DL1 DL1 DL1 DL1 BE 1 BE 4 BE 2 BE 3 Data cache architecture (IV) • Interleaved • Word/line interleaved • Steering needs to predict the bank UL2

Memory issues • Disambiguation • Load/Store queues are distributed • Stores are allocated in all clusters • Address is computed in one and broadcast • Loads go to memory once previous stores know their addresses • Memory coherence • Write-Invalidate / Write-Update protocols

Performance comparison

ROB Cluster 0 Cluster 3 FPS CS IS ITLB FPRF IRF RAT DECO Cluster 2 Cluster 1 TC BP MS/MOB FPFU IFU DL0 DTLB UL2 Thermal benefits of clustering Example layout for a quad-cluster architecture

Temperature metrics • AbsMax • Maximum sensed temperature • Average • Average temperature across time and area • AverageMax • Average temperature across time of maximum sensed temperature

Clustering reduces temperature • If clustering is smart

Clustering effects • May end up with higher power densities! • Simpler and smaller units may create hotspots • Layout must be thermal-effective • Surround hotspots by cold areas • Activity steering must be smart • Other techniques (e.g. throttling) can be applied at smaller granularity • Aim at particular clusters without affecting others

Dynamic cluster resizing [González, ICCD 03] • Motivation

Dynamic cluster resizing • Proposal • Dynamically compute the energy of blocks • Schedulers, FUs, DL0s, etc • Dynamically compute the energyxdelay2 of the processor • Use different configurations for different intervals • Measure the optimal configuration • Gate-off (disable) useless units • Scheduler level • Backend level

ED2Px+3 ED2Px+1 ED2Px-3 X+y X-y X X+2 X-2 X+3 X+1 X-3 ED2Px+y ED2Px-y ED2Px ED2Px+2 ED2Px-2 memory bus disamb. bus X-1 ED2Px-1 Dynamic cluster resizing I$ UL2 cache Decode Rename Steer BEn BE4 BE5 BE1 BE2 BE3 ED2Px < ED2Px+1 < ED2Px-1 ?

Dynamic cluster resizing

Cluster hopping • Motivation • Power and average temperature savings when statically Vdd gating clusters * Temperatures in the backend area when gating all but the indicated cluster(s). Reductions over in-box ambient temperature (45º) respect to a baseline quad-cluster architecture.

Cluster hopping • Based on activity migration [Heo, ISLPED 03] • Vdd gate a subset of clusters • Rotate clusters to spread activity over time • Gated clusters cannot provide any register value • Before gating, some register values must be evicted • Cache/DTLB contents are lost • Unless some low power (e.g. drowsy) mode is used • Proactive and/or reactive behavior • Proactive: Per interval basis • Reactive: On thermal events

2dis-dia 1dis-rot 3dis-rot 2dis-alt Cluster hopping schemes Effective at reducing average temperature (thus leakage) but not max temperature

Thermal-aware steering • Try to minimize max temperature • Take into account cluster temperature when deciding destination • Some examples • Cold • Dispatch to coldest cluster with available resources • Lowest average temperature • Lowest peak temperature • T-Cold • Like Cold but discard clusters that are too hot • If difference in temperature with previous cluster (ordered by temperature) is higher than a threshold

Thermal-aware steering • T-Thermal • Minimize communications unless candidate cluster is too hot • If temperature difference > threshold  Priority to the colder • Otherwise  Priority to the one that minimize communications, and in case of tie maximize workload balance (#instructions in the schedulers)

Thermal-aware steering • Thermal-aware steering standalone

Hopping + thermal steering • Putting it all together

src/dst regs. assign-ments steering hit/miss PC Fetch Decode Rename Cluster Assignment DependenceChecking Br. Prediction Clustering the front-end `[Parcerisa, TR 02] Distributed Back-end

(1) (2) Cluster 0 (2) (1) Cluster 1 Back-end St BrP F Dec R D Cluster 2 Cluster 3 Predictor Table Distributed branch predictor • Broadcast every prediction (next PC) to all clusters • Hardware loop: predictor uses PC as index • insert bubble when switching the predictor cluster (2) • if interleaving by low order bits: frequent bubbles • Solution • Pipeline prediction ahead of I-cache + interleave by hi-bits • Bubble only when high level interleave boundary crossed (2)

Impact of distributing branch predictor • Bank switching • SpecInt95: every 24 instructions • Mbench: every 133 instructions • IPC loss • SpecInt95: 0,5% • Mbench: no loss

* Back-end St BrP F Dec R D ** Broadcast assignments override assignments St BrP Back-end F Dec R D ** Dep ** Broadcast register designators Distributed cluster assignment • Make local assignments and broadcast them to all clusters • Loop: steering logic uses assignments made by other clusters • Partial solution: use outdated info (2 cycles) • Problem: outdated dependences  generates communications • Solution: • anticipate dependence-checking and • override assignment, if dependence was violated

Impact of distributing assignment • W/o assignment overriding • 0.42 communications / instruction • More than 10% IPC loss • With assignment overriding • 0.17 communications / instruction • Less than 2% IPC loss

Thermal benefits • Clustering the rename table and the reorder buffer [Chaparro, 04]

Summary • Clustering is thermal-effective (in addition to complexity-effective) • Reduces power • Distributes activity • Clustering enables effective temperature control schemes • Adaptive configuration • DVS/DFS • Cluster hopping • Thermal steering

Overview

Overview

Presentation Transcript

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

Overview

OVERVIEW

Overview

Overview

Overview

Overview

Overview

OVERVIEW

Overview

Overview

OVERVIEW

Overview

Overview

Overview