advanced microarchitecture n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Advanced Microarchitecture PowerPoint Presentation
Download Presentation
Advanced Microarchitecture

Loading in 2 Seconds...

play fullscreen
1 / 51

Advanced Microarchitecture - PowerPoint PPT Presentation


  • 97 Views
  • Uploaded on

Advanced Microarchitecture. Multi-This , Multi-That, …. Limits on IPC. Lam92 This paper focused on impact of control flow on ILP Speculative execution can expose 10-400 IPC assumes no machine limitations except for control dependencies and actual dataflow dependencies Wall91

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Advanced Microarchitecture' - gilmore-holly


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
advanced microarchitecture

Advanced Microarchitecture

Multi-This, Multi-That, …

limits on ipc
Limits on IPC
  • Lam92
    • This paper focused on impact of control flow on ILP
    • Speculative execution can expose 10-400 IPC
      • assumes no machine limitations except for control dependencies and actual dataflow dependencies
  • Wall91
    • This paper looked at limits more broadly
      • No branch prediction, no register renaming, no memory disambiguation: 1-2 IPC
      • ∞-entry bpred, 256 physical registers, perfect memory disambiguation: 4-45 IPC
      • perfect bpred, register renaming and memory disambiguation: 7-60 IPC
    • This paper did not consider “control independent” instructions

Lecture 17: Multi-This, Multi-That, ...

practical limits
Practical Limits
  • Today, 1-2 IPC sustained
    • far from the 10’s-100’s reported by limit studies
  • Limited by:
    • branch prediction accuracy
    • underlying DFG
      • influenced by algorithms, compiler
    • memory bottleneck
    • design complexity
      • implementation, test, validation, manufacturing, etc.
    • power
    • die area

Lecture 17: Multi-This, Multi-That, ...

differences between real hardware and limit studies
Differences BetweenReal Hardware and Limit Studies?
  • Real branch predictors aren’t 100% accurate
  • Memory disambiguation is not perfect
  • Physical resources are limited
    • can’t have infinite register renaming w/o infinite PRF
    • need infinite-entry ROB, RS and LSQ
    • need 10’s-100’s of execution units for 10’s-100’s of IPC
  • Bandwidth/Latencies are limited
    • studies assumed single-cycle execution
    • infinite fetch/commit bandwidth
    • infinite memory bandwidth (perfect caching)

Lecture 17: Multi-This, Multi-That, ...

bridging the gap
Bridging the Gap

Power has been growing

exponentially as well

Watts/

IPC

100

10

1

Diminishing returns w.r.t.

larger instruction window,

higher issue-width

Single-Issue

Pipelined

Limits

Superscalar

Out-of-Order

(Today)

Superscalar

Out-of-Order

(Hypothetical-

Aggressive)

Lecture 17: Multi-This, Multi-That, ...

past the knee of the curve
Past the Knee of the Curve?

Made sense to go

Superscalar/OOO:

good ROI

Performance

Very little gain for

substantial effort

“Effort”

Scalar

In-Order

Moderate-Pipe

Superscalar/OOO

Very-Deep-Pipe

Aggressive

Superscalar/OOO

Lecture 17: Multi-This, Multi-That, ...

so how do we get more performance
So how do we get more Performance?
  • Keep pushing IPC and/or frequenecy?
    • possible, but too costly
      • design complexity (time to market), cooling (cost), power delivery (cost), etc.
  • Look for other parallelism
    • ILP/IPC: fine-grained parallelism
    • Multi-programming: coarse grained parallelism
      • assumes multiple user-visible processing elements
      • all parallelism up to this point was user-invisible

Lecture 17: Multi-This, Multi-That, ...

user visible invisible
User Visible/Invisible
  • All microarchitecture performance gains up to this point were “free”
    • free in that no user intervention required beyond buying the new processor/system
      • recompilation/rewriting could provide even more benefit, but you get some even if you do nothing
  • Multi-processing pushes the problem of finding the parallelism to above the ISA interface

Lecture 17: Multi-This, Multi-That, ...

workload benefits

4-wide

OOO

CPU

Task A

Task B

Benefit

Task A

3-wide

OOO

CPU

3-wide

OOO

CPU

Task B

Task A

2-wide

OOO

CPU

2-wide

OOO

CPU

Task B

Workload Benefits

runtime

Task A

Task B

3-wide

OOO

CPU

This assumes you have two

tasks/programs to execute…

Lecture 17: Multi-This, Multi-That, ...

if only one task
… If Only One Task

runtime

Task A

3-wide

OOO

CPU

4-wide

OOO

CPU

Task A

Benefit

Task A

3-wide

OOO

CPU

3-wide

OOO

CPU

No benefit over 1 CPU

Task A

2-wide

OOO

CPU

2-wide

OOO

CPU

Performance

degradation!

Idle

Lecture 17: Multi-This, Multi-That, ...

sources of coarse parallelism
Sources of (Coarse) Parallelism
  • Different applications
    • MP3 player in background while you work on Office
    • Other background tasks: OS/kernel, virus check, etc.
    • Piped applications
      • gunzip -c foo.gz | grep bar | perl some-script.pl
  • Within the same application
    • Java (scheduling, GC, etc.)
    • Explicitly coded multi-threading
      • pthreads, MPI, etc.

Lecture 17: Multi-This, Multi-That, ...

execution latency vs bandwidth
(Execution) Latency vs. Bandwidth
  • Desktop processing
    • typically want an application to execute as quickly as possible (minimize latency)
  • Server/Enterprise processing
    • often throughput oriented (maximize bandwidth)
    • latency of individual task less important
      • ex. Amazon processing thousands of requests per minute: it’s ok if an individual request takes a few seconds more so long as total number of requests are processed in time

Lecture 17: Multi-This, Multi-That, ...

benefit of mp depends on workload

parallelizable

1CPU

2CPUs

3CPUs

4CPUs

Benefit of MP Depends on Workload
  • Limited number of parallel tasks to run on PC
    • adding more CPUs than tasks provide zero performance benefit
  • Even for parallel code, Amdahl’s law will likely result in sub-linear speedup
  • In practice, parallelizable portion may not be evenly divisible

Lecture 17: Multi-This, Multi-That, ...

cache coherency protocols
Cache Coherency Protocols
  • Not covered in this course
    • You should have seen a bunch of this in CS6290
  • Many different protocols
    • different number of states
    • different bandwidth/performance/complexity tradeoffs
    • current protocols usually referred to by their states
      • ex. MESI, MOESI, etc.

Lecture 17: Multi-This, Multi-That, ...

shared memory focus
Shared Memory Focus
  • Most small-medium multi-processors (these days) use some sort of shared memory
    • shared memory doesn’t scale as well to larger number of nodes
      • communications are broadcast based
      • bus becomes a severe bottleneck
        • or you have to deal with directory-based implementations
    • message passing doesn’t need centralized bus
      • can arrange multi-processor like a graph
        • nodes = CPUs, edges = independent links/routes
      • can have multiple communications/messages in transit at the same time

Lecture 17: Multi-This, Multi-That, ...

smp machines
SMP Machines
  • SMP = Symmetric Multi-Processing
    • Symmetric = All CPUs are “equal”
    • Equal = any process can run on any CPU
      • contrast with older parallel systems with master CPU and multiple worker CPUs

CPU0

CPU1

CPU2

CPU3

Pictures found from google images

Lecture 17: Multi-This, Multi-That, ...

hardware modifications for smp
Hardware Modifications for SMP
  • Processor
    • mainly support for cache coherence protocols
      • includes caches, write buffers, LSQ
      • control complexity increases, as memory latencies may be substantially more variable
  • Motherboard
    • multiple sockets (one per CPU)
    • datapaths between CPUs and memory controller
  • Other
    • Case: larger for bigger mobo, better airflow
    • Power: bigger power supply for N CPUs
    • Cooling: need to remove N CPUs’ worth of heat

Lecture 17: Multi-This, Multi-That, ...

chip multiprocessing
Chip-Multiprocessing
  • Simple SMP on the same chip

Intel “Smithfield” Block Diagram

AMD Dual-Core Athlon FX

Lecture 17: Multi-This, Multi-That, ...

Pictures found from google images

shared caches
Shared Caches
  • Resources can be shared between CPUs
    • ex. IBM Power 5

CPU0

CPU1

L2 cache shared between

both CPUs (no need to

keep two copies coherent)

L3 cache is also shared (only tags

are on-chip; data are off-chip)

Lecture 17: Multi-This, Multi-That, ...

benefits
Benefits?
  • Cheaper than mobo-based SMP
    • all/most interface logic integrated on to main chip (fewer total chips, single CPU socket, single interface to main memory)
    • less power than mobo-based SMP as well (communication on-die is more power-efficient than chip-to-chip communication)
  • Performance
    • on-chip communication is faster
  • Efficiency
    • potentially better use of hardware resources than trying to make wider/more OOO single-threaded CPU

Lecture 17: Multi-This, Multi-That, ...

performance vs power
Performance vs. Power
  • 2x CPUs not necessarily equal to 2x performance
  • 2x CPUs  ½ power for each
    • maybe a little better than ½ if resources can be shared
  • Back-of-the-Envelope calculation:
    • 3.8 GHz CPU at 100W
    • Dual-core: 50W per CPU
    • P  V3: Vorig3/VCMP3 = 100W/50W  VCMP = 0.8 Vorig
    • f  V: fCMP = 3.0GHz

Lecture 17: Multi-This, Multi-That, ...

simultaneous multi threading
Simultaneous Multi-Threading
  • Uni-Processor: 4-6 wide, lucky if you get 1-2 IPC
    • poor utilization
  • SMP: 2-4 CPUs, but need independent tasks
    • else poor utilization as well
  • SMT: Idea is to use a single large uni-processor as a multi-processor

Lecture 17: Multi-This, Multi-That, ...

smt 2

SMT (4 threads)

CMP

2x HW Cost

Approx 1x HW Cost

SMT (2)

Regular CPU

Lecture 17: Multi-This, Multi-That, ...

overview of smt hardware changes
Overview of SMT Hardware Changes
  • For an N-way (N threads) SMT, we need:
    • Ability to fetch from N threads
    • N sets of registers (including PCs)
    • N rename tables (RATs)
    • N virtual memory spaces
  • But we don’t need to replicate the entire OOO execution engine (schedulers, execution units, bypass networks, ROBs, etc.)

Lecture 17: Multi-This, Multi-That, ...

smt fetch

Cycle-Multiplexed fetch logic

RS

PC0

I$

Decode, etc.

fetch

PC1

PC2

cycle % N

SMT Fetch
  • Duplicate fetch logic

RS

fetch

PC0

Decode, Rename, Dispatch

I$

PC1

fetch

PC2

fetch

  • Alternatives
    • Other-Multiplexed fetch logic
    • Duplicate I$ as well

Lecture 17: Multi-This, Multi-That, ...

smt rename
SMT Rename
  • Thread #1’s R12 != Thread #2’s R12
    • separate name spaces
    • need to disambiguate

RAT0

PRF

RAT

PRF

Thread0

Register #

Thread-ID

concat

RAT1

Thread1

Register #

Register #

Lecture 17: Multi-This, Multi-That, ...

smt issue exec bypass

Shared RS Entries

Sub T5 = T17 – T2

Add T12 = RT20 + T8

Load T25 = 0[T31]

Xor T14 = T12 ^ T19

Load T23 = 0[T14]

Sub T19 = T12 – T16

Xor T31 = T17 ^ T5

Add T17 = RT29 + T3

SMT Issue, Exec, Bypass, …
  • No change needed

After Renaming

Thread 0:

Add R1 = R2 + R3

Sub R4 = R1 – R5

Xor R3 = R1 ^ R4

Load R2 = 0[R3]

Thread 0:

Add T12 = RT20 + T8

Sub T19 = T12 – T16

Xor T14 = T12 ^ T19

Load T23 = 0[T14]

Thread 1:

Add R1 = R2 + R3

Sub R4 = R1 – R5

Xor R3 = R1 ^ R4

Load R2 = 0[R3]

Thread 1:

Add T17 = RT29 + T3

Sub T5 = T17 – T2

XorT31 = T17 ^ T5

Load T25 = 0[T31]

Lecture 17: Multi-This, Multi-That, ...

smt cache
SMT Cache
  • Each process has own virtual address space
    • TLB must be thread-aware
      • translate (thread-id,virtual page)  physical page
    • Virtual portion of caches must also be thread-aware
      • VIVT cache must now be (virutal addr, thread-id)-indexed, (virtual addr, thread-id)-tagged
      • Similar for VIPT cache

Lecture 17: Multi-This, Multi-That, ...

smt commit
SMT Commit
  • One “Commit PC” per thread
  • Register File Management
    • ARF/PRF organization
      • need one ARF per thread
    • Unified PRF
      • need one “architected RAT” per thread
  • Need to maintain interrupts, exceptions, faults on a per-thread basis
    • like OOO needs to appear to outside world that it is in-order, SMT needs to appear as if it is actually N CPUs

Lecture 17: Multi-This, Multi-That, ...

smt design space
SMT Design Space
  • Number of threads
  • Full-SMT vs. Hard-partitioned SMT
    • full-SMT: ROB-entries can be allocated arbitrarily between the threads
    • hard-partitioned: if only one thread, use all ROB entries; if two threads, each is limited to one half of the ROB (even if the other thread uses only a few entries); possibly similar for RS, LSQ, PRF, etc.
  • Amount of duplication
    • Duplicate I$, D$, fetch engine, decoders, schedulers, etc.?
    • There’s a continuum of possibilities between SMT and CMP
      • ex. could have CMP where FP unit is shared SMT-styled

Lecture 17: Multi-This, Multi-That, ...

smt performance
SMT Performance
  • When it works, it fills idle “issue slots” with work from other threads; throughput improves
  • But sometimes it can cause performance degradation!

Time( ) < Time( )

Finish one task,

then do the other

Do both at same

time using SMT

Lecture 17: Multi-This, Multi-That, ...

slide32
How?
  • Cache thrashing

L2

I$

D$

Executes

reasonably

quickly due

to high cache

hit rates

Thread0 just fits in

the Level-1 Caches

I$

D$

Caches were just big enough

to hold one thread’s data, but

not two thread’s worth

Context switch to Thread1

I$

D$

Now both threads have

significantly higher cache

miss rates

Thread1 also fits

nicely in the caches

Lecture 17: Multi-This, Multi-That, ...

fairness
Fairness
  • Consider two programs
    • By themselves:
      • Program A: runtime = 10 seconds
      • Program B: runtime = 10 seconds
    • On SMT:
      • Program A: runtime = 14 seconds
      • Program B: runtime = 18 seconds
  • Standard Deviation of Speedups (lower = better)
    • A’s speedup: 10/14 = 0.71
    • B’s speedup: 10/18 = 0.56
    • SDS = 0.11

Lecture 17: Multi-This, Multi-That, ...

fairness 2
Fairness (2)
  • SDS encourages everyone to be punished similarly
    • does not account for actual performance, so if everyone is 1000x slower, it’s still “fair”
  • Alternative: Harmonic Mean of Weighted IPCs (HMWIPC)
    • IPCi = achieved IPC for thread i
    • SingleIPCi = IPC when thread i runs alone
    • HMWIPC = N

SingleIPC1 + SingleIPC2 + … + SingleIPCN

IPC1 IPC2 IPCN

Lecture 17: Multi-This, Multi-That, ...

this is all combinable
This is all combinable
  • Can have a system that supports SMP, CMP and SMT at the same time
  • Take a dual-socket SMP motherboard…
  • Insert two chips, each with a dual-core CMP…
  • Where each core supports two-way SMT
  • This example provides 8 threads worth of execution, shared on 4 actual “cores”, split across two physical packages

Lecture 17: Multi-This, Multi-That, ...

os confusion
OS Confusion
  • SMT/CMP is supposed to look like multiple CPUs to the software/OS

Performance

worse than

if SMT was

turned off

and used

2-way SMP

only

A

CPU0

2-way

SMT

A/B

B

CPU1

idle

2-way

SMT

idle

CPU2

idle

CPU3

2 cores

(either SMP/CMP)

Say OS has two

tasks to run…

Schedule tasks to

(virtual) CPUs

Lecture 17: Multi-This, Multi-That, ...

os confusion 2
OS Confusion (2)
  • Asymmetries in MP-Hierarchy can be very difficult for the OS to deal with
    • need to break abstraction: OS needs to know which CPUs are real physical processor (SMP), which are shared in the same package (CMP), and which are virtual (SMT)
    • Distinct applications should be scheduled to physically different CPUs
      • no cache contention, no power contention
    • Cooperative applications (different threads of the same program) should maybe be scheduled to the same physical chip (CMP)
      • reduce latency of inter-thread communication, possibly reduce duplication if shared L2 is used
    • Use SMT as last choice

Lecture 17: Multi-This, Multi-That, ...

multi is happening
Multi-* is Happening
  • Intel Pentium 4 already had “Hyperthreading” (SMT)
    • went away for a while, but is back in Core i7
  • IBM Power 5 and later have SMT
  • Dual, Quad core already here
  • Octo-core soon
    • Intel Core i7: 8 cores, each with 2-thread SMT
  • So is single-thread performance dead?
  • Is single-thread microarchitecture performance dead?

Following adapted from Mark Hill’s HPCA08 keynote talk

Lecture 17: Multi-This, Multi-That, ...

recall amdahl s law
Recall Amdahl’s Law
  • Begins with Simple Software Assumption (Limit Arg.)
    • Fraction F of execution time perfectly parallelizable
    • No Overhead for
        • Scheduling
        • Synchronization
        • Communication, etc.
    • Fraction1 – F Completely Serial
  • Time on 1 core = (1 – F) / 1 + F / 1 = 1
  • Time on N cores = (1 – F) / 1 + F / N

The following slides derived from Mark Hill’s HPCA’08 Keynote

recall amdahl s law 1967

1

Amdahl’s Speedup =

F

1 - F

+

1

N

Recall Amdahl’s Law [1967]
  • For mainframes, Amdahl expected 1 - F = 35%
    • For a 4-processor speedup = 2
    • For infinite-processor speedup < 3
    • Therefore, stay with mainframes with one/few processors
  • Do multicore chips repeal Amdahl’s Law?
  • Answer: No, But.
designing multicore chips hard
Designing Multicore Chips Hard
  • Designers must confront single-core design options
    • Instruction fetch, wakeup, select
    • Execution unit configuation & operand bypass
    • Load/queue(s) & data cache
    • Checkpoint, log, runahead, commit.
  • As well as additional design degrees of freedom
    • How many cores? How big each?
    • Shared caches: levels? How many banks?
    • Memory interface: How many banks?
    • On-chip interconnect: bus, switched, ordered?
want simple multicore hardware model
Want Simple Multicore Hardware Model

To Complement Amdahl’s Simple Software Model

(1) Chip Hardware Roughly Partitioned into

  • Multiple Cores (with L1 caches)
  • The Rest (L2/L3 cache banks, interconnect, pads, etc.)
  • Changing Core Size/Number does NOT change The Rest

(2) Resources for Multiple Cores Bounded

  • Bound of N resources per chip for cores
  • Due to area, power, cost ($$$), or multiple factors
  • Bound = Power? (but our pictures use Area)
want simple multicore hardware model cont
Want Simple Multicore Hardware Model, cont.

(3) Micro-architects can improve single-core performance using more of the bounded resource

  • A Simple Base Core
    • Consumes 1 Base Core Equivalent (BCE) resources
    • Provides performance normalized to 1
  • An Enhanced Core (in same process generation)
    • Consumes R BCEs
    • Performance as a function Perf(R)
  • What does functionPerf(R) look like?
more on enhanced cores
More on Enhanced Cores
  • (Performance Perf(R) consuming R BCEs resources)
  • If Perf(R) > R Always enhance core
  • Cost-effectively speedups both sequential & parallel
  • Therefore, Equations Assume Perf(R) < R
  • Graphs Assume Perf(R) = square root of R
    • 2x performance for 4 BCEs, 3x for 9 BCEs, etc.
    • Why? Models diminishing returns with “no coefficients”
  • How to speedup enhanced core?
    • <Insert favorite or TBD micro-architectural ideas here>
how many symmetric cores per chip
How Many (Symmetric) Cores per Chip?
  • Each Chip Bounded to N BCEs (for all cores)
  • Each Core consumes R BCEs
  • Assume SymmetricMulticore = All Cores Identical
  • Therefore, N/R Cores per Chip —(N/R)*R = N
  • For an N = 16 BCE Chip:

Sixteen 1-BCE cores

Four 4-BCE cores

One 16-BCE core

performance of symmetric multicore chips

1

Symmetric Speedup =

F * R

1 - F

+

Perf(R)

Perf(R)*N

Enhanced Cores speed Serial & Parallel

Performance of Symmetric Multicore Chips
  • Serial Fraction 1-F uses 1 core at rate Perf(R)
  • Serial time = (1 – F) / Perf(R)
  • Parallel Fraction uses N/R cores at rate Perf(R) each
  • Parallel time = F / (Perf(R) * (N/R)) = F*R / Perf(R)*N
  • Therefore, w.r.t. one base core:
  • Implications?
symmetric multicore chip n 16 bces
Symmetric Multicore Chip, N = 16 BCEs

F=0.5, Opt. Speedup S = 4 = 1/(0.5/4 + 0.5*16/(4*16))

Need to increase parallelism to make multicore optimal!

F=0.5

R=16,

Cores=1,

Speedup=4

(16 cores)

(8 cores)

(2 cores)

(1 core)

(4 cores)

symmetric multicore chip n 16 bces1
Symmetric Multicore Chip, N = 16 BCEs

At F=0.9, Multicore optimal, but speedup limited

Need to obtain even more parallelism!

F=0.9, R=2, Cores=8, Speedup=6.7

F=0.5

R=16,

Cores=1,

Speedup=4

symmetric multicore chip n 16 bces2
Symmetric Multicore Chip, N = 16 BCEs

F1, R=1, Cores=16, Speedup16

F matters: Amdahl’s Law applies to multicore chips

Researchers should target parallelism F first

symmetric multicore chip n 16 bces3
Symmetric Multicore Chip, N = 16 BCEs

Recall F=0.9, R=2, Cores=8, Speedup=6.7

As Moore’s Law enables N to go from 16 to 256 BCEs,

More core enhancements? More cores? Or both?

symmetric multicore chip n 256 bces
Symmetric Multicore Chip, N = 256 BCEs

As Moore’s Law increases N, often need enhanced core designs

Some researchers should target single-core performance

F1

R=1 (vs. 1)

Cores=256 (vs. 16)

Speedup=204 (vs. 16)

MORE CORES!

F=0.99

R=3 (vs. 1)

Cores=85 (vs. 16)

Speedup=80 (vs. 13.9)

CORE ENHANCEMENTS& MORE CORES!

F=0.9

R=28 (vs. 2)

Cores=9 (vs. 8)

Speedup=26.7 (vs. 6.7)

CORE ENHANCEMENTS!