future computer advances are between a rock slow memory and a hard place multithreading
Download
Skip this Video
Download Presentation
Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading)

Loading in 2 Seconds...

play fullscreen
1 / 36

Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading) - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading). Mark D. Hill Computer Sciences Dept. and Electrical & Computer Engineer Dept. University of Wisconsin—Madison Multifacet Project ( www.cs.wisc.edu/multifacet ) October 2004.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading)' - scot


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
future computer advances are between a rock slow memory and a hard place multithreading

Future Computer Advances are Between a Rock (Slow Memory) and a Hard Place (Multithreading)

Mark D. Hill

Computer Sciences Dept.

and Electrical & Computer Engineer Dept.

University of Wisconsin—Madison

Multifacet Project (www.cs.wisc.edu/multifacet)

October 2004

Full Disclosure: Consult for Sun & US NSF

executive summary problem
talkExecutive Summary: Problem
  • Expect computer performance doubling every 2 years
  • Derives from Technology & Architecture
  • Technology will advance for ten or more years
  • But Architecture faces a Rock: Slow Memory
    • a.k.a. Wall [Wulf & McKee 1995]
  • Prediction: Popular Moore’s Law (doubling performance) will end soon, regardless ofthe real Moore’s Law (doubling transistors)
executive summary recommendation
Executive Summary: Recommendation
  • Chip Multiprocessing (CMP) Can Help
    • Implement multiple processors per chip
    • >>10x cost-performance for multithreaded workloads
    • What about software with one apparent thread?
  • Go to Hard Place: Mainstream Multithreading
    • Make most workloads flourish with chip multiprocessing
    • Computer architects can help, but long run
    • Requires moving multithreading from CS fringe to center (algorithms, programming languages, …, hardware)
  • Necessary For Restoring Popular Moore’s Law
outline
Outline
  • Executive Summary
  • Background
    • Moore’s Law
    • Architecture
    • Instruction Level Parallelism
    • Caches
  • Going Forward Processor Architecture Hits Rock
  • Chip Multiprocessing to the Rescue?
  • Go to the Hard Place of Mainstream Multithreading
society expects a popular moore s law
talkSociety Expects A Popular Moore’s Law

Computing critical: commerce, education, engineering, entertainment, government, medicine, science, …

    • Servers (> PCs)
    • Clients (= PCs)
    • Embedded (< PCs)
  • Come to expect a misnamed “Moore’s Law”
    • Computer performance doubles every two years (same cost)
    •  Progress in next two years = All past progress
  • Important Corollary
    • Computer cost halves every two years (same performance)
    •  In ten years, same performance for 3% (sales tax – Jim Gray)
  • Derives from Technology & Architecture
technologist s moore s law provides transistors
(Technologist’s) Moore’s Law Provides Transistors

Number of transistorsper chip doubles everytwo years (18 months)

Merely a “Law” of Business Psychology

performance from technology architecture
Performance from Technology & Architecture

Reprinted from Hennessy and Patterson,"Computer Architecture:A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.

architects use transistors to compute faster
Time 

Time 

 Instrns

 Instrns

Architects Use Transistors To Compute Faster
  • Bit Level Parallelism (BLP) within Instructions
  • Instruction Level Parallelism (ILP) among Instructions
  • Scores of speculative instructions look sequential!
architects use transistors tolerate slow memory
Architects Use Transistors Tolerate Slow Memory
  • Cache
    • Small, Fast Memory
    • Holds information (expected)to be used soon
    • Mostly Successful
  • Apply Recursively
    • Level-one cache(s)
    • Level-two cache
  • Most of microprocessordie area is cache!
outline10
Outline
  • Executive Summary
  • Background
  • Going Forward Processor Architecture Hits Rock
    • Technology Continues
    • Slow Memory
    • Implications
  • Chip Multiprocessing to the Rescue?
  • Go to the Hard Place of Mainstream Multithreading
future technology implications
Future Technology Implications
  • For (at least) ten years, Moore’s Law continues
    • More repeated doublings of number of transistors per chip
    • Faster transistors
  • But hard for processor architects to use
    • More transistors due global wire delays
    • Faster transistors due too much dynamic power
  • Moreover, hitting a Rock: Slow Memory
    • Memory access = 100s floating-point multiplies!
    • a.k.a. Wall [Wulf & McKee 1995]
rock memory gets relatively slower
Rock: Memory Gets (Relatively) Slower

Reprinted from Hennessy and Patterson,"Computer Architecture:A Quantitative Approach,” 3rd Edition, 2003, Morgan Kaufman Publishers.

impact of slow memory rock
I1

I2

window = 4 (64)

Compute Phases

I3

I4

Memory Phases

Time 

Time 

 Instrns

 Instrns

Impact of Slow Memory (Rock)
  • Off-Chip Misses are now hundreds of cycles
  • More Realistic Case

Good Case!

implications of slow memory rock
Implications of Slow Memory (Rock)
  • Increasing Memory Latency hides Compute Phase
  • Near Term Implications
    • Reduce memory latency
    • Fewer memory accesses
    • More Memory Level Parallelism (MLP)
  • Longer Term Implications
    • What can single-threaded software do while waiting 100 instruction opportunities, 200, 400, … 1000?
    • What can amazing speculative hardware do?
assessment so far
Assessment So Far
  • Appears
    • Popular Moore’s Law (doubling performance)will end soon, regardless of thereal Moore’s Law (doubling transistors)
  • Processor performance hitting Rock (Slow Memory)
  • No known way to overcome this, unless
  • Redefine performance in Popular Moore’s Law
    • From Processor Performance
    • To Chip Performance
outline16
Outline
  • Executive Summary
  • Background
  • Going Forward Processor Architecture Hits Rock
  • Chip Multiprocessing to the Rescue?
    • Small & Large CMPs
    • CMP Systems
    • CMP Workload
  • Go to the Hard Place of Mainstream Multithreading
performance for chip not processor or thread
Performance for Chip, not Processor or Thread
  • Chip Multiprocessing (CMP)
  • Replicate Processor
  • Private L1 Caches
    • Low latency
    • High bandwidth
  • Shared L2 Cache
    • Larger than if private
piranha processing node
Alpha core:

1-issue, in-order,

500MHz

CPU

Next few slides from

Luiz Barosso’s ISCA 2000 presentation of

Piranha: A Scalable ArchitectureBased on Single-Chip Multiprocessing

Piranha Processing Node
piranha processing node19
I$

D$

Piranha Processing Node

Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way

CPU

piranha processing node20
I$

I$

I$

I$

I$

I$

I$

I$

D$

D$

D$

D$

D$

D$

D$

D$

Piranha Processing Node

Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS)

32GB/sec, 1-cycle delay

CPU

CPU

CPU

CPU

ICS

CPU

CPU

CPU

CPU

piranha processing node21
I$

I$

I$

I$

I$

I$

I$

I$

D$

D$

D$

D$

D$

D$

D$

D$

Piranha Processing Node

Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS)

32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

CPU

L2$

CPU

L2$

CPU

L2$

CPU

L2$

ICS

L2$

L2$

L2$

L2$

CPU

CPU

CPU

CPU

piranha processing node22
MEM-CTL

MEM-CTL

MEM-CTL

MEM-CTL

I$

I$

I$

I$

I$

I$

I$

I$

D$

D$

D$

D$

D$

D$

D$

D$

MEM-CTL

MEM-CTL

MEM-CTL

MEM-CTL

Piranha Processing Node

Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS)

32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec

CPU

L2$

CPU

L2$

CPU

L2$

CPU

L2$

ICS

L2$

L2$

L2$

L2$

CPU

CPU

CPU

CPU

8 banks

@1.6GB/sec

piranha processing node23
MEM-CTL

MEM-CTL

MEM-CTL

MEM-CTL

I$

I$

I$

I$

I$

I$

I$

I$

D$

D$

D$

D$

D$

D$

D$

D$

MEM-CTL

MEM-CTL

MEM-CTL

MEM-CTL

Piranha Processing Node

Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS)

32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE)

prog., 1K instr.,

even/odd interleaving

HE

CPU

L2$

CPU

L2$

CPU

L2$

CPU

L2$

ICS

RE

L2$

L2$

L2$

L2$

CPU

CPU

CPU

CPU

piranha processing node24
MEM-CTL

MEM-CTL

MEM-CTL

MEM-CTL

I$

I$

I$

I$

I$

I$

I$

I$

D$

D$

D$

D$

D$

D$

D$

D$

MEM-CTL

MEM-CTL

MEM-CTL

MEM-CTL

Piranha Processing Node

Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS)

32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE):

prog., 1K instr.,

even/odd interleaving

System Interconnect:

4-port Xbar router

topology independent

32GB/sec total bandwidth

4 Links

@ 8GB/s

HE

CPU

L2$

CPU

L2$

CPU

L2$

CPU

L2$

ICS

Router

RE

L2$

L2$

L2$

L2$

CPU

CPU

CPU

CPU

piranha processing node25
MEM-CTL

MEM-CTL

MEM-CTL

MEM-CTL

I$

I$

I$

I$

I$

I$

I$

I$

D$

D$

D$

D$

D$

D$

D$

D$

MEM-CTL

MEM-CTL

MEM-CTL

MEM-CTL

Piranha Processing Node

Alpha core:

1-issue, in-order,

500MHz

L1 caches:

I&D, 64KB, 2-way

Intra-chip switch (ICS)

32GB/sec, 1-cycle delay

L2 cache:

shared, 1MB, 8-way

Memory Controller (MC)

RDRAM, 12.8GB/sec

Protocol Engines (HE & RE):

prog., 1K instr.,

even/odd interleaving

System Interconnect:

4-port Xbar router

topology independent

32GB/sec total bandwidth

HE

CPU

L2$

CPU

L2$

CPU

L2$

CPU

L2$

ICS

Router

RE

L2$

L2$

L2$

L2$

CPU

CPU

CPU

CPU

Single Chip

single chip piranha performance
Single-Chip Piranha Performance
  • Piranha’s performance margin 3x for OLTP and 2.2x for DSS
  • Piranha has more outstanding misses  better utilizes memory system
simultaneous multithreading smt
Simultaneous Multithreading (SMT)
  • Multiplex S logical processors on each processor
    • Replicate registers, share caches, & manage other parts
    • Implementation factors keep S small, e.g., 2-4
  • Cost-effective gain if threads available
    • E.g, S=2  1.4x performance
  • Modest cost
    • Limits waste if additional logical processor(s) not used
  • Worthwhile CMP enhancement
small cmp systems
C

M

C

Small CMP Systems
  • Use One CMP (with C cores of S-way SMT)
    • C=[2,16] & S=[2,4]  C*S = [4,64]
    • Size of a small PC!
  • Directly Connect CMP (C) toMemory Controller (M) or DRAM
medium cmp systems
M

M

C

C

C

C

C

C

C

C

M

M

M

M

M

M

Processor-Centric

Dance Hall

Medium CMP Systems
  • Use 2-16 CMPs (with C cores of S-way SMT)
    • Smaller: 2*4*4 = 32
    • Larger: 16*16*4 = 1024
    • In a single cabinet
  • Connecting CMPs & Memory Controllers/DRAM & many issues
inflection points
Inflection Points
  • Inflection point occurs when
    • Smooth input change leads
    • Disruptive output change
  • Enough transistors for …
    • 1970s simple microprocessor
    • 1980s pipelined RISC
    • 1990s speculative out-of-order
    • 2000s …
  • CMP will be Server Inflection Point
    • Expect >10x performance for less cost
    • Implying, >>10x cost-performance
    • Early CMPs like old SMPs but expect dramatic advances!
so what s wrong with cmp picture
So What’s Wrong with CMP Picture?
  • Chip Multiprocessors
    • Allow profitable use of more transistors
    • Support modest to vast multithreading
    • Will be inflection point for commercial servers
  • But
    • Many workloads have single thread (available to run)
    • Even if single thread solves a problem formerly done by many people in parallel (e.g., clerks in payroll processing)
  • Go to a Hard Place
    • Make most workloads flourish with CMPs
outline32
Outline
  • Executive Summary
  • Background
  • Going Forward Processor Architecture Hits Rock
  • Chip Multiprocessing to the Rescue?
  • Go to the Hard Place of Mainstream Multithreading
    • Parallel from Fringe to Center
    • For All of Computer Science!
thread parallelism from fringe to center
Thread Parallelism from Fringe to Center
  • History
    • Automatic Computer (vs. Human)  Computer
    • Digital Computer (vs. Analog)  Computer
  • Must Change
    • Parallel Computer (vs. Sequential)  Computer
    • Parallel Algorithm (vs. Sequential)  Algorithm
    • Parallel Programming (vs. Sequential)  Programming
    • Parallel Library (vs. Sequential)  Library
    • Parallel X (vs. Sequential)  X
  • Otherwise, repeated performance doublings unlikely
computer architects can contribute
Computer Architects Can Contribute
  • Chip Multiprocessor Design
    • Transcend pre-CMP multiprocessor design
    • Intra-CMP has lower latency & much higher bandwidth
  • Hide Multithreading (Helper Threads)
  • Assist Multithreading (Thread-Level Speculation)
  • Ease Multithreaded Programming (Transactions)
  • Provide a “Gentle Ramp to Parallelism” (Hennessy)
but all of computer science is needed
But All of Computer Science is Needed
  • Hide Multithreading (Libraries & Compilers)
  • Assist Multithreading (Development Environments)
  • Ease Multithreaded Programming (Languages)
  • Divide & Conquer Multithreaded Complexity(Theory & Abstractions)
  • Must Enable
    • 99% of programmers think sequentially while
    • 99% of instructions execute in parallel
  • Enable a “Parallelism Superhighway”
summary
Summary
  • (Single-Threaded) Computing faces a Rock: Slow Memory
  • Popular Moore’s Law (doubling performance) will end soon
  • Chip Multiprocessing Can Help
    • >>10x cost-performance for multithreaded workloads
    • What about software with one apparent thread?
  • Go to Hard Place: Mainstream Multithreading
    • Make most workloads flourish with chip multiprocessing
    • Computer architects can help, but long run
    • Requires moving multithreading from CS fringe to center
  • Necessary For Restoring Popular Moore’s Law
ad