Multicore computing evolution
Download
1 / 44

Multicore Computing - Evolution - PowerPoint PPT Presentation


  • 69 Views
  • Uploaded on

Multicore Computing - Evolution. Performance Scaling. Pentium® 4 Architecture. Pentium® Pro Architecture. Pentium® Architecture. 486. 386. 286. 8086. Source: Shekhar Borkar, Intel Corp. Intel. Homogeneous cores Bus based on chip interconnect Shared Memory Traditional I/O.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Multicore Computing - Evolution' - duke


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Performance scaling
Performance Scaling

Pentium® 4 Architecture

Pentium® Pro Architecture

Pentium® Architecture

486

386

286

8086

Source: Shekhar Borkar, Intel Corp.


Intel
Intel

  • Homogeneous cores

  • Bus based on chip interconnect

  • Shared Memory

  • Traditional I/O

Classic OOO: Reservation Stations, Issue ports, Schedulers…etc

Source: Intel Corp.

Large, shared set associative, prefetch, etc.


Ibm cell processor
IBM Cell Processor

Heterogeneous MultiCore

High speed I/O

High bandwidth, multiple buses

Source: IBM

Classic (stripped down) core

Co-processor accelerator


Amd au1200 system on chip

Embedded processor

Custom cores

On-Chip I/O

Source: AMD

On-Chip Buses

AMD Au1200 System on Chip


Playstation 2 die photo soc
PlayStation 2 Die Photo (SoC)

Floating point MACs

Source: IEEE Micro, March/April 2000


Multi is happening
Multi-* is Happening

Source: Intel Corp.


Intel s roadmap for multicore
Intel’s Roadmap for Multicore

Mobile processors

Enterprise processors

Desktop processors

8C 12MB shared (45nm)

8C 12MB shared (45nm)

QC 8/16MB shared

DC 3MB /6MB shared (45nm)

DC 3 MB/6 MB shared (45nm)

QC 4MB

DC 4MB

DC 2/4MB shared

DC 16MB

DC 2/4MB shared

DC 2MB

DC 4MB

SC 1MB

DC 2MB

DC 2/4MB

SC 512KB/ 1/ 2MB

2006

2007

2008

2006

2007

2008

2006

2007

2008

Source: Adapted from Tom’s Hardware

  • Drivers are

    • Market segments

    • More cache

    • More cores


Distillation into trends
Distillation Into Trends

  • Technology Trends

    • What can we expect/project?

  • Architecture Trends

    • What are the feasible outcomes?

  • Application Trends

    • What are the driving deployment scenarios?

    • Where are the volumes?


Technology scaling

GATE

DRAIN

SOURCE

BODY

Technology Scaling

GATE

DRAIN

SOURCE

  • 30% scaling down in dimensions  doubles transistor density

  • Power per transistor

    • Vdd scaling  lower power

  • Transistor delay = Cgate Vdd/ISAT

    • Cgate, Vdd scaling  lower delay

tox

L


Fundamental trends
Fundamental Trends

Source: Shekhar Borkar, Intel Corp.


Moore s law
Moore’s Law

  • How do we use the increasing number of transistors?

  • What are the challenges that must be addressed?

Source: Intel Corp.


Impact of moore s law to date
Impact of Moore’s Law To Date

Increase Frequency Deeper Pipelines

IBM Power5

Source: IBM

Source: IBM

Increase ILP Concurrent Threads, Branch Prediction and SMT

Push the Memory Wall  Larger caches

Manage Power clock gating, activity minimization


Shaping future multicore architectures
Shaping Future Multicore Architectures

  • The ILP Wall

    • Limited ILP in applications

  • The Frequency Wall

    • Not much headroom

  • The Power Wall

    • Dynamic and static power dissipation

  • The Memory Wall

    • Gap between compute bandwidth and memory bandwidth

  • Manufacturing

    • Non recurring engineering costs

    • Time to market


The frequency wall
The Frequency Wall

  • Not much headroom left in the stage to stage times (currently 8-12 FO4 delays)

  • Increasing frequency leads to the power wall

Vikas Agarwal, M. S. Hrishikesh, Stephen W. Keckler, Doug Burger. Clock rate versus IPC: the end of the road for conventional microarchitectures. In ISCA 2000


Options
Options

  • Increase performance via parallelism

    • On chip this has been largely at the instruction/data level

  • The 1990’s through 2005 was the era of instruction level parallelism

    • Single instruction multiple data/Vector parallelism

      • MMX, SSIMD, Vector Co-Processors

    • Out Of Order (OOO) execution cores

    • Explicitly Parallel Instruction Computing (EPIC)

  • Have we exhausted options in a thread?


The ilp wall past the knee of the curve
The ILP Wall - Past the Knee of the Curve?

Made sense to go

Superscalar/OOO:

good ROI

Performance

Very little gain for

substantial effort

Scalar

In-Order

“Effort”

Moderate-Pipe

Superscalar/OOO

Very-Deep-Pipe

Aggressive

Superscalar/OOO

Source: G. Loh


The ilp wall
The ILP Wall

  • Limiting phenomena for ILP extraction:

    • Clock rate: at the wall each increase in clock rate has a corresponding CPI increase (branches, other hazards)

    • Instruction fetch and decode: at the wall more instructions cannot be fetched and decoded per clock cycle

    • Cache hit rate: poor locality can limit ILP and it adversely affects memory bandwidth

    • ILP in applications: serial fraction on applications

  • Reality:

    • Limit studies cap IPC at 100-400 (using ideal processor)

    • Current processors have IPC of only 1-2


The ilp wall options
The ILP Wall: Options

  • Increase granularity of parallelism

    • Simultaneous Multi-threading to exploit TLP

      • TLP has to exist  otherwise poor utilization results

    • Coarse grain multithreading

    • Throughput computing

  • New languages/applications

    • Data intensive computing in the enterprise

    • Media rich applications


The memory wall
The Memory Wall

µProc

60%/yr.

1000

CPU

“Moore’s Law”

100

Processor-Memory

Performance Gap:(grows 50% / year)

10

DRAM

7%/yr.

DRAM

1

Time


The memory wall1

Year?

The Memory Wall

  • Increasing the number of cores increases the demanded memory bandwidth

  • What architectural techniques can meet this demand?

Average access time


The memory wall2

CPU0

CPU1

The Memory Wall

  • On die caches are both area intensive and power intensive

    • StrongArm dissipates more than 43% power in caches

    • Caches incur huge area costs

  • Larger caches never deliver the near-universal performance boost offered by frequency ramping (Source: Intel)

AMD Dual-Core Athlon FX

IBM Power5


The power wall
The Power Wall

  • Power per transistor scales with frequency but also scales with Vdd

    • Lower Vdd can be compensated for with increased pipelining to keep throughput constant

    • Power per transistor is not same as power per area  power density is the problem!

    • Multiple units can be run at lower frequencies to keep throughput constant, while saving power


Leakage power basics
Leakage Power Basics

  • Sub-threshold leakage

    • Increases with lower Vth , T, W

  • Gate-oxide leakage

    • Increases with lower Tox, higher W

    • High K dielectrics offer a potential solution

  • Reverse biased pn junction leakage

    • Very sensitive to T, V (in addition to diffusion area)


The current power trend

10000

1000

Rocket

Nozzle

100

Nuclear

Power Density (W/cm2)

Reactor

8086

10

4004

P6

8008

Pentium®

8085

386

286

486

8080

1

1970

1980

1990

2000

2010

Year

The Current Power Trend

Sun’s

Surface

Hot Plate

Source: Intel Corp.


Improving power performance
Improving Power/Performance

  • Consider constant die size and decreasing core area each generation = more cores/chip

    • Effect of lowering voltage and frequency  power reduction

    • Increasing cores/chip  performance increase

      Better power performance!


Accelerators
Accelerators

TCP/IP Offload Engine

2.23 mm X 3.54 mm,

260K transistors

Opportunities: Network processing engines

MPEG Encode/Decode engines, Speech engines

Source: Shekhar Borkar, Intel Corp.


Low power design techniques
Low-Power Design Techniques

  • Circuit and gate level methods

    • Voltage scaling

    • Transistor sizing

    • Glitch suppression

    • Pass-transistor logic

    • Pseudo-nMOS logic

    • Multi-threshold gates

  • Functional and architectural methods

    • Clock gating

    • Clock frequency reduction

    • Supply voltage reduction

    • Power down/off

    • Algorithmic and software techniques

Two decades worth of research and development!


The economics of manufacturing
The Economics of Manufacturing

  • Where are the costs of developing the next generation processors?

    • Design Costs

    • Manufacturing Costs

  • What type of chip level solutions is the economics implying?

  • Assessing the implications of Moore’s Law is an exercise in mass production


The cost of an asic

C

P

The Cost of An ASIC

  • Cost and Risk rising to unacceptable levels

  • Top cost drivers

    • Verification (40%)

    • Architecture Design (23%)

    • Embedded Software Design

      • 1400 man months (SW)

      • 1150 man months (HW)

    • HW/SW integration

Estimated Cost -

$85 M -$90 M

Example: Design with

80 M transistors in

100 nm technology

12 – 18 months

implementation

verification

verification

production

prototype

verification

design

*Handel H. Jones, “How to Slow the Design Cost Spiral,” Electronics Design Chain, September 2002, www.designchain.com


The spectrum of architectures
The Spectrum of Architectures

Customization fullyin Software

Customization fully in Hardware

Design NRE Effort

Decreasing Customization

Increasing NRE and Time to Market

Software Development

Hardware Development

Compilation

Synthesis

Structured ASIC

Polymorphic Computing Architectures

Custom ASIC

Fixed + Variable ISA

FPGA

Microprocessor

Tiled architectures

Xilinx Altera

LSI Logic Leopard Logic

MONARCHSM,RAW, TRIPS

PACT, PICOChip

Tensilica Stretch Inc.


Interlocking trade offs
Interlocking Trade-offs

Memory

ILP

bandwidth

miss penalty

dynamic penalties

leakage power

speculation

Power

Frequency

dynamic power


Multi core architecture drivers
Multi-core Architecture Drivers

  • Addressing ILP limits

    • Multiple threads

    • Coarse grain parallelism  raise the level of abstraction

  • Addressing Frequency and Power limits

    • Multiple slower cores across technology generation

    • Scaling via increasing the number of cores rather than frequency

    • Heterogeneous cores for improved power/performance

  • Addressing memory system limits

    • Deep, distributed, cache hierarchies

    • OS replication  shared memory remains dominant

  • Addressing manufacturing issues

    • Design and verification costs

       Replication  the network becomes more important!



Beyond ilp

parallelizable

1CPU

2CPUs

3CPUs

4CPUs

Beyond ILP

  • Performance is limited by the serial fraction

  • Coarse grain parallelism in the post ILP era

    • Thread, process and data parallelism

  • Learn from the lessons of the parallel processing community

    • Revisit the classifications and architectural techniques


Flynn s model
Flynn’s Model

  • Flynn’s Classification

    • Single instruction stream, single data stream (SISD)

      • The conventional, word-sequential architecture including pipelined computers

    • Single instruction stream, multiple data stream (SIMD)

      • The multiple ALU-type architectures (e.g., array processor)

    • Multiple instruction stream, single data stream (MISD)

      • Not very common

    • Multiple instruction stream, multiple data stream (MIMD)

      • The traditional multiprocessor system

M.J. Flynn, “Very high speed computing systems,” Proc. IEEE, vol. 54(12), pp. 1901–1909, 1966.


Simd vector computation
SIMD/Vector Computation

IBM Cell SPE Organization

  • SIMD and Vector models are spatial and temporal analogs of each other

  • A rich architectural history dating back to 1953!

IBM Cell SPE pipeline diagram

Source: IBM

Source: Cray


Simd vector architectures
SIMD/Vector Architectures

  • VIRAM - Vector IRAM

    • Logic is slow in DRAM process

    • put a vector unit in a DRAM and provide a port between a traditional processor and the vector IRAM instead of a whole processor in DRAM

Source: Berkeley Vector IRAM


Mimd machines
MIMD Machines

P + C

P + C

P + C

P + C

  • Parallel processing has catalyzed the development of a several generations of parallel processing machines

  • Unique features include the interconnection network, support for system wide synchronization, and programming languages/compilers

Dir

Dir

Dir

Dir

Memory

Memory

Memory

Memory

Interconnection Network


Basic models for parallel programs
Basic Models for Parallel Programs

  • Shared Memory

    • Coherency/consistency are driving concerns

    • Programming model is simplified at the expense of system complexity

  • Message Passing

    • Typically implemented on distributed memory machines

    • System complexity is simplified at the expense of increased effort by the programmer


Shared memory model
Shared Memory Model

  • That’s basically it…

    • need to fork/join threads, synchronize (typically locks)

Main Memory

Write X

Read X

CPU0

CPU1


Message passing protocols
Message Passing Protocols

  • Explicitly send data from one thread to another

    • need to track ID’s of other CPUs

    • broadcast may need multiple send’s

    • each CPU has own memory space

  • Hardware: send/recv queues between CPUs

Send

Recv

CPU0

CPU1


Shared memory vs message passing
Shared Memory Vs. Message Passing

  • Shared memory doesn’t scale as well to larger number of nodes

    • communications are broadcast based

    • bus becomes a severe bottleneck

  • Message passing doesn’t need centralized bus

    • can arrange multi-processor like a graph

      • nodes = CPUs, edges = independent links/routes

    • can have multiple communications/messages in transit at the same time


  • Two emerging challenges
    Two Emerging Challenges

    Programming Models and Compilers?

    Source: Intel Corp.

    Source: IBM

    Interconnection Networks