Vlsi architecture design course lecture 4 5
This presentation is the property of its rightful owner.
Sponsored Links
1 / 32

VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5 PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5. The Generic Processor Microarchitecture trends Performance/power/frequency implications Insights. Today's lecture: Comprehend performance, power and area implications of various Microarchitectures. References of the day.

Download Presentation

VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Vlsi architecture design course lecture 4 5

VLSI ARCHITECTURE DESIGN COURSE LECTURE #4-5

  • The Generic Processor

    • Microarchitecture trends

    • Performance/power/frequency implications

    • Insights

Today's lecture: Comprehend performance, power and area implications of various Microarchitectures


References of the day

References of the day

  • “Computer Architecture - A Quantitative Approach” (The second edition), John L. Hennessy, David A. Patterson, Chapter 3-4 (p. 125-370)

  • “Computer Organization and Design”, John L. Hennessy, David A. Patterson, Chapter 5-6, 9 (p. 268-451, 594-646)

  • “Tuning the Pentium Pro Micro-Architecture”, David Papworth, IEEE Micro, April 1996

  • IA-64 Application Architecture Tutorial, Allan D. Knies, Hot-Chips 11, August 1999.

  • “Billion-Transistor Architecture: There and Back again” Doug Burger, James Goodman, Computer, March

  • “VLIW Architecture for a Tree Scheduling Compiler”, R. Colwell, R. Nix, J. O’Donnell, D. Papworth, P. Rodman, ACM 1987

  • Joseph Fisher - "The VLIW Machine: A Multiprocessor for Compiling Scientific Code", Computer July 1984.

  • “The IBM 360/91: Machine Philosophy and Instruction Handling”, R.M. Tomasulo et al, IBM Journal of Research and Development 11:1, 1967

  • “VLIW Architecture for a Tree Scheduling Compiler”, R. Colwell, R. Nix, J. O’Donnell, D. Papworth, P. Rodman, ACM 1987

  • Joseph Fisher - "The VLIW Machine: A Multiprocessor for Compiling Scientific Code", Computer July 1984.

  • “Tuning the Pentium Pro Micro-Architecture”, David Papworth, IEEE Micro, April 1996

  • IA-64 Application Architecture Tutorial, Allan D. Knies, Hot-Chips 11, August 1999

    Some of the lectures material have been prepared by Ronny Ronen


Computing platform messages

Computing platformMessages

  • Balanced design

  • Power  CV2f

  • System Performance

    • Transactions overhead

    • Memory as a scratch pad

    • Scheduling

    • System efficiency

  • CPU

    • ILP and IPC vs. Frequency

    • External vs. internal frequency

    • Speculation

      • Branch Prediction

      • $ (Caches)

      • Memory disambiguation

      • Instructions and Data Prefetch

      • Value prediction

      • ….

  • Multithread

    • Multithread on single core

    • Multi-cores system

      • $ in multi-core

      • Asymmetry

      • NUMA

      • Scheduling in MC

      • Mtulti-Core vs. Multi-thread machines

      • ….


The generic processor

The Generic Processor

Instruction

supply

Sophisticated organization to “service” instructions

  • Instruction supply

    • Instruction cache

    • Branch prediction

    • Instruction decoder

    • ...

  • Execution engine

    • Instruction scheduler

    • Register files

    • Execution units

    • ...

  • Data supply

    • Data cache

    • TLB’s

  • Goal - Maximum throughput – balanced design

Data

supply

Execution

engine


Power performance

Power & Performance

  • Performance 1/Execution Time (IPC x Frequency) / #-of-instructions-in-Task

    For a given instruction stream: Performance depends on the number of instructions executed per time-unit:

  • Performance IPC x Frequency

    Sometimes, Measured in MIPS - Million Instructions Per Second

  • PowerC x V2 x Frequency

    C = overall capacitance: for a given technology, is ~proportional to the # of transistors

  • Energy Efficiency = Performance/Power

    • Measured in MIPS/Watt

Message: Power = C x V2 x Frequency


Microprocessor performance evolution

[John DeVale & Bryan Black, 2006]

Microprocessor Performance Evolution

MRM

IPC

Itanium

YNH

Intel P-M

Power 3

Power 4

AMD Opetron

AMD Athlon

Intel P4

Message: Frequency vs. IPC


Real life performance vs frequency

866  95%

87892%

80787%

708  90%

Real life:Performance vs. frequency

Message: Internal vs. external frequency

* Source: Intel ® Pentium ® 4 Processor and Intel ® 850 Performance Brief, April2002


Microarchitecture

Microarchitecture

  • Micro-Processor Core – Performance/ power/area insights

    • Parallelism

    • Pipeline stalls/Bypasses

    • Superpipeline

    • Static/Dynamic scheduling

    • Branch prediction

    • Memory Hierarchy

  • VLIW / EPIC


Parallelism evolution performance power area insights

...

PE

PE

PE

PE

PE

PE

PE

PE

PE

...

f

a

n

n

b

c

d

a

a

e

b

c

Parallelism Evolution Performance, power, area Insights?

Pipeline

Superscalar - In order

Basic configuration

PE

PE=Processor

Element

...

a

Instruction

a

b

c

n

VLIW

Superscalar - Out of Order


Static scheduling vliw epic performance power area insights

IF

ID

IE

IW

FF

FD

FE

FW

MF

MD

ME

MW

BF

BD

BE

BW

IF

ID

IE

IW

FF

FD

FE

FW

MF

MD

ME

MW

BF

BD

BE

BW

IF

ID

IE

st

st

st

st

IW

FF

FD

FE

st

st

st

st

FW

MF

MD

ME

st

st

st

st

MW

BF

BD

BE

st

st

st

st

BW

IF

ID

st

st

st

st

IE

IW

FF

FD

st

st

st

st

FE

FW

MF

MD

st

st

st

st

ME

MW

BF

BD

st

st

st

st

BE

BW

Static Scheduling: VLIW / EPICPerformance, power, area Insights?

  • Static scheduling of instructions by compiler

    • VLIW: Very Long Instruction Word (MultiFlow, TI6X family)

    • EPIC: Explicit Parallel Instruction set Computer (IA64)

  • Shorter pipe, wider machine, global view=> potentially huge ILP (wider & simpler than plain superscalar!)

  • Many nops, sensitive to varying latencies (memory accesses)

    • Low utilization

    • Huge code size

    • Highly depends on compiler

  • EPIC overcomes some of theselimitations:

    • Advance loads (hide memory latency)

    • Predicated execution (avoid branches)

    • Decoder templates (reduce nops)

      But at increased complexity

I:integer

F:Float

M:Memory

B:Branch

st:stall

Gray: nop

Pipeline stages

Perf/power

Examples

Intel Itanium® proc.

DSPs

increase

time

decrease


Dynamic scheduling performance power area insights

Dynamic SchedulingPerformance, power, area Insights?

  • Scheduling instructions at run time, by the HW

  • Advantages:

    • Works on the dynamic instruction flow:Can schedule across procedures, modules...

    • Can see dynamic values (memory addresses)

    • Can accommodate varying latencies and cases (e.g. cache miss)

  • Disadvantages

    • Can schedule within a limited window only

    • Should be fast - cannot be too smart

Perf/power

increase

decrease


Out of order execution

1

1

2

2

3

3

4

4

5

5

6

6

7

7

8

8

9

9

1F

1D

1E

1E

1W

2F

2D

2E

2E

2E

2W

3F

3D

3E

3E

3E

3E

3W

4F

4D

4E

4E

4E

4E

4E

4W

5F

5D

5E

5E

5E

5E

5E

5E

5W

1F

1D

1E

1E

1W

2F

2D

2E

2E

2E

2W

3F

3D

3E

3E

3w

3W

4F

4D

4E

4E

4E

4W

5F

5D

5E

5E

5E

5E

5W

Out Of Order Execution

  • In Order Execution: instructions are processed in their program order.

    • Limitation to potential Parallelism.

  • OOO: Instructions are executed based on “data flow” rather than program order

    Before:src -> dest

    (1) load(r10), r21(2) movr21, r31(2 depends on 1)(3) loada, r11(4) movr11, r22(4 depends on 3)(5) movr22, r23(5 depends on 4)

    After:(1)load(r10), r21;(3) loada, r11;<wait for loads to complete>(2) movr21,r31;(4) movr11,r22;(5) movr22,r23;

  • Usually highly superscalar

t

In Order Processing

t

Out of Order Processing

In Order vs. OOO execution.Assuming:

- Unlimited resources- 2 cycles load latency

Examples:

Intel Pentium® II/III/4

Compaq Alpha 21264


Out of order cont performance power area insights

Out Of Order (cont.)Performance, power, area Insights?

  • Advantages

    • Help exploit Instruction Level Parallelism (ILP)

    • Help cover latencies (e.g., cache miss, divide)

    • Artificially increase the Register file size (i.e. number of registers) ?

    • Superior/complementary to compiler scheduler

      • Dynamic instruction window

      • Make usage of more registers than the Architecture Registers ?

  • Complex microarchitecture

    • Complex scheduler

      • Large instruction window

      • Speculative execution

    • Requires reordering back-end mechanism (retirement) for:

      • Precise interrupt resolution

      • Misprediction/speculation recovery

      • Memory ordering

Perf/power

increase

decrease


Branch prediction performance power area insights

Branch PredictionPerformance, power, area Insights?

  • Goal - ensure instruction supply by correct prefetching

  • In the past - prefetcher assumed fall-through

    • Lose on unconditional branch (e.g., call)

    • Lose on frequently taken branches (e.g., loops)

  • Dynamic Branch prediction

    • Predicts whether a branch is taken/not taken

    • Predicts branch target address

  • Misprediction cost varies (higher w/ increased pipeline depth)

  • Typical Branch prediction rates: ~90%-96% 4%-10% misprediction, 10-25 branches between mispredictions 50-125 instructions between mispredictions

  • Misprediction cost increased with

    • Pipeline depth

    • Machine width

      • e.g. 3 width x 10 stages = 30 inst flushed!

?

Perf/power

increase

decrease


Caches

Caches

In computer engineering, a cache (pronounced /kæʃ/kash in US and /keɪʃ/ kaysh in Aust/NZ) is a component that transparently stores data so that future requests for that data can be served faster (Wikipedia)


Memory hierarchy performance power area insights

Small

Fast

<500B

CPU

Registers

0.25ns

64KB

1-2ns

L1 cache

8MB

5ns

L2 cache

Speed

Capacity (Size)

Main memory

(DRAM)

4GB

100ns

DISK/Flash

1ms/

100GB

Slow

Big

Memory hierarchyPerformance, power, area Insights?

10us

Perf/power: What are the parameters to consider here?


Environment and motivation

Environment and motivation

Moore’s Law: 2X transistors (cores?) per chip every technology generationhowever, current process generation provide almost same clock rate

  • Processor running single process can compute only as fast as memory

    • A 3Ghz processor can execute an “add” operation in 0.33ns

    • Today’s “external Main memory” latency is 50-100ns

    • Naïve implementation: loads/stores can be 300x slower than other operations


Cache motivation cpu dram gap latency

Cache Motivation CPU - DRAM Gap (latency)

µProc

60%/yr.

(2X/1.5yr)

“Moore’s Law”

CPU-DRAM Gap

Processor-Memory

Performance Gap:(grows 50% / year)

DRAM

9%/yr.

(2X/10 yrs)

  • Memory latency can be handle by:

  • Multi-threaded engine (no cache)  every memory access = off-chip access  BW and power implications?

  • Caches  every Cache miss = off-chip access  BW and power implications?


Memory hierarchy

Memory Hierarchy

!

Number of CPU cycles to reach memory domain  latency

Memory

*

1,000,000 C to Disk

10,000 C to SSD

1C

T=300 C

Registers

CPU

Disk/SSD

C=CPU cycles

046267 Computer Architecture 1 U Weiser


Vlsi architecture design course lecture 4 5

Cache

A cache is a smaller, faster memory which stores copies of the data from the most frequently used main memory locations

048750 CMP Cache/Mem Arch – Uri W.,Evgeny B.


Memory hierarchy solution i single core environment

Memory Hierarchysolution I – single core environment

!

A fast memory structure between CPU and memory solves latency issue

Memory

1,000,000 C to Disk

10,000 C to SSD

1C

10 C

300 C

Registers

CPU

Cache

Disk/SSD

C=CPU cycles

046267 Computer Architecture 1 U Weiser


Memory hierarchy solution ii multi thread environment

Memory Hierarchysolution II – Multi-thread environment

!

Memory

Many Threads execution hide latency

1,000,000 C to Disk

10,000 C to SSD

300 C

Performance1 BW1, P1

execution

Disk/SSD

execution

Memory access (300 C)

Memory access

046267 Computer Architecture 1 U Weiser

22


Memory hierarchy solution ii multi thread environment1

Memory HierarchySolution II – Multi-thread environment

!

Memory

1,000,000 C to Disk

10,000 C to SSD

Memory structure ($) between CPU and memory serves as BW filter

10 C

300 C

Cache

Disk/SSD

Same performance:Performance1 BW1*MR, P1*MR

046267 Computer Architecture 1 U Weiser

MR=Cache Miss rate


Power performance area insights 1

Power, Performance, Area:Insights – 1

  • Energy to process one instruction: Wi

    • increases with the complexity of the processorE.g., OOO processor consumes more energy per instruction than an in-order processor  Perf/Power

  • Energy efficiency=Perf/Power

    • value deteriorates as speculation increases and complexity grows

  • Area efficiency = Performance/area

    • Leakage become a major issue

    • Effectiveness of area – how to get more performance for a given area(secondary to power)


Power performance area insights 2

Power, Performance, Area:Insights - 2

  • Performance

    • Perf a IPC * f

  • Voltage Scaling

    • Increased operating voltage to increase frequency

    • f = k * V (within a given voltage range)

  • Power & Energy consumption

    • P a C * V2 * f  P ~ a * C * V3

    • E = P * t

  • Tradeoff

    • Maximum performance

    • Minimum energy  1% perf  1% power < W/O voltage scaling>

    • Maximum performance within constrained power  1% perf  3% power <with voltage scaling>


Power performance area insight 3

Many things do not scale

Wire delays

Power

Memory latencies and bandwidth

Instruction Level parallelism (ILP)

… We solve one: we fertilize the others!

Performance = frequency * IPC

Increasing IPC => more work per instruction

Prediction, renaming, scheduling, etc…

More useless work: Speculation, replays...

More Frequency => More pipe stages

Less gate delays per stage

More gate delays per instruction overall

Bigger loss due to flushes, cache misses, prefetch miss

We may “gain” Performance => But with a lot of area and power!

Power, Performance, Area Insight- 3


Static scheduling vliw epic a short architectural case study

Static Scheduling: VLIW / EPICA short architectural case study

  • Why “new”? ….CISC = Old

  • Why reviving? ….OOO complexity

  •  Advantages – simplicity (pipeline, dependency, dynamic)

  •  reasons:

    • EOL of X86?

    • Business?

  • Servers?

    • Questions to ask?

      • Technical

      • Business

  • Controllers

    • Questions to ask?

      • Technical

      • Business


Static issuing example vliw very long instruction word multiflow 7 200

Static Issuing - exampleVLIW-Very Long Instruction WordMultiflow 7/200

  • A VLIW Performs many program steps at once.

  • Many operations are grouped together into Very Long Instruction Word and execute together

Memory

Register File

LD/ST

FADD

FMUL

IALU

Instruction

Word

LD/ST

FADD

FMUL

IALU

BRANCH

Ref: “VLIW Architecture for a Trace Scheduling Compiler” Colwell. Nix, O’Donnell


Multiflow 7 200 cont compiler basic concept

Multiflow 7/200 (cont’)Compiler Basic Concept

Optimized compiler arrange instructions according to instruction timing

example:LD#B, R1

LD#C, R2

FADDR1, R2, R3

LD#D, R4

LD#E, R5

FADDR4, R5, R6

FMULR6, R3, R1

STOR1, #A

LD#G, R7

LD#H, R8

FMULLR7, R8, R9

LD#X, R4

LD#Y, R5

FMULLR4, R5, R6

FADDR6, R9, R1

STOR1, #F

A = (B+C) * (D+E)

F = G*H + X*Y

Assume latencies:

Load 3

FADD 3

FMUL 3

Store 1


Multiflow 7 200 cont compiler basic concept1

Assume latencies:

Load 3

FADD 3

FMUL 3

Store 1

Multiflow 7/200 (cont’)Compiler Basic Concept

Example (Cont.):A = (B+C) * (D+E)

F = G*H + X*Y

LD/ST

IALU

FADD

FMUL

BR

LD #B, R1

LD #C, R2

LD #D, R4

LD #E, R5

LD #G, R7

FADD R1,R2,R3

LD #H, R8

LD #X, R4

FADD R4,R5,R6

LD #Y, R5

FMUL R7,R8,R9

FMUL R3,R6,R1

FMUL R4,R5,R6

STO R1, #A

FADD R9,R6,R1

- - - - - - - - - : stalled cycle, takes time, but no space.

Overall latency 17 cycles.Very Low code efficiency: <25%!

STO R1, #F


Intel itanium processor block diagram

Intel® Itanium™ Processor Block Diagram

IA-32

Decode

and

Control

L1 Instruction Cache and

Fetch/Pre-fetch Engine

ITLB

ECC

Branch

Prediction

Instruction Queue

8 bundles

B

B

B

M

M

I

I

F

F

9 Issue Ports

Register Stack Engine / Re-Mapping

L2 Cache

L3 Cache

Branch & Predicate

Registers

128 Integer Registers

128 FP Registers

Scoreboard, Predicate

,NaTs, Exceptions

Branch

Units

Integer

and

MM Units

Dual-Port

L1

Data

Cache

and

DTLB

Floating

Point

Units

ALAT

SIMD

FMAC

SIMD

FMAC

ECC

ECC

Bus Controller

ECC

ECC

ECC


Ia64 instruction template

Instruction Types

M: Memory

I: Shifts, MM

A: ALU

B: Branch

F: Floating point

L+X: Long

Template types

Regular: MII, MLX, MMI, MFI, MMF

Stop: MI_I M_MI

Branch: MIB, MMB, MFB, MBB, BBB

All come in two versions:

with stop at end

without stop at end

template

5 bits

IA64 Instruction Template

128 bits

Instruction 2

41 bits

Instruction 1

41 bits

Instruction 0

41 bits

  • Microarchitecture considerations:

    • Can run N bundles per clock (Merced = 2)

    • Limits on numbers of memory ports (Merced =2, future > 2?)


  • Login