Platform based design
Download
1 / 47

Platform-based Design - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

Platform-based Design. Exploiting ILP VLIW architectures (part a). TU/e 5kk70 Henk Corporaal Bart Mesman. VLIW = Very Long Instruction Word architecture. Instruction format:. operation 1. operation 2. operation 3. operation 4. operation 5. What are we talking about?.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Platform-based Design' - ethan-golden


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Platform based design

Platform-based Design

Exploiting ILP

VLIW architectures (part a)

TU/e 5kk70

Henk Corporaal

Bart Mesman


What are we talking about

VLIW = Very Long Instruction Word architecture

Instruction format:

operation 1

operation 2

operation 3

operation 4

operation 5

What are we talking about?

ILP = Instruction Level Parallelism =

ability to perform multiple operations (or instructions),

from a single instruction stream,

in parallel

Platform Design H. Corporaal and B. Mesman


Vliw topics overview
VLIW: Topics Overview

  • Enhance performance: architecture methods

  • Instruction Level Parallelism

    • Limits on ILP

  • VLIW

    • Examples

  • Clustering

  • Code generation

  • Hands-on

Platform Design H. Corporaal and B. Mesman


Enhance performance 3 architecture methods
Enhance performance: 3 architecture methods

  • (Super)-pipelining

  • Powerful instructions

    • MD-technique

      • multiple data operands per operation

    • MO-technique

      • multiple operations per instruction

  • Multiple instruction issue

Platform Design H. Corporaal and B. Mesman


Architecture methods pipelined execution of instructions

IF

IF

IF

IF

DC

DC

DC

DC

RF

RF

RF

RF

EX

EX

EX

EX

WB

WB

WB

WB

Architecture methodsPipelined Execution of Instructions

IF: Instruction Fetch

DC: Instruction Decode

RF: Register Fetch

EX: Execute instruction

WB: Write Result Register

CYCLE

1

2

3

4

5

6

7

8

1

2

INSTRUCTION

3

4

Simple 5-stage pipeline

  • Purpose of pipelining:

    • Reduce #gate_levels in critical path

    • Reduce CPI close to one

    • More efficient Hardware

  • Problems

    • Hazards: pipeline stalls

      • Structural hazards: add more hardware

      • Control hazards, branch penalties: use branch prediction

      • Data hazards: by passing required

Platform Design H. Corporaal and B. Mesman


Architecture methods pipelined execution of instructions1

*

Architecture methodsPipelined Execution of Instructions

Superpipelining:

  • Split one or more of the critical pipeline stages

Platform Design H. Corporaal and B. Mesman


Architecture methods powerful instructions 1
Architecture methodsPowerful Instructions (1)

MD-technique

  • Multiple data operands per operation

  • SIMD: Single Instruction Multiple Data

Vector instruction:

for (i=0, i++, i<64)

c[i] = a[i] + 5*b[i];

c = a + 5*b

Assembly:

set vl,64

ldv v1,0(r2)

mulvi v2,v1,5

ldv v1,0(r1)

addv v3,v1,v2

stv v3,0(r3)

Platform Design H. Corporaal and B. Mesman


Architecture methods powerful instructions 11

SIMD Execution Method

time

node1

node2

node-K

Instruction 1

Instruction 2

Instruction 3

Instruction n

Architecture methodsPowerful Instructions (1)

SIMD computing

  • Nodes used for independent operations

  • Mesh or hypercube connectivity

  • Exploit data locality of e.g. image processing applications

  • Dense encoding (few instruction bits needed)

Platform Design H. Corporaal and B. Mesman


Architecture methods powerful instructions 12

*

*

*

*

Architecture methodsPowerful Instructions (1)

  • Sub-word parallelism

    • SIMD on restricted scale:

    • Used for Multi-media instructions

  • Examples

    • MMX, SUN-VIS, HP MAX-2, AMD-K7/Athlon 3Dnow, Trimedia II

    • Example: i=1..4|ai-bi|

Platform Design H. Corporaal and B. Mesman


Architecture methods powerful instructions 2
Architecture methodsPowerful Instructions (2)

MO-technique: multiple operations per instruction

  • CISC (Complex Instruction Set Computer)

  • VLIW (Very Long Instruction Word)

FU 1

FU 2

FU 3

FU 4

FU 5

field

sub r8, r5, 3

and r1, r5, 12

mul r6, r5, r2

ld r3, 0(r5)

bnez r5, 13

instruction

VLIW instruction example

Platform Design H. Corporaal and B. Mesman


VLIW architecture: central Register File

Register file

Exec

unit 1

Exec

unit 2

Exec

unit 3

Exec

unit 4

Exec

unit 5

Exec

unit 6

Exec

unit 7

Exec

unit 8

Exec

unit 9

Issue slot 1

Issue slot 2

Issue slot 3

Platform Design H. Corporaal and B. Mesman


TM1000 DSPCPU

Register file (128 regs, 32 bit, 15 ports)

5 constant

5 ALU

2 memory

2 shift

2 DSP-ALU

2 DSP-mul

3 branch

2 FP ALU

2 Int/FP ALU

1 FP compare

1 FP div/sqrt

Exec

unit

Exec

unit

Exec

unit

Exec

unit

Exec

unit

Data

cache

(16 kB)

Instruction register (5 issue slots)

PC

Instruction

cache (32kB)

Platform Design H. Corporaal and B. Mesman


Trimedia tm32a processor

I/O

INTERFACE

D-CACHE

I-CACHE

I-Cache

D-cache

32K

16K

TAG

TAG

TAG

SEQUENCER

/ DECODE

TAG

(FLOAT)

DSPALU2

IFMUL2

FCOMP2

DSPMUL2

FALU3

ALU3

ALU4

ALU2

REGFILE

128 REGS X 32 BITS

DSPALU0

SHIFTER0

FTOUGH1

SHIFTER1

(FLOAT)

IFMUL1

FALU0

DSPMUL1

ALU1

ALU0

TriMedia TM32A processor

0.18 micron

area : 16.9mm2

200 MHz (typ)

1.4 W

7 mW/MHz

(MIPS=

0.9 mW/MHz)

Platform Design H. Corporaal and B. Mesman


Architecture methods powerful instructions 2 vliw characteristics
Architecture methods: Powerful Instructions (2) VLIW Characteristics

  • Only RISC like operation support

    • Short cycle times

  • Flexible: Can implement any FU mixture

  • Extensible

  • Tight inter FU connectivity required

  • Large instructions (up to 1000 bits)

  • Not binary compatible

  • But good compilers exist

Platform Design H. Corporaal and B. Mesman


Architecture methods multiple instruction issue per cycle
Architecture methodsMultiple instruction issue (per cycle)

Who guarantees semantic correctness?

  • can instructions be executed in parallel

  • User specifies multiple instruction streams

    • MIMD (Multiple Instruction Multiple Data)

  • Run-time detection of ready instructions

    • Superscalar

  • Compile into dataflow representation

    • Dataflow processors

  • Platform Design H. Corporaal and B. Mesman


    Multiple instruction issue three approaches
    Multiple instruction issueThree Approaches

    Example code

    a := b + 15;

    c := 3.14 * d;

    e := c / f;

    Translation to DDG

    (Data Dependence Graph)

    &d

    ld

    3.14

    &f

    &b

    ld

    ld

    *

    15

    &c

    +

    /

    st

    &a

    &e

    st

    st

    Platform Design H. Corporaal and B. Mesman


    Instr. Sequential Code Dataflow Code

    I1 ld r1,M(&b) ld(M(&b) -> I2

    I2 addi r1,r1,15 addi 15 -> I3

    I3 st r1,M(&a) st M(&a)

    I4 ld r1,M(&d) ld M(&d) -> I5

    I5 muli r1,r1,3.14 muli 3.14 -> I6, I8

    I6 st r1,M(&c) st M(&c)

    I7 ld r2,M(&f) ld M(&f) -> I8

    I8 div r1,r1,r2 div -> I9

    I9 st r1,M(&e) st M(&e)

    Generated Code

    Notes:

    • An MIMD may execute two streams: (1) I1-I3 (2) I4-I9

      • No dependencies between streams; in practice communication and synchronization required between streams

    • A superscalar issues multiple instructions from sequential stream

      • Obey dependencies (True and name dependencies)

      • Reverse engineering of DDG needed at run-time

    • Dataflow code is direct representation of DDG

    Platform Design H. Corporaal and B. Mesman


    Multiple instruction issue data flow processor

    FU-1

    FU-2

    FU-K

    Multiple Instruction Issue:Data flow processor

    Token

    Matching

    Token

    Store

    Instruction

    Generate

    Instruction

    Store

    Result Tokens

    Reservation Stations

    Platform Design H. Corporaal and B. Mesman


    Instruction pipeline overview

    IF

    DC

    RF

    EX

    WB

    IF

    DC/RF

    EX

    WB

    IF2

    IFk

    IF1

    IF3

    DC3

    DC1

    DC2

    DCk

    ISSUE

    ISSUE

    ISSUE

    ISSUE

    RF3

    RF1

    RF2

    RFk

    EX2

    EX1

    EX3

    EXk

    ROB

    ROB

    ROB

    ROB

    WB2

    WB1

    WB3

    WBk

    IF1

    IF2

    ---

    IFs

    DC

    RF

    EX1

    EX2

    ---

    EX5

    WB

    IF

    DC

    RF1

    EX1

    WB1

    RF1

    EX1

    WB1

    RF2

    EX2

    WB2

    RF2

    EX2

    WB2

    RFk

    EXk

    WBk

    RFk

    EXk

    WBk

    Instruction Pipeline Overview

    CISC

    RISC

    Superscalar

    Superpipelined

    DATAFLOW

    VLIW

    Platform Design H. Corporaal and B. Mesman


    Four dimensional representation of the architecture design space i o d s
    Four dimensional representation of the architecture design space <I, O, D, S>

    SIMD

    100

    Data/operation ‘D’

    10

    Vector

    CISC

    Superscalar

    MIMD

    Dataflow

    0.1

    10

    100

    RISC

    Instructions/cycle ‘I’

    Superpipelined

    10

    VLIW

    10

    Operations/instruction ‘O’

    Superpipelining Degree ‘S’

    Platform Design H. Corporaal and B. Mesman


    Architecture design space

    Architecture space <I, O, D, S> K I O D S Mpar

    CISC 1 0.2 1.2 1.1 1 0.26

    RISC 1 1 1 1 1.2 1.2

    VLIW 10 1 10 1 1.2 12

    Superscalar 3 3 1 1 1.2 3.6

    Superpipelined 1 1 1 1 3 3

    Vector 7 0.1 1 64 5 32

    SIMD 128 1 1 128 1.2 154

    MIMD 32 32 1 1 1.2 38

    Dataflow 10 10 1 1 1.2 12

    Architecture design space

    Typical values of K (# of functional units or processor nodes), and

    <I, O, D, S> for different architectures

    S(architecture) = f(Op) * lt (Op)

    Op I_set

    Mpar = I*O*D*S

    Platform Design H. Corporaal and B. Mesman


    Overview
    Overview space <I, O, D, S>

    • Enhance performance: architecture methods

    • Instruction Level Parallelism

      • limits on ILP

    • VLIW

      • Examples

    • Clustering

    • Code generation

    • Hands-on

    Platform Design H. Corporaal and B. Mesman


    General organization of an ilp architecture

    FU-1 space <I, O, D, S>

    CPU

    FU-2

    Instruction fetch unit

    Instruction decode unit

    Instruction memory

    FU-3

    Bypassing network

    Data memory

    Register file

    FU-4

    FU-5

    General organization of an ILP architecture

    Platform Design H. Corporaal and B. Mesman


    Motivation for ilp
    Motivation for ILP space <I, O, D, S>

    • Increasing VLSI densities; decreasing feature size

    • Increasing performance requirements

    • New application areas, like

      • multi-media (image, audio, video, 3-D)

      • intelligent search and filtering engines

      • neural, fuzzy, genetic computing

    • More functionality

    • Use of existing Code (Compatibility)

    • Low Power: P = fCVdd2

    Platform Design H. Corporaal and B. Mesman


    Low power through parallelism
    Low power through parallelism space <I, O, D, S>

    • Sequential Processor

      • Switching capacitance C

      • Frequency f

      • Voltage V

      • P = fCV2

    • Parallel Processor (two times the number of units)

      • Switching capacitance 2C

      • Frequency f/2

      • Voltage V’ < V

      • P = f/2 2C V’2 =fCV’2

    Platform Design H. Corporaal and B. Mesman


    Measuring and exploiting available ilp
    Measuring and exploiting available ILP space <I, O, D, S>

    • How much ILP is there in applications?

    • How to measure parallelism within applications?

      • Using existing compiler

      • Using trace analysis

        • Track all the real data dependencies (RaWs) of instructions from issue window

          • register dependence

          • memory dependence

        • Check for correct branch prediction

          • if prediction correct continue

          • if wrong, flush schedule and start in next cycle

    Platform Design H. Corporaal and B. Mesman


    Trace analysis

    Trace space <I, O, D, S>

    set r1,0

    set r2,3

    set r3,&A

    st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    add r1,r5,3

    Trace analysis

    Compiled code

    set r1,0

    set r2,3

    set r3,&A

    Loop: st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    add r1,r5,3

    Program

    For i := 0..2

    A[i] := i;

    S := X+3;

    How parallel can this code be executed?

    Platform Design H. Corporaal and B. Mesman


    Trace analysis1
    Trace analysis space <I, O, D, S>

    Parallel Trace

    set r1,0 set r2,3 set r3,&A

    st r1,0(r3) add r1,r1,1 add r3,r3,4

    st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

    st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

    brne r1,r2,Loop

    add r1,r5,3

    Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7

    Platform Design H. Corporaal and B. Mesman


    Ideal processor
    Ideal Processor space <I, O, D, S>

    Assumptions for ideal/perfect processor:

    1. Register renaming– infinite number of virtual registers => all register WAW & WAR hazards avoided

    2. Branch and Jump prediction– Perfect => all program instructions available for execution

    3. Memory-address alias analysis– addresses are known. A store can be moved before a load provided addresses not equal

    Also:

    • unlimited number of instructions issued/cycle (unlimited resources), and

    • unlimited instruction window

    • perfect caches

    • 1 cycle latency for all instructions (FP *,/)

      Programs were compiled using MIPS compiler with maximum optimization level

    Platform Design H. Corporaal and B. Mesman


    Upper limit to ilp ideal processor
    Upper Limit to ILP: Ideal Processor space <I, O, D, S>

    Integer: 18 - 60

    FP: 75 - 150

    IPC

    Platform Design H. Corporaal and B. Mesman


    Window size and branch impact
    Window Size and Branch Impact space <I, O, D, S>

    • Change from infinite window to examine 2000 and issue at most 64 instructions per cycle

    FP: 15 - 45

    Integer: 6 – 12

    IPC

    PerfectTournamentBHT(512)ProfileNo prediction

    Platform Design H. Corporaal and B. Mesman


    Impact of limited renaming registers
    Impact of Limited Renaming Registers space <I, O, D, S>

    • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor)

    FP: 11 - 45

    Integer: 5 - 15

    IPC

    Infinite2561286432

    Platform Design H. Corporaal and B. Mesman


    Memory address alias impact
    Memory Address Alias Impact space <I, O, D, S>

    • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers

    FP: 4 - 45

    (Fortran,

    no heap)

    Integer: 4 - 9

    IPC

    PerfectGlobal/stack perfectInspectionNone

    Platform Design H. Corporaal and B. Mesman


    Window size impact
    Window Size Impact space <I, O, D, S>

    • Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window

    FP: 8 - 45

    IPC

    Integer: 6 - 12

    Infinite2561286432 16 8 4

    Platform Design H. Corporaal and B. Mesman


    How to exceed ilp limits of this study
    How to Exceed ILP Limits of This Study? space <I, O, D, S>

    • WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory

    • Unnecessary dependences

      • compiler did not unroll loops so iteration variable dependence

    • Overcoming the data flow limit: value prediction, predicting values and speculating on prediction

      • Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis

    Platform Design H. Corporaal and B. Mesman


    Conclusions
    Conclusions space <I, O, D, S>

    • Amount of parallelism is limited

      • higher in Multi-Media

      • higher in kernels

    • Trace analysis detects all types of parallelism

      • task, data and operation types

    • Detected parallelism depends on

      • quality of compiler

      • hardware

      • source-code transformations

    Platform Design H. Corporaal and B. Mesman


    Overview1
    Overview space <I, O, D, S>

    • Enhance performance: architecture methods

    • Instruction Level Parallelism

    • VLIW

      • Examples

        • C6

        • TM

        • IA-64: Itanium, ....

        • TTA

    • Clustering

    • Code generation

    • Hands-on

    Platform Design H. Corporaal and B. Mesman


    Vliw concept

    Instruction Memory space <I, O, D, S>

    Int FU

    Int FU

    Int FU

    LD/ST

    LD/ST

    FP FU

    FP FU

    Int Register File

    Floating Point

    Register File

    Data Memory

    VLIW concept

    A VLIW architecture with 7 FUs

    Instruction register

    Function

    units

    Platform Design H. Corporaal and B. Mesman


    Vliw characteristics
    VLIW characteristics space <I, O, D, S>

    • Multiple operations per instruction

    • One instruction per cycle issued (at most)

    • Compiler is in control

    • Only RISC like operation support

      • Short cycle times

      • Easier to compile for

    • Flexible: Can implement any FU mixture

    • Extensible / Scalable

      However:

    • tight inter FU connectivity required

    • not binary compatible !!

      • (new long instruction format)

    Platform Design H. Corporaal and B. Mesman


    Velocitic6x datapath
    VelociTIC6x space <I, O, D, S>datapath

    Platform Design H. Corporaal and B. Mesman


    Vliw example tms320c62
    VLIW example: TMS320C62 space <I, O, D, S>

    TMS320C62 VelociTI Processor

    • 8 operations (of 32-bit) per instruction (256 bit)

    • Two clusters

      • 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs)

      • 2 x 16 registers

      • One bus available to write in register file of other cluster

    • Flexible addressing modes (like circular addressing)

    • Flexible instruction packing

    • All instruction conditional

    • 5 ns, 200 MHz, 0.25 um, 5-layer CMOS

    • 128 KB on-chip RAM

    Platform Design H. Corporaal and B. Mesman


    VLIW example: Trimedia TM1000 DSPCPU space <I, O, D, S>

    Register file (128 regs, 32 bit, 15 ports)

    5 constant

    5 ALU

    2 memory

    2 shift

    2 DSP-ALU

    2 DSP-mul

    3 branch

    2 FP ALU

    2 Int/FP ALU

    1 FP compare

    1 FP div/sqrt

    Exec

    unit

    Exec

    unit

    Exec

    unit

    Exec

    unit

    Exec

    unit

    Data

    cache

    (16 kB)

    Instruction register (5 issue slots)

    PC

    Instruction

    cache (32kB)

    Platform Design H. Corporaal and B. Mesman


    Intel architecture ia 64
    Intel Architecture IA-64 space <I, O, D, S>

    Explicit Parallel Instruction Computer (EPIC)

    • IA-64 architecture -> Itanium, first realization

      Register model:

    • 128 64-bit int x bits, stack, rotating

    • 128 82-bit floating point, rotating

    • 64 1-bit boolean

    • 8 64-bit branch target address

    • system control registers

    Platform Design H. Corporaal and B. Mesman


    Epic architecture ia 64
    EPIC Architecture: IA-64 space <I, O, D, S>

    • Instructions grouped in 128-bit bundles

      • 3 * 41-bit instruction

      • 5 template bits, indicate type and stop location

    • Each 41-bit instruction

      • starts with 4-bit opcode, and

      • ends with 6-bit guard (boolean) register-id

    • Supports speculative loads

    Platform Design H. Corporaal and B. Mesman


    Itanium
    Itanium space <I, O, D, S>

    Platform Design H. Corporaal and B. Mesman


    Itanium 2 mckinley
    Itanium 2: McKinley space <I, O, D, S>

    Platform Design H. Corporaal and B. Mesman


    Epic architecture ia 641
    EPIC Architecture: IA-64 space <I, O, D, S>

    • EPIC allows for more binary compatibility then a plain VLIW:

      • Function unit assignment performed at run-time

      • Lock when FU results not available

    • See other website for more info on IA-64:

      • www.ics.ele.tue.nl/~heco/courses/ACA

      • (look at related material)

    Platform Design H. Corporaal and B. Mesman


    ad