Embedded computer architecture
This presentation is the property of its rightful owner.
Sponsored Links
1 / 71

Embedded Computer Architecture PowerPoint PPT Presentation


  • 93 Views
  • Uploaded on
  • Presentation posted in: General

Embedded Computer Architecture. Exploiting ILP VLIW architectures. TU/e 5KK73 Henk Corporaal Bart Mesman. operation 1. operation 2. operation 3. operation 4. operation 5. What are we talking about?. ILP = Instruction Level Parallelism =

Download Presentation

Embedded Computer Architecture

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Embedded computer architecture

Embedded Computer Architecture

Exploiting ILP

VLIW architectures

TU/e 5KK73

Henk Corporaal

Bart Mesman


What are we talking about

operation 1

operation 2

operation 3

operation 4

operation 5

What are we talking about?

ILP = Instruction Level Parallelism =

ability to perform multiple operations (or instructions),

from a single instruction stream,

in parallel

VLIW = Very Long Instruction Word architecture

Instruction format example of 5 issue VLIW:

Embedded Computer Architecture H. Corporaal and B. Mesman


Single issue risc vs vliw

instr

instr

op

op

op

op

op

op

op

nop

op

op

op

nop

op

nop

op

op

op

op

op

op

op

op

op

op

op

op

op

instr

instr

instr

instr

instr

Compiler

instr

instr

instr

instr

instr

instr

instr

execute

1 instr/cycle

3 ops/cycle

instr

instr

instr

execute

1 instr/cycle

3-issue VLIW

RISC CPU

Single Issue RISC vs VLIW

Embedded Computer Architecture H. Corporaal and B. Mesman


Topics overview

Topics Overview

  • Enhance performance:

    • What options do you have?

  • Operation/Instruction Level Parallelism

    • Limits on ILP

  • VLIW

    • Examples

  • Clustering

  • Code generation

  • Hands-on

Embedded Computer Architecture H. Corporaal and B. Mesman


Architecture methods pipelined execution of instructions

IF

IF

IF

IF

DC

DC

DC

DC

RF

RF

RF

RF

EX

EX

EX

EX

WB

WB

WB

WB

Architecture methodsPipelined Execution of Instructions

IF: Instruction Fetch

DC: Instruction Decode

RF: Register Fetch

EX: Execute instruction

WB: Write Result Register

CYCLE

1

2

3

4

5

6

7

8

1

2

INSTRUCTION

3

4

Simple 5-stage pipeline

  • Purpose of pipelining:

    • Reduce #gate_levels in critical path

    • Reduce CPI close to one (instead of a large number for the multicycle machine)

    • More efficient Hardware

  • Problems

    • Hazards: pipeline stalls

      • Structural hazards: add more hardware

      • Control hazards, branch penalties: use branch prediction

      • Data hazards: by passing required

Embedded Computer Architecture H. Corporaal and B. Mesman


Architecture methods pipelined execution of instructions1

*

Architecture methodsPipelined Execution of Instructions

Superpipelining:

  • Split one or more of the critical pipeline stages

  • Superpipelining degree S:

S(architecture) = f(Op) * lt (Op)

Op I_set

where:

f(op) is frequency of operation op

lt(op) is latency of operation op

Embedded Computer Architecture H. Corporaal and B. Mesman


Architecture methods powerful instructions 1

Architecture methodsPowerful Instructions (1)

MD-technique

  • Multiple data operands per operation

  • SIMD: Single Instruction Multiple Data

Vector instruction:

for (i=0, i++, i<64)

c[i] = a[i] + 5*b[i];

or

c = a + 5*b

Assembly:

set vl,64

ldv v1,0(r2)

mulvi v2,v1,5

ldv v1,0(r1)

addv v3,v1,v2

stv v3,0(r3)

Embedded Computer Architecture H. Corporaal and B. Mesman


Architecture methods powerful instructions 11

SIMD Execution Method

time

node1

node2

node-K

Instruction 1

Instruction 2

Instruction 3

Instruction n

Architecture methodsPowerful Instructions (1)

SIMD computing

  • Nodes used for independent operations

  • Mesh or hypercube connectivity

  • Exploit data locality of e.g. image processing applications

  • Dense encoding (few instruction bits needed)

Embedded Computer Architecture H. Corporaal and B. Mesman


Architecture methods powerful instructions 12

*

*

*

*

Architecture methodsPowerful Instructions (1)

  • Sub-word parallelism

    • SIMD on restricted scale:

    • Used for Multi-media instructions

  • Examples

    • MMX, SSX, SUN-VIS, HP MAX-2, AMD 3Dnow, Trimedia II

    • Example: i=1..4|ai-bi|

Embedded Computer Architecture H. Corporaal and B. Mesman


Architecture methods powerful instructions 2

Architecture methodsPowerful Instructions (2)

MO-technique: multiple operations per instruction

Two options:

  • CISC (Complex Instruction Set Computer)

  • VLIW (Very Long Instruction Word)

FU 1

FU 2

FU 3

FU 4

FU 5

field

sub r8, r5, 3

and r1, r5, 12

mul r6, r5, r2

ld r3, 0(r5)

bnez r5, 13

instruction

VLIW instruction example

Embedded Computer Architecture H. Corporaal and B. Mesman


Embedded computer architecture

VLIW architecture: central Register File

Shared, Multi-ported Register file

Exec

unit 1

Exec

unit 2

Exec

unit 3

Exec

unit 4

Exec

unit 5

Exec

unit 6

Exec

unit 7

Exec

unit 8

Exec

unit 9

Issue slot 1

Issue slot 2

Issue slot 3

Q: How many ports does the registerfile need for n-issue?

Embedded Computer Architecture H. Corporaal and B. Mesman


Trimedia tm32a processor

I/O

INTERFACE

D-CACHE

I-CACHE

I-Cache

D-cache

32K

16K

TAG

TAG

TAG

SEQUENCER

/ DECODE

TAG

(FLOAT)

DSPALU2

IFMUL2

FCOMP2

DSPMUL2

FALU3

ALU3

ALU4

ALU2

REGFILE

128 REGS X 32 BITS

DSPALU0

SHIFTER0

FTOUGH1

SHIFTER1

(FLOAT)

IFMUL1

FALU0

DSPMUL1

ALU1

ALU0

TriMedia TM32A processor

0.18 micron

area : 16.9mm2

200 MHz (typ)

1.4 W

7 mW/MHz

(MIPS processor:

0.9 mW/MHz)

Embedded Computer Architecture H. Corporaal and B. Mesman


Architecture methods powerful instructions 2 vliw characteristics

Architecture methods: Powerful Instructions (2) VLIW Characteristics

  • Only RISC like operation support

    • Short cycle times

  • Flexible: Can implement any FU mixture

  • Extensible

  • Tight inter FU connectivity required

  • Large instructions (up to 1024 bits)

  • Not binary compatible !!!

  • But good compilers exist

Embedded Computer Architecture H. Corporaal and B. Mesman


Architecture methods multiple instruction issue per cycle

Architecture methodsMultiple instruction issue (per cycle)

Who guarantees semantic correctness?

  • which can instructions be executed in parallel?

  • User: he specifies multiple instruction streams

    • Multi-processor: MIMD (Multiple Instruction Multiple Data)

  • HW: Run-time detection of ready instructions

    • Superscalar

  • Compiler: Compile into dataflow representation

    • Dataflow processors

  • Embedded Computer Architecture H. Corporaal and B. Mesman


    Multiple instruction issue three approaches

    &d

    ld

    3.14

    &f

    &b

    ld

    ld

    *

    15

    &c

    +

    /

    st

    &a

    &e

    st

    st

    Multiple instruction issueThree Approaches

    Example code

    a := b + 15;

    c := 3.14 * d;

    e := c / f;

    Translation to DDG

    (Data Dependence Graph)

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Embedded computer architecture

    Instr.Sequential Code

    I1ldr1,M(&b)

    I2addir1,r1,15

    I3str1,M(&a)

    I4ldr1,M(&d)

    I5mulir1,r1,3.14

    I6str1,M(&c)

    I7ldr2,M(&f)

    I8divr1,r1,r2

    I9str1,M(&e)

    Dataflow Code

    I1ld(M(&b)-> I2

    I2addi 15-> I3

    I3st M(&a)

    I4ld M(&d)-> I5

    I5muli 3.14-> I6, I8

    I6st M(&c)

    I7ld M(&f)-> I8

    I8div-> I9

    I9st M(&e)

    Generated Code

    • 3 approaches:

    • An MIMD may execute two streams: (1) I1-I3 (2) I4-I9

      • No dependencies between streams; in practice communication and synchronization required between streams

    • A superscalar issues multiple instructions from sequential stream

      • Obey dependencies (True and name dependencies)

      • Reverse engineering of DDG needed at run-time

    • Dataflow code is direct representation of DDG

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Multiple instruction issue data flow processor

    FU-1

    FU-2

    FU-K

    Multiple Instruction Issue:Data flow processor

    Token

    Matching

    Token

    Store

    Instruction

    Generate

    Instruction

    Store

    Result Tokens

    Reservation Stations

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Instruction pipeline overview

    IF

    DC

    RF

    EX

    WB

    IF

    DC/RF

    EX

    WB

    IF1

    IF2

    IFk

    IF3

    DC2

    DCk

    DC1

    DC3

    ISSUE

    ISSUE

    ISSUE

    ISSUE

    RFk

    RF2

    RF1

    RF3

    EX2

    EX3

    EX1

    EXk

    ROB

    ROB

    ROB

    ROB

    WBk

    WB1

    WB3

    WB2

    IF1

    IF2

    ---

    IFs

    DC

    RF

    EX1

    EX2

    ---

    EX5

    WB

    IF

    DC

    RF1

    EX1

    WB1

    RF1

    EX1

    WB1

    RF2

    EX2

    WB2

    RF2

    EX2

    WB2

    RFk

    EXk

    WBk

    RFk

    EXk

    WBk

    Instruction Pipeline Overview

    (no pipelining)

    CISC

    RISC

    Superscalar

    Superpipelined

    DATAFLOW

    VLIW

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Four dimensional representation of the architecture design space i o d s

    SIMD

    100

    Data/operation ‘D’

    10

    Vector

    CISC

    Superscalar

    MIMD

    Dataflow

    0.1

    10

    100

    RISC

    Instructions/cycle ‘I’

    Superpipelined

    10

    VLIW

    10

    Operations/instruction ‘O’

    Superpipelining Degree ‘S’

    Four dimensional representation of the architecture design space <I, O, D, S>

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Architecture design space

    ArchitectureKIODSMpar

    CISC10.21.21.11 0.26

    RISC11111.2 1.2

    VLIW1011011.2 12

    Superscalar33111.2 3.6

    Superpipelined11113 3

    Vector70.11645 32

    SIMD128111281.2154

    MIMD3232111.2 38

    Dataflow1010111.2 12

    Architecture design space

    Typical values of K (# of functional units or processor nodes), and

    <I, O, D, S> for different architectures

    S(architecture) = f(Op) * lt (Op)

    Op I_set

    Mpar = I*O*D*S

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Overview

    Overview

    • Enhance performance: architecture methods

    • Instruction Level Parallelism (ILP)

      • limits on ILP

    • VLIW

      • Examples

    • Clustering

    • Code generation

    • Hands-on

    Embedded Computer Architecture H. Corporaal and B. Mesman


    General organization of an ilp architecture

    FU-1

    CPU

    FU-2

    Instruction fetch unit

    Instruction decode unit

    Instruction memory

    FU-3

    Bypassing network

    Data memory

    Register file

    FU-4

    FU-5

    General organization of an ILP architecture

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Motivation for ilp

    Motivation for ILP

    • Increasing VLSI densities; decreasing feature size

    • Increasing performance requirements

    • New application areas, like

      • multi-media (image, audio, video, 3-D, holographic)

      • intelligent search and filtering engines

      • neural, fuzzy, genetic computing

    • More functionality

    • Use of existing Code (Compatibility)

    • Low Power: P = fCVdd2

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Low power through parallelism

    Low power through parallelism

    • Sequential Processor

      • Switching capacitance C

      • Frequency f

      • Voltage V

      • P = fCV2

    • Parallel Processor (two times the number of units)

      • Switching capacitance 2C

      • Frequency f/2

      • Voltage V’ < V

      • P = f/2 2C V’2 =fCV’2

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Measuring and exploiting available ilp

    Measuring and exploiting available ILP

    • How much ILP is there in applications?

    • How to measure parallelism within applications?

      • Using existing compiler

      • Using trace analysis

        • Track all the real data dependencies (RaWs) of instructions from issue window

          • register dependence

          • memory dependence

        • Check for correct branch prediction

          • if prediction correct continue

          • if wrong, flush schedule and start in next cycle

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Trace analysis

    Trace

    set r1,0

    set r2,3

    set r3,&A

    st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    add r1,r5,3

    Trace analysis

    Compiled code

    set r1,0

    set r2,3

    set r3,&A

    Loop:st r1,0(r3)

    add r1,r1,1

    add r3,r3,4

    brne r1,r2,Loop

    add r1,r5,3

    Program

    For i := 0..2

    A[i] := i;

    S := X+3;

    How parallel can you execute this code?

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Trace analysis1

    Trace analysis

    Parallel Trace

    set r1,0 set r2,3 set r3,&A

    st r1,0(r3) add r1,r1,1 add r3,r3,4

    st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

    st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

    brne r1,r2,Loop

    add r1,r5,3

    Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Ideal processor

    Ideal Processor

    Assumptions for ideal/perfect processor:

    1. Register renaming– infinite number of virtual registers => all register WAW & WAR hazards avoided

    2. Branch and Jump prediction– Perfect => all program instructions available for execution

    3. Memory-address alias analysis– addresses are known. A store can be moved before a load provided addresses not equal

    Also:

    • unlimited number of instructions issued/cycle (unlimited resources), and

    • unlimited instruction window

    • perfect caches

    • 1 cycle latency for all instructions (FP *,/)

      Programs were compiled using MIPS compiler with maximum optimization level

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Upper limit to ilp ideal processor

    Upper Limit to ILP: Ideal Processor

    Integer: 18 - 60

    FP: 75 - 150

    IPC

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Window size and branch impact

    Window Size and Branch Impact

    • Change from infinite window to examine 2000 and issue at most 64 instructions per cycle

    FP: 15 - 45

    Integer: 6 – 12

    IPC

    PerfectTournamentBHT(512)ProfileNo prediction

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Limiting nr of renaming registers

    Limiting nr. of Renaming Registers

    • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor)

    FP: 11 - 45

    Integer: 5 - 15

    IPC

    Infinite2561286432

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Memory address alias impact

    Memory Address Alias Impact

    • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers

    FP: 4 - 45

    (Fortran,

    no heap)

    Integer: 4 - 9

    IPC

    PerfectGlobal/stack perfectInspectionNone

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Reducing window size

    Reducing Window Size

    • Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window

    FP: 8 - 45

    IPC

    Integer: 6 - 12

    Infinite2561286432 16 8 4

    Embedded Computer Architecture H. Corporaal and B. Mesman


    How to exceed ilp limits of this study

    How to Exceed ILP Limits of This Study?

    • WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory

    • Unnecessary dependences

      • compiler did not unroll loops so iteration variable dependence

    • Overcoming the data flow limit: value prediction, predicting values and speculating on prediction

      • Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Conclusions

    Conclusions

    • Amount of parallelism is limited

      • higher in Multi-Media and Signal Processing appl.

      • higher in kernels

    • Trace analysis detects all types of parallelism

      • task, data and operation types

    • Detected parallelism depends on

      • quality of compiler

      • hardware

      • source-code transformations

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Overview1

    Overview

    • Enhance performance: architecture methods

    • Instruction Level Parallelism

    • VLIW

      • Examples

        • C6

        • TM

        • IA-64: Itanium, ....

        • TTA

    • Clustering

    • Code generation

    • Hands-on

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Vliw general concept

    Instruction Memory

    Int FU

    Int FU

    Int FU

    LD/ST

    LD/ST

    FP FU

    FP FU

    Int Register File

    Floating Point

    Register File

    Data Memory

    VLIW: general concept

    A VLIW architecture with 7 FUs

    Instruction register

    Function

    units

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Vliw characteristics

    VLIW characteristics

    • Multiple operations per instruction

    • One instruction per cycle issued (at most)

    • Compiler is in control

    • Only RISC like operation support

      • Short cycle times

      • Easier to compile for

    • Flexible: Can implement any FU mixture

    • Extensible / Scalable

      However:

    • tight inter FU connectivity required

    • not binary compatible !!

      • (new long instruction format)

    • low code density

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Velocitic6x datapath

    VelociTIC6x datapath

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Vliw example tms320c62

    VLIW example: TMS320C62

    TMS320C62 VelociTI Processor

    • 8 operations (of 32-bit) per instruction (256 bit)

    • Two clusters

      • 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs)

      • 2 x 16 registers

      • One bus available to write in register file of other cluster

    • Flexible addressing modes (like circular addressing)

    • Flexible instruction packing

    • All instruction conditional

    • Originally: 5 ns, 200 MHz, 0.25 um, 5-layer CMOS

    • 128 KB on-chip RAM

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Embedded computer architecture

    VLIW example: Philips TriMedia TM1000

    Register file (128 regs, 32 bit, 15 ports)

    5 constant

    5 ALU

    2 memory

    2 shift

    2 DSP-ALU

    2 DSP-mul

    3 branch

    2 FP ALU

    2 Int/FP ALU

    1 FP compare

    1 FP div/sqrt

    Exec

    unit

    Exec

    unit

    Exec

    unit

    Exec

    unit

    Exec

    unit

    Data

    cache

    (16 kB)

    Instruction register (5 issue slots)

    PC

    Instruction

    cache (32kB)

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Intel epic architecture ia 64

    Intel EPIC Architecture IA-64

    Explicit Parallel Instruction Computer (EPIC)

    • IA-64 architecture -> Itanium, first realization 2001

      Register model:

    • 128 64-bit int x bits, stack, rotating

    • 128 82-bit floating point, rotating

    • 64 1-bit boolean

    • 8 64-bit branch target address

    • system control registers

      See http://en.wikipedia.org/wiki/Itanium

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Epic architecture ia 64

    EPIC Architecture: IA-64

    • Instructions grouped in 128-bit bundles

      • 3 * 41-bit instruction

      • 5 template bits, indicate type and stop location

    • Each 41-bit instruction

      • starts with 4-bit opcode, and

      • ends with 6-bit guard (boolean) register-id

    • Supports speculative loads

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Itanium organization

    Itanium organization

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Itanium 2 mckinley

    Itanium 2: McKinley

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Epic architecture ia 641

    EPIC Architecture: IA-64

    • EPIC allows for more binary compatibility then a plain VLIW:

      • Function unit assignment performed at run-time

      • Lock when FU results not available

    • See other website (course 5MD00) for more info on IA-64:

      • www.ics.ele.tue.nl/~heco/courses/ACA

      • (look at related material)

    Embedded Computer Architecture H. Corporaal and B. Mesman


    What are we talking about1

    VLIW = Very Long Instruction Word architecture

    Instruction format:

    operation 1

    operation 2

    operation 3

    operation 4

    operation 5

    What are we talking about?

    ILP = Instruction Level Parallelism =

    ability to perform multiple operations (or instructions),

    from a single instruction stream,

    in parallel

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Vliw evaluation

    FU-1

    CPU

    FU-2

    Instruction fetch unit

    Instruction decode unit

    Instruction memory

    FU-3

    Bypassing network

    Data memory

    Register file

    FU-4

    FU-5

    Control problem

    O(N2)

    O(N)-O(N2)

    With N function units

    VLIW evaluation

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Vliw evaluation1

    VLIW evaluation

    Strong points of VLIW:

    • Scalable (add more FUs)

    • Flexible (an FU can be almost anything; e.g. multimedia support)

      Weak points:

  • With N FUs:

    • Bypassing complexity: O(N2)

    • Register file complexity: O(N)

    • Register file size: O(N2)

  • Register file design restricts FU flexibility

    Solution: .................................................. ?

  • Embedded Computer Architecture H. Corporaal and B. Mesman


    Solution

    Solution

    TTA: Transport Triggered Architecture

    Mirroring the Programming Paradigm

    +

    -

    +

    -

    >

    *

    >

    *

    st

    st

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Transport triggered architecture

    Transport Triggered Architecture

    General organization of a TTA

    FU-1

    CPU

    FU-2

    FU-3

    Instruction fetch unit

    Instruction decode unit

    Bypassing network

    FU-4

    Instruction memory

    Data memory

    FU-5

    Register file

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Tta structure datapath details

    load/store

    unit

    load/store

    unit

    integer

    ALU

    integer

    ALU

    float

    ALU

    integer

    RF

    float

    RF

    boolean

    RF

    instruct.

    unit

    immediate

    unit

    TTA structure; datapath details

    Data Memory

    Socket

    Instruction Memory

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Tta hardware characteristics

    TTA hardware characteristics

    • Modular: building blocks easy to reuse

    • Very flexible and scalable

      • easy inclusion of Special Function Units (SFUs)

    • Very low complexity

      • > 50% reduction on # register ports

      • reduced bypass complexity (no associative matching)

      • up to 80 % reduction in bypass connectivity

      • trivial decoding

      • reduced register pressure

      • easy register file partitioning (a single port is enough!)

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Tta software characteristics

    add r3, r1, r2

    TTA software characteristics

    • More difficult to schedule !

    • But: extra scheduling optimizations

    That does not

    look like an

    improvement !?!

    • r1  add.o1;

    • r2 add.o2;

    • add.r  r3

    o1

    o2

    +

    r

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Program ttas

    Trigger

    Operand

    Internal stage

    Result

    FU Pipeline

    Program TTAs

    • How to do data operations ?

    • 1. Transport of operands to FU

      • Operand move (s)

      • Trigger move

    • 2. Transport of results from FU

      • Result move (s)

    Example Add r3,r1,r2 becomes

    r1  Oint// operand move to integer unit

    r2  Tadd// trigger move to integer unit

    ………….// addition operation in progress

    Rint  r3// result move from integer unit

    How to do Control flow ?

    1. Jumps:#jump-address  pc

    2. Branch:#displacement  pcd

    3. Call:pc  r; #call-address  pcd

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Scheduling example

    VLIW

    add r1,r1,r2

    sub r4,r1,95

    TTA

    r1 -> add.o1, r2 -> add.o2

    add.r -> sub.o1, 95 -> sub.o2

    sub.r -> r4

    Scheduling example

    load/store

    unit

    integer

    ALU

    integer

    ALU

    integer

    RF

    immediate

    unit

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Tta instruction format

    General MOVE instructions: multiple fields

    g

    g

    i

    1

    imm

    src

    dst

    dst

    move 1

    move 2

    move 3

    move 4

    How to use immediates?

    Small, 6 bits

    Long, 32 bits

    g

    0

    Ir-1

    dst

    imm

    TTA Instruction format

    General MOVE field:

    g: guard specifier

    i: immediate specifier

    src: source

    dst: destination

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Programming ttas

    Programming TTAs

    How to do conditional execution

    Each move is guarded

    Example

    r1  cmp.o1// operand move to compare unit

    r2  cmp.o2// trigger move to compare unit

    cmp.r g// put result in boolean register g

    g:r3 r4// guarded move takes place when r1=r2

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Register file port pressure for ttas

    Register file port pressure for TTAs

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Summary of tta advantages

    Summary of TTA Advantages

    • Better usage of transport capacity

      • Instead of 3 transports per dyadic operation, about 2 are needed

      • # register ports reduced with at least 50%

      • Inter FU connectivity reduces with 50-70%

        • No full connectivity required

    • Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs

    • Flexible: Fus can incorporate arbitrary functionality

    • Scalable: #FUS, #reg.files, etc. can be changed

    • FU splitting results into extra exploitable concurrency

    • TTAs are easy to design and can have short cycle times

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Tta automatic dse

    x

    Pareto curve

    (solution space)

    x

    x

    x

    exec. time

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    x

    cost

    TTA automatic DSE

    User

    intercation

    Optimizer

    Architecture

    parameters

    feedback

    feedback

    Parametric compiler

    Hardware generator

    Move framework

    Parallel

    object

    code

    chip

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Overview2

    Overview

    • Enhance performance: architecture methods

    • Instruction Level Parallelism

    • VLIW

    • Examples

      • C6

      • TM

      • TTA

    • Clustering and Reconfigurable components

    • Code generation

    • Hands-on

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Clustered vliw

    Level 1 Instruction Cache

    loop buffer

    loop buffer

    loop buffer

    FU

    FU

    FU

    FU

    FU

    FU

    FU

    FU

    FU

    Level 2 (shared) Cache

    register file

    register file

    register file

    Level 1 Data Cache

    Clustered VLIW

    • Clustering = Splitting up the VLIW data path- same can be done for the instruction path –

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Clustered vliw1

    Clustered VLIW

    Why clustering?

    • Timing: faster clock

    • Lower Cost

      • silicon area

      • T2M

    • Lower Energy

      What’s the disadvantage?

      Want to know more: see PhD thesis Andrei Terechko

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Fine grained reconfigurable xilinx xc4000 fpga

    Vcc

    Slew

    Passive

    Rate

    Pull-Up,

    Control

    Pull-Down

    D Q

    Output

    Pad

    Buffer

    Input

    Buffer

    Q D

    Delay

    C1

    C2

    C3

    C4

    H1 DIN S/R EC

    S/R

    Control

    G4

    G

    DIN

    G3

    SD

    F'

    Func.

    Q

    D

    G2

    G'

    Gen.

    H'

    G1

    EC

    RD

    1

    H

    Y

    G'

    Func.

    H'

    Gen.

    S/R

    F4

    Control

    F

    F3

    Func.

    DIN

    SD

    F2

    F'

    Gen.

    Q

    D

    G'

    F1

    H'

    EC

    RD

    1

    X

    H'

    F'

    K

    Fine-Grained reconfigurable:Xilinx XC4000 FPGA

    Programmable

    Interconnect

    I/O Blocks (IOBs)

    Configurable

    Logic Blocks (CLBs)

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Coarse grained reconfigurable chameleon cs2000

    Coarse-Grained reconfigurable: Chameleon CS2000

    • Highlights:

    • 32-bit datapath (ALU/Shift)

    • 16x24 Multiplier

    • distributed local memory

    • fixed timing

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Recent coarse grain reconfigurable architectures

    Recent Coarse Grain Reconfigurable Architectures

    • SmartCell 2009

      • read http://www.hindawi.com/journals/es/2009/518659.html

    • Montium (reconfigurable VLIW)

    • RAPID

    • NIOS II

    • RAW

    • PicoChip

    • PACT XPP64

    • many more ….

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Hybrid fpgas virtex ii pro

    Up to 16 serial transceivers

    PowerPCs

    ReConfig.

    logic

    Courtesy of Xilinx (Virtex II Pro)

    Hybrid FPGAs: Virtex II-Pro

    GHz IO: Up to 16 serial transceivers

    Memory blocks

    PowerPC

    Reconfigurable logic

    blocks

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Xilinx zynq with 2 arm processors

    Xilinx Zynq with 2 ARM processors

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Granularity makes differences

    Granularity Makes Differences

    Embedded Computer Architecture H. Corporaal and B. Mesman


    Hw or sw reconfigurable

    Spatial mapping

    FPGA

    Temporal mapping

    VLIW

    configuration bandwidth

    HW or SW reconfigurable?

    reset

    Reconfiguration time

    loopbuffer

    context

    Subword parallelism

    1 cycle

    fine

    coarse

    Data path granularity

    Embedded Computer Architecture H. Corporaal and B. Mesman


  • Login