embedded computer architecture
Download
Skip this Video
Download Presentation
Embedded Computer Architecture

Loading in 2 Seconds...

play fullscreen
1 / 71

Embedded Computer Architecture - PowerPoint PPT Presentation


  • 150 Views
  • Uploaded on

Embedded Computer Architecture. Exploiting ILP VLIW architectures. TU/e 5KK73 Henk Corporaal Bart Mesman. operation 1. operation 2. operation 3. operation 4. operation 5. What are we talking about?. ILP = Instruction Level Parallelism =

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Embedded Computer Architecture' - yaholo


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
embedded computer architecture

Embedded Computer Architecture

Exploiting ILP

VLIW architectures

TU/e 5KK73

Henk Corporaal

Bart Mesman

what are we talking about

operation 1

operation 2

operation 3

operation 4

operation 5

What are we talking about?

ILP = Instruction Level Parallelism =

ability to perform multiple operations (or instructions),

from a single instruction stream,

in parallel

VLIW = Very Long Instruction Word architecture

Instruction format example of 5 issue VLIW:

Embedded Computer Architecture H. Corporaal and B. Mesman

single issue risc vs vliw

instr

instr

op

op

op

op

op

op

op

nop

op

op

op

nop

op

nop

op

op

op

op

op

op

op

op

op

op

op

op

op

instr

instr

instr

instr

instr

Compiler

instr

instr

instr

instr

instr

instr

instr

execute

1 instr/cycle

3 ops/cycle

instr

instr

instr

execute

1 instr/cycle

3-issue VLIW

RISC CPU

Single Issue RISC vs VLIW

Embedded Computer Architecture H. Corporaal and B. Mesman

topics overview
Topics Overview
  • Enhance performance:
    • What options do you have?
  • Operation/Instruction Level Parallelism
    • Limits on ILP
  • VLIW
    • Examples
  • Clustering
  • Code generation
  • Hands-on

Embedded Computer Architecture H. Corporaal and B. Mesman

architecture methods pipelined execution of instructions

IF

IF

IF

IF

DC

DC

DC

DC

RF

RF

RF

RF

EX

EX

EX

EX

WB

WB

WB

WB

Architecture methodsPipelined Execution of Instructions

IF: Instruction Fetch

DC: Instruction Decode

RF: Register Fetch

EX: Execute instruction

WB: Write Result Register

CYCLE

1

2

3

4

5

6

7

8

1

2

INSTRUCTION

3

4

Simple 5-stage pipeline

  • Purpose of pipelining:
    • Reduce #gate_levels in critical path
    • Reduce CPI close to one (instead of a large number for the multicycle machine)
    • More efficient Hardware
  • Problems
    • Hazards: pipeline stalls
      • Structural hazards: add more hardware
      • Control hazards, branch penalties: use branch prediction
      • Data hazards: by passing required

Embedded Computer Architecture H. Corporaal and B. Mesman

architecture methods pipelined execution of instructions1

*

Architecture methodsPipelined Execution of Instructions

Superpipelining:

  • Split one or more of the critical pipeline stages
  • Superpipelining degree S:

S(architecture) = f(Op) * lt (Op)

Op I_set

where:

f(op) is frequency of operation op

lt(op) is latency of operation op

Embedded Computer Architecture H. Corporaal and B. Mesman

architecture methods powerful instructions 1
Architecture methodsPowerful Instructions (1)

MD-technique

  • Multiple data operands per operation
  • SIMD: Single Instruction Multiple Data

Vector instruction:

for (i=0, i++, i<64)

c[i] = a[i] + 5*b[i];

or

c = a + 5*b

Assembly:

set vl,64

ldv v1,0(r2)

mulvi v2,v1,5

ldv v1,0(r1)

addv v3,v1,v2

stv v3,0(r3)

Embedded Computer Architecture H. Corporaal and B. Mesman

architecture methods powerful instructions 11

SIMD Execution Method

time

node1

node2

node-K

Instruction 1

Instruction 2

Instruction 3

Instruction n

Architecture methodsPowerful Instructions (1)

SIMD computing

  • Nodes used for independent operations
  • Mesh or hypercube connectivity
  • Exploit data locality of e.g. image processing applications
  • Dense encoding (few instruction bits needed)

Embedded Computer Architecture H. Corporaal and B. Mesman

architecture methods powerful instructions 12

*

*

*

*

Architecture methodsPowerful Instructions (1)
  • Sub-word parallelism
    • SIMD on restricted scale:
    • Used for Multi-media instructions
  • Examples
    • MMX, SSX, SUN-VIS, HP MAX-2, AMD 3Dnow, Trimedia II
    • Example: i=1..4|ai-bi|

Embedded Computer Architecture H. Corporaal and B. Mesman

architecture methods powerful instructions 2
Architecture methodsPowerful Instructions (2)

MO-technique: multiple operations per instruction

Two options:

  • CISC (Complex Instruction Set Computer)
  • VLIW (Very Long Instruction Word)

FU 1

FU 2

FU 3

FU 4

FU 5

field

sub r8, r5, 3

and r1, r5, 12

mul r6, r5, r2

ld r3, 0(r5)

bnez r5, 13

instruction

VLIW instruction example

Embedded Computer Architecture H. Corporaal and B. Mesman

slide11

VLIW architecture: central Register File

Shared, Multi-ported Register file

Exec

unit 1

Exec

unit 2

Exec

unit 3

Exec

unit 4

Exec

unit 5

Exec

unit 6

Exec

unit 7

Exec

unit 8

Exec

unit 9

Issue slot 1

Issue slot 2

Issue slot 3

Q: How many ports does the registerfile need for n-issue?

Embedded Computer Architecture H. Corporaal and B. Mesman

trimedia tm32a processor

I/O

INTERFACE

D-CACHE

I-CACHE

I-Cache

D-cache

32K

16K

TAG

TAG

TAG

SEQUENCER

/ DECODE

TAG

(FLOAT)

DSPALU2

IFMUL2

FCOMP2

DSPMUL2

FALU3

ALU3

ALU4

ALU2

REGFILE

128 REGS X 32 BITS

DSPALU0

SHIFTER0

FTOUGH1

SHIFTER1

(FLOAT)

IFMUL1

FALU0

DSPMUL1

ALU1

ALU0

TriMedia TM32A processor

0.18 micron

area : 16.9mm2

200 MHz (typ)

1.4 W

7 mW/MHz

(MIPS processor:

0.9 mW/MHz)

Embedded Computer Architecture H. Corporaal and B. Mesman

architecture methods powerful instructions 2 vliw characteristics
Architecture methods: Powerful Instructions (2) VLIW Characteristics
  • Only RISC like operation support
    • Short cycle times
  • Flexible: Can implement any FU mixture
  • Extensible
  • Tight inter FU connectivity required
  • Large instructions (up to 1024 bits)
  • Not binary compatible !!!
  • But good compilers exist

Embedded Computer Architecture H. Corporaal and B. Mesman

architecture methods multiple instruction issue per cycle
Architecture methodsMultiple instruction issue (per cycle)

Who guarantees semantic correctness?

    • which can instructions be executed in parallel?
  • User: he specifies multiple instruction streams
    • Multi-processor: MIMD (Multiple Instruction Multiple Data)
  • HW: Run-time detection of ready instructions
    • Superscalar
  • Compiler: Compile into dataflow representation
    • Dataflow processors

Embedded Computer Architecture H. Corporaal and B. Mesman

multiple instruction issue three approaches

&d

ld

3.14

&f

&b

ld

ld

*

15

&c

+

/

st

&a

&e

st

st

Multiple instruction issueThree Approaches

Example code

a := b + 15;

c := 3.14 * d;

e := c / f;

Translation to DDG

(Data Dependence Graph)

Embedded Computer Architecture H. Corporaal and B. Mesman

slide16

Instr. Sequential Code

I1 ld r1,M(&b)

I2 addi r1,r1,15

I3 st r1,M(&a)

I4 ld r1,M(&d)

I5 muli r1,r1,3.14

I6 st r1,M(&c)

I7 ld r2,M(&f)

I8 div r1,r1,r2

I9 st r1,M(&e)

Dataflow Code

I1 ld(M(&b) -> I2

I2 addi 15 -> I3

I3 st M(&a)

I4 ld M(&d) -> I5

I5 muli 3.14 -> I6, I8

I6 st M(&c)

I7 ld M(&f) -> I8

I8 div -> I9

I9 st M(&e)

Generated Code

  • 3 approaches:
  • An MIMD may execute two streams: (1) I1-I3 (2) I4-I9
    • No dependencies between streams; in practice communication and synchronization required between streams
  • A superscalar issues multiple instructions from sequential stream
    • Obey dependencies (True and name dependencies)
    • Reverse engineering of DDG needed at run-time
  • Dataflow code is direct representation of DDG

Embedded Computer Architecture H. Corporaal and B. Mesman

multiple instruction issue data flow processor

FU-1

FU-2

FU-K

Multiple Instruction Issue:Data flow processor

Token

Matching

Token

Store

Instruction

Generate

Instruction

Store

Result Tokens

Reservation Stations

Embedded Computer Architecture H. Corporaal and B. Mesman

instruction pipeline overview

IF

DC

RF

EX

WB

IF

DC/RF

EX

WB

IF1

IF2

IFk

IF3

DC2

DCk

DC1

DC3

ISSUE

ISSUE

ISSUE

ISSUE

RFk

RF2

RF1

RF3

EX2

EX3

EX1

EXk

ROB

ROB

ROB

ROB

WBk

WB1

WB3

WB2

IF1

IF2

---

IFs

DC

RF

EX1

EX2

---

EX5

WB

IF

DC

RF1

EX1

WB1

RF1

EX1

WB1

RF2

EX2

WB2

RF2

EX2

WB2

RFk

EXk

WBk

RFk

EXk

WBk

Instruction Pipeline Overview

(no pipelining)

CISC

RISC

Superscalar

Superpipelined

DATAFLOW

VLIW

Embedded Computer Architecture H. Corporaal and B. Mesman

four dimensional representation of the architecture design space i o d s

SIMD

100

Data/operation ‘D’

10

Vector

CISC

Superscalar

MIMD

Dataflow

0.1

10

100

RISC

Instructions/cycle ‘I’

Superpipelined

10

VLIW

10

Operations/instruction ‘O’

Superpipelining Degree ‘S’

Four dimensional representation of the architecture design space <I, O, D, S>

Embedded Computer Architecture H. Corporaal and B. Mesman

architecture design space

Architecture K I O D S Mpar

CISC 1 0.2 1.2 1.1 1 0.26

RISC 1 1 1 1 1.2 1.2

VLIW 10 1 10 1 1.2 12

Superscalar 3 3 1 1 1.2 3.6

Superpipelined 1 1 1 1 3 3

Vector 7 0.1 1 64 5 32

SIMD 128 1 1 128 1.2 154

MIMD 32 32 1 1 1.2 38

Dataflow 10 10 1 1 1.2 12

Architecture design space

Typical values of K (# of functional units or processor nodes), and

<I, O, D, S> for different architectures

S(architecture) = f(Op) * lt (Op)

Op I_set

Mpar = I*O*D*S

Embedded Computer Architecture H. Corporaal and B. Mesman

overview
Overview
  • Enhance performance: architecture methods
  • Instruction Level Parallelism (ILP)
    • limits on ILP
  • VLIW
    • Examples
  • Clustering
  • Code generation
  • Hands-on

Embedded Computer Architecture H. Corporaal and B. Mesman

general organization of an ilp architecture

FU-1

CPU

FU-2

Instruction fetch unit

Instruction decode unit

Instruction memory

FU-3

Bypassing network

Data memory

Register file

FU-4

FU-5

General organization of an ILP architecture

Embedded Computer Architecture H. Corporaal and B. Mesman

motivation for ilp
Motivation for ILP
  • Increasing VLSI densities; decreasing feature size
  • Increasing performance requirements
  • New application areas, like
    • multi-media (image, audio, video, 3-D, holographic)
    • intelligent search and filtering engines
    • neural, fuzzy, genetic computing
  • More functionality
  • Use of existing Code (Compatibility)
  • Low Power: P = fCVdd2

Embedded Computer Architecture H. Corporaal and B. Mesman

low power through parallelism
Low power through parallelism
  • Sequential Processor
    • Switching capacitance C
    • Frequency f
    • Voltage V
    • P = fCV2
  • Parallel Processor (two times the number of units)
    • Switching capacitance 2C
    • Frequency f/2
    • Voltage V’ < V
    • P = f/2 2C V’2 =fCV’2

Embedded Computer Architecture H. Corporaal and B. Mesman

measuring and exploiting available ilp
Measuring and exploiting available ILP
  • How much ILP is there in applications?
  • How to measure parallelism within applications?
    • Using existing compiler
    • Using trace analysis
      • Track all the real data dependencies (RaWs) of instructions from issue window
        • register dependence
        • memory dependence
      • Check for correct branch prediction
        • if prediction correct continue
        • if wrong, flush schedule and start in next cycle

Embedded Computer Architecture H. Corporaal and B. Mesman

trace analysis

Trace

set r1,0

set r2,3

set r3,&A

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

add r1,r5,3

Trace analysis

Compiled code

set r1,0

set r2,3

set r3,&A

Loop: st r1,0(r3)

add r1,r1,1

add r3,r3,4

brne r1,r2,Loop

add r1,r5,3

Program

For i := 0..2

A[i] := i;

S := X+3;

How parallel can you execute this code?

Embedded Computer Architecture H. Corporaal and B. Mesman

trace analysis1
Trace analysis

Parallel Trace

set r1,0 set r2,3 set r3,&A

st r1,0(r3) add r1,r1,1 add r3,r3,4

st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

st r1,0(r3) add r1,r1,1 add r3,r3,4 brne r1,r2,Loop

brne r1,r2,Loop

add r1,r5,3

Max ILP = Speedup = Lparallel / Lserial = 16 / 6 = 2.7

Embedded Computer Architecture H. Corporaal and B. Mesman

ideal processor
Ideal Processor

Assumptions for ideal/perfect processor:

1. Register renaming– infinite number of virtual registers => all register WAW & WAR hazards avoided

2. Branch and Jump prediction– Perfect => all program instructions available for execution

3. Memory-address alias analysis– addresses are known. A store can be moved before a load provided addresses not equal

Also:

  • unlimited number of instructions issued/cycle (unlimited resources), and
  • unlimited instruction window
  • perfect caches
  • 1 cycle latency for all instructions (FP *,/)

Programs were compiled using MIPS compiler with maximum optimization level

Embedded Computer Architecture H. Corporaal and B. Mesman

upper limit to ilp ideal processor
Upper Limit to ILP: Ideal Processor

Integer: 18 - 60

FP: 75 - 150

IPC

Embedded Computer Architecture H. Corporaal and B. Mesman

window size and branch impact
Window Size and Branch Impact
  • Change from infinite window to examine 2000 and issue at most 64 instructions per cycle

FP: 15 - 45

Integer: 6 – 12

IPC

PerfectTournamentBHT(512)ProfileNo prediction

Embedded Computer Architecture H. Corporaal and B. Mesman

limiting nr of renaming registers
Limiting nr. of Renaming Registers
  • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor (slightly better than tournament predictor)

FP: 11 - 45

Integer: 5 - 15

IPC

Infinite2561286432

Embedded Computer Architecture H. Corporaal and B. Mesman

memory address alias impact
Memory Address Alias Impact
  • Changes: 2000 instr. window, 64 instr. issue, 8K 2-level predictor, 256 renaming registers

FP: 4 - 45

(Fortran,

no heap)

Integer: 4 - 9

IPC

PerfectGlobal/stack perfectInspectionNone

Embedded Computer Architecture H. Corporaal and B. Mesman

reducing window size
Reducing Window Size
  • Assumptions: Perfect disambiguation, 1K Selective predictor, 16 entry return stack, 64 renaming registers, issue as many as window

FP: 8 - 45

IPC

Integer: 6 - 12

Infinite2561286432 16 8 4

Embedded Computer Architecture H. Corporaal and B. Mesman

how to exceed ilp limits of this study
How to Exceed ILP Limits of This Study?
  • WAR and WAW hazards through memory: eliminated WAW and WAR hazards through register renaming, but not in memory
  • Unnecessary dependences
    • compiler did not unroll loops so iteration variable dependence
  • Overcoming the data flow limit: value prediction, predicting values and speculating on prediction
    • Address value prediction and speculation predicts addresses and speculates by reordering loads and stores. Could provide better aliasing analysis

Embedded Computer Architecture H. Corporaal and B. Mesman

conclusions
Conclusions
  • Amount of parallelism is limited
    • higher in Multi-Media and Signal Processing appl.
    • higher in kernels
  • Trace analysis detects all types of parallelism
    • task, data and operation types
  • Detected parallelism depends on
    • quality of compiler
    • hardware
    • source-code transformations

Embedded Computer Architecture H. Corporaal and B. Mesman

overview1
Overview
  • Enhance performance: architecture methods
  • Instruction Level Parallelism
  • VLIW
    • Examples
      • C6
      • TM
      • IA-64: Itanium, ....
      • TTA
  • Clustering
  • Code generation
  • Hands-on

Embedded Computer Architecture H. Corporaal and B. Mesman

vliw general concept

Instruction Memory

Int FU

Int FU

Int FU

LD/ST

LD/ST

FP FU

FP FU

Int Register File

Floating Point

Register File

Data Memory

VLIW: general concept

A VLIW architecture with 7 FUs

Instruction register

Function

units

Embedded Computer Architecture H. Corporaal and B. Mesman

vliw characteristics
VLIW characteristics
  • Multiple operations per instruction
  • One instruction per cycle issued (at most)
  • Compiler is in control
  • Only RISC like operation support
    • Short cycle times
    • Easier to compile for
  • Flexible: Can implement any FU mixture
  • Extensible / Scalable

However:

  • tight inter FU connectivity required
  • not binary compatible !!
    • (new long instruction format)
  • low code density

Embedded Computer Architecture H. Corporaal and B. Mesman

velocitic6x datapath
VelociTIC6x datapath

Embedded Computer Architecture H. Corporaal and B. Mesman

vliw example tms320c62
VLIW example: TMS320C62

TMS320C62 VelociTI Processor

  • 8 operations (of 32-bit) per instruction (256 bit)
  • Two clusters
    • 8 Fus: 4 Fus / cluster : (2 Multipliers, 6 ALUs)
    • 2 x 16 registers
    • One bus available to write in register file of other cluster
  • Flexible addressing modes (like circular addressing)
  • Flexible instruction packing
  • All instruction conditional
  • Originally: 5 ns, 200 MHz, 0.25 um, 5-layer CMOS
  • 128 KB on-chip RAM

Embedded Computer Architecture H. Corporaal and B. Mesman

slide41

VLIW example: Philips TriMedia TM1000

Register file (128 regs, 32 bit, 15 ports)

5 constant

5 ALU

2 memory

2 shift

2 DSP-ALU

2 DSP-mul

3 branch

2 FP ALU

2 Int/FP ALU

1 FP compare

1 FP div/sqrt

Exec

unit

Exec

unit

Exec

unit

Exec

unit

Exec

unit

Data

cache

(16 kB)

Instruction register (5 issue slots)

PC

Instruction

cache (32kB)

Embedded Computer Architecture H. Corporaal and B. Mesman

intel epic architecture ia 64
Intel EPIC Architecture IA-64

Explicit Parallel Instruction Computer (EPIC)

  • IA-64 architecture -> Itanium, first realization 2001

Register model:

  • 128 64-bit int x bits, stack, rotating
  • 128 82-bit floating point, rotating
  • 64 1-bit boolean
  • 8 64-bit branch target address
  • system control registers

See http://en.wikipedia.org/wiki/Itanium

Embedded Computer Architecture H. Corporaal and B. Mesman

epic architecture ia 64
EPIC Architecture: IA-64
  • Instructions grouped in 128-bit bundles
    • 3 * 41-bit instruction
    • 5 template bits, indicate type and stop location
  • Each 41-bit instruction
    • starts with 4-bit opcode, and
    • ends with 6-bit guard (boolean) register-id
  • Supports speculative loads

Embedded Computer Architecture H. Corporaal and B. Mesman

itanium organization
Itanium organization

Embedded Computer Architecture H. Corporaal and B. Mesman

itanium 2 mckinley
Itanium 2: McKinley

Embedded Computer Architecture H. Corporaal and B. Mesman

epic architecture ia 641
EPIC Architecture: IA-64
  • EPIC allows for more binary compatibility then a plain VLIW:
    • Function unit assignment performed at run-time
    • Lock when FU results not available
  • See other website (course 5MD00) for more info on IA-64:
    • www.ics.ele.tue.nl/~heco/courses/ACA
    • (look at related material)

Embedded Computer Architecture H. Corporaal and B. Mesman

what are we talking about1

VLIW = Very Long Instruction Word architecture

Instruction format:

operation 1

operation 2

operation 3

operation 4

operation 5

What are we talking about?

ILP = Instruction Level Parallelism =

ability to perform multiple operations (or instructions),

from a single instruction stream,

in parallel

Embedded Computer Architecture H. Corporaal and B. Mesman

vliw evaluation

FU-1

CPU

FU-2

Instruction fetch unit

Instruction decode unit

Instruction memory

FU-3

Bypassing network

Data memory

Register file

FU-4

FU-5

Control problem

O(N2)

O(N)-O(N2)

With N function units

VLIW evaluation

Embedded Computer Architecture H. Corporaal and B. Mesman

vliw evaluation1
VLIW evaluation

Strong points of VLIW:

    • Scalable (add more FUs)
    • Flexible (an FU can be almost anything; e.g. multimedia support)

Weak points:

  • With N FUs:
    • Bypassing complexity: O(N2)
    • Register file complexity: O(N)
    • Register file size: O(N2)
  • Register file design restricts FU flexibility

Solution: .................................................. ?

Embedded Computer Architecture H. Corporaal and B. Mesman

solution
Solution

TTA: Transport Triggered Architecture

Mirroring the Programming Paradigm

+

-

+

-

>

*

>

*

st

st

Embedded Computer Architecture H. Corporaal and B. Mesman

transport triggered architecture
Transport Triggered Architecture

General organization of a TTA

FU-1

CPU

FU-2

FU-3

Instruction fetch unit

Instruction decode unit

Bypassing network

FU-4

Instruction memory

Data memory

FU-5

Register file

Embedded Computer Architecture H. Corporaal and B. Mesman

tta structure datapath details

load/store

unit

load/store

unit

integer

ALU

integer

ALU

float

ALU

integer

RF

float

RF

boolean

RF

instruct.

unit

immediate

unit

TTA structure; datapath details

Data Memory

Socket

Instruction Memory

Embedded Computer Architecture H. Corporaal and B. Mesman

tta hardware characteristics
TTA hardware characteristics
  • Modular: building blocks easy to reuse
  • Very flexible and scalable
    • easy inclusion of Special Function Units (SFUs)
  • Very low complexity
    • > 50% reduction on # register ports
    • reduced bypass complexity (no associative matching)
    • up to 80 % reduction in bypass connectivity
    • trivial decoding
    • reduced register pressure
    • easy register file partitioning (a single port is enough!)

Embedded Computer Architecture H. Corporaal and B. Mesman

tta software characteristics

add r3, r1, r2

TTA software characteristics
  • More difficult to schedule !
  • But: extra scheduling optimizations

That does not

look like an

improvement !?!

  • r1  add.o1;
  • r2 add.o2;
  • add.r  r3

o1

o2

+

r

Embedded Computer Architecture H. Corporaal and B. Mesman

program ttas

Trigger

Operand

Internal stage

Result

FU Pipeline

Program TTAs
  • How to do data operations ?
  • 1. Transport of operands to FU
    • Operand move (s)
    • Trigger move
  • 2. Transport of results from FU
    • Result move (s)

Example Add r3,r1,r2 becomes

r1  Oint // operand move to integer unit

r2  Tadd // trigger move to integer unit

…………. // addition operation in progress

Rint  r3 // result move from integer unit

How to do Control flow ?

1. Jumps: #jump-address  pc

2. Branch: #displacement  pcd

3. Call: pc  r; #call-address  pcd

Embedded Computer Architecture H. Corporaal and B. Mesman

scheduling example

VLIW

add r1,r1,r2

sub r4,r1,95

TTA

r1 -> add.o1, r2 -> add.o2

add.r -> sub.o1, 95 -> sub.o2

sub.r -> r4

Scheduling example

load/store

unit

integer

ALU

integer

ALU

integer

RF

immediate

unit

Embedded Computer Architecture H. Corporaal and B. Mesman

tta instruction format

General MOVE instructions: multiple fields

g

g

i

1

imm

src

dst

dst

move 1

move 2

move 3

move 4

How to use immediates?

Small, 6 bits

Long, 32 bits

g

0

Ir-1

dst

imm

TTA Instruction format

General MOVE field:

g : guard specifier

i : immediate specifier

src : source

dst : destination

Embedded Computer Architecture H. Corporaal and B. Mesman

programming ttas
Programming TTAs

How to do conditional execution

Each move is guarded

Example

r1  cmp.o1 // operand move to compare unit

r2  cmp.o2 // trigger move to compare unit

cmp.r g // put result in boolean register g

g:r3 r4// guarded move takes place when r1=r2

Embedded Computer Architecture H. Corporaal and B. Mesman

register file port pressure for ttas
Register file port pressure for TTAs

Embedded Computer Architecture H. Corporaal and B. Mesman

summary of tta advantages
Summary of TTA Advantages
  • Better usage of transport capacity
    • Instead of 3 transports per dyadic operation, about 2 are needed
    • # register ports reduced with at least 50%
    • Inter FU connectivity reduces with 50-70%
      • No full connectivity required
  • Both the transport capacity and # register ports become independent design parameters; this removes one of the major bottlenecks of VLIWs
  • Flexible: Fus can incorporate arbitrary functionality
  • Scalable: #FUS, #reg.files, etc. can be changed
  • FU splitting results into extra exploitable concurrency
  • TTAs are easy to design and can have short cycle times

Embedded Computer Architecture H. Corporaal and B. Mesman

tta automatic dse

x

Pareto curve

(solution space)

x

x

x

exec. time

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

x

cost

TTA automatic DSE

User

intercation

Optimizer

Architecture

parameters

feedback

feedback

Parametric compiler

Hardware generator

Move framework

Parallel

object

code

chip

Embedded Computer Architecture H. Corporaal and B. Mesman

overview2
Overview
  • Enhance performance: architecture methods
  • Instruction Level Parallelism
  • VLIW
  • Examples
    • C6
    • TM
    • TTA
  • Clustering and Reconfigurable components
  • Code generation
  • Hands-on

Embedded Computer Architecture H. Corporaal and B. Mesman

clustered vliw

Level 1 Instruction Cache

loop buffer

loop buffer

loop buffer

FU

FU

FU

FU

FU

FU

FU

FU

FU

Level 2 (shared) Cache

register file

register file

register file

Level 1 Data Cache

Clustered VLIW
  • Clustering = Splitting up the VLIW data path- same can be done for the instruction path –

Embedded Computer Architecture H. Corporaal and B. Mesman

clustered vliw1
Clustered VLIW

Why clustering?

  • Timing: faster clock
  • Lower Cost
    • silicon area
    • T2M
  • Lower Energy

What’s the disadvantage?

Want to know more: see PhD thesis Andrei Terechko

Embedded Computer Architecture H. Corporaal and B. Mesman

fine grained reconfigurable xilinx xc4000 fpga

Vcc

Slew

Passive

Rate

Pull-Up,

Control

Pull-Down

D Q

Output

Pad

Buffer

Input

Buffer

Q D

Delay

C1

C2

C3

C4

H1 DIN S/R EC

S/R

Control

G4

G

DIN

G3

SD

F\'

Func.

Q

D

G2

G\'

Gen.

H\'

G1

EC

RD

1

H

Y

G\'

Func.

H\'

Gen.

S/R

F4

Control

F

F3

Func.

DIN

SD

F2

F\'

Gen.

Q

D

G\'

F1

H\'

EC

RD

1

X

H\'

F\'

K

Fine-Grained reconfigurable:Xilinx XC4000 FPGA

Programmable

Interconnect

I/O Blocks (IOBs)

Configurable

Logic Blocks (CLBs)

Embedded Computer Architecture H. Corporaal and B. Mesman

coarse grained reconfigurable chameleon cs2000
Coarse-Grained reconfigurable: Chameleon CS2000
  • Highlights:
  • 32-bit datapath (ALU/Shift)
  • 16x24 Multiplier
  • distributed local memory
  • fixed timing

Embedded Computer Architecture H. Corporaal and B. Mesman

recent coarse grain reconfigurable architectures
Recent Coarse Grain Reconfigurable Architectures
  • SmartCell 2009
    • read http://www.hindawi.com/journals/es/2009/518659.html
  • Montium (reconfigurable VLIW)
  • RAPID
  • NIOS II
  • RAW
  • PicoChip
  • PACT XPP64
  • many more ….

Embedded Computer Architecture H. Corporaal and B. Mesman

hybrid fpgas virtex ii pro

Up to 16 serial transceivers

PowerPCs

ReConfig.

logic

Courtesy of Xilinx (Virtex II Pro)

Hybrid FPGAs: Virtex II-Pro

GHz IO: Up to 16 serial transceivers

Memory blocks

PowerPC

Reconfigurable logic

blocks

Embedded Computer Architecture H. Corporaal and B. Mesman

xilinx zynq with 2 arm processors
Xilinx Zynq with 2 ARM processors

Embedded Computer Architecture H. Corporaal and B. Mesman

granularity makes differences
Granularity Makes Differences

Embedded Computer Architecture H. Corporaal and B. Mesman

hw or sw reconfigurable

Spatial mapping

FPGA

Temporal mapping

VLIW

configuration bandwidth

HW or SW reconfigurable?

reset

Reconfiguration time

loopbuffer

context

Subword parallelism

1 cycle

fine

coarse

Data path granularity

Embedded Computer Architecture H. Corporaal and B. Mesman

ad