real time signal processing on embedded systems n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Real-time Signal Processing on Embedded Systems PowerPoint Presentation
Download Presentation
Real-time Signal Processing on Embedded Systems

Loading in 2 Seconds...

play fullscreen
1 / 115

Real-time Signal Processing on Embedded Systems - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

Real-time Signal Processing on Embedded Systems. Advanced Cutting-edge Research Seminar I&III. Advances in Microprocessor Technology. Architectural improvements of microprocessors. Pipelining Paralle processing exploiting ILP Superscalar VLIW SIMD.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Real-time Signal Processing on Embedded Systems' - grant


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
real time signal processing on embedded systems

Real-time Signal Processing on Embedded Systems

Advanced Cutting-edge Research Seminar I&III

architectural improvements of microprocessors
Architectural improvementsof microprocessors
  • Pipelining
  • Paralle processing exploiting ILP
    • Superscalar
    • VLIW
  • SIMD
procedure of instruction execution on a processor
Procedure of instruction execution on a processor
  • Instruction Fetch (IF)
    • fetches an instruction from main memory.
  • Instruction Decode (ID)
    • decodes fetched instruction
  • Execution (EX)
    • executes decoded instruction
  • Memory Access (MA)
    • accesses to main memory
  • Write Back (WB)
    • Write back data to registers
operation cycles on a processor
Operation cycles on a processor
  • Single cycle machine
    • This kinds of machines execute all procedures from IFto WB in a cycle.
    • Operation speed is determined by the slowest instruction. (Because all instructions must be executed in a cycle)
  • Multi-cycle machine
    • This kinds of machines execute an instruction in several cycles.

IF

ID

EX

MA

WB

piepelined operation
Piepelined operation
  • can improve throughput of instructions.

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

To realize pipelined operation, several techniques are required.

IF

IF

IF

ID

ID

ID

EX

EX

EX

MA

MA

MA

WB

WB

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

causes of pipeline hazards
Causes of pipeline hazards
  • Structural hazard: The hardware cannot cope with the combination of issued instructions.
  • Data hazard: The latter instruction must wait completion of former instruction because the latter uses the result of the former.
  • Control hazard: A condition that determines whether an instruction is executed or not depends on the result of the former instruction.
structural hazard

Memory

Structural hazard

CPU

PC

Instructionregister

Instructiondecoder

ALU

Registers

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

structural hazard1

Memory

Structural hazard

CPU

PC

Instructionregister

Instructiondecoder

IF

ID

EX

MA

WB

ALU

Registers

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

structural hazard2

Memory

Structural hazard

CPU

PC

Instructionregister

Instructiondecoder

IF

ID

EX

MA

WB

ALU

Registers

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

structural hazard3

Memory

Structural hazard

CPU

PC

Instructionregister

Instructiondecoder

IF

ID

EX

MA

WB

ALU

Registers

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

structural hazard4

Memory

MA

Structural hazard

conflict

IF

CPU

PC

Instructionregister

Instructiondecoder

IF

ID

EX

MA

WB

ALU

Registers

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

structural hazard5

Memory

Structural hazard

CPU

  • Resolve 1: to stall the next instruction

PC

Instructionregister

Instructiondecoder

IF

ID

EX

MA

WB

ALU

Registers

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

structural hazard6

Memory

Structural hazard

CPU

  • Resolve 1: to stall the next instruction

PC

Instructionregister

Instructiondecoder

IF

ID

EX

MA

WB

ALU

Registers

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

structural hazard7

Memory

MA

Structural hazard

conflict

IF

CPU

  • Resolve 2: to add another data bus to access the instruction memory.

PC

Instructionregister

Instructiondecoder

ALU

Registers

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

structural hazard8

Inst Mem

Data Mem

Structural hazard

CPU

  • Resolve 2: to add another data bus to access the instruction memory.

PC

Instructionregister

Instructiondecoder

ALU

Registers

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

Harvard Architecture

data hazard

Memory

Data hazard

CPU

PC

Instructionregister

Instructiondecoder

add $s0,$t0,$t1

($s0=$t0+$t1)

IF

ID

EX

MA

WB

ALU

Registers

sub $t2,$s0,$t3

($t2=$s0-$t3)

IF

ID

EX

MA

WB

Registers

5

4

3

2

1

t0

t1

t2

t3

t4

0

0

0

0

0

s0

s1

s2

s3

s4

data hazard1

Memory

Data hazard

CPU

PC

Instructionregister

$s0=$t0+$t1

Instructiondecoder

add $s0,$t0,$t1

($s0=$t0+$t1)

IF

ID

EX

MA

WB

ALU

Registers

sub $t2,$s0,$t3

($t2=$s0-$t3)

IF

ID

EX

MA

WB

Registers

5

4

3

2

1

t0

t1

t2

t3

t4

0

0

0

0

0

s0

s1

s2

s3

s4

data hazard2

Memory

Data hazard

CPU

PC

Instructionregister

$s0=$t0+$t1

Instructiondecoder

add $s0,$t0,$t1

($s0=$t0+$t1)

IF

ID

EX

MA

WB

ALU

Registers

sub $t2,$s0,$t3

($t2=$s0-$t3)

IF

ID

EX

MA

WB

$t2=$s0-$t3

Registers

5

4

3

2

1

t0

t1

t2

t3

t4

0

0

0

0

0

s0

s1

s2

s3

s4

data hazard3

Memory

Data hazard

CPU

PC

Instructionregister

$s0=$t0+$t1

Instructiondecoder

add $s0,$t0,$t1

($s0=$t0+$t1)

IF

ID

EX

MA

WB

ALU

Registers

sub $t2,$s0,$t3

($t2=$s0-$t3)

IF

ID

EX

MA

WB

$t2=$s0-$t3

-2=0-2

Registers

5

4

3

2

1

t0

t1

t2

t3

t4

0

0

0

0

0

s0

s1

s2

s3

s4

data hazard4

Memory

Data hazard

CPU

  • Waiting by stalls: consuming 3 cycles

PC

Instructionregister

$s0=$t0+$t1

Instructiondecoder

add $s0,$t0,$t1

($s0=$t0+$t1)

IF

ID

EX

MA

WB

ALU

Registers

sub $t2,$s0,$t3

($t2=$s0-$t3)

IF

ID

EX

MA

WB

Registers

5

4

3

2

1

t0

t1

t2

t3

t4

0

0

0

0

0

s0

s1

s2

s3

s4

data hazard5

Memory

Data hazard

CPU

  • Resolve: forwarding

PC

Instructionregister

$s0=$t0+$t1

Instructiondecoder

add $s0,$t0,$t1

($s0=$t0+$t1)

IF

ID

EX

MA

WB

ALU

Registers

sub $t2,$s0,$t3

($t2=$s0-$t3)

IF

ID

EX

MA

WB

Registers

5

4

3

2

1

t0

t1

t2

t3

t4

0

0

0

0

0

s0

s1

s2

s3

s4

data hazard6

Memory

Data hazard

CPU

  • Resolve: forwarding

PC

Instructionregister

$s0=$t0+$t1

Instructiondecoder

add $s0,$t0,$t1

($s0=$t0+$t1)

IF

ID

EX

MA

WB

ALU

Registers

sub $t2,$s0,$t3

($t2=$s0-$t3)

IF

ID

EX

MA

WB

The result is

forwarded to ALU

Registers

5

4

3

2

1

t0

t1

t2

t3

t4

0

0

0

0

0

s0

s1

s2

s3

s4

data hazard7

Memory

Data hazard

CPU

  • Resolve:forwarding

PC

Instructionregister

$s0=$t0+$t1

Instructiondecoder

add $s0,$t0,$t1

($s0=$t0+$t1)

IF

ID

EX

MA

WB

ALU

Registers

sub $t2,$s0,$t3

($t2=$s0-$t3)

IF

ID

EX

MA

WB

The result is

forwarded to

ALU

$t2=9-$t3

7=9-2

Registers

5

4

3

2

1

t0

t1

t2

t3

t4

0

0

0

0

0

s0

s1

s2

s3

s4

control hazard
Control hazard

An instruction sequence

including branch

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

CPU

※ In this explanation,

PC adopts word address

for simplification.

PC:10

Instructiondecoder

Instructionregister

ALU

Registers

IF

ID

EX

MA

WB

control hazard1
Control hazard

An instruction sequence

including branch

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

CPU

PC:

Instructiondecoder

Instructionregister

ALU

Registers

IF

ID

EX

MA

WB

control hazard2
Control hazard

An instruction sequence

including branch

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

CPU

PC:11

Instructiondecoder

Instructionregister

ALU

Registers

IF

ID

EX

MA

WB

control hazard3
Control hazard

An instruction sequence

including branch

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

PC’s value of next instruction depends on the branch condition

Branch is taken:PC=40

Not taken:PC=12

CPU

PC:12

Instructiondecoder

Instructionregister

ALU

Registers

IF

ID

EX

MA

WB

control hazard4
Control hazard
  • Resolve 1:stall

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

2 cycle stall

The number of required stall cycle

aetermined by architecture.

IF

ID

EX

MA

WB

control hazard5
Control hazard
  • Resolve 1:stall

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

1 cycle stall

If the processor can calculate the branch target

address at the ID stage.

IF

ID

EX

MA

WB

control hazard6
Control hazard
  • Resolve 2: Branch prediction

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

CPU

PC:10

In this example, the next

PC is predicted as if the

branch is always untaken.

Instructiondecoder

Instructionregister

ALU

Registers

IF

ID

EX

MA

WB

control hazard7
Control hazard
  • Resolve 2:branch prediction

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

CPU

PC:11

In this example, the next

PC is predicted as if the

branch is always untaken.

Instructiondecoder

Instructionregister

ALU

Registers

IF

ID

EX

MA

WB

control hazard8
Control hazard
  • Resolve 2: branch prediction

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

CPU

PC:12

In this example, the next

PC is predicted as if the

branch is always untaken.

Instructiondecoder

Instructionregister

ALU

Registers

IF

ID

EX

MA

WB

control hazard9
Control hazard
  • Resolve 2: branch prediction

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

stall

CPU

PC:40

If the prediction is missed,

in other words, if branch

is taken.

Instructiondecoder

Instructionregister

ALU

Registers

IF

ID

EX

MA

WB

control hazard10
Control hazard
  • More practical scheme: dynamic branch prediction
    • n-bit counter-based prediction:

Branch History Table

Address of a branch instraction

Lower i-bit

n-bit saturating

up/down counter

1 bit counter based prediction
1-bit counter-based prediction

1

0

Predict branch will be taken

Predict branch will be untaken

Branch is taken

Branch is untaken

2 bit counter based prediction
2-bit counter-based prediction

Branch is taken

Branch is untaken

Predict branch will be taken

Predict branch will be taken

01

10

Predict branch will be taken

Predict branch will be taken

This scheme is adopted in Intel Pentium, Sun Ultra SPARC, MIPS R10000,etc

00

11

control hazard11
Control hazard
  • Resolve 3:delayed prediction

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

Inserted instruction

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

CPU

PC:11

An instruction that has no dependency

is inserted.

Instructiondecoder

Instructionregister

IF

ID

EX

MA

WB

ALU

Registers

control hazard12
Control hazard
  • Resolve 3:delayed prediction

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

Inserted instruction

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

CPU

PC:12

An instruction that has no dependency

is inserted.

Instructiondecoder

Instructionregister

IF

ID

EX

MA

WB

ALU

Registers

control hazard13
Control hazard
  • Resolve 3:delayed prediction

add $s0,$t0,$t1

($s0=$t0+$t1)

beq $s1,$s2, 40

(if($s1==$s2){goto 40})

Inserted instruction

or $s3,$s4,$t2

($s3=$s4|$t2)

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

IF

ID

EX

MA

WB

CPU

PC:13or40

An instruction that has no dependency

is inserted.

Instructiondecoder

Instructionregister

IF

ID

EX

MA

WB

ALU

Registers

An instruction at determined address

is executed.

exploiting ilp instruction level parallelism
Exploiting ILP (Instruction Level Parallelism)
  • SuperScalar : issuing multiple instructions per cycle with hardware support.
    • Advantage: binary compatibility.
  • VLIW: issuing multiple instructions per cycle with compiler support.
    • Advantage: simple hardware
types of data dependence
Types of data dependence
  • True data dependence (RAW: Read After Write)
  • Anti-dependence (WAR: Write After Read)
  • Output dependence (WAW: Write After Write)

difficult to remove

i1: r2=r1+r3

i2: r4=r2+1

can be removed by register renaming

They are called as artificial dependence

i1: r1=r2+r3

i2: r2=r4+1

i3: r1=r4+2

Anti

Output

basic architecture of superscaler processor
Basic Architecture of Superscaler Processor

Instructioncache

Frontend

Instruction decode

Branch prediction

Datacache

Register renaming

dispatch

commit

・・・・・

Ex-core

Back

end

・・・・・

Instruction

window

Registers

issue

・・・・・

Reorder buffer

Function unit

Function unit

・・・・・

・・・・・

basic function of frontend
Basic function of Frontend
  • provides enough instructions.
  • predicts next instruction address if branch instruction appears.
  • resolves artificial dependences by register renaming.
  • analyzes true data dependence after register renaming.
  • transfers instructions after the above operations.
    • This operation is called “dispatch”.
basic function of ex core
Basic function of Ex-core
  • finds independent instructions stored in “instruction window” as many as possible.
    • In this operation, dynamic scheduling is performed to resolve several restrictions: data dependence, resource, prior defined priority, etc.
  • executes independent instructions in parallel.
    • An operation that transfers an instruction to a function unit is called “issue”.
basic function of backend
Basic function of Backend
  • updates processor state.
    • Results obtained as out-of-order are reordered to in-order.
    • Update of the processor state is performed precisely.
      • Update of the processor state based on the execution result is called “commit”.
      • Disappear of instruction is called “retire”.
dynamic instruction scheduling
Dynamic instruction scheduling
  • Instruction scheduling means to determine issuing order of instructions and when the instructions are issued.
  • In superscalar processors, dynamic instruction scheduling is performed using instructions stored in the instruction buffer.

In the following slides, dynamic scheduling will be explained using several types of processors:1-way in-order processor, i-way in-order processro, and i-way out-of-order processor.

1 way in order issue
1 way in-order issue
  • The number of issued instructionsat a cycle is at most 1.
  • The size of instruction window is 1 because all subsequent instructions cannot be issued if an instruction cannot be issued.
  • Only true and output dependences should be checked because anti dependence is always resolved.
control by r flag
Control by R flag
  • R flag is used to check true and output dependences.

Registers

op

dst

src1

src2

R

value

R

value

Register number

Instruction

R

value

R

value

R

value

R

value

R

value

R

value

R==false means the register is reserved but the result has not been stored yet. In this case, the operand is not available.

Only when R(dst) == true && R(src1) ==true && R(src2),

the instruction is issued. (This condition is called “ready”.)

update sequence of the r flag
Update sequence of the R flag
  • R bit of destination becomes false when an instruction is issued.
  • R bit of destination becomes true when a result is stored in the destination.

by the above update,

Practically, resource restrictions must be satisfied to issue instructions in addition to the check of dependency. In this lecture, only restriction about function unit is considered to simplify the discussion.

  • Instructions using unavailable registers as source registers are not issued; true dependence is resolved.
  • Instructions using unavailable a register as a destination register are not issued; output dependence is resolved.
i way in order issue
i-way in-order issue
  • We think about how the following 4 instructions are executed on this processor.

i1: r1 = r5

i2: r2 = r1 + 1

i3: r3 = r6

i4: r4 = r3 +1

IPC becomes 1.3. (4instcuctions/3 cycle)

In-order scheduling

how to check dependency of instructions
How to check dependency of instructions?
  • True and output dependence must be checked.

Registers

R

value

:

:

:

:

:

Register number

Instruction i-1

Instruction

window

:

:

i

3 × i

Instruction 0

R

value

3 × i

i

how to allocate resources funciton unit
How to allocate resources(funciton unit)?
  • Allocation of is performed as follows.
    • Check whether any of preceding ready instructions refers or not. If there is no instructions refering , the function unit is available.
    • Repeat the above procedure from to , where means the number of function units.
complexity of i way in order issue
Complexity of i-way in-order issue
  • Ready detection
    • ports are required.
    • comparators are required for check of operand dependency.
  • Resource allocation
    • input NOR gate is required.

Complexity increases by

i way out of order issue
i-way out-of-order issue
  • Out-of-order scheduling of the same code used in the previous i-way in-order case.

i1: r1 = r5

i2: r2 = r1 + 1

i3: r3 = r6

i4: r4 = r3 +1

IPC becomes 2.0. (4instcuctions/2 cycle)

Out-of-order scheduling

architectural requirements for out of order execution
Architectural requirements for out-of-order execution
  • The depth of instruction window should be increased to .
  • The number of registers’ ports must be for check of dependence.
  • Anti-dependence must be checked, in addition to the i-way in-order case.
  • Resource allocation can be performed in the same way as the i-way in-order case.
complexity of i way in order issue1
Complexity of i-way in-order issue
  • Ready detection
    • ports are required.
    • comparators are required for check of operand dependency.
  • Resource allocation
    • input NOR gate is required.

Increase of hardware complexity is more significant than the in-order case because n>>i in general.

Complexity increases by

tomasulo s algorithm
Tomasulo’s Algorithm
  • was proposed by R.M. Tomasulo in 1967.
  • was originally adopted in floating point unit in IBM 360/91.
    • Performance was drastically improved.
  • Similar algorithms are used in the latest microprocessors.
superscalar arch using tomasulo
Superscalar arch using Tomasulo

Instructioncache

Frontend

Instruction decode

Branch prediction

Datacache

Tag allocation

dispatch

Ex-core

・・・・・

Registers

・・・・・

Reservation

Station

・・・・・

issue

Function unit

Function unit

・・・・・

contents of reservation station and register
Contents of reservation station and register
  • Register
    • Tag is used for register renaming.
  • Reservation station
    • op: opecode
    • dtag: destination tag
    • stag: source tag
    • R: ready flag
    • value: operand’s value

R

tag

value

op

dtag

R

stag

value

R

stag

value

Source 1

Source 2

operation on the arch
Operation on the arch
  • Dispatch
  • Issue
  • Execution
  • Finalization
operation on the arch1
Operation on the arch
  • Dispatch
    • dtag is assigned to a destination operand from tag pool that holds unassigned tags.
    • Src operands are obtained by reading registers using each register number. If R is true, then value is read, otherwise tag’s value is read from the register.
    • Then, an instruction is stored in a reservatoin station corresponding to a function unit used in the instruction.
operation on the arch2
Operation on the arch
  • Issue
    • A ready instruction in a reservation is executed on a corresponding function unit, if the function unit is available.
    • The issued instruction is deleted from the reservation station.
  • Execution
    • Issued instructions are executed on corresponding function units.
operation on the arch3
Operation on the arch
  • Finalize
    • Based on a result of execution, dtag and a result value is broadcasted to the result bus.
    • If there is an instruction holds the broadcasted dtag as stag, R flag and value of the instruction is replaced by true and the broadcasted result value, respectively.
    • Only when there is a register holding a tag corresponding to broadcasted dtag, the broadcasted result is stored in the register.
    • Finally, the broadcasted tag is stored to tag pool.
an example of tomasulo
An example of Tomasulo
  • A superscalar processor used in this example has the following 5 stage pipeline and the number of way is 2.
    • IF: fetches 2 instructions.
    • ID: decodes, allocates tags, and dispatches.
    • RS: waits operands until an instruction becomes ready.
    • EX: executes an instruction.
    • WB: writes a result.

i1: r1 = load A

i2: r2 = r1 + 3

i3: r3 = r2 + 1

i4: r4 = load B

#A and B are const

cycle 0
Cycle 0

State of instructions

Registers

Tag pool

cycle 1
Cycle 1

State of instructions

Registers

Tag pool

cycle 2
Cycle 2

State of instructions

Registers

Tag pool

cycle 3
Cycle 3

State of instructions

Registers

Tag pool

cycle 4
Cycle 4

State of instructions

Registers

Tag pool

cycle 5
Cycle 5

State of instructions

Registers

Tag pool

cycle 6
Cycle 6

State of instructions

Registers

Tag pool

problem of out of order execution
Problem of out-of-order execution
  • It is difficult to update the processor state precisely if exception is occurred.

In order execution

Out of order execution

flow of exception handling
Flow of exception handling
  • Unfinished instructions that include an instruction causes the exception is invalidated.
  • Control is moved to OS to save the current state to main memory and to handle the exception.
  • After the process of the exception, CPU begins to execute the instruction causing the exception again.
problem of out of order execution1
Problem of out-of-order execution
  • It is difficult to update the processor state precisely if exception is occurred.
  • Save the current state.
  • OS handles the exception.
  • CPU restarts from i3.

In order execution

problem of out of order execution2
Problem of out-of-order execution
  • It is difficult to update the processor state precisely if exception is occurred.
  • Save the current state.
    • i5 has finished before i3.
    • i1 has not finished.
    • the data of r3 has been lost.
  • OS handles the exception.

CPU cannot restart from i3.

Reorder buffer is used for precise exception handling.

Out of order execution

reorder buffer
Reorder buffer
  • Updates CPU’s state in the original program order by reordering results.
  • Handles exception at the state update.

Results and information about exception

Reorder Buffer

Store of results in the original

program order and detection of

exception.

Commit

Registers

superscalar arch using tomasulo and reorder buffer
Superscalar arch using Tomasulo and reorder buffer

Instructioncache

Frontend

Instruction decode

Branch prediction

Datacache

Tag allocation

dispatch

・・・・・

Backend

Registers

Reorder Buffer

・・・・・

commit

Reservation

Station

・・・・・

issue

Ex-core

Function unit

Function unit

・・・・・

behaviour of reorder buffer
Behaviour of reorder buffer
  • If there is result without an exception, it is stored to a register and the entry corresponding to it is removed.
  • There is a result with an exception, pipeline and reorder buffer are cleared.
  • If a result is not stored, reorder buffer waits until the result is obtained.
contents of reorder buffer
Contents of reorder buffer
  • PC: instruction address
  • R: Ready flag
  • dreg: register number of destination
  • dtag: operand tag of destination
  • E: Exception flag
  • result: result

PC

R

dreg

dtag

E

result

operand bypass and supply of source operand tag
Operand bypass and supply of source operand tag
  • Tomasulo: operand values are obtained from registers that have the latest values.
  • Reorder buffer: the latest values are stored in reorder buffer. (not in registers)
  • Procedure of obtaining operands:
    • Check dependency to instructions decoded concurrently. If there is dependency, stag becomes dtag of the dependent instruction.
    • Otherwise, reorder buffer is searched by source register number to obtain value (when R=1) or tag. (when R=0) If reorder buffer does not have value and tag corresponding to the register number, values are obtained from registers.
an example of reorder buffer
An example of reorder buffer
  • A superscalar processor used in this example has the following 6 stage pipeline and the number of way is 2.
    • IF: fetches 2 instructions.
    • ID: decodes, allocates tags, and dispatches.
    • RS: waits operands until an instruction becomes ready.
    • EX: executes an instruction.
    • WB: writes results to reorder buffer.
    • RT: writes result to registers.
a c ode used in the example
A code used in the example

i1: 0x40: r1 = load A (r0)

i2: 0x44: r2 = r1 + r3

i3: 0x48: r2 = r2 + 16

i4: 0x4C: r5 = load 0 (r1)

i5: 0x50: r1 = r1 + 1

i6: 0x54: r2 = load 0 (r2)

Address of instruction

cycle 01
Cycle 0

State of instructions

Reorder buffer

cycle 11
Cycle 1

State of instructions

Reorder buffer

cycle 21
Cycle 2

State of instructions

Reorder buffer

cycle 31
Cycle 3

State of instructions

Reorder buffer

cycle 41
Cycle 4

State of instructions

Reorder buffer

Tail

cycle 51
Cycle 5

State of instructions

Reorder buffer

Tail

cycle 61
Cycle 6

State of instructions

Reorder buffer

Tail

cycle 7
Cycle 7

State of instructions

Reorder buffer

Exception

is detected.

vliw very long instruction word
VLIW (Very Long Instruction Word)
  • In the VLIW processor, compiler extracts parallelism in a code. Therefore, special hardware support used in the superscalar processor becomes unnecesarry.
    • Superscalar: dynamic scheduling by hardware support
    • VLIW: static scheduling by compiler
overview of vliw
Overview of VLIW

compiler

processor

main(){・・・・・・・・・・・・}

addsub ・・・

code gen

scheduling

execution

Superscalar

compiler

processor

main(){・・・・・・・・・・・・}

add sub load add mul load

・・・

addsub ・・・

execution

code gen

scheduling

VLIW

vliw code
VLIW code

i1: r3=r4+1

i2: r1=load(r2)

i3: r1=r1<<r3

i4: r5=r2+r6

i5:beq r5,L

Sequential code

VLIW code

hardware organization of vliw
Hardware organization of VLIW

Instruction cache

・・・・・

Registers

ALU

ALU

MEM

Branch

・・・・・

Data cache

dynamic vs static schedluing
Dynamic vs Static schedluing

i1: r1=load A

i2: r2=load(r1)

i3: r3=load B

i4: r4=r3<<r2

i5: r5=r4+1

i6: r6=r2+r5

i1

i3

Sample

code

Data dependency

of the code

i2

i4

i5

i6

Optimal scheduling

Dynamic scheduling

advantage of dynamic scheduling
Advantage of dynamic scheduling
  • Scheduling based on information that can only be obtained at run time.
    • For example, cache miss can be concealed.
  • Scheduling based on accurate dependency of memories.
    • Data address that can be obtained only at run time improves scheduling performance.
taxonomy of scheduling algorithm
Taxonomy of scheduling algorithm
  • Local scheduling
  • Global scheduling
    • Cyclic scheduling
    • Acyclic scheduling
      • Trace-based scheduling
      • DAG-based (Directed acyclic graph) scheduling
vliw based commercial processors
VLIW-based commercial processors
  • Transmeta Crusoe
    • Aiming mobile computing
  • Texas Instruments TMS320C6x series
    • Embedded applications
  • Intel Itanium
parallel operation by simd
Parallel operation by SIMD
  • What is SIMD?: SIMD (Single Instruction Multiple Data) means that the same operation is applied to several operands.
    • Ex: Addition

c[0]=a[0]+b[0]

c[1]=a[1]+b[1]

c[2]=a[2]+b[2]

c[3]=a[3]+b[3]

Sequential

operation

int i;

int a[4]={1,2,3,4};

int b[4]={5,6,7,8};

int c[4];

for (i=0;i<4;i++){

c[i]=a[i]+b[i];

}

SIMD

allocation of vector values
Allocation of vector values
  • Vector values are allocated to memory in the big-endian style as shown in the following figure.

*This figure is adapted from cell.fixstars.com

how to access vector type via normal pointer
How to access vector type via normal pointer

vector signed int va = (vector signed int) { 1, 2, 3, 4 };

int *a = (int *) &va;

*This figure is adapted from cell.fixstars.com

how to access a normal array from vector type
How to access a normal array from vector type

int a[8] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8 };

vector signed int *va = (vector signed int *) a;

__attribute__((aligned(16))) forces scalar data to be 16 byte-aligned

*This figure is adapted from cell.fixstars.com

simd operation on ppe
SIMD operation on PPE

int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 };

int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 };

int c[4] __attribute__((aligned(16)));

vector signed int *va = (vector signed int *) a;

vector signed int *vb = (vector signed int *) b;

vector signed int *vc = (vector signed int *) c;

*vc = vec_add(*va, *vb);

vec_add is a SIMDfunction provided by VMX (Vector Multimedia Extension) proposed by IBM and Mtorola.

entire code for vector addition
Entire code for vector addition

#include <stdio.h>

#include <altivec.h>

int a[4] __attribute__((aligned(16))) = { 1, 2, 3, 4 };

int b[4] __attribute__((aligned(16))) = { 5, 6, 7, 8 };

int c[4] __attribute__((aligned(16)));

int main(int argc, char **argv)

{

vector signed int *va = (vector signed int *) a;

vector signed int *vb = (vector signed int *) b;

vector signed int *vc = (vector signed int *) c;

*vc = vec_add(*va, *vb);

printf("c[0]=%d, c[1]=%d, c[2]=%d, c[3]=%d\n", c[0], c[1], c[2], c[3]);

return 0;

}

how to create dense vector data
How to create dense vector data
  • In general, vector data is not densely stored. Threfore, dense vector data must be created before vector operation.

vc = vec_perm(va, vb, vpat);

*This figure

is adapted from

cell.fixstars.com

ex of vec perm transpose
Ex of vec_perm: Transpose

*These figures are adapted from cell.fixstars.com

branch on simd
Branch on SIMD

*These figures are adapted from cell.fixstars.com

procedure of simd branch
Procedure of SIMD Branch

*These figures are adapted from cell.fixstars.com

detail of simd branch
Detail of SIMD Branch

vec_cmpgt()

*These figures are adapted

from cell.fixstars.com

vec_sel()

ex of simd branch
Ex of SIMD Branch

int a[16] = { 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16 };

int b[16] = { 16, 15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4 3, 2, 1 };

int c[16];

int i;

for (i = 0; i < 16; i++) {

if (a[i] > b[i]) {

c[i] = a[i] - b[i];

} else {

c[i] = b[i] - a[i];

}

}

ex of simd branch1
Ex of SIMD Branch

int a[16] __attribute__((aligned(16))) = { 1, 2, 3, 4, 5, 6, 7, 8,

9, 10, 11, 12, 13, 14, 15, 16 };

int b[16] __attribute__((aligned(16))) = { 16, 15, 14, 13, 12, 11, 10, 9,

8, 7, 6, 5, 4, 3, 2, 1 };

int c[16] __attribute__((aligned(16)));

vector signed int *va = (vector signed int *) a;

vector signed int *vb = (vector signed int *) b;

vector signed int *vc = (vector signed int *) c;

vector signed int vc_true, vc_false;

vector unsigned int vpat;

int i;

for (i = 0; i < 4; i++) {

vpat = vec_cmpgt(va[i], vb[i]);

vc_true = vec_sub(va[i], vb[i]);

vc_false = vec_sub(vb[i], va[i]);

vc[i] = vec_sel(vc_false, vc_true, vpat);

}