architectures of digital information systems part 4 caches pipelines and superscalar machines n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Architectures of Digital Information Systems Part 4: Caches, pipelines and superscalar machines PowerPoint Presentation
Download Presentation
Architectures of Digital Information Systems Part 4: Caches, pipelines and superscalar machines

Loading in 2 Seconds...

play fullscreen
1 / 86

Architectures of Digital Information Systems Part 4: Caches, pipelines and superscalar machines - PowerPoint PPT Presentation


  • 98 Views
  • Uploaded on

Architectures of Digital Information Systems Part 4: Caches, pipelines and superscalar machines. dr.ir. A.C. Verschueren Eindhoven University of Technology Section of Digital Information Systems. The memory speed ‘gap’.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Architectures of Digital Information Systems Part 4: Caches, pipelines and superscalar machines' - salvador-perkins


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
architectures of digital information systems part 4 caches pipelines and superscalar machines

Architectures ofDigital Information SystemsPart4: Caches, pipelines and superscalar machines

dr.ir. A.C. VerschuerenEindhoven University of TechnologySection of Digital Information Systems

the memory speed gap
The memory speed ‘gap’
  • High-performance processors are much too fast for the main memory they are connected to
    • Processors running at 1000 MegaHerz would like a memory read/write cycle time of 1 nanosecond
    • Large memories with (relatively) cheap RAM’s have cycle times on the order of 100 nanoseconds

100 times slower, this speed gap continues to grow...

wide words and memory banking

4 words in parallel

4 accesses in parallel

read

0..3

4..7

read

0

1

2

3

4

5

6

7

use

use

0

1

2

3

4

5

6

7

0

1

2

3

4

5

6

7

1) Wide memory words

2) Multiple memory 'banks'

Wide words and memory banking
  • The gap can be closed IF the processor tolerates a long delay between the start and end of a cycle

Complex timing

Lots of pins

the big if in closing the gap
The big IF in closing the gap
  • Long memory access delays can be toleratedIF addresses are known in advance
    • True for sequential instruction reads
    • NOT true for most of the other read operations
  • Memory reading MUST become quicker!
  • Not interested in (timing of) write operations
    • Data & address to memory, then forget about it...
small scale virtual memory the cache

‘Cache’ is French:‘secret hiding place’

Small-scale virtual memory: the cache
  • A 'cache' is a small but very fast memory which contains the 'most active' memory words

IF a requested memory word is in the cache

THEN supply the word from the cache {very fast}

ELSE supply the word from main memory {rather slow} and place it in the cache for later references (throwing out not used words when needed)

  • An ideal cache knows which words will be used soon
  • A good cache reaches 95% THEN and only 5% ELSE
keeping the cache hidden
Keeping the cache hidden
  • The cache must keep a copy of memory words
  • Memory mapped I/O ports are problematic
    • These can spontaneously change their value !
    • Have to be made'non-cacheable’ at all times
  • Shared memory is problematic too
    • Make it non-cacheable (from all sides), or better
    • Inform all attached caches of changes (write actions)
cache writing policies
Cache writing policies

'write-through’: written data copied into memory

  • Option: write to cache only if word is already present
    • The amount of data in the cache can be reduced
    • Read after non-cached write requires true memory read

'posted write’: writes buffered until the bus is free

    • Gives priority to reads, allows high speed write bursts
    • More hardware, delay between CPU and memory write

'late write’: write only to make free space in cache

    • Reduces the amount of memory write cycles drastically
    • Complex cache control, especially with shared memory!

Pentium

an example of a cache

data

bus switch

data

cache memory

CPU

main

(80386)

memory

cache controller

(82385)

address

address

control

administration

control

system bus

CPU bus

An example of a cache
  • To reduce the amount of administration memory, a single cache 'line' administrates 8 word blocks
intel 82385 direct mapped cache mode

17

10

3

2

32 bitsaddress:

'tag'

line

word

byte

word select

'hit'

'word valid'

'word valid'

1024lines

17 bit

tags

32 bit

data

32 bit

data

Lineselect

'line valid'

word #0

word #7

Intel 82385 'direct mapped’ cache mode
  • Also known as '1-way set associative’prone to ‘tag clashing’ !
intel 82385 2 way set associative mode

18

9

17

10

3

2

32 bitsaddress:

'tag'

line

word

byte

word select

hitlogic

'hit'

'hit'

'word valid'

'word valid'

512lines

18bit

tags

1024lines

17 bit

tags

32 bit

data

32 bit

data

Lineselect

LRU bits

'line valid'

word #0

word #7

Intel 82385 ’2-way set associative’ mode
  • ’Least Recently Used' bits indicate which set in each line has been used last (the other is replacement target)
the mesi protocol
The MESI protocol
  • Late write and shared memory combine badly
    • The 'MESI' protocol solves this with four states for each of the cache words (or lines)

Modified: cached data differs from the main memory and is only located in this cache

Exclusive: cached data is the same as main memory and is only located in this cache

Shared: cached data is the same as main memory and also located in one or more other caches

Invalid: cache word/line not loaded with memory data

state changes in the mesi protocol
State changes in the MESI protocol
  • Induced by processor read/write actions and actions of other cache controllers
  • Caches keep track of other read/write actions
    • Uses ’bus snooping’:monitoring the address and control buses when they are driven by someone else
    • During a memory access, other cache controllers indicate if one of them contains the accessed location

Needed to decide between the Shared/Exclusive states!

intel 82496 cpu accesses
Intel 82496 CPU accesses

Pentium

  • A read hit reads the cache, does not change state
  • A read miss reads memory, other controllers check if they also contain the address read
  • A write hit handling depends on the state
    • If Shared, write is done in main memory too
    • If Exclusive or Modified, write is only done in cache
  • A write miss writes to memory, but not the cache Other caches may change their state!

Normal MESI:write cache too

intel 82496 state diagram

read hit

write miss

read miss & somewhere else

Invalid

Shared

snoop write

snoop

write

snoop

read

write hit

(write to memory)

snoop

write

read miss,

only here

snoop

read (*)

readhit

Modified

Exclusive

write hit (setup for late write)

read/writehit

Intel 82496 state diagram

snoop

read

anysnoop

(*): This controller copies localdata to memory immediately

final remarks on caches 1

CPU chip

mainmemoryhuge& very slow

off-chipcachelarge(r)& slow(er)

on-chipcachesmall& fast

CPU

Final remarks on caches (1)
  • High performance processors rely on caches
    • Main memory must be accessed in a single clock cycle
  • At 1 GHz, the cache must be on the CPU chip
    • But a large & fast cache takes a lot of chip space!

Second level cache

First level cache

final remarks on caches 2
Final remarks on caches (2)
  • The off-chip cache becomes as slow as main memory was some time ago...
  • Second level cache placed on the CPU chip too
    • Examples: power-PC, Crusoe (both > 256 KiloByte!)
    • The external cache becomes a third-level cache
    • Data transfer between on-chip caches can be done a complete cache line in parallel: a huge speedup
speeding it up which speed
Speeding it up: which speed ?
  • It is nice to talk for hours on how to increase the speed of a processor, but...

what do we actually want ?

  • We first have to look at the application side, where speed is more measured in terms of

algorithm execution performance thanprocessor performance

different applications different speeds
Different applications, different speeds
  • Fixed function (control) applications:the required algorithms must be executed in a given amount of time, expensive and unnecessary to go any faster !
  • Transaction processing and databases:the algorithms must be executed at a speed so that the system is not perceived to be 'slow' by human standards
  • ’Number crunching' and simulations:the algorithms must be executed as fast as possible

The last one is used the least !

number crunching and simulation 1
Number crunching and simulation (1)
  • The only applications where algorithm processing speed is of major concern
    • A single operation may take hours or even days!
  • It may be worthwhile to spend a lot of money to increase processing speed by 'only' 10%
    • These users are willing to upgrade their computer once a year to follow the latest technology trend...
number crunching and simulation 2
Number crunching and simulation (2)
  • 'No holds barred' - all tricks in the book are used
    • Massively parallel processor systems
    • Special purpose hardware
    • Vector processors and 'systolic arrays’('Single Instruction Multiple Data’ machines)
    • 'Normal' processors speeded up by all kinds of tricks

often based upon the type of operations to be performed

We will focus on some of these tricks

algorithm processing speed
Algorithm processing speed
  • The clock speed of a processor doesn't say much
    • The Rekursiv machine (vCISC) at 10 MHz beatsa TI 'LISP engine' (RISC) at 40 MHz to run LISP
      • The reason: Rekursiv can ‘Malloc’ in one clock cycle
  • It is possible to optimise a processor architecture to fit the programming language
    • which may give tremendous speedups(f.i. LISP, Smalltalk or Java)
the problem with benchmarks
The problem with ‘benchmarks’

MeaninglessInformation aboutProcessorSpeed

  • 'Million Instructions Per Second’is an empty measurement

unless scaled to some normalised instruction set and 'mix'

  • 'Standard' benchmark programs are not representative of real applications

their instruction mix is non-standardand results are influenced by the compiler which is used

reduced instruction set computers
Reduced Instruction Set Computers
  • Execute a 'simple' instruction set(load/store philosophy: operations between registers only)
  • Have fixed length instructions with a few formats(easy to decode but sometimes space inefficient)
  • Use a large number of general purpose registers(needed for calculating addresses and reduce reads/writes)
  • Tuned for high-speed instruction execution
    • But not high speed 'C' execution, as some believe
complex instruction set computers
Complex Instruction Set Computers
  • Execute a complex instruction setdoing much more in one instruction, difficult to decode
  • Have variable length instructionsgives higher storage efficiency and shorter programs
  • Use a moderate number of registers(some of them special purpose)
  • Tuneable towards high-level language execution
    • f.i. 'ENTER' and 'LEAVE' instructions in the 80286
    • or even operating system support (task switching)
the risc cisc boundary fades fast
The RISC/CISC boundary fades fast
  • 'RISC' is sometimes a completely misplaced label
    • The IBM 'POWER' architecture knows more instructions than an average CISC
  • RISC speed ('one instruction per clock') can also be reached by modern CISC processors
    • Which then perform the equivalent of several RISC instructions in that same 'single clock'
the number of instructions per clock
The number of instructions per clock

'one instruction per clock' (1 IPC)is hardly ever reached, even for RISC CPU's

  • Early RISC's reached 0.3 .. 0.5 IPC
    • it takes a lot of hardware to reach 0.6 .. 0.7 IPCwhen running normal programs !
  • Only 'Superscalar' processors can reach(and even exceed) 1 IPC
standard cisc instruction execution
Standard CISC instruction execution
  • In the old days, a CISC processor took a lot of clocks to execute a single instruction

1: fetch the (first part of the) instruction

2: decode the (first part of the) instruction

3: fetch more parts of the instruction if needed

4: fetch operands from memory (after address calculations) and/or registers

5: perform ALU operation (may take several cycles)

6: write result(s) to registers and/or memory

a program to execute programs
A program to execute programs
  • These old machines interpret the actual program
    • They ran a lower-level ('microcode') program!
  • Hardware was expensive, so it was re-used for different purposes during different clock cycles
    • A single bus to transfer data inside the processor
    • One ALU for addresses and actual operations
streamlining the execution on a risc
Streamlining the execution on a RISC
  • Early RISC processors could break instruction execution into four basic steps

1: Fetch instruction (always the same size)

2: Decode instruction and read source operands (s1, s2)

3: Execute the actual operation in the ALU

4: Write result to destination operand (d)

  • We will denote these four steps with the letters FDEW from now on...
single clock risc instruction execution

clock

data registers

s1

s2

d

instruction

program

PC

memory

+ 1

ALU

Single clock RISC instruction execution
  • The basic intruction execution steps can be executed within one clock
single clock risc execution timing

setup time to clock

clock

delays

prog. addr.

instruction

source ops

ALU result

clock cycle

Single clock RISC execution timing
  • This is a bit slow in terms of clock speed
extra registers for the one clock risc

clock

control

data registers

s1

s2

d

unit

instruction

program

PC

I

memory

S1

+ 1

ALU

D

S2

Extra registers for the one clock RISC
  • The clock speed can be increased by adding extra registers
timing of risc with extra registers
Timing of RISC with extra registers
  • The control unit tells all registers when to load

1: Read program memory and store in 'I', PC++

2: Read source registers and store in 'S1'/'S2’

3: Perform ALU operation and store result in 'D’

4: Write 'D' contents into destination register

  • Less done in each clock cycle: clock speed higher

but the number of clocks per instruction goes up and the total instruction execution time increases !

reducing hardware costs

multiplexer

s1

clock

control

data

s2

regs

unit

d

instruction

program

I

PC

memory

S1

+ 1

D

ALU

S2

Reducing hardware costs
  • The previous solution can be optimised a lot to reduce hardware costs
the reduced hardware costs timing
The ‘reduced hardware costs’ timing
  • Separate clock cycles for reading operands
    • The data registers have become single ported(much less hardware than 3-ported)
    • It is even possible to do PC++ with the ALU
  • Back at square one:this is how they used to do it in the old days...

VERY slow

splitting the processor in stages

data registers

s1

s2

d

program

PC

I1

I2

I3

memory

+ 1

S1

D

ALU

S2

stage 1

Fetch

stage 2

stage 3

stage 4

Decode

Execute

Write

Splitting the processor in ‘stages’
  • By adding even more registers,we can split the processor in 'stages'
the stages form a pipeline
The stages form a ‘pipeline’
  • Each stages uses independent hardware
    • Performs one of the basic instruction execution steps
  • The stages can all work at the same time
    • In general on different instructions !
  • This way of splitting a processor in stages is called 'pipelining'
the timing of a pipeline

clock

X

X + 1

X + 2

X + 3

stage 1

fetch inst

fetch inst

fetch inst

fetch inst

N

N+1

N+2

N+3

stage 2

?

read source

read source

read source

N

N+1

N+2

stage 3

?

?

ALU op

ALU op

N

N+1

stage 4

?

?

?

write dest

N

The timing of a pipeline
  • These stages handle 4 instructions in parallel
    • At roughly four times the clock speed of the first hardware implementation!
giving more time to a pipeline stage

multiply uses 2 extra clocks in ALU

r1 := r2 + 3

F

D

E

W

r3 := r4 x r5

F

D

E

E

E

W

r6 := r4 - 2

F

D

E

W

(must wait for E)

r7 := r2 - r5

F

D

E

W

(must wait for D)

r0 := r5 + 22

F

D

E

(must wait for F)

Time

‘stall cycles’

Giving more time to a pipeline stage
  • A pipeline stage which cannot handle the next instruction in one clock cycle, has to 'stall' the stages in front of it

W

the bad thing about pipeline stalls
The bad thing about pipeline stalls
  • Stalls force 'no operation' cycles on the expensive hardware of the previous stages
  • The following instructions finish later than absolutely necessary

Pipeline stalls should be avoided whenever possible !

another pipeline problem dependencies

r2+3

22+3

25

r1 := 25

initial values:

r1 = 11

25

r1 := r2 + 3

D

E

W

r2 = 22

9

r4 := r3 – r1

F

D

E

r3 = 34

r3-r1

34–11

23

r4 := 23

wrong value! '25' not written yet...

Another pipeline problem: ‘dependencies’
  • In the standard pipeline, instructions which depend upon eachother's results give problems

F

W

solving the dependency problem

r2+3

22+3

25

r1 := 25

25

r1 := r2 + 3

D

E

W

9

r4 := r3 - r1

F

D

D

r3-r1

34-11

34-25

9

r4 := 9

D source = E destination

D source = W destination

Solving the dependency problem
  • Compare D, E and W stage operands,stall the pipeline if a match is found

F

D

E

W

result forwarding to solve dependencies

data registers

s1

s2

d

control

program

PC

I1

I2

I3

memory

result forwarding ‘path’

+ 1

s1

S1

ALU

D

s2

multiplexers

S2

stage 1

stage 2

stage 3

stage 4

Fetch

Decode

Execute

Write

Result forwarding to solve dependencies

{ source operand control and multiplexer specification: }IF I3.dest = I2.source1 THEN s1 := D ELSE s1 := S1;IF I3.dest = I2.source2 THEN s2 := D ELSE s2 := S2;

parallel pipelines to speed things up

memory pipeline hardware

r1 := r2 + 3

F

D

E

W

2 write stages

r3 := [r4]

F

D

M

M

M

W

r6 := r4 - 2

F

D

E

W

forwarding !

r7 := r2 - r5

F

D

E

W

r0 := r3 + 22

F

D

E

write orderreversed !

Parallel pipelines to speed things up
  • No need to wait for the completion of slow operations, if handled by separate hardware

W

the order of completion
The ‘order of completion’
  • In this example, we have 'out-of-order completion'

r6 is written before r3, the instruction ordering suggests r3 before r6 !

  • The normal case is called 'in-order completion'

Shorthand: ‘OOO’

dependencies with ooo completion

time

source

dest

order

write/read or

source

dest

'true data dependency'

source

dest

source

dest

read/write dependency

source

dest

source

dest

or 'antidependency'

Dependencies with OOO completion

reading 2nd source must wait for 1st destination write,

otherwise wrong source value in 2nd instruction

write/write dependency

writing 2nd destination must be done after writing 1st destination,

otherwise leaves wrong result in destination at end

writing 2nd destination must be done after reading first source value,

otherwise wrong source value in 1st instruction

scoreboarding instead of forwarding
‘Scoreboarding’ instead of forwarding
  • Result forwarding helps in a simple pipeline
    • It becomes rather complex in a multiple pipeline with out-of-order completion
    • One of the earlier DEC Alpha processors used more than 40 result forwarding paths
  • A 'register scoreboard' can be used to make sure that dependency relations are kept in order
operation of a register scoreboard
Operation of a register scoreboard
  • All registers have a 'scoreboard' bit, initially reset
  • Instructions wait in the Decode stage until all their source and destination scoreboard bits are reset (to zero)
  • Instructions which exit the Decode stage set the scoreboard bit in their destination register(s)
  • A scoreboard bit is reset during the writing of a destination register in any Writeback stage
scoreboard performance
Scoreboard performance
  • A simple scoreboard is very conservative in it's stalling decisions
    • It stalls the pipeline for true data dependencies

But removes all forwarding paths in return !

  • Write-write and antidependencies are stalled much longer than absolutely necessary
    • They should be stalled in the Writeback stage,not the Decode stage !
the real reason for some dependencies
The real reason for some dependencies
  • Write-write and antidependencies exist because a register is re-used to hold another value !
  • If we use a different destination register for the each write action, these dependencies vanish
    • This requires changing the program,which is not always possible
    • The amount of available registers may not be enoughevery result a different register ?
register renaming as solution
Register ‘renaming’ as solution
  • Write-write and antidependencies can be removed by writing each result in a different hardware register
    • This removes the direct relation between a register number in the program and a real register

Register numbers are renamed into something else !

    • Have to make sure that source register references always use the correct (renamed) hardware register
register renaming example

True dependencies

Anti-dependencies

Write-write dependencies

Register renaming example

before renaming: after renaming:

1) R1 := R2 + 3 R1b := R2a + 3

2) R3 := R1 x 2 R3b := R1b x 2

3) R1 := R6 + R2 R1c := R6a + R2a

4) R2 := R1 - 15 R2b := R1c - 15

All registers start as R..a

an implementation of register renaming
An implementation of register renaming
  • Use a lookup table in the Decode stagewhich indicates the 'current' hardware registerfor each of the software-visible registers
    • Source values are read from the hardware registers currently referenced from the lookup table
    • Each destination register, gets a 'fresh' hardware register whose reference is placed in the lookup table
    • Later pipeline stages all use the hardware register references for result forwarding and/or writeback
the problem with register renaming
The problem with register renaming
  • When is a hardware register not needed anymore ?

OR, in other words

  • When can a hardware register be re-used ?
    • There must be another hardware register assigned for its software register number

AND

    • All source value references to it must have been done

Will be soved later

flow control instructions in the pipeline

is a jump

PC updated here

PC := PC + 5

F

D

E

W

r8 := r1 -22

r3 := r4 x r5

F

F

D

E

W

fetch at wrong address!

Flow control instructions in the pipeline
  • When the PC is changed by an instruction,the Fetch stage must wait for the actual update
    • For instance: a relative jump calculated by the ALU, with PC updated in the Writeback stage
improving the flow control handling

PC updated here

is a jump

No-operation: NOP

PC := 25

F

D

r8 := r1 -22

F

D

E

W

r3 := r4 x r5

F

fetch at wrong address!

Improving the flow control handling
  • The number of stall cycles can be reduced a lot: update the PC earlier in the pipeline
    • For instance in the Decode stage
another method use delay slots

PC updated here

is a jump

X : PC := 25

F

D

X+1 : r3 := r4 x r5

F

D

E

W

'delay slot'

25 : r8 := r1 -22

F

D

E

execute anyway...

Another method: use ‘delay slots’
  • The pipeline stall can be removed by executing instruction(s) following the flow control instruction
    • These are executed before the actual jump is made

W

delay slots to have or not to have
Delay slots: to have or not to have
  • Using delay slots changes processor behaviourold programs will not run anymore !
  • Compilers try to find useful instructions for delay slots
    • Able to fill 75% of the first delay slots
    • But only filling 40% of the second delay slots
  • If no useful instruction can be found,insert a NOP
an alternative to delay slots
An alternative to delay slots
  • Sometimes several stages between fetching and execution (PC update) of a jump instruction
    • Would lead to many (unfillable) delay slots
  • Alternative solution: a 'branch target cache' (BTC)
    • This cache contains for out-of-sequence jumps the new PC value and the first (few) instruction(s)
    • Is indexed on the address of the jump instruction

the BTC ‘knows’ a jump is coming before it is fetched !

operation of the branch target cache

BTC checks address 10

Hit !

PC updated to 23...

10 : PC := 22

F

D

D

E

W

11 : r2 := r6 + 3

F

23 : r8 := r1 -22

F

D

E

BTC provides instruction

Operation of the Branch Target Cache
  • If the Branch Target Cache hits, the fetch stage starts fetching after the target address
    • The BTC provides the first (few) instruction(s) itself

22 : r3 := r4 x r5

W

jump prediction saves time

Prediction: taken

Prediction wrong !

10 : JNZ r1,22

F

D

E

W

Prediction correct !

11 : r2 := 3

F

D

E

W

W

22 : r3 := r4 x r5

F

D

E

W

'delay slot'

23 : r7 := r9

F

D

E

W

24 : r6 := 5

12 : r8 := 0

F

F

D

D

E

E

Jump prediction saves time
  • By predicting the outcome of a conditional jump, no need to wait until test outcome is known
    • Example: condition test outcome known in W stage

W

W

Must avoid wrong predictions !

how to predict a test outcome 1
How to predict a test outcome (1)
  • Prediction may be given with bit in instruction
    • Shifts prediction problem to the assembler/compiler
    • Instruction set must be changed to hold this flag
  • The prediction may be based upon the type of test and/or jump direction
    • End of loop jumps are taken most of the time
    • A single bit test is generally unpredictable...
how to predict a test outcome 2
How to predict a test outcome (2)
  • Prediction can be based upon the previous outcome(s) of the condition test
    • This is done with a 'branch history buffer’
      • A cache which holds information for the most recently executed conditional jumps
      • May be based solely on last execution or more complex (statistical) algorithms
      • Implemented in separate hardware or combined with branch target/instruction caches
  • Combination can achieve a 'hit rate' of > 90%!
call and return handling
CALL and RETURN handling
  • A subroutine CALL can be seen as a jump combined with a memory write
    • Is not more problematic than a normal JUMP
  • A subroutine RETURN gives more problems
    • The new PC value cannot be determined from the instruction location and contents
    • Special tricks exist to bypass the memory stack read(for instance a ‘return address cache’)
calculated and indirect jumps
Calculated and indirect jumps
  • These give huge problems in a pipeline
    • The new PC value must be determined before fetching can continue
  • Most of the speedup tricks break down on this problem
    • A Branch Target Cache can help a littlebit,but only if the actual target remains stable
    • The predicted target must be checked afterwards !
moving instructions around

3) R1c := R6a + R2a1) R1b := R2a + 32) R3b := R1b x 24) R2b := R1c - 15

3) R1c := R6a + R2a4) R2b := R1c - 151) R1b := R2a + 32) R3b := R1b x 2

Moving instructions around
  • It is possible to change the execution order of instructions which do not have dependencies

without renaming: with renaming:1) R1 := R2 + 3 R1b := R2a + 32) R3 := R1 x 2 R3b := R1b x 23) R1 := R6 + R2 R1c := R6a + R2a4) R2 := R1 - 15 R2b := R1c - 15

  • True dependencies: 2) comes after 1), 4) comes after 3)
  • With renaming, these are the only sequence restrictions !
out of order ooo execution
Out-of-order (OOO) execution
  • Changing the order of instruction execution can remove pipeline stalls and/or fill delay slots:increase the performance
    • Instructions can be re-ordered in the program,but this is not OOO execution !
  • OOO execution: instructions are sent to the operational units (ALU, load/store...) in a different order than the program specifies

OOO memory accessing is not discussed here

instruction buffers for ooo execution

reservation

station

scheduler

ALU

program

memory

fetch &

decode

(renamed)

registers

reservation

station

load/

store

scheduler

1) Separate instruction buffers for each functional unit

ALU

program

memory

fetch &

decode

central

instruction

window

(renamed)

registers

scheduler

load/

store

2) Central instruction buffer

Instruction buffers for OOO execution
  • To be able to change the execution order,fetched instructions must be buffered
differences between buffer strategies
Differences between buffer strategies
  • Reservation stations have advantages
    • Smaller buffers, schedulers are simpler
    • Buffer entries can be tailored to instruction format
    • Routing of instructions across chip simpler
  • The central instruction window also has advantages
    • Total number of buffered instructions can be smaller
    • The single scheduler can take better decisions
    • No ‘false locking’ with identical functional units
false locking between functional units

Instruction sequence: A1, B1, A2, B2, A3, B3, A4, B4

reservation stations

ALU’s

A4 A3 A2 A1

A1

A4 A3 A2

A1

A1

A1

A1

A1

A4 A3 A2

A4 A3 A2

A1

A1

A4 A3 B2 A1

A4 A3 A2

A1

A4 A3 A2

A4 A3 A2

A1

scheduler

program

memory

fetch &

decode

1

(renamed)

registers

B4

B1

B3

B1

B4 B3

B4 B3 B2 B1

B2

B4 B3 B2

B4 B3 B2

B4 B3 A2 B1

B1

B1

B4

scheduler

2

A4 A3 B2

A4 A3 B2

locked

A4 A3 B2

A4 A3 B2

B4 B3 A2

locked

B4 B3 A2

false locking

B4 B3 A2

B4 B3 A2

False locking between functional units

This will not happen with a central instruction window !

Hybrid solution: one reservation station + one scheduler for multiple identical functional units

scheduler operation
Scheduler operation
  • The schedulers actually have only a simple task

Pick ready-to-execute instructions from their buffers and send them to the appropriate operational units

'ready-to-execute'  with all source values known

  • Try to calculate conditional jump results ASAP
  • Otherwise: oldest instructions first
ready to execute determination
‘Ready to execute’ determination
  • The scheduler(s) depend on other system parts to determine which instructions can be executed

The fetch unit knows the original order of the instructions and must determine the dependencies

The operational units signal the end of a dependency when writing a result operand

The instruction buffer(s)determine from this information which instructions are ready to execute and store this knowledge in status flags

the scoreboard again
The ‘scoreboard’, again
  • A simple scoreboarding technique can be used for ‘ready to execute’ determination
    • Renamed registers get a flag bit which indicates the register does not contain a result yet
    • Each renamed destination register write sets the attached flag bit to indicate the result is available
  • An instruction is ready to execute when all the flag bits of it's renamed source registers are set
the problem with interrupts and traps
The problem with interrupts and traps
  • OOO completion means instructions results may be written in an order which differs from the instruction sequence in the program
    • If an instruction generates a trap,instructions following it may already have changed registers (and/or memory locations !)
    • If an interrupt must break off processing,some instructions may not completewhile later ones in the program have already completed
solution a safe state register set
Solution: a ‘safe state’ register set
  • With these impreciseinterrupts and traps, it is almost impossible to get the processor in a state from which it can be safely restarted
  • We must find a way to maintain the 'visible' set of processor registers in a 'safe state':updated in the normal program order
    • We don't care if this updating of the safe statelags behind the normal updating of the renamed set
implementation of the safe state

in-order updates

result bus(es)

'reorder buffer'

simulated FIFO

renamedregister

number

safe

register

set

real

register

number

read pointer

renamed

registers

'head'

renamed

validflags

'tail'

write pointer

1

source

operand

0

operand

valid

Implementation of the safe state
  • One common way to provide this 'safe' register set is by using a so-called 'reorder buffer'
operation of the reorder buffer

33

10

6

renamed

valid

renamedregisternumber

renamed register

real register

real register value

result

678

Y 7 R1: 12Y 6 R2: 47N ? R3: 114N ? R4: 0

Y 7 R1: 12Y 8 R2: 47N ? R3: 114N ? R4: 0

N ? R1: 33Y 8 R2: 678Y 9 R3: 114N ? R4: 0

N ? R1: 33Y 8 R2: 678Y 9 R3: 114N ? R4: 0

Y 7 R1: 12Y 8 R2: 47Y 9 R3: 114N ? R4: 0

Y 7 R1: 12Y 8 R2: 47Y 9 R3: 114N ? R4: 0

N R3 ? 9N R2 ? 8N R1 ? 7N R2 ? 6

N R3 ? 9N R2 ? 8Y R1 33 7N R2 ? 6

N R2 ? 8N R1 ? 7N R2 ? 6

Y R3 6 9N R2 ? 8

N R1 ? 7N R2 ? 6

Y R3 6 9Y R2 10 8

N ? R1: 12Y 6 R2: 47N ? R3: 114N ? R4: 0

‘head’

N ? R1: 33N ? R2: 10N ? R3: 6N ? R4: 0

N ? R1: 33N ? R2: 10Y 9 R3: 114N ? R4: 0

Y 7 R1: 12Y 8 R2: 47Y 9 R3: 114N ? R4: 0

N R3 ? 9N R2 ? 8Y R1 33 7Y R2 678 6

N ? R1: 33N ? R2: 10Y 9 R3: 114N ? R4: 0

Y 7 R1: 12Y 8 R2: 678Y 9 R3: 114N ? R4: 0

N R3 ? 9N R2 ? 8Y R1 33 7Y R2 678 6

Safe register set

N ? R1: 33Y 8 R2: 678Y 9 R3: 114N ? R4: 0

N ? R1: 33Y 8 R2: 678Y 9 R3: 114N ? R4: 0

Y 7 R1: 12Y 8 R2: 678Y 9 R3: 114N ? R4: 0

N R3 ? 9N R2 ? 8Y R1 33 7

N R3 ? 9N R2 ? 8Y R1 33 7

Reorder buffer FIFO

N R2 ? 6

Y R3 6 9Y R2 10 8

N R3 ? 9N R2 ? 8

Y R3 6 9

‘retiring’

Y R3 6 9

Operation of the reorder buffer

Four instructions writing to (real) registers R2, R1, R2 & R3

N ? R1: 12N ? R2: 47N ? R3: 114N ? R4: 0

N ? R1: 33N ? R2: 10N ? R3: 6N ? R4: 0

other solutions and variations
Other solutions and variations
  • Both 'history buffer' and 'future file' are (minor) variations/extensions on the reorder buffer
  • A central instruction window can combine the reorder buffer and instruction buffer functions
  • 'Checkpoint repair' makes backups of the complete register set when problems may occur
    • Only instructions which were already in execution at the time of the backup modify the backup's state (these must complete execution)
ooo execution conditional jumps
OOO execution & conditional jumps
  • Machines uncapable to move instructions across (conditional) jumps will not perform well
    • Basic block sizes of 4..6 instructions are normal for CISC's (6..8 instructions for RISC's)
    • Around half of the jumps is conditional !
  • The problem with conditional jumps
    • If the prediction is wrong, the processor state must be restored to the point of the jump instruction

In fact, the same as if a trap occurred

speculative ooo conditional jumps 1
‘Speculative’ OOO conditional jumps (1)

'Speculative fetching’fetches and decodes instructions after the conditional jump,but does not take them in execution

'Speculative execution’also executes instructions in the predicted path, using renaming as buffer for the in-order (safe) state

  • The speculative renamed registers are discardedwhen the prediction was incorrect
  • Rename indexes must be restored ! (checkpoint repair ?)
speculative ooo conditional jumps 2
‘Speculative’ OOO conditional jumps (2)

'Multi-path speculative execution’extends speculative execution to handle both paths following a conditional branch

    • may also allow multiple condition tests to be unresolved (needs more checkpointing buffers)
  • Retiring of renamed registers is frozen for speculative renamed registers until the branch outcome is known
handling more instructions per clock
Handling more instructions per clock
  • Fetching more than one instruction per clock is generally not such a problem
    • Make the bus to the instruction memory wider !
  • Need more than one functional unit to actually execute the instructions in parallel
  • Must also decode more than one instruction per clock to get a 'superscalar' processor
superscalar parts we have already seen
Superscalar parts we have already seen
  • Instruction decoders can easily send multiple instructions to separate reservation stations
    • With a minor increase in complexity even multiple instructions to the same reservation station
  • The central instruction window can be modified to receive multiple instructions in a single cycle
    • The scheduler can be changed to handle multiple instructions in parallel
superscalar dependency detection
Superscalar dependency detection
  • Instruction dependency determination must now be partially implemented in a parallel form
    • Renamed register indexes must be forwarded between concurrently decoded instructions
    • It must be possible to create multiple renamed registers in a single cycle
  • It must also be possible to update multiplein-order (safe) registers in parallel !
another method to go superscalar

faster

Another method to go superscalar
  • Very Large Instruction Word (VLIW) machines pack several ‘normal’ instructions in a single ‘superinstruction’
    • They execute this superinstruction using separate functional units

With all scheduling done by the compiler !

    • Programming VLIW machines in assembly language is virtually impossible
vliw but not exactly
VLIW, but not exactly
  • The Intel 80860 processor uses another trick which resembles VLIW operation
    • It always fetches two instructions at a time
    • If the first one is a floating point operation, it checks a flag in this instruction
    • If this flag is set, it assumes the second one is not a floating point operation and executes both in parallel
  • Intel Pentium ‘pairs’ instructions without flags