slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Pipelining (Chapter 8) PowerPoint Presentation
Download Presentation
Pipelining (Chapter 8)

Loading in 2 Seconds...

play fullscreen
1 / 47

Pipelining (Chapter 8) - PowerPoint PPT Presentation


  • 70 Views
  • Uploaded on

Pipelining (Chapter 8). http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_8.ppt. Course website: http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_results.htm. 1. T U -Delft. TI1400/12-PDS. Basic idea (1). I1. I2. I3. I4. F1. E1. F2. E2. F3. E3. F4. E4. time.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Pipelining (Chapter 8)' - zada


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
slide1

Pipelining(Chapter 8)

http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_8.ppt

Course website:

http://www.pds.ewi.tudelft.nl/~iosup/Courses/2012_ti1400_results.htm

1

TU-Delft

TI1400/12-PDS

slide2

Basic idea (1)

I1

I2

I3

I4

F1

E1

F2

E2

F3

E3

F4

E4

time

sequential execution

buffer

B1

Instruction

fetch

unit

Execution

unit

2

slide3

Basic idea (2): Overlap

Clock cycle

1

2

3

4

5

I1

F1

E1

I2

F2

E2

F3

E3

I3

F4

E4

I4

time

pipelined execution

3

slide4

Instruction phases

  • F Fetch instruction
  • D Decode instruction and fetch operands
  • O Perform operation
  • W Write result

4

slide5

Four-stage pipeline

Clock cycle

1

2

3

4

5

I1

F1

D1

O1

W1

I2

F2

D2

O2

W2

F3

D3

O3

W3

I3

F4

D4

O4

W4

I4

time

pipelined execution

5

slide6

Hardware organization (1)

B3

B1

B2

Decode

and

fetch

oper.

Write

unit

Fetch

unit

Oper

unit

6

slide7

Hardware organization (2)

During cycle 4, the buffers contain:

  • B1:
    • instruction I3
  • B2:
    • the source operands of I2
    • the specification of the operation
    • the specification of the destination operand
  • B3:
    • the result of the operation of I1
    • the specification of the destination operand

7

slide8

Hardware organization (3)

B3

B1

B2

Decode

and

fetch

oper.

Write

unit

Fetch

unit

Oper

unit

I3

Operands I2

Operation I2

Result I1

8

slide9

Pipeline stall (1)

  • Pipeline stall: delay in a stage of the pipeline due to an instruction
  • Reasons for pipeline stall:
    • Cache miss
    • Long operation (for example, division)
    • Dependency between successive instructions
    • Branching

9

slide10

Pipeline stall (2): Cache miss

1

2

3

4

5

6

7

8

Clock cycle

I1

F1

D1

O1

W1

I2

F2

D2

O2

W2

W3

F3

D3

O3

I3

Cache miss in I2

time

10

slide11

Pipeline stall (3): Cache miss

Clock cycle

1

2

3

4

5

6

7

8

F

F1

F2

F2

F2

F2

F3

D

D1

idle

idle

idle

D2

D3

O1

idle

idle

idle

O2

O3

O

W1

idle

idle

idle

W2

W3

W

Effect of cache miss in F2

11

slide12

Pipeline stall (4): Long operation

1

2

3

4

5

6

7

8

Clock cycle

I1

F1

D1

O1

W1

I2

F2

D2

O2

W2

F3

D3

O3

W3

I3

F4

D4

O4

W4

I4

time

12

slide13

Pipeline stall (5): Dependencies

  • Instructions:

ADD R1, 3(R1)

ADD R4, 4(R1)

cannotbe done inparallel

  • Instructions:

ADD R2, 3(R1)

ADD R4, 4(R3)

can be done in parallel

13

slide14

Pipeline stall (6): Branch

only start fetching

instructions after

branch has been

executed

(branch)

Ii

Fi

Ei

Fk

Ek

Ik

time

Pipeline stall due to Branch

14

slide15

Data dependency (1): example

MUL R2,R3,R4 /* R4 destination */

ADD R5,R4,R6 /* R6 destination */

New value of R4 must be available before ADD instruction uses it

15

slide16

Data dependency (2): example

I1

F1

D1

O1

W1

MUL

time

F2

D2

O2

W2

ADD

I2

F3

D3

O3

W3

I3

I4

F4

D4

O4

W4

Pipeline stall due to data dependence between W1 and D2

16

slide17

Branching: Instruction queue

instruction queue

Fetch

........

Operation

Write

Dispatch

17

slide18

Idling at branch

(branch)

Ij

Fj

Ej

Ij+1

Fj+1

idle

Ik

Fk

Ek

Ik+1

Fk+1

Ek+1

time

18

slide19

Branch with instruction queue

I1

F1

E1

Branch folding:

execute a later branch

instruction simultaneously

(i.e., compute target)

I2

F2

E2

branch

I3

F3

E3

I4 discarded

I4

F4

Ij

Fj

Ej

time

Ij+1

Fj+1

Ej+1

Ij+2

Fj+2

Ej+2

Ij+3

Fj+3

Ej+3

19

slide20

Delayed branch (1): reordering

LOOP Shift_left R1

Decrement R2

Branch_if>0 LOOP

NEXT Add R1,R3

always

loose a

cycle

Original

LOOP Decrement R2

Branch_if>0 LOOP

Shift_left R1

NEXT Add R1,R3

always

executed

Reordered

20

slide21

Delayed branch (2): execution timing

F

E

Decrement

F

E

Branch

F

E

Shift

F

E

Decrement

F

E

Branch

F

E

Shift

F

E

Add

21

slide22

Branch prediction (1)

I1

Compare

F1

D1

E1

W1

I2

Branch-if>

F2

E2

I3

F3

D3

E3

X

I4

F4

D4

X

Ik

Fk

Dk

Effect of incorrect branch prediction

22

slide23

Branch prediction (2)

Possible implementation:

  • use a single bit
  • bit records previous choice of branch
  • bit tells from which location to fetch next instructions

23

slide24

Data paths of CPU (1)

Source 1

Source 2

SRC1

SRC2

Register

file

ALU

RSLT

Destination

Operand forwarding

24

slide25

Data paths of CPU (2)

SRC1

SRC2

RSLT

Operation

Write

ALU

register file

forwarding data path

25

slide26

Pipelined operation

result of Add has to

be available

I1

Add

F

R1

R2

+

R3

I2

Shift

F

R3

shift

R3

I3

F

D

O

W

I4

F

D

O

W

I1: Add R1, R2, R3

I2: Shift_left R3

26

slide27

Short pipeline

I1

F

R1

R2

+

R3

fwd,

shift

F

D

R3

 -

I2

F

D

O

W

I3

27

slide28

Long pipeline

1

2

3

I1

F

D

O

O

O

W

fwd

1

2

3

I2

F

D

O

O

O

W

1

2

3

I3

F

D

O

O

O

W

28

slide29

Compiler solution

I1: Add R1, R2, R3

I2: Shift_left R3

insert no-operations to

wait for result

I1: Add R1, R2, R3

NOP

NOP

I2: Shift_left R3

29

slide30

Side effects

Other form of (implicit) data dependency:

instructions can have side effects that are used

by the next instruction

I2: ADD D1, D2

I3: ADDX D3, D4

carry copy

30

slide31

Complex addressing mode

X in instruction

Load

F

D

D

X+[R1]

[X+[R1]]

[[X+[R1]]]

R2 

Next instruct.

F

D

D

fwd,O

D

D

W

Load (X(R1)), R2

Cause pipe line stall

31

slide32

Simple addressing modes

Add #X,R1,R2

Load (R2),R2

Load (R2),R2

Add

F

D

D

X+[R1]

R2 

Load

F

D

D

[X+[R1]]

R2 

Load

F

D

D

[[X+[R1]]]

R2 

Next instruction

F

D

D

fwd,O

D

W

Build up from simple instructions: same amount of time

32

slide33

Addressing modes

  • Requirements addressing modes with pipelining:
    • operand access not more than one memory access
    • only load and store instructions access memory
    • addressing modes do not have side effects
  • Possible addressing modes:
    • register
    • register indirect
    • index

33

slide34

Condition codes (1)

  • Problemsin RISC with condition codes (CCs):
    • do instructions after reordering have access to the right CC values?
    • are CCs already available at the next instruction?
  • Solutions:
    • compiler detection
    • no automatic use of CCs, only when explicitly given in instruction

34

slide35

Explicit specification of CCs

Increment R5

Add R2, R4

Add-with-increment R1, R3

double precision

addition

ADDI R5, R5, 1

ADDC R4, R2, R4

ADDE R3, R1, R3

PowerPC instructions (C: change carry flag, E: use carry flag)

35

slide36

Two execution units

instruction queue

Fetch

........

FP Unit

Dispatch

Unit

Write

Integer

Unit

36

slide37

Instruction flow (superscalar)

I1

Fadd

F1

D1

O1

O1

O1

W1

I2

Add

F2

D2

O2

W2

I3

Fsub

F3

D3

O3

O3

O3

W3

F4

D4

O4

W4

I4

Sub

Simultaneous execution of floating point

and integer operations

37

slide38

Completion in program order

I1

Fadd

F1

D1

O1

O1

O1

W1

I2

Add

F2

D2

O2

W2

I3

Fsub

F3

D3

O3

O3

O3

W3

F4

D4

O4

W4

I4

Sub

wait until previous instruction has completed

38

slide39

Consequences completion order

When an exception occurs:

  • writes not necessarily in order of instructions: imprecise exceptions
  • writes in order: precise exceptions

39

slide40

PowerPC pipeline

Data cache

Instr. cache

Instr. fetch

Branch unit

Instruction

queue

Dispatcher

LSU

FPU

IU

store

queue

Completion

queue

40

slide41

Performance Effects (1)

  • Execution time of a program: T
  • Dynamic instruction count:N
  • Number of cycles per instruction: S
  • Clock rate: R
  • Without pipelining: T = (N x S) / R
  • With an n-stage pipeline: T’ = T / n ???

41

slide42

Performance Effects (2)

  • Cycle time: 2 ns(R is 500 MHz)
  • Cache hit (miss) ratio instructions: 0.95 (0.05)
  • Cache hit (miss) ratio data:0.90 (0.10)
  • Fraction of instructions that need data from memory:0.30
  • Cache miss penalty:17 cycles
  • Average extra delay per instruction:

(0.05 + 0.3 x 0.1) x 17 = 1.36 cycles,

so slow down by a factor of more than 2!!

42

slide43

Performance Effects (3)

  • On average, the fetch stage takes, due to instruction cache misses:

1 + (0.05 x 17) = 1.85 cycles

  • On average, the decode stage takes, due to operand cache misses:

1 + (0.3 x 0.1 x 17) = 1.51 cycles

  • For a total additional cost of 1.36 cycles

43

slide44

Performance Effects (4)

  • If only one stage takes longer, the additional time should be counted relative to one stage, not relative to the complete instruction:
  • In other words: here, the pipeline is as slow as the slowest stage

F1

D1

O1

W1

F1

D1

O1

W1

44

slide45

Performance Effects (5)

  • Delay of 1 cycle every 4 instructions in only one stage: average penalty: 0.25
  • Average inter-completion time:

(3x1 + 1x2)/4=1.25

F1

D1

O1

W1

F2

D2

O2

W2

F3

D3

O3

W3

F4

D4

O4

W4

F5

W5

D5

O5

45

slide46

Performance Effects (6)

  • Delays in two stages:
    • k % of the instructions in one stage, penalty s cycles
    • l % of the instructions in another stage, penalty t cycles
  • Average inter-completion time:

((100-k-l) x 1 + k(1+s) + l(1+t))/100 =

(100+ ks +lt)/100

  • In example (k=5, l=3, s=t=17):2.36

46

slide47

Performance Effects (7)

  • Large number of pipeline stages seems advantageous, but:
    • more instructions simultaneously being processed, so more opportunity for conflicts
    • branch penalty becomes larger
    • ALU is usually bottleneck, no use having smaller time steps

47