pipelining 5 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Pipelining 5 PowerPoint Presentation
Download Presentation
Pipelining 5

Loading in 2 Seconds...

play fullscreen
1 / 19

Pipelining 5 - PowerPoint PPT Presentation


  • 117 Views
  • Uploaded on

Pipelining 5. Two Approaches for Multiple Issue. Superscalar Issue a variable number of instructions per clock Instructions are scheduled either statically or dynamically VLIW (Very Long Instruction Word)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Pipelining 5' - ksena


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
two approaches for multiple issue
Two Approaches for Multiple Issue
  • Superscalar
    • Issue a variable number of instructions per clock
    • Instructions are scheduled either statically or dynamically
  • VLIW (Very Long Instruction Word)
    • Issue a single very long instruction per clock that contains a large number of real instructions
    • Instructions are scheduled statically by the compiler
superscalar
Superscalar
  • Superscalar DLX: 2 instructions, 1 FP & 1 anything else

– Fetch 64-bits/clock cycle; Int on left, FP on right

– Can only issue 2nd instruction if 1st instruction issues

– More ports for FP registers to do FP load & FP op as doubles

Type Pipe Stages

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

Int. instruction IF ID EX MEM WB

FP instruction IF ID EX MEM WB

  • 1 cycle load delay expands to 3 instructions in Superscalar
    • instruction in right half can’t use it, nor instructions in next slot
unrolled loop
Unrolled Loop

1 Loop: LD F0,0(R1)

2 LD F6,-8(R1)

3 LD F10,-16(R1)

4 LD F14,-24(R1)

5 ADDD F4,F0,F2

6 ADDD F8,F6,F2

7 ADDD F12,F10,F2

8 ADDD F16,F14,F2

9 SD 0(R1),F4

10 SD -8(R1),F8

11 SD -16(R1),F12

12 SUBI R1,R1,#32

13 BNEZ R1,LOOP

14 SD 8(R1),F16 ; 8-32 = -24

14 clock cycles, or 3.5 per iteration

LD to ADDD: 1 Cycle

ADDD to SD: 2 Cycles

loop unrolling in superscalar
Loop Unrolling in Superscalar

Integer instruction FP instruction Clock cycle

Loop: LD F0,0(R1) 1

LD F6,-8(R1) 2

LD F10,-16(R1) ADDD F4,F0,F2 3

LD F14,-24(R1) ADDD F8,F6,F2 4

LD F18,-32(R1) ADDD F12,F10,F2 5

SD 0(R1),F4 ADDD F16,F14,F2 6

SD -8(R1),F8 ADDD F20,F18,F2 7

SD -16(R1),F12 8

SD -24(R1),F16 9

SUBI R1,R1,#40 10

BNEZ R1,LOOP 11

SD -32(R1),F20 12

  • Unrolled 5 times to avoid delays (+1 due to Superscalar)
  • 12 clocks, or 2.4 clocks per iteration
dynamic scheduling in superscalar
Dynamic Scheduling in Superscalar
  • Dependencies will stop instruction issue
  • Code compiled for non-superscalar will run poorly on superscalar
  • Simple approach
    • Separate Tomasulo control
    • Separate reservation stations for Integer FU/Reg and for FP FU/Reg
dynamic scheduling in superscalar1
Dynamic Scheduling in Superscalar
  • How to issue two dependent instructions in the same cycle? - otherwise why use the dynamic scheduling?
    • Issue 2X Clock Rate, so that issue remains in order
    • More complex issue logic - relatively easy since only FP loads might cause dependency between integer and FP issue
performance of dynamic superscalar
Performance of Dynamic Superscalar

Iteration Instructions Issues Executes Writes result

no. clock-cycle number

1 LD F0,0(R1) 1 2 4

1 ADDD F4,F0,F2 1 5 8

1 SD 0(R1),F4 2 9

1 SUBI R1,R1,#8 3 4 5

1 BNEZ R1,LOOP 4 5

2 LD F0,0(R1) 5 6 8

2 ADDD F4,F0,F2 5 9 12

2 SD 0(R1),F4 6 13

2 SUBI R1,R1,#8 7 8 9

2 BNEZ R1,LOOP 8 9

5 clocks per iteration

Branches, Decrements still take 1 clock cycle

limits of superscalar
Limits of Superscalar
  • While Integer/FP split is simple for the HW, get CPI of 0.5 only for programs with:
    • Exactly 50% FP operations
    • No hazards
  • If more instructions issue at the same time, greater difficulty of decode and issue
    • Even 2-scalar => examine 2 opcodes, 6 register specifiers, & decide if 1 or 2 instructions can issue
  • Issue rates of modern processors vary between 2 and 4 instructions per cycle.
vliw processors
VLIW Processors
  • Very Long Instruction Word (VLIW) processors
    • Tradeoff instruction space for simple decoding
    • The long instruction word has room for many operations
    • By definition, all the operations the compiler puts in the long instruction word can execute in parallel
    • E.g., 2 integer operations, 2 FP ops, 2 Memory refs, 1 branch
      • 16 to 24 bits per field => 7*16 or 112 bits to 7*24 or 168 bits wide
    • Need compiling technique that schedules across several branches (Trace Scheduling)
loop unrolling in vliw
Loop Unrolling in VLIW

Memory Memory FP FP Int. op/ Clockreference 1 reference 2 operation 1 op. 2 branch

LD F0,0(R1) LD F6,-8(R1) 1

LD F10,-16(R1) LD F14,-24(R1) 2

LD F18,-32(R1) LD F22,-40(R1) ADDD F4,F0,F2 ADDD F8,F6,F2 3

LD F26,-48(R1) ADDD F12,F10,F2 ADDD F16,F14,F2 4

ADDD F20,F18,F2 ADDD F24,F22,F2 5

SD 0(R1),F4 SD -8(R1),F8 ADDD F28,F26,F2 6

SD -16(R1),F12 SD -24(R1),F16 7

SD -32(R1),F20 SD -40(R1),F24 SUBI R1,R1,#48 8

SD -0(R1),F28 BNEZ R1,LOOP 9

  • Unrolled 7 times to avoid delays
  • 7 results in 9 clocks, or 1.3 clocks per iteration
  • Need more registers in VLIW
limits of multi issue machines
Limits of Multi-Issue Machines
  • Inherent limitations of ILP
    • 1 branch in 5 instructions => how to keep a 5-way VLIW busy?
    • Latencies of units => many operations must be scheduled
    • Need about Pipeline Depth x No. Functional Units of independent operations to keep the functional units busy
  • Difficulties in building HW
    • Duplicate FUs to get parallel execution
    • Increased # of ports to register file
    • Increased # of ports to memory
    • Instruction issue hardware (a wide spectrum)
limits to multi issue machines
Limits to Multi-Issue Machines
  • Limitations specific to either Superscalar or VLIW implementation
    • Superscalar instruction issue logic
    • VLIW code size: unroll loops + wasted fields
    • VLIW lock step => 1 hazard & all instructions stall
    • VLIW binary compatibility => object-code (binary) translation
hardware based speculation
Hardware-based Speculation
  • Instructions are executed out of order and speculatively but committed in order
  • Instruction commit is separated from instruction completion
  • In-order instruction commit requires a hardware buffer called the reorder buffer
  • The reorder buffer holds the results of an instruction between its completion time and commit time
  • Exceptions are also processed in order (precise exception model)
hardware based speculation1
Hardware-based Speculation

reservation

station 1

EX

FU1

reservation

station 2

reservation

station 3

Write

results

Commit

IF

Issue

reservation

station 1

Wait until the head of the reorder buffer is reached and the result is present

EX

FUn

reservation

station 2

Structural hazard: delaying the issue until there is an empty reservation station and an empty slot in the reorder buffer

RAW data hazard: wait at the reservation station until the values of the source registers are available

hardware based speculation3
Hardware-based Speculation
  • Example Code

LD F6,34(R2)

LD F2,45(R3)

MULTD F0,F2,F4

SUBD F8,F6,F2

DIVD F10,F0,F6

ADDD F6,F8,F2

  • Figure 4.35 in the book
hardware based speculation4
Hardware-based Speculation
  • Advantages
    • Lessens the performance degradation resulting from control hazards
    • Allows a precise exception model since exception conditions can be checked at the instruction commit time
    • Can incorporate hardware-based branch prediction
    • Does not require additional bookkeeping code
    • Does not depend on a good compiler - performs OK with non-optimized code
  • Disadvantage
    • Hardware complexity
summary
Summary
  • Superscalar and VLIW
    • CPI < 1
    • Superscalar is more hardware dependent (dynamic)
    • VLIW is more compiler dependent (static)
    • More instructions issue at same time => larger penalties for hazards
  • Hardware-based speculative execution
    • Minimizes the impact of control hazards on performance
    • Enables a precise exception model for out of order and/or speculative execution