1 / 20

Advanced Topics in Pipelining

Advanced Topics in Pipelining. Two methods to exploit instruction-level parallelism Superpipelining : longer (deeper) pipelines. The ideal speedup is equal to the number of pipeline stages. 8 or more pipeline stages are common in modern processors. Superscalar :

rhutchinson
Download Presentation

Advanced Topics in Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Advanced Topics in Pipelining • Two methods to exploit instruction-level parallelism • Superpipelining: longer (deeper) pipelines. • The ideal speedup is equal to the number of pipeline stages. • 8 or more pipeline stages are common in modern processors. • Superscalar: • multiple issue (CPI can be less than one) • Instruction execution rate exceeds the clock rate. • 6 GHz four-way multiple issue  CPI = 0.25, IPC = 4 • 24 billion instructions/second

  2. Static Multiple Issue • Two-issue 5-stage MIPS processor • (R-type or branch) AND (Load or Store) • VLIW concept • Compiler to remove dependencies between instruction pairs

  3. Static Two-Issue MIPS

  4. Example: Static Two-Issue MIPS 1/2 • Extra reading and writing ports to register file. • Data dependencies results in more serious stalls • In superscalar pipeline, the next two instructions cannot use the result of lw instruction without stalling. • Example:Loop: lw $t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1, -4 bne $s1, $zero, Loop • reorder the instructions to avoid as many pipeline stalls as possible

  5. Example: Static Two-Issue MIPS 1/2 Loop: lw$t0, 0($s1) addu $t0, $t0, $s2 sw $t0, 0($s1) addi $s1, $s1, -4 bne $s1, $zero, Loop • CPI = 4/5 = 0.8  IPC = 1.25

  6. Loop Unrolling 1/2 Loop: lw $t0, 0($s1) addi $s1, $s1, -4 addu $t0, $t0, $s2 sw $t0, 4($s1) bne $s1, $zero, Loop Loop: lw $t0, 0($s1) addi $s1, $s1, -16 lw $t1, 12($s1) addu $t0, $t0, $s2 lw $t2, 8($s1) addu $t1, $t1, $s2 lw $t3, 4($s1) addu $t2, $t2, $s2 sw $t0, 16($s1) addu $t3, $t3, $s2 sw $t1, 12($s1) sw $t2, 8($s1) sw $t2, 8($s1) bne $s1,$zero, Loop Register Renaming

  7. Loop Unrolling 2/2 • CPI = 8/14  0.57  IPC = 1.75

  8. Speculation • Guessing, for example, a branch outcome and execute instructions based on this guessing • Can be done by the compiler or hardware • compiler to reorder the instructions • Recovery mechanism to fix up when the speculation turns out to be wrong • The results obtained from speculative execution are kept in temporary buffers until they are no longer speculative. • Committing them when speculation is correct • discarding them otherwise

  9. IA-64 Architecture • RISC-style instruction set • almost like a MIPS 64 • differences • IA-64 has more registers (128 integer, 128 floating-point, 8 special registers for branch) • IA-64 places instructions into groups or bundles (VLIW) • IA-64 includes special capabilities for speculation and branch elimination • Predication – branch elimination • loop unrolling does not help in if-then-else statements

  10. Predication in IA-64 • 64 1-bit predicate registers • Example: • CMP Ra, Rb JNE else MOV Ra, 0 JMP endelse MOV Ra, Rbend whatever • Code with predicates • CMPEQ Ra, Rb, P1/P2[P1] MOV Ra, 0[P2] MOV Ra, Rb • If the predicate is not true, the instruction becomes nop

  11. Predicates in ARM • Almost all instructions can be conditionally executed. • Thirteen different predicates are available, • Each depending on the four flags Carry, Overflow, Zero, and Negative in some way. • The ARM's 16-bit Thumb instruction set has no branch predication, in order to save encoding space • every instruction reserves a bit-field for the predicate specifying whether that instruction should have an effect

  12. IA-64 Characteristics Itanium :3.2 GFLOPS Itanium: 6.67 GFLOPS.

  13. Dynamic Pipeline Scheduling • dynamic pipelining is a hardware mechanism to avoid pipeline stalls. • Example:lw $t0, 20($s2) addu $t1, $t0, $t2 sub $s4, $s4, $t3 slti $t5, $s4, 20 • Even though addu has to wait for lw to complete, the following two instructions can be started. • Out of order execution => more complicated pipeline control. • Dynamic pipeline scheduling goes past stalls to find later instructions to execute while waiting for the stall to be resolved.

  14. Dynamic Pipeline Scheduling Instruction fetch and decode unit In-order issue Reservation Station Reservation Station Reservation Station Reservation Station Out-of-order execution Integer Integer Floating points Load/ Store Commit Unit In-order commit Reorder buffers

  15. Dynamic Pipeline Scheduling • 5-10 functional units with reservation stations (RS)that hold the operands and the operation. • When the buffer contains all the operands and the unit is ready to execute, the result is calculated, • If necessary they are sent to other RS • The commit unit to decide when it is safe to put the result into the register file or into memory (committing). • Completion methods: • In-order completion and out-of-order completion.

  16. Pentium 4 • After fetched, IA-32 instructions are translated into microoperations • Microoperations • dynamically scheduled • speculative pipelining • issue rate: three microoperations per cycle • deep pipelining • 20 stages • 7 functional units • support for 126 outstanding operations • trace cache

  17. Pentium 3 vs. Pentium 4

  18. Pentium 4 Datapath instruction prefetch and decode branch prediction Trace cache Microoperation queue Register file Dispatch & register renaming Memory operation queue Integer and floating-point operation queue Complex Instruction Integer Floating Point Load Integer Store Commit Unit Data cache

  19. Faster Clock rate Slower Slower Faster IPC Datapath Comparison 1/2 Deeply pipelined Multiple-issue deep pipelined Multiple-issue pipelined Multi-cycle Pipelined Single-cycle

  20. Specialized Hardware Shared 1 Several Latency in instructions Datapath Comparison 2/2 Multiple-issue deep pipelined Multiple-issue pipelined Deeply pipelined Single-cycle Pipelined Multi-cycle

More Related