1 / 125

Chapter 8

Chapter 8. Pipelining. Pipelining. A strategy for employing parallelism to achieve better performance Taking the “assembly line” approach to fetching and executing instructions. The Cycle. The control unit: Fetch Execute Fetch Execute Etc. Etc. The Cycle.

mona-deleon
Download Presentation

Chapter 8

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 8 Pipelining

  2. Pipelining • A strategy for employing parallelism to achieve better performance • Taking the “assembly line” approach to fetching and executing instructions

  3. The Cycle The control unit: Fetch Execute Fetch Execute Etc. Etc.

  4. The Cycle How about separate components for fetching the instruction and executing it? Then fetch unit: fetch instruction execute unit: execute instruction So, how about fetch while execute?

  5. clock cycle clock cycle

  6. Overlapping fetch with execute Two stage pipeline

  7. F4 Both components busy during each clock cycle

  8. The Cycle The cycle can be divided into four parts fetch instruction decode instruction execute instruction write result back to memory So, how about four components?

  9. The four components operating in parallel

  10. buffer for instruction buffer for operands buffer for result

  11. Operands for I2 Operation info for I2 Write info for I2 Instruction I3 Result of instruction I1

  12. One clock cycle for each pipeline stage Therefore cycle time must be long enough for the longest stage A unit is idle if it requires less time than another Best if all stages are about the same length Cache memory helps

  13. Fetching (instructions or data) from main memory may take 10 times as long as an operation such as ADD Cache memory (especially if on the same chip) allows fetching as quickly as other operations

  14. One clock cycle per component, four cycles total to complete an instruction

  15. Completes an instruction each clock cycle Therefore, four times as fast as without pipeline

  16. Completes an instruction each clock cycle Therefore, four times as fast as without pipeline as long as nothing takes more than one cycle But sometimes things take longer -- for example, most executes such as ADD take one clock, but suppose DIVIDE takes three

  17. and other stages idle Write has nothing to write Decode can’t use its “out” buffer Fetch can’t use its “out” buffer

  18. no data for Write A data “hazard” has caused the pipeline to “stall”

  19. An instruction “hazard” (or “control hazard”) has caused the pipeline to “stall” Instruction I2 not in the cache, required a main memory access

  20. Structural Hazards • Conflict over use of a hardware resource • Memory • Can’t fetch an instruction while another instruction is fetching an operand, for example • Cache: same • Unless cache has multiple ports • Or separate caches for instructions, data • Register file • One access at a time, again unless multiple ports

  21. Structural Hazards • Conflict over use of a hardware resource--such as the register file Example: LOAD X(R1), R2 (LOAD R2, X(R1) in MIPS) X + [R1] address of memory location i.e., the address in R1 + X Load that word from memory (cache) into R2

  22. I2 writing to register file I3 must wait for register file calculate the address I5 fetch delayed I2 takes extra cycle for cache access as part of execution

  23. Data Hazards • Situations that cause the pipeline to stall because data to be operated on is delayed • execute takes extra cycle, for example

  24. Data Hazards • Or, because of data dependencies • Pipeline stalls because an instruction depends on data from another instruction

  25. Concurrency A 3 + A B 4 x A Can’t be performed concurrently--result incorrect if new value of A is not used A 5 x C B 20 + C Can be performed concurrently (or in either order) without affecting result

  26. Concurrency A 3 + A B 4 x A Second operation depends on completion of first operation A 5 x C B 20 + C The two operations are independent

  27. MUL R2, R3, R4 ADD R5, R4, R6 (dependent on result in R4 from previous instruction) will write result in R4 can’t finish decoding until result is in R4

  28. Data Forwarding • Pipeline stalls in previous example waiting for I1’s result to be stored in R4 • Delay can be reduced if result is forwarded directly to I2

  29. pipeline stall data forwarding

  30. MUL R2, R3, R4 from R2 from R3 to I2 R2 x R3 toR4

  31. MUL R2, R3, R4 R2 x R3 R2, R3 R4 R2 x R3 ADD R5, R4, R6 R5 R4 + R5 R6 R2 x R3

  32. 2 cycle stall introduced by hardware (if no data forwarding) If solved by software: MUL R2, R3, R4 NOOP NOOP ADD R5, R4, R6

  33. Side Effects • ADD (R1)+, R2, R3 • Not only changes destination register, but also changes R1 • ADD R1, R3 • ADDWC R2, R4 • Add with carry dependent on condition code flag set by previous ADD—an implicit dependency

  34. Side Effects • Data dependency on something other than the result destination • Multiple dependencies • Pipelining clearly works better if side effects are avoided in the instruction set • Simple instructions

  35. Instruction Hazards • Pipeline depends on steady stream of instructions from the instruction fetch unit pipeline stall from a cache miss

  36. Decode, execute, and write units are all idle for the “extra” clock cycles

  37. Branch Instructions • Their purpose is to change the content of the PC and fetch another instruction • Consequently, the fetch unit may be fetching an “unwanted” instruction

  38. two stage pipeline computes new PC value SW R1, A BUN K LW R5, B fetch instruction 3 discard instruction 3 and fetch instruction K instead

  39. the lost cycle is the “branch penalty”

  40. four stage pipeline instruction 3 fetched and decoded instruction 4 fetched instructions 3 and 4 discarded, instruction K fetched

  41. In a four stage pipeline, the penalty is two clock cycles

  42. Unconditional Branch Instructions • Reducing the branch penalty requires computing the branch address earlier • Hardware in the fetch and decode units • Identify branch instructions • Compute branch target address (instead of doing it in the execute stage)

  43. fetched and decoded discarded penalty reduced to one cycle

  44. Instruction Queue and Prefetching • Fetching instructions into a “queue” • Dispatch unit (added to decode) to take instructions from queue • Enlarging the “buffer” zone between fetch and decode

  45. buffer for instruction buffer for operands buffer for result

More Related