1 / 98

Instruction Set Architecture & Pipelining

Instruction Set Architecture & Pipelining. CS 505: Computer Architecture Spring 2005 Thu D. Nguyen. Instruction Set Architecture (ISA). software. instruction set. hardware. instruction set. hardware. Instruction Set Architecture (ISA). software. Higher-Level Languages. compiler.

gudrun
Download Presentation

Instruction Set Architecture & Pipelining

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Instruction Set Architecture & Pipelining CS 505: Computer Architecture Spring 2005 Thu D. Nguyen

  2. Instruction Set Architecture (ISA) software instruction set hardware CS 505: Computer Structures

  3. instruction set hardware Instruction Set Architecture (ISA) software Higher-Level Languages compiler CS 505: Computer Structures

  4. Classes of ISAs CS 505: Computer Structures

  5. Review: Basic ISA Classes • Accumulator: • 1 address add A acc ¬ acc + mem[A] • 1+x address addx A acc ¬ acc + mem[A + x] • Stack: • 0 address add tos ¬ tos + next • General Purpose Register: • 2 address add A B EA(A) ¬ EA(A) + EA(B) • 3 address add A B C EA(A) ¬ EA(B) + EA(C) • Load/Store: • 3 address add Ra Rb Rc Ra ¬ Rb + Rc • load Ra Rb Ra ¬ mem[Rb] • store Ra Rb mem[Rb] ¬ Ra CS 505: Computer Structures

  6. Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based Concept of a Family (B5000 1963) (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets Load/Store Architecture (CDC 6600, Cray 1 1963-76) (Vax, Intel 432 1977-80) RISC (Mips,Sparc,HP-PA,IBM RS6000, . . .1987) CS 505: Computer Structures

  7. Issues in Instruction Set Design • Opcodes • Memory addressing • Type and size of operands • Encoding • Implementation (pipelining, exploiting ILP) CS 505: Computer Structures

  8. Addressing Modes • Too many to count? • Register, Immediate, Displacement, Register indirect, Indexed, Direct, Memory indirect, Autoincrement, Autodecrement, Scaled CS 505: Computer Structures

  9. Usage of Addressing Modes CS 505: Computer Structures

  10. Displacement CS 505: Computer Structures

  11. Immediate Usage CS 505: Computer Structures

  12. Size of Immediate Operands CS 505: Computer Structures

  13. Quantitative Design Methodology • Previous study example of quantitative design methodology • Leave you to read the rest of data/results from book •  Data lead to many of the design decisions embodied in today’s RISC processors • So how come Intel (and AMD) are so successful when going against this trend? CS 505: Computer Structures

  14. VAX-11: the canonical CISC Variable format, 2 and 3 address instruction • Rich set of orthogonal address modes • immediate, offset, indexed, autoinc/dec, indirect, indirect+offset • applied to any operand • Simple and complex instructions • synchronization instructions • data structure operations (queues) • polynomial evaluation CS 505: Computer Structures

  15. Review: Load/Store Architectures • ° 3 address GPR • ° Register to register arithmetic • ° Load and store with simple addressing modes (reg + immediate) • ° Simple conditionals • compare ops + branch z • compare&branch • condition code + branch on condition • ° Simple fixed-format encoding MEM reg op r r r op r r immed op offset ° Substantial increase in instructions ° Decrease in data BW (due to many registers) ° Even more significant decrease in CPI (pipelining) ° Cycle time, Real estate, Design time, Design complexity CS 505: Computer Structures

  16. Case Study: MIPS • Simple load-store instruction set • Designed for pipelining efficiency • Efficient as a compiler target CS 505: Computer Structures

  17. MIPS • 32 64-bit GPRs • R0 is always 0 • 32 FPRs (capable of holding double-precision 64-bit values) • Data types: 8-bit byte, 16-bit half words, 32-bit words, 64-bit double words, 32-bit and 64-bit single/double precision floating point • Addressing modes: immediate & displacement • 16-bit fields • Register Indirect? Absolute addressing? CS 505: Computer Structures

  18. MIPS Instruction Format CS 505: Computer Structures

  19. MIPS Instruction Set Arithmetic logical Add, AddU, Sub, SubU, And, Or, Xor, Nor, SLT, SLTU, AddI, AddIU, SLTI, SLTIU, AndI, OrI, XorI, LUI SLL, SRL, SRA, SLLV, SRLV, SRAV Memory Access LB, LBU, LH, LHU, LW, LWL,LWR SB, SH, SW, SWL, SWR Control J, JAL, JR, JALR BEq, BNE, BLEZ,BGTZ,BLTZ,BGEZ,BLTZAL,BGEZAL CS 505: Computer Structures

  20. Instruction Usage CS 505: Computer Structures

  21. Execution Cycle Obtain instruction from program storage Instruction Fetch Determine required actions and instruction size Instruction Decode Locate and obtain operand data Operand Fetch Compute result value or status Execute Deposit results in storage for later use Result Store CS 505: Computer Structures

  22. What’s a Clock Cycle? • Old days: 10 levels of gates • Today: determined by numerous time-of-flight issues + gate delays • clock propagation, wire lengths, etc. Latch or register combinational logic CS 505: Computer Structures

  23. Instruction Fetch Instruction Register Decode & Operand Fetch Operand Registers Result Registers Execute Registers or Mem Store Results Fast, Pipelined Instruction Interpretation IF IF IF IF IF D D D D D E E E E E W W W W W Time CS 505: Computer Structures

  24. 6 PM Midnight 7 8 9 11 10 Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r A B C D Sequential Laundry • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take? CS 505: Computer Structures

  25. Pipelined laundry takes 3.5 hours for 4 loads 30 40 40 40 40 20 A B C D Pipelined Laundry Start work ASAP 6 PM Midnight 7 8 9 11 10 Time T a s k O r d e r CS 505: Computer Structures

  26. Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Multiple tasks operating simultaneously Potential speedup = Number pipe stages Unbalanced lengths of pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup 30 40 40 40 40 20 A B C D Pipelining Lessons 6 PM 7 8 9 Time T a s k O r d e r CS 505: Computer Structures

  27. Instruction Pipelining • Execute billions of instructions, so throughput is what matters • What is desirable in instruction sets for pipelining? • Variable length instructions vs. all instructions same length? • Memory operands part of any operation vs. memory operands only in loads or stores? • Register operand many places in instruction format vs. registers located in same place? CS 505: Computer Structures

  28. Example: MIPS (Note register location) Register-Register 6 5 11 10 31 26 25 21 20 16 15 0 Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 0 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 0 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 0 target Op CS 505: Computer Structures

  29. Adder 4 Address Inst ALU 5 Steps of MIPS Datapath Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX Next SEQ PC Zero? RS1 Reg File MUX RS2 Memory Data Memory L M D RD MUX MUX Sign Extend Imm WB Data Figure 3.1, Page 130, CA:AQA 2e CS 505: Computer Structures

  30. MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU 5 Steps of MIPS Datapath Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD Figure 3.4, Page 134 , CA:AQA 2e CS 505: Computer Structures

  31. Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU ALU Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Visualizing Pipelining Time (clock cycles) I n s t r. O r d e r Figure 3.3, Page 133 , CA:AQA 2e CS 505: Computer Structures

  32. Its Not That Easy for Computers • Limits to pipelining: Hazards prevent next instruction from executing during its designated clock cycle • Structural hazards: HW cannot support this combination of instructions (single person to fold and put clothes away) • Data hazards: Instruction depends on result of prior instruction still in the pipeline (missing sock) • Control hazards: Caused by delay between the fetching of instructions and decisions about changes in control flow (branches and jumps). CS 505: Computer Structures

  33. Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU ALU ALU DMem Ifetch Example: One Memory Port/Structural HazardFigure 3.6, Page 142 , CA:AQA 2e Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Instr 3 Instr 4 Structural Hazard CS 505: Computer Structures

  34. Resolving structural hazards • Structural hazards: attempt to use same hardware for two different things at the same time • Solution 1: Wait • must detect the hazard • must have mechanism to stall • Solution 2: Throw more hardware at the problem CS 505: Computer Structures

  35. Reg Reg Reg Reg Reg Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem DMem DMem ALU ALU ALU ALU Bubble Bubble Bubble Bubble Bubble Detecting and Resolving Structural Hazard Time (clock cycles) Cycle 1 Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 I n s t r. O r d e r Load DMem Instr 1 Instr 2 Stall Instr 3 CS 505: Computer Structures

  36. MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Eliminating Structural Hazards at Design Time Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Instr Cache RS2 Data Cache MUX MUX Sign Extend WB Data Imm Datapath RD RD RD Control Path CS 505: Computer Structures

  37. Role of Instruction Set Design in Structural Hazard Resolution • Simple to determine the sequence of resources used by an instruction • opcode tells it all • Uniformity in the resource usage • MIPS approach => all instructions flow through same 5-stage pipeling CS 505: Computer Structures

  38. Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg Ifetch ALU DMem Ifetch Ifetch Ifetch Ifetch ALU DMem DMem DMem DMem ALU ALU ALU Time (clock cycles) EX WB MEM IF ID/RF I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Data Hazards Figure 3.9, page 147 , CA:AQA 2e CS 505: Computer Structures

  39. Three Generic Data Hazards • Read After Write (RAW)InstrJ tries to read operand before InstrI writes it • Caused by a “Data Dependence” (in compiler nomenclature). This hazard results from an actual need for communication. I: add r1,r2,r3 J: sub r4,r1,r3 CS 505: Computer Structures

  40. I: sub r4,r1,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards • Write After Read (WAR)InstrJ writes operand before InstrI reads it • Called an “anti-dependence” by compiler writers.This results from reuse of the name “r1”. • Can’t happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Reads are always in stage 2, and • Writes are always in stage 5 CS 505: Computer Structures

  41. I: sub r1,r4,r3 J: add r1,r2,r3 K: mul r6,r1,r7 Three Generic Data Hazards • Write After Write (WAW)InstrJ writes operand before InstrI writes it. • Called an “output dependence” by compiler writersThis also results from the reuse of name “r1”. • Can’t happen in MIPS 5 stage pipeline because: • All instructions take 5 stages, and • Writes are always in stage 5 • Will see WAR and WAW in later more complicated pipes CS 505: Computer Structures

  42. Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem I n s t r. O r d e r add r1,r2,r3 sub r4,r1,r3 and r6,r1,r7 or r8,r1,r9 xor r10,r1,r11 Forwarding to Avoid Data HazardFigure 3.10, Page 149 , CA:AQA 2e Time (clock cycles) CS 505: Computer Structures

  43. ALU HW Change for ForwardingFigure 3.20, Page 161, CA:AQA 2e ID/EX EX/MEM MEM/WR NextPC mux Registers Data Memory mux mux Immediate CS 505: Computer Structures

  44. Reg Reg Reg Reg Reg Reg Reg Reg ALU Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem ALU ALU ALU lwr1, 0(r2) I n s t r. O r d e r sub r4,r1,r6 and r6,r1,r7 or r8,r1,r9 Data Hazard Even with ForwardingFigure 3.12, Page 153 , CA:AQA 2e Time (clock cycles) CS 505: Computer Structures

  45. Resolving this load hazard • Adding hardware? ... not • Detection? • Compilation techniques? • What is the cost of load delays? CS 505: Computer Structures

  46. Reg Reg Reg Ifetch Ifetch Ifetch Ifetch DMem ALU Bubble ALU ALU Reg Reg DMem DMem Bubble Reg Reg Resolving the Load Data Hazard Time (clock cycles) I n s t r. O r d e r lwr1, 0(r2) sub r4,r1,r6 and r6,r1,r7 Bubble ALU DMem or r8,r1,r9 CS 505: Computer Structures

  47. Software Scheduling to Avoid Load Hazards Try producing fast code for a = b + c; d = e – f; assuming a, b, c, d ,e, and f in memory. Slow code: LW Rb,b LW Rc,c ADD Ra,Rb,Rc SW a,Ra LW Re,e LW Rf,f SUB Rd,Re,Rf SW d,Rd • Fast code: • LW Rb,b • LW Rc,c • LW Re,e • ADD Ra,Rb,Rc • LW Rf,f • SW a,Ra • SUB Rd,Re,Rf • SW d,Rd CS 505: Computer Structures

  48. Reg Reg Reg Reg Reg Reg Reg Reg Reg Reg ALU ALU ALU ALU ALU Ifetch Ifetch Ifetch Ifetch Ifetch DMem DMem DMem DMem DMem 10: beq r1,r3,36 14: and r2,r3,r5 18: or r6,r1,r7 22: add r8,r1,r9 36: xor r10,r1,r11 Control Hazard on Branches=> Three Stage Stall CS 505: Computer Structures

  49. Example: Branch Stall Impact • If 30% branch, Stall 3 cycles significant • Two part solution: • Determine branch taken or not sooner, AND • Compute taken branch address earlier • MIPS branch tests if register = 0 or  0 • MIPS Solution: • Move Zero test to ID/RF stage • Adder to calculate new PC in ID/RF stage • 1 clock cycle penalty for branch versus 3 CS 505: Computer Structures

  50. MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU Pipelined MIPS DatapathFigure 3.22, page 163, CA:AQA 2/e Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next SEQ PC Next PC MUX Adder Zero? RS1 Reg File Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD CS 505: Computer Structures

More Related