230 likes | 258 Views
Designing an instruction set architecture suitable for specific applications, focusing on operations, operands, addressing modes, and optimization.
E N D
Midterm 2 review Chapter 2 - 4
Instruction Set Architecture • Interface between the hardware and software • Easy to program with and efficient to implement • Regularity, simple tradeoff, constant operands • Constant size, small instruction set • What operations to include? • What type of operands to include ? • What addressing modes to include? • What memory addressing modes to include? • Should be optimized for the targeted application
Operations • Computation • Add, sub, mult, div, shift, … • Control flow • Branch, jump, jal, ….
Operands • Data type: int, float, immediate … • Internal storage • Stack, accumulator: old style • register-memory: • Smaller code size, but variable cycle per instruction and harder to encode both memory address and register in an instruction • register-register • Larger code size but relatively constant cycle per instruction. • List on page 98
Memory Address Modes • Register • Immediate • Displacement • …. • A list of examples on page 104
Branch prediction • 2 bit branch predictor • Correlating branch predictor • (m,n) predictor • Use last m prediction result to pick a n bit predictor • Tournament predictor • Branch Target Buffers
Explore ILP • Dynamic • Tomasulo+Branch prediction=>Speculation • More info at dynamic for optimization, but smaller window • More hardware, more executable compatibility • Static • Bigger window but less info • Simple hardware, complex compiler.
Example Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop
Speculative Dynamic Machine specification • Issue rate of 1 • One broadcast per cycle for CDB • branch takes 1 cycle, • Load takes 1 cycle, • integer alu takes 1 cycle, • float add takes 2 cycle • float multiply takes 3 cycle. • These cycle count doesn’t include write to CDB
Reorder buffer Cycle 0 Reservation table Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop FP register status
Reorder buffer Cycle 1 Reservation table Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop FP register status
Reorder buffer Cycle 2 Reservation table Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop FP register status
Reorder buffer Cycle 3 Reservation table Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop FP register status
Reorder buffer Cycle 4 Reservation table Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop FP register status
Reorder buffer Cycle n Reservation table Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop FP register status
Reorder buffer Cycle n+1 Reservation table Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop FP register status
Reorder buffer Cycle n+2 Reservation table Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop FP register status
Reorder buffer Cycle n+3 Reservation table Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R1, Loop FP register status
VLIW example Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop • Static machine specification • One delay slot between any true data flow dependency for floating point operations • One branch delay slot
Register rename Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F0, 0(R2) Mult.D F0, F0, F2 S.D 0(R2), F0 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F1, 0(R2) Mult.D F1, F1, F2 S.D 0(R2), F1 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop
Instruction reorder Loop: L.D F0, 0(R1) Add.D F0, F0, F2 S.D 0(R1), F0 L.D F1, 0(R2) Mult.D F1, F1, F2 S.D 0(R2), F1 SUBI R1, R1, 8 SUBI R2, R2, 8 BNEZ R2, Loop Loop: L.D F0, 0(R1) L.D F1, 0(R2) Add.D F0, F0, F2 Mult.D F1, F1, F2 S.D 0(R1), F0 S.D 0(R2), F1 SUBI R2, R2, 8 BNEZ R2, Loop SUBI R1, R1, 8 Loop can be unrolled to increase reorder freedom
Software pipeline Code for one iteration. L.D F0, 0(R1) L.D F1, 0(R2) Add.D F0, F0, F2 Mult.D F1, F1, F2 S.D 0(R1), F0 S.D 0(R2), F1 SUBI R2, R2, 8 SUBI R1, R1, 8 BNEZ R2, Loop L.D F0, 0(R1) L.D F1, 0(R2) Add.D F0, F0, F2 Mult.D F1, F1, F2 S.D 0(R1), F0 S.D 0(R2), F1 SUBI R2, R2, 8 SUBI R1, R1, 8 BNEZ R2, Loop L.D F0, 0(R1) L.D F1, 0(R2) Add.D F0, F0, F2 Mult.D F1, F1, F2 S.D 0(R1), F0 S.D 0(R2), F1 SUBI R2, R2, 8 SUBI R1, R1, 8 BNEZ R2, Loop 8 copies
Midterm detail • Take home • Available online on Monday12/9 morning • Due 12/16 11:59 pm • 3 questions