Shannon Tauro /Jerry Lebowitz - PowerPoint PPT Presentation

slide1 n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Shannon Tauro /Jerry Lebowitz PowerPoint Presentation
Download Presentation
Shannon Tauro /Jerry Lebowitz

play fullscreen
1 / 42
Shannon Tauro /Jerry Lebowitz
172 Views
Download Presentation
kamala
Download Presentation

Shannon Tauro /Jerry Lebowitz

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Computer OrganizationDesign of MicroArchitecture LevelTannenbaum 4.4 Shannon Tauro/Jerry Lebowitz

  2. Design Challenge– Optimizing Design Metrics • Obvious design goal: • Construct an implementation with desired functionality • Key design challenge: • Simultaneously optimize numerous design metrics • Design metric • A measurable feature of a system’s implementation • Optimizing design metrics is a key challenge

  3. Design Challenge – Optimizing Design Metrics • Common metrics • Unit cost:the monetary cost of manufacturing each copy of the system, excluding NRE cost • NRE cost (Non-Recurring Engineering cost):The one-time monetary cost of designing the system • Size: the physical space required by the system • There are several others such as reliability, ease of use, energy requirements, physical size, etc.

  4. Expertise with both software and hardware is needed to optimize design metrics Not just a hardware or software expert A designer must be comfortable with various technologies in order to choose the best for a given application and constraints Power Digital camera chip CCD Performance Size CCD preprocessor Pixel coprocessor D2A A2D lens NRE cost JPEG codec Microcontroller Multiplier/Accum DMA controller Display ctrl Memory controller ISA bus interface UART LCD ctrl Design Challenge – Optimizing Design Metrics Hardware Software

  5. Definitions • A clock is a circuit that emits a series of pulses with a precise pulse width and precise interval between consecutive pulses • The interval between the corresponding edges of the two consecutive pulses is known as the clock cycle time • A key factor in determining clock speed is the amount of work that must be done in each clock cycle • The more work the longer the cycle • The sequence of operations that must be performed serially in a single clock cycle determines the length of the cycle • Even though there are parallel operations transpiring

  6. Speed vs. Cost Two main methods of gaining speed: • Hardware: Speed through new technology

  7. Speed vs. Cost 2. Organization (given a technology & ISA): Three basic approaches for speeding up execution • Reduce the # of clock cycles needed to execute an instruction • Reducing # of micro-instructions; path length (for an ISA instruction) • Simplify the organization so that the clock cycle can be shorter • Adding hardware (does not help as much as expected) • Breaking data path into stages • Overlap the execution of instructions • Separating circuitry for fetching instructions (8 bit memory port, MBR and PC) can be effective • Pipelining

  8. Speed vs. Cost • How can cost be measured for circuits? • Measured in a variety of ways: • Count number of components • The entire processor exists on a single chip • Bigger, more complex chips are much more expensive than smaller, simpler ones • Technology used, whether components are custom made or COTS (commercial off the shelf) • The more area required for the functions, the larger the chip • Designers use the term “real estate” (area required for a circuit)

  9. Speed vs. Cost • Speeding up the circuit with fast components costs money - $$$$$ • A trade-off similar to memory hierarchies • Use a small number of fast parts • Those that we determine will be used the most frequently

  10. Speed vs. Cost (Decoding) • One cancontrol the amount of decoding • While any of the nine registers can be read into the ALU from the B bus • Only 4 bits in the microinstruction are required to specify which register is to be selected • Decoding adds delay

  11. Speed vs. Cost (Decoding) • Delays • ALU receives its input slightly delayed • The result is available on the C bus a little later • Clock cannot run quite as fast due to the delays • Reducing the control store by 5 bits comes at the cost of reduces clock speed

  12. Reducing the Execution Path Length • Best Quote: • “Simple machines are not fast & fast machines are not simple.” • A look at our architecture: (Mic-1 CPU) • Uses the minimum amount of hardware: • 10 registers • Simple ALU (1 bit ALU replicated 32 times) • Shifter • Decoder • Control store • Some glue

  13. Reducing the Execution Path Length • Let’s look at ways to reduce the number of micro-instructions per ISA instruction • Recall… each ISA instruction is represented as several micro-code instructions…

  14. A Method to Reduce the Execution Path Length • One way is to reduce the path length by merging the Interpreter Loop with Microcode

  15. A Method to Reduce the Execution Path Length • The main loop must be executed at the beginning of every IJVM instruction • It is possible to overlap it with a previous instruction

  16. A Method to Reduce the Execution Path Length • (four cycles) • The sequence above can be reduced to three instructions by merging the main-loop instructions (three cycles) Label Operations Comment pop1 MAR = SP = SP -1;rd Read in the next-to-to on stack pop2 Wait for the new TOS to be read from memory pop3 TOS = MDR; go to Main1 Copy new word to TOS Main1 PC = PC + 1; fetch; go to (MBR) MBR holds OPCODE; get next byte; dispatch Label Operations Comment pop1 MAR = SP = SP -1;rd Read in the next-to-to on stack Main.pop PC = PC + 1; fetch; Wait for the new TOS to be read from memory pop3 TOS = MDR; go to (MBR) Copy new word to TOS

  17. Reducing the Execution Path Length • Look at the architecture shown below: • Let’s simulate the IADD ISA instruction: • Is there a path that could be speed up by adding something?

  18. Reducing the Execution Path Length • Add another bus!!!—the A bus • No longer need an instruction to simply load the H register • Possible to add any register to any register in one cycle

  19. Reducing the Execution Path Length • Using a 3-bus Architecture….. • How can the following sequence of micro-instructions for ILOAD can be reduced

  20. Reducing the Execution Path Length • The result: • By adding addition bus has reduced the total execution time of the ILOAD from six to five cycles • What are the apparent trade-offs here?

  21. Instruction Fetch Unit (IFU) • Cardinal Rule of Computer Design: Make the common case fast • What is common about almost all instructions? • For every instruction the following may occur: • The PC is passed through the ALU and incremented • The PC is used to fetch the next byte in the instruction stream • Operands are read from memory • Operands are written to memory • The ALU does a computation & results are stored back • How can we improve this? • Create an independent unit to fetch and process the instructions: Instruction Fetch Unit (IFU)

  22. Instruction Fetch Unit • Reduce the ALU load • Requires an incrementer • Far simpler than an adder or another ALU • Can independently increment PC and fetch bytes from the byte stream before they are needed • If an instruction has an operand, it must be explicitly fetched one byte at a time • Not having to increment PC in the main loop, helps as generally all we will do is increment PC. Tradeoffs?

  23. Instruction Fetch Unit • Two approaches • Interpret each opcode, determine the number of additional fields (operands), fetch and assemble them 2. Take advantage of the stream nature of the instructions and make available at all times the next 8 and 16 bit pieces for immediate use Will discuss the second approach ….

  24. Instruction Fetch Unit • There are now two MBR’s • 8-bit MBR1 and 16-bit MBR-2 • The IFU keeps track of the most recent byte(s) consumed by the main execution • When MBR1 is read, the next values are shifted into MBR1 & MBR2 • MBR1 holds the oldest byte in the shift register while MBR2 holds the oldest 2 bytes (16 bit integer) • Allows the instructions to use what they need making the next 8- and 16-bit pieces available

  25. Instruction Fetch Unit • Benefits: • Eliminates the main loop entirely; each instruction branches directly to the next instruction • Avoids tying up ALU incrementing the PC • Treats instructions as streams Takes advantage of stream nature of instructions NOTE: Bytes are opcodes & operands; Not all instructions use operands

  26. Pipelining • What about pipelining? • Attempt to make the clock-cycle faster by introducing more parallelism • Clock cycle • Recall the clock cycle is limited by the time needed for the signals to propagate through the data-path

  27. Pipelining • There are three major components to the actual data path cycle • The time to drive the selected registers onto the A and B buses Registers + A and B Buses • The time for the ALU and shifter to do their work ALU/shifter • The time for the results to get back to the registers to be stored C Bus Adding parallelism is real opportunity

  28. Pipelined Laundry • Steps • Dryer (30 minutes) • Washing machine (30 minutes) • Folding (30 minutes) • Putting away (30 minutes) • Each step is part of doing one load of laundry • How did we pipeline them? • How did we know how to pipeline them?

  29. Pipelining • Our data path can also be broken into logical steps • USED: • Registers + A & B Buses • ALU/Shifter • C Bus • We separate each portion by using latches : flip-flops (registers) • One inserted in the middle of each bus

  30. Pipelining • Why do this? • What have we gained? • We can speed up the clock because the maximum delay is now shorter • We can use all parts of the data path during every cycle

  31. Pipelining • Now it takes three clock cycles to use the data path • One for loading the A and B latches • One for running the ALU and shifter and loading the C latch • One for storing the C latch back into the registers • Are we worse off now?

  32. Pipelining • First point… • Now we have three smaller data paths with reduced maximum delays clock frequency can be higher • By breaking up the data path into three time intervals (each one is about 1/3 as long), the clock speed can be triple • Not quite true since additional registers have been added 1 3 2

  33. Reducing the Execution Path Length • Second point… • Throughput (rather than speed) of an individual instruction • Before… • 1 micro-instruction = 1 datapath cycle • Now… 1 micro-instruction = (1 datapath cycle) divided into 3 steps For example: look at swap1: before: MAR = SP – 1; rd now:B = SP C = B – 1 MAR = C; rd MDR = Mem Can use the ALU on every cycle We try to issue a new micro-instruction on every cycle, for example use the ALU on every cycle

  34. Pipelining ISA: swap • Pipelined implementation of swap • Notice Swap3 • Notice Swap3 • Depends on the result of Swap1 • Called read-after write (RAW) dependence or true dependence

  35. Mic-3: 4-Stage Pipeline • 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers

  36. Mic-3: 4-Stage Pipeline • 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers

  37. Mic-3: 4-Stage Pipeline • 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers

  38. Mic-3: 4-Stage Pipeline • 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers NOTE: although the Mic-3 program takes more cycles than the Mic-2 program, it still runs faster

  39. Backup

  40. Reducing the Execution Path Length • For the instructions shown below, let’s determine the new micro-instruction sequence if we were to merge the Main1 instruction with each micro-instruction that is performed: goto Main1 • What are the trade-offs for doing this?

  41. Reducing the Execution Path Length • The Mic-2 Microprogram

  42. Pipelining ISA: istore • Let’s pipeline the istore sequence • Report • the sequence of instructions; • the number of full datapath cycles required to successfully execute this sequence