Shannon Tauro /Jerry Lebowitz

Computer OrganizationDesign of MicroArchitecture LevelTannenbaum 4.4 Shannon Tauro/Jerry Lebowitz

Design Challenge– Optimizing Design Metrics • Obvious design goal: • Construct an implementation with desired functionality • Key design challenge: • Simultaneously optimize numerous design metrics • Design metric • A measurable feature of a system’s implementation • Optimizing design metrics is a key challenge

Design Challenge – Optimizing Design Metrics • Common metrics • Unit cost:the monetary cost of manufacturing each copy of the system, excluding NRE cost • NRE cost (Non-Recurring Engineering cost):The one-time monetary cost of designing the system • Size: the physical space required by the system • There are several others such as reliability, ease of use, energy requirements, physical size, etc.

Expertise with both software and hardware is needed to optimize design metrics Not just a hardware or software expert A designer must be comfortable with various technologies in order to choose the best for a given application and constraints Power Digital camera chip CCD Performance Size CCD preprocessor Pixel coprocessor D2A A2D lens NRE cost JPEG codec Microcontroller Multiplier/Accum DMA controller Display ctrl Memory controller ISA bus interface UART LCD ctrl Design Challenge – Optimizing Design Metrics Hardware Software

Definitions • A clock is a circuit that emits a series of pulses with a precise pulse width and precise interval between consecutive pulses • The interval between the corresponding edges of the two consecutive pulses is known as the clock cycle time • A key factor in determining clock speed is the amount of work that must be done in each clock cycle • The more work the longer the cycle • The sequence of operations that must be performed serially in a single clock cycle determines the length of the cycle • Even though there are parallel operations transpiring

Speed vs. Cost Two main methods of gaining speed: • Hardware: Speed through new technology

Speed vs. Cost 2. Organization (given a technology & ISA): Three basic approaches for speeding up execution • Reduce the # of clock cycles needed to execute an instruction • Reducing # of micro-instructions; path length (for an ISA instruction) • Simplify the organization so that the clock cycle can be shorter • Adding hardware (does not help as much as expected) • Breaking data path into stages • Overlap the execution of instructions • Separating circuitry for fetching instructions (8 bit memory port, MBR and PC) can be effective • Pipelining

Speed vs. Cost • How can cost be measured for circuits? • Measured in a variety of ways: • Count number of components • The entire processor exists on a single chip • Bigger, more complex chips are much more expensive than smaller, simpler ones • Technology used, whether components are custom made or COTS (commercial off the shelf) • The more area required for the functions, the larger the chip • Designers use the term “real estate” (area required for a circuit)

Speed vs. Cost • Speeding up the circuit with fast components costs money - $$$$$ • A trade-off similar to memory hierarchies • Use a small number of fast parts • Those that we determine will be used the most frequently

Speed vs. Cost (Decoding) • One cancontrol the amount of decoding • While any of the nine registers can be read into the ALU from the B bus • Only 4 bits in the microinstruction are required to specify which register is to be selected • Decoding adds delay

Speed vs. Cost (Decoding) • Delays • ALU receives its input slightly delayed • The result is available on the C bus a little later • Clock cannot run quite as fast due to the delays • Reducing the control store by 5 bits comes at the cost of reduces clock speed

Reducing the Execution Path Length • Best Quote: • “Simple machines are not fast & fast machines are not simple.” • A look at our architecture: (Mic-1 CPU) • Uses the minimum amount of hardware: • 10 registers • Simple ALU (1 bit ALU replicated 32 times) • Shifter • Decoder • Control store • Some glue

Reducing the Execution Path Length • Let’s look at ways to reduce the number of micro-instructions per ISA instruction • Recall… each ISA instruction is represented as several micro-code instructions…

A Method to Reduce the Execution Path Length • One way is to reduce the path length by merging the Interpreter Loop with Microcode

A Method to Reduce the Execution Path Length • The main loop must be executed at the beginning of every IJVM instruction • It is possible to overlap it with a previous instruction

A Method to Reduce the Execution Path Length • (four cycles) • The sequence above can be reduced to three instructions by merging the main-loop instructions (three cycles) Label Operations Comment pop1 MAR = SP = SP -1;rd Read in the next-to-to on stack pop2 Wait for the new TOS to be read from memory pop3 TOS = MDR; go to Main1 Copy new word to TOS Main1 PC = PC + 1; fetch; go to (MBR) MBR holds OPCODE; get next byte; dispatch Label Operations Comment pop1 MAR = SP = SP -1;rd Read in the next-to-to on stack Main.pop PC = PC + 1; fetch; Wait for the new TOS to be read from memory pop3 TOS = MDR; go to (MBR) Copy new word to TOS

Reducing the Execution Path Length • Look at the architecture shown below: • Let’s simulate the IADD ISA instruction: • Is there a path that could be speed up by adding something?

Reducing the Execution Path Length • Add another bus!!!—the A bus • No longer need an instruction to simply load the H register • Possible to add any register to any register in one cycle

Reducing the Execution Path Length • Using a 3-bus Architecture….. • How can the following sequence of micro-instructions for ILOAD can be reduced

Reducing the Execution Path Length • The result: • By adding addition bus has reduced the total execution time of the ILOAD from six to five cycles • What are the apparent trade-offs here?

Instruction Fetch Unit (IFU) • Cardinal Rule of Computer Design: Make the common case fast • What is common about almost all instructions? • For every instruction the following may occur: • The PC is passed through the ALU and incremented • The PC is used to fetch the next byte in the instruction stream • Operands are read from memory • Operands are written to memory • The ALU does a computation & results are stored back • How can we improve this? • Create an independent unit to fetch and process the instructions: Instruction Fetch Unit (IFU)

Instruction Fetch Unit • Reduce the ALU load • Requires an incrementer • Far simpler than an adder or another ALU • Can independently increment PC and fetch bytes from the byte stream before they are needed • If an instruction has an operand, it must be explicitly fetched one byte at a time • Not having to increment PC in the main loop, helps as generally all we will do is increment PC. Tradeoffs?

Instruction Fetch Unit • Two approaches • Interpret each opcode, determine the number of additional fields (operands), fetch and assemble them 2. Take advantage of the stream nature of the instructions and make available at all times the next 8 and 16 bit pieces for immediate use Will discuss the second approach ….

Instruction Fetch Unit • There are now two MBR’s • 8-bit MBR1 and 16-bit MBR-2 • The IFU keeps track of the most recent byte(s) consumed by the main execution • When MBR1 is read, the next values are shifted into MBR1 & MBR2 • MBR1 holds the oldest byte in the shift register while MBR2 holds the oldest 2 bytes (16 bit integer) • Allows the instructions to use what they need making the next 8- and 16-bit pieces available

Instruction Fetch Unit • Benefits: • Eliminates the main loop entirely; each instruction branches directly to the next instruction • Avoids tying up ALU incrementing the PC • Treats instructions as streams Takes advantage of stream nature of instructions NOTE: Bytes are opcodes & operands; Not all instructions use operands

Pipelining • What about pipelining? • Attempt to make the clock-cycle faster by introducing more parallelism • Clock cycle • Recall the clock cycle is limited by the time needed for the signals to propagate through the data-path

Pipelining • There are three major components to the actual data path cycle • The time to drive the selected registers onto the A and B buses Registers + A and B Buses • The time for the ALU and shifter to do their work ALU/shifter • The time for the results to get back to the registers to be stored C Bus Adding parallelism is real opportunity

Pipelined Laundry • Steps • Dryer (30 minutes) • Washing machine (30 minutes) • Folding (30 minutes) • Putting away (30 minutes) • Each step is part of doing one load of laundry • How did we pipeline them? • How did we know how to pipeline them?

Pipelining • Our data path can also be broken into logical steps • USED: • Registers + A & B Buses • ALU/Shifter • C Bus • We separate each portion by using latches : flip-flops (registers) • One inserted in the middle of each bus

Pipelining • Why do this? • What have we gained? • We can speed up the clock because the maximum delay is now shorter • We can use all parts of the data path during every cycle

Pipelining • Now it takes three clock cycles to use the data path • One for loading the A and B latches • One for running the ALU and shifter and loading the C latch • One for storing the C latch back into the registers • Are we worse off now?

Pipelining • First point… • Now we have three smaller data paths with reduced maximum delays clock frequency can be higher • By breaking up the data path into three time intervals (each one is about 1/3 as long), the clock speed can be triple • Not quite true since additional registers have been added 1 3 2

Reducing the Execution Path Length • Second point… • Throughput (rather than speed) of an individual instruction • Before… • 1 micro-instruction = 1 datapath cycle • Now… 1 micro-instruction = (1 datapath cycle) divided into 3 steps For example: look at swap1: before: MAR = SP – 1; rd now:B = SP C = B – 1 MAR = C; rd MDR = Mem Can use the ALU on every cycle We try to issue a new micro-instruction on every cycle, for example use the ALU on every cycle

Pipelining ISA: swap • Pipelined implementation of swap • Notice Swap3 • Notice Swap3 • Depends on the result of Swap1 • Called read-after write (RAW) dependence or true dependence

Mic-3: 4-Stage Pipeline • 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers

Mic-3: 4-Stage Pipeline • 4-stage pipeline: Stage 1: Instruction fetching Stage 2: Operand access Stage 3: ALU operations Stage 4: Writeback to registers NOTE: although the Mic-3 program takes more cycles than the Mic-2 program, it still runs faster

Backup

Reducing the Execution Path Length • For the instructions shown below, let’s determine the new micro-instruction sequence if we were to merge the Main1 instruction with each micro-instruction that is performed: goto Main1 • What are the trade-offs for doing this?

Reducing the Execution Path Length • The Mic-2 Microprogram

Pipelining ISA: istore • Let’s pipeline the istore sequence • Report • the sequence of instructions; • the number of full datapath cycles required to successfully execute this sequence

Shannon Tauro /Jerry Lebowitz

Shannon Tauro /Jerry Lebowitz

Presentation Transcript

Follow That Horse

Neal Bhamre Principal Analyst, The Code Works Inc. Sean Shannon Principal Engineer, The Code Works Inc.

Nadcap Supplier Seminar Heat Treatment

International Aircraft Ground Deicing Service Providers Winter 2008-2009 Update

LABORATORY INSPECTIONS

Jerry Sandusky trial

Genitourinary Radiology

Electric Fields and Potentials

William Kim Clare Lei John Prpic Jerry Zhang

Rain Forest Biome

CS712: Knowledge Based MT

-Jerry Seinfeld

If I Can’t Balance My Checkbook, How Can I Help You Balance Yours?

Stormwater Management

By: Hope Camerino, Lauren Simon, Bailee Echols, and Shannon McLaughlin

Hold on, Jerry!

The Ecology of Mental Health

Jerry Spinelli

Near Shannon Limit Performance of Low Density Parity Check Codes

Customer Relationship Management