Final Review

Final Review Bernard Chen

Example 1 • Binary selector input • 1) MUX A selector (SELA) : to place the content of R2 into BUS A • 2) MUX B selector (SELB) : to place the content of R3 into BUS B • 3) ALU operation selector (OPR) : to provide the arithmetic addition R2 + R3 • 4) Decoder selector (SELD) : to transfer the content of the output bus into R1

Encoding of Register Selection Fields: • SELA or SELB = 000 (External Input) : MUX selects the external data • SELD = 000 (None) : no destination register is selected but the contents of the output bus are available in the external output

(Example 2)1. Micro-operationR1 ¬ R2 - R32. Control wordField: SELA SELB SELD OPRSymbol: R2 R3 R1 SUBControl word: 010 011 001 00101 Example

STACK OPERATIONSREVERSE POLISH NOTATION (postfix) • • Evaluation procedure: • 1. Scan the expression from left to right.2. When an operator is reached, perform the operation with the two operands found on the left side of the operator.3. Replace the two operands and the operator by the result obtained from the operation. • (Example) infix 3 * 4 + 5 * 6= 42 postfix 3 4 * 5 6 * + • 12 5 6 * +12 30 +42

STACK OPERATIONSREVERSE POLISH NOTATION (postfix) • • Reverse Polish notation evaluation with a stack. Stack is the most efficient way for evaluating arithmetic expressions. stack evaluation:Get valueIf value is data: push dataElse if value is operation: pop, pop evaluate and push.

STACK OPERATIONSREVERSE POLISH NOTATION (postfix) • (Example) using stacks to do this.3 * 4 + 5 * 6 = 42 => 3 4 * 5 6 * +

8.4 Instruction Formats • Zero address instruction: Stack is used. Arithmetic operation pops two operands from the stack and pushes the result. • One address instructions: AC and memory. Since the accumulator always provides one operand, only one memory address needs to be specified. •Two address instructions: Two address registers or two memory locations are specified, one for the final result. •Three address instructions: Three address registers or memory locations are specified, one for the final result. It is also called general address organization.

EXAMPLE: Show how can the following operation be performed using:a- three address instructionb- two address instructionc- one address instructiond- zero address instructionX = (A + B) * (C + D)

a-Three-address instructions (general register organization) ADD R1, A, B R1  M[A] + M[B] ADD R2, C, D R2  M[C] + M[D] MUL X, R1, R2 M[X]  R1 * R2

b-Two-address instructions (general register organization) MOV R1, A R1  M[A] ADD R1, B R1  R1 + M[B] MOV R2, C R2  M[C] ADD R2, D R2  R2 + M[D] MOV X, R2 M[X] R2 MUL X, R1 M[X]  R1 * M[X]

c- One-address instructions LOAD A AC M[A] ADD B AC  AC + M[B] STORE T M[T ] AC LOAD C AC  M[C] ADD D AC  AC + M[D] MUL T AC  AC * M[T ] STORE X M[X]  AC Store

d- Zero-address instructions (stack organization) Push value Else If operator is encountered: Pop, pop, operation, push Pop operand pop another operand then perform an operation and push the result back into the stack. PUSH A TOS  A Push PUSH B TOS  B ADD TOS  (A+B) PUSH C TOS  C PUSH D TOS  D ADD TOS  (C+D) MUL TOS  (C+D)*(A+B) POP X M[X]  TOS (*TOS stands for top of stack). Pop, pop, operation, push

Pipelining: Laundry Example Small laundry has one washer, one dryer and one operator, it takes 90 minutes to finish one load: Washer takes 30 minutes Dryer takes 40 minutes “operator folding” takes 20 minutes A B C D

Sequential Laundry This operator scheduled his loads to be delivered to the laundry every 90 minutes which is the time required to finish one load. In other words he will not start a new task unless he is already done with the previous task The process is sequential. Sequential laundry takes 6 hours for 4 loads A B C D 6 PM Midnight 7 8 9 11 10 Time 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r 90 min

Efficiently scheduled laundry: Pipelined LaundryOperator start work ASAP Another operator asks for the delivery of loads to the laundry every 40 minutes!?. Pipelined laundry takes 3.5 hours for 4 loads 30 40 40 40 40 20 A B C D 6 PM Midnight 7 8 9 11 10 Time 40 40 40 T a s k O r d e r

Pipelining Facts Multiple tasks operating simultaneously Pipelining doesn’t help latency of single task, it helps throughput of entire workload Pipeline rate limited by slowest pipeline stage Potential speedup = Number of pipe stages Unbalanced lengthsof pipe stages reduces speedup Time to “fill” pipeline and time to “drain” it reduces speedup 30 40 40 40 40 20 A B C D 6 PM 7 8 9 Time T a s k O r d e r The washer waits for the dryer for 10 minutes

9.2 Pipelining • Suppose we want to perform the combined multiply and add operations with a stream of numbers: • Ai * Bi + Ci for i =1,2,3,…,7

Pipeline Performance n:instructions k: stages in pipeline : clockcycle Tk: total time n is equivalent to number of loads in the laundry example k is the stages (washing, drying and folding. Clock cycle is the slowest task time n k

Example: 6 tasks, divided into 4 segments

Some definitions • Pipeline: is an implementation technique where multiple instructions are overlapped in execution. • Pipeline stage: The computer pipeline is to divided instruction processing into stages. Each stage completes a part of an instruction and loads a new part in parallel. The stages are connected one to the next to form a pipe - instructions enter at one end, progress through the stages, and exit at the other end.

Some definitions Throughput of the instruction pipeline is determined by how often an instruction exits the pipeline. Pipelining does not decrease the time for individual instruction execution. Instead, it increases instruction throughput. Machine cycle . The time required to move an instruction one step further in the pipeline. The length of the machine cycle is determined by the time required for the slowest pipe stage.

Instruction pipeline (Contd.) sequential processing is faster for few instructions

Difficulties... If a complicated memory access occurs in stage 1, stage 2 will be delayed and the rest of the pipe is stalled. If there is a branch, if.. and jump, then some of the instructions that have already entered the pipeline should not be processed. We need to deal with these difficulties to keep the pipeline moving

5-Stage Pipelining S1 1 2 3 4 5 6 7 8 9 S2 1 2 3 4 5 6 7 8 S3 1 2 3 4 5 6 7 S4 1 2 3 4 5 6 S5 1 2 3 4 5 S1 S2 S3 S4 S5 Fetch Instruction (FI) Decode Instruction (DI) Fetch Operand (FO) Execution Instruction (EI) Write Operand (WO) Time

Five Stage Instruction Pipeline Fetch instruction Decode instruction Fetch operands Execute instructions Write result

Two major difficulties • Data Dependency • Branch Difficulties Solutions: • Prefetch target instruction • Delayed Branch • Branch target buffer (BTB) • Branch Prediction

Data Dependency • Use Delay Load to solve: Example: load R1 R1M[Addr1] load R2 R2M[Addr2] ADD R3R1+R2 Store M[addr3]R3

Delay Load

Example • Five instructions need to be carried out: Load from memory to R1 Increment R2 Add R3 to R4 Subtract R5 from R6 Branch to address X

Delay Branch

Rearrange the Instruction

Floating Point Arithmetic Pipeline • Example for floating-point addition and subtraction • Inputs are two normalized floating-point binary numbers • X = A x 2^a • Y = B x 2^b • A and B are two fractions that represent the mantissas • a and b are the exponents • Try to design segments are used to perform the “add” operation

Floating Point Arithmetic Pipeline • Compare the exponents • Align the mantissas • Add or subtract the mantissas • Normalize the result

Floating Point Arithmetic Pipeline • X = 0.9504 x 103 and Y = 0.8200 x 102 • The two exponents are subtracted in the first segment to obtain 3-2=1 • The larger exponent 3 is chosen as the exponent of the result • Segment 2 shifts the mantissa of Y to the right to obtain Y = 0.0820 x 103 • The mantissas are now aligned • Segment 3 produces the sum Z = 1.0324 x 103 • Segment 4 normalizes the result by shifting the mantissa once to the right and incrementing the exponent by one to obtain Z = 0.10324 x 104

Memory Hierarchy • The main memory occupies a central position by being able to communicate directly with the CPU and with auxiliary memory devices through an I/O processor • A special very-high-speed memory called cache is used to increase the speed of processing by making current programs and data available to the CPU at a rapid rate

RAM

ROM

Memory Address Map • Memory Address Map is a pictorial representation of assigned address space for each chip in the system • To demonstrate an example, assume that a computer system needs 512 bytes of RAM and 512 bytes of ROM • The RAM have 128 byte and need seven address lines, where the ROM have 512 bytes and need 9 address lines

Memory Address Map

Memory Address Map • The hexadecimal address assigns a range of hexadecimal equivalent address for each chip • Line 8 and 9 represent four distinct binary combination to specify which RAM we chose • When line 10 is 0, CPU selects a RAM. And when it’s 1, it selects the ROM

Cache memory • The performance of cache memory is frequently measured in terms of a quantity called hit ratio • When the CPU refers to memory and finds the word in cache, it is said to produce a hit • Otherwise, it is a miss • Hit ratio = hit / (hit+miss)

Cache memory • The basic characteristic of cache memory is its fast access time, • Therefore, very little or no time must be wasted when searching the words in the cache • The transformation of data from main memory to cache memory is referred to as a mapping process, there are three types of mapping: • Associative mapping • Direct mapping • Set-associative mapping

Cache memory • To help understand the mapping procedure, we have the following example:

Associative mapping

Direct Mapping

Final Review

Final Review

Presentation Transcript

Final Review

Final Review

Final Review

Final Review

Final review

Final review

Final Review

Final Review

Final Review

Final Review

Final Review

Final review

Final Review

Final Review

Final Review

FINAL REVIEW

Final Review