Performance, ALUs and such like

Performance, ALUs and such like • The good news: no quiz today ! • Homework #1 is on the net now, so are the slides from previous class. • Home page is www.cs.ucsd.edu/~tsoni/cse141 • Finals will be the last day of class, no special time slot • Add-drops shall be handled at break. • Today: Chap 2 and 4 of the text. Tarun Soni, Summer ‘03

The Story so far: • Computer organization: concept of abstraction • Instruction Set Architectures: Definition, types, examples • Instruction formats: operands, addressing modes • Operations: load, store, arithmetic, logical • Control instructions: branch, jump, procedures • Stacks Basically learnt about Instruction Set Architectures Tarun Soni, Summer ‘03

MIPS Software Register Conventions 0 zero constant 0 1 at reserved for assembler 2 v0 expression evaluation & 3 v1 function results 4 a0arguments 5 a1 6 a2 7 a3 8 t0temporary: caller saves . . . (callee can clobber) 15 t7 16 s0callee saves . . . (caller can clobber) 23 s7 24 t8temporary (cont’d) 25 t9 26 k0 reserved for OS kernel 27 k1 28 gp Pointer to global area 29 sp Stack pointer 30 fp frame pointer 31 ra Return Address (HW) Tarun Soni, Summer ‘03

Example: Swap() swap(int v[], int k); { int temp; temp = v[k] v[k] = v[k+1]; v[k+1] = temp; } • Can we figure out the code? swap: // $4=v, $5=k muli $2, $5, 4 // $2 = k*4 add $2, $4, $2 // $2 = v+(4*k) lw $15, 0($2) // $15=temp= *($2+0)=*(v+k) lw $16, 4($2) // $16 = *($2+4) = *(v+k+1) sw $16, 0($2) // *(v+k) = $16 = *(v+k+1) sw $15, 4($2) // *(v+k+1) = $15 = temp jr $31 // return; Tarun Soni, Summer ‘03

Example: Leaf_procedure() int PairDiff(int a, int b, int c,int d); { int temp; temp = (a+b)-(c+d); return temp; } • Procedures? Assume caller puts $a0-$a3 = a,b,c,d and wants result in $v0 PairDiff: // sub $sp,$sp,12 // Make space for 3 temp locations sw $t1, 8($sp) // save $t1 (optional if MIPS convention) sw $t0, 4($sp) // save $t0 (optional if MIPS convention) sw $s0, 0($sp) // save $s0 add $t0,$a0,$a1 // (t0=a+b) add $t1,$a2,$a3 //(t1=c+d) sub $s0,$t0,$t1 //(s0=t0-t1) add $v0,$s0,$zero // store return value in $v0 lw $s0,0($sp) // restore registers lw $t0,4($sp) // (optional if MIPS convention) lw $t1,8($sp) // (optional if MIPS convention) add $sp,$sp,12 // ‘pop’ the stack jr $ra // The actual return to calling routine Tarun Soni, Summer ‘03

Example: Nested_procedure() int fact(int n); { if(n<1) return(1); else return (n*fact(n-1)); } • What about nested procedures? $ra ?? • Recursive procedures? Assume $a0 = n fact: // sub $sp,$sp,8 // Make space for 2 temp locations sw $ra, 4($sp) // save return address sw $a0, 4($sp) // save argument n slt $t0,$a0,1 // test for n<1 beq $t0,$zero, L1 // if (n>=1) goto L1 add $v0,$zero,1 // $v0=1 add $sp,$sp,8 // ‘pop’ the stack jr $ra // return L1: sub $a0,$a0,1 // n--; jal fact; // call fact again. lw $a0,0($sp) // fact() returns here. Restore n lw $ra,4($sp) // restore return address add $sp,$sp,8 // ‘pop’ stack mult $v0,$a0,$v0 // $v0 = n*fact(n-1) jr $ra // return to caller (n<1) case (n>=1) case Tarun Soni, Summer ‘03

CPI Inst. Count Cycle Time Comparing Instruction Set Architectures Design-time metrics: ° Can it be implemented, in how long, at what cost? ° Can it be programmed? Ease of compilation? Static Metrics: ° How many bytes does the program occupy in memory? Dynamic Metrics: ° How many instructions are executed? ° How many bytes does the processor fetch to execute the program? ° How many clocks are required per instruction? ° How "lean" a clock is practical? Best Metric: Time to execute the program! This depends on • instruction set, • processor organization, and • compilation techniques. Tarun Soni, Summer ‘03

Computer Performance Measuring and Discussing Computer System Performance or “My computer is faster than your computer” Tarun Soni, Summer ‘03

SPEC Performance RISC introduction • performance now improves 50% per year (2x every 1.5 years) • But what is performance ?? Tarun Soni, Summer ‘03

Performance depends on the eyes of the beholder? • Purchasing perspective • given a collection of machines, which has the • best performance ? • least cost ? • best performance / cost ? • Design perspective • faced with design options, which has the • best performance improvement ? • least cost ? • best performance / cost ? • Both require • basis for comparison • metric for evaluation • Our goal is to understand cost & performance implications of architectural choices Tarun Soni, Summer ‘03

Plane DC to Paris Speed Passengers Throughput (pmph) Boeing 747 6.5 hours 610 mph 470 286,700 Concorde 3 hours 1350 mph 132 178,200 Two ideas • How much faster is the Concorde compared to the 747? • How much bigger is the 747 than the Douglas DC-8? Which has higher performance? ° Time to do the task (Execution Time) – execution time, response time, latency ° Tasks per day, hour, week, sec, ns. .. (Performance) – throughput, bandwidth Response time and throughput often are in opposition Tarun Soni, Summer ‘03

Two mechanisms of getting to the bay-area Vehicle Time to Bay Area Speed Passengers Throughput (pm/h) Ferrari 3.1 hours 160 mph 2 320 Greyhound 7.7 hours 65 mph 60 3900 ° Time to do the task from start to finish – execution time, response time, latency ° Tasks per unit time – throughput, bandwidth mostly used for data movement Response time and throughput often are in opposition Tarun Soni, Summer ‘03

Relative performance ? • can be confusing A runs in 12 seconds B runs in 20 seconds • A/B = .6 , so A is 40% faster, or 1.4X faster, or B is 40% slower • B/A = 1.67, so A is 67% faster, or 1.67X faster, or B is 67% slower • needs a precise definition Tarun Soni, Summer ‘03

Relative performance ? • Performance is in units of things-per-second • bigger is better • If we are primarily concerned with response time • performance(x) = 1 execution_time(x) " X is n times faster than Y" means Performance(X) n = ---------------------- Performance(Y) Relative Performance PerformanceX Execution TimeY n = = = PerformanceY Execution TimeX Tarun Soni, Summer ‘03

How many times ? • Time of Concorde vs. Boeing 747? • Concord is 1350 mph / 610 mph = 2.2 times faster • = 6.5 hours / 3 hours • Throughput of Concorde vs. Boeing 747 ? • Concord is 178,200 pmph / 286,700 pmph = 0.62 “times faster” • Boeing is 286,700 pmph / 178,200 pmph = 1.6 “times faster” • Boeing is 1.6 times (“60%”)faster in terms of throughput • Concord is 2.2 times (“120%”) faster in terms of flying time • We will focus primarily on execution time for a single job Tarun Soni, Summer ‘03

Some grammar? • “times faster than” (or “times as fast as”) • there’s a multiplicative factor relating quantities • “X was 3 time faster than Y”  speed(X) = 3 speed(Y) • “percent faster than” • implies an additive relationship • “X was 25% faster than Y”  speed(X) = (1+25/100) speed(Y) • “percent slower than” • implies subtraction • “X was 5% slower than Y”  speed(X) = (1-5/100) speed(Y) • “100% slower” means it doesn’t move at all ! • “times slower than” or “times as slow as” • is awkward. • “X was 3 times slower than Y” means speed(X) = 1/3 speed(Y) Tarun Soni, Summer ‘03

Avoid Linguistic Confusion X is r times faster than Y  speed(X) = r speed(Y)  speed(Y) = 1/r speed(X)  Y is r times slower than X X is r times faster than Y, & Y is s times faster than Z  speed(X) = r speed(Y) = rs speed(Z)  X is rs faster than Z (Cannot do this with % numbers !) Easiest way to avoid confusion: • Convert “% faster” to “times faster” • then do calculation and convert back if needed. • Example: change “25% faster” to “5/4 times faster”. Tarun Soni, Summer ‘03

user + kernel wallclock Which time anyways ? > time foo ... foo’s results ... 90.7u 12.9s 2:39 65% > • user CPU time? (time CPU spends running your code) • total CPU time(user + kernel)? (includes op. sys. code) • Wallclock time? (total elapsed time) • Includes time spent waiting for I/O, other users, ... • Answer depends ... For measuring processor speed, we can use total CPU. • If no I/O or interrupts, wallclock may be better • more precise (microseconds rather than 1/100 sec) • can measure individual sections of code Tarun Soni, Summer ‘03

Metrics of Performance Answers per month Useful Operations per second Application Programming Language Compiler (millions) of Instructions per second – MIPS (millions) of (F.P.) operations per second – MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins Each metric has a place and a purpose, and each can be misused Tarun Soni, Summer ‘03

Levels of benchmarking Cons Pros • very specific • non-portable • difficult to run, or • measure • hard to identify cause • representative Actual Target Workload • portable • widely used • improvements useful in reality • less representative Full Application Benchmarks • easy to “fool” Small “Kernel” Benchmarks • easy to run, early in design cycle • “peak” may be a long way from application performance • identify peak capability and potential bottlenecks Microbenchmarks Tarun Soni, Summer ‘03

time Cycle Time • Instead of reporting execution time in seconds, we often use cycles • Clock “ticks” indicate when to start activities (one abstraction): • cycle time = time between ticks = seconds per cycle • clock rate (frequency) = cycles per second (1 Hz. = 1 cycle/sec)A 200 Mhz. clock has a cycle time of: Tarun Soni, Summer ‘03

CPU Execution Time Instruction Count CPI Clock Cycle Time = X X Cycle Time seconds instructions seconds/cycle cycles/instruction • Improve performance => reduce execution time • Reduce instruction count (Programmer, Compiler) • Reduce cycles per instruction (ISA, Machine designer) • Reduce clock cycle time (Hardware designer, Physicist) Tarun Soni, Summer ‘03

Performance Variation CPU Execution Time Instruction Count CPI Clock Cycle Time = X X Tarun Soni, Summer ‘03

Amdahl’s Law • Execution Time After Improvement = Execution Time Unaffected + • ( Execution Time Affected / Amount of Improvement ) • Example: "Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?" How about making it 5 times faster? • Principle: Make the common case fast Tarun Soni, Summer ‘03

MIPS, MFLOPS etc. • MIPS - million instructions per second • = number of instructions executed in program = Clock rate • execution time in seconds * 106 CPI * 106 • MFLOPS - million floating point operations per second • = number of floating point operations executed in program • execution time in seconds * 106 • program-independent • deceptive Tarun Soni, Summer ‘03

Example RISC Processor Base Machine (Reg / Reg) Op Freq Cycles CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? How does this compare with using branch prediction to shave a cycle off the branch time? What if two ALU instructions could be executed at once? Tarun Soni, Summer ‘03

SPEC Which Programs? • peak throughput measures (simple programs) • synthetic benchmarks (whetstone, dhrystone,...) • Real applications • SPEC (best of both worlds, but with problems of their own) • System Performance Evaluation Cooperative • Provides a common set of real applications along with strict guidelines for how to run them. • provides a relatively unbiased means to compare machines. Tarun Soni, Summer ‘03

SPEC89 • Compiler “enhancements” and performance Tarun Soni, Summer ‘03

SPECCPU2000 Suite • SPECint2000– • gzip and bzip2 - compression • gcc – compiler; 205K lines of messy code! • crafty – chess program • parser – word processing • vortex – object-oriented database • perlbmk – PERL interpreter • eon – computer visualization • vpr, twolf – CAD tools for VLSI • mcf, gap – “combinatorial” programs • SPECfp2000– 10 Fortran, 3 C programs • scientific application programs (physics, chemistry, image processing, number theory, ...) Tarun Soni, Summer ‘03

Performance is always misleading • Performance is specific to a particular program/s • Total execution time is a consistent summary of performance • For a given architecture performance increases come from: • increases in clock rate (without adverse CPI affects) • improvements in processor organization that lower CPI • compiler enhancements that lower CPI and/or instruction count • Pitfall: expecting improvement in one aspect of a machine’s performance to affect the total performance • You should not always believe everything you read! Read carefully! Tarun Soni, Summer ‘03

Computer Arithmetic What do all those bits mean now? bits (011011011100010 ....01) data instruction number text chars .............. R-format I-format ... integer floating point single precision double precision signed unsigned ... ... ... ... Tarun Soni, Summer ‘03

Computer Arithmetic • How do you represent • negative numbers? • fractions? • really large numbers? • really small numbers? • How do you • do arithmetic? • identify errors (e.g. overflow)? • What is an ALU and what does it look like? • ALU=arithmetic logic unit Tarun Soni, Summer ‘03

0 8 bits of data 1 8 bits of data 2 8 bits of data 31 31 0 0 3 8 bits of data 4 8 bits of data 5 8 bits of data least-significant bit least-significant bit 6 8 bits of data Big Endian vs. Little Endian Big Endian IBM, Mot, HP, Sun Little Endian Dec, Intel Some processors (e.g. PowerPC) provide both • If you can figure out how to switch modes or get the compiler to issue “Byte-reversed load’s and store’s” Tarun Soni, Summer ‘03

Binary Numbers: An Introduction • Consider a 4-bit binary number • Examples of binary arithmetic: • 3 + 2 = 5 3 + 3 = 6 Decimal Binary Decimal Binary 0 0000 4 0100 1 0001 5 0101 2 0010 6 0110 3 0011 7 0111 1 1 1 0 0 1 1 0 0 1 1 + 0 0 1 0 + 0 0 1 1 0 1 0 1 0 1 1 0 Tarun Soni, Summer ‘03

Negative Numbers: Some options • Sign Magnitude -- MSB is sign bit, rest the same • -1 == 1001 • -5 == 1101 • One’s complement -- flip all bits to negate • -1 == 1111 • -5 == 1010 • We would like a number system that provides • obvious representation of 0,1,2... • uses adder for addition • single value of 0 • equal coverage of positive and negative numbers • easy detection of sign • easy negation Tarun Soni, Summer ‘03

Negative Numbers: two’s complement • Positive numbers: normal binary representation • Negative numbers: flip bits (0 1) , then add 1 Decimal -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 Two’s Complement Binary 1000 1001 1010 1011 1100 1101 1110 1111 0000 0001 0010 0011 0100 0101 0110 0111 Smallest 4-bit number: -8 Biggest 4-bit number: 7 Tarun Soni, Summer ‘03

Two’s complement arithmetic Decimal 2’s Complement Binary Decimal 2’s Complement Binary 0 0000 -1 1111 1 0001 -2 1110 2 0010 -3 1101 3 0011 -4 1100 4 0100 -5 1011 5 0101 -6 1010 6 0110 -7 1001 7 0111 -8 1000 • Examples: 7 - 6 = 7 + (- 6) = 1 3 - 5 = 3 + (- 5) = -2 1 1 1 1 1 0 1 1 1 0 0 1 1 + 1 0 1 0 + 1 0 1 1 0 0 0 1 1 1 1 0 Uses simple adder for + and - numbers Tarun Soni, Summer ‘03

Things to keep in mind • Negation • flip bits and add 1. (Works for + and -) • Might cause overflow. • Extend sign when loading into large register • +3 => 0011, 00000011, 0000000000000011 • -3 => 1101, 11111101, 1111111111111101 • Overflowdetection • (need to raise “exception” when answer can’t be represented) 0101 5 + 01106 1011 -5 ??!!! Tarun Soni, Summer ‘03

Overflow detection again 0 0 1 0 1 1 0 0 0 0 1 0 2 1 1 0 0 - 4 + 0 0 1 1 3 + 1 1 1 0 - 2 0 1 0 1 5 1 0 1 0 - 6 0 1 1 1 1 0 1 0 0 1 1 1 7 1 1 0 0 - 4 3 - 5 + 0 0 1 1 + 1 0 1 1 1 0 1 0 -6 0 1 1 1 7 So how do we detect overflow? Carry into MSB ! = Carry out of MSB Tarun Soni, Summer ‘03

operation a ALU 32 result 32 b 32 Execution: the heart of it all Instruction Fetch Instruction Decode Operand Fetch Execute Result Store Next Instruction Tarun Soni, Summer ‘03

ALUop 3 A N Zero ALU Result N Overflow B N CarryOut A Basic ALU • ALU Control Lines (ALUop) Function • 000 And • 001 Or • 010 Add • 110 Subtract • 111 Set-on-less-than General idea: Build for 1-bit numbers and then extend for n-bits! Tarun Soni, Summer ‘03

Some basics of digital logic Tarun Soni, Summer ‘03

1-bit ALU • ALU Control Lines (ALUop) Function • 000 And • 001 Or • ALU Control Lines (ALUop) Function • 000 And • 001 Or • 010 Add But how do we make the adder? Tarun Soni, Summer ‘03

CarryIn A 1-bit Full Adder C B CarryOut Inputs Outputs A B CarryIn CarryOut Sum Comments 0 0 0 0 0 0 + 0 + 0 = 00 0 0 1 0 1 0 + 0 + 1 = 01 0 1 0 0 1 0 + 1 + 0 = 01 0 1 1 1 0 0 + 1 + 1 = 10 1 0 0 0 1 1 + 0 + 0 = 01 1 0 1 1 0 1 + 0 + 1 = 10 1 1 0 1 0 1 + 1 + 0 = 10 1 1 1 1 1 1 + 1 + 1 = 11 1-bit Full Adder • This is also called a (3, 2) adder • Half Adder: No CarryIn nor CarryOut • Truth Table: Tarun Soni, Summer ‘03

Inputs Outputs A B CarryIn CarryOut Sum Comments 0 0 0 0 0 0 + 0 + 0 = 00 0 0 1 0 1 0 + 0 + 1 = 01 0 1 0 0 1 0 + 1 + 0 = 01 0 1 1 1 0 0 + 1 + 1 = 10 1 0 0 0 1 1 + 0 + 0 = 01 1 0 1 1 0 1 + 0 + 1 = 10 1 1 0 1 0 1 + 1 + 0 = 10 1 1 1 1 1 1 + 1 + 1 = 11 1-bit Full Adder: CarryOut CarryOut = (!A & B & CarryIn) | (A & !B & CarryIn) | (A & B & !CarryIn) | (A & B & CarryIn); CarryOut = B & CarryIn | A & CarryIn | A & B Tarun Soni, Summer ‘03

Inputs Outputs A B CarryIn CarryOut Sum Comments 0 0 0 0 0 0 + 0 + 0 = 00 0 0 1 0 1 0 + 0 + 1 = 01 0 1 0 0 1 0 + 1 + 0 = 01 0 1 1 1 0 0 + 1 + 1 = 10 1 0 0 0 1 1 + 0 + 0 = 01 1 0 1 1 0 1 + 0 + 1 = 10 1 1 0 1 0 1 + 1 + 0 = 10 1 1 1 1 1 1 + 1 + 1 = 11 1-bit Full Adder: Sum • Sum = (!A & !B & CarryIn) | (!A & B & !CarryIn) | (A & !B & !CarryIn) | (A & B & CarryIn) Tarun Soni, Summer ‘03

32-bit ALU • (ALUop) Function • And • Or • Add The 1-bit ALU What about other operations sub $s1, $s, $s3 ; $s1 = $s2 - $s3 ; Subtraction slt $s1, $s2, $s3 ; if ($s2 < $s3) {$s1 = 1} else {$s1 = 0}; set on less than (SLT) The 32-bit ALU Tarun Soni, Summer ‘03

32-bit ALU • Keep in mind the following: • (A - B) is the same as: A + (-B) • 2’s Complement negate: Take the inverse of every bit and add 1 • Bit-wise inverse of B is !B: • A - B = A + (-B) = A + (!B + 1) = A + !B + 1 Binvert provides the negation What about the ‘+1’ ? Tarun Soni, Summer ‘03

32-bit ALU Setting CarryIn[0] = 1 provides the ‘+1’ for the 32-bit adder. Tarun Soni, Summer ‘03

32-bit ALU: slt • slt instruction • If ( a<b) result = 1, else result = 0; • If ( a-b < 0 ) result = 1, else result = 0; • Do a subtract • use sign bit • route to bit 0 of result • all other bits zero Tarun Soni, Summer ‘03

Performance, ALUs and such like

Performance, ALUs and such like

Presentation Transcript

ALUs

Packing and Such

Chromebooks and Such

Such, Such Were the Joys

DAE  ALUS

Arithmetic- Logic Units (ALUs)

S ALUS Semantic Middleware

STRINGS! Fretted and Such

Instruction Set Architectures Performance issues ALUs Single Cycle CPU

ALUs

5.10 Adders, Subtractor and ALUs

Instruction Set Architectures Performance issues ALUs Single Cycle CPU

Irony,Satire and Such

Alus custom functional kitchen cabinets

More ALUs and floating point numbers

Refactoring and such