Advanced Computer Architecture

Advanced Computer Architecture • We will consider the issues involved in current Architecture design and implementation: • RISC instruction sets • Pipelining • Reducing Cache Misses and Optimizing Virtual Memory Usage • Designing an I/O System • Interconnection Networks and Multiprocessing

A Different Perspective • Rather than focusing on the roles of the architectural components, we will • Use Quantitative measures to test ideas • Use a RISC instruction set for examples • Discuss a variety of software and hardware techniques to provide optimization • Attempt to force as much parallelism out of the code as possible

How did we get here? • Growth of microprocessor capabilities and reduction in prices in the past few years (see figure 1.1 p. 2) • Dominance of micro’s (workstations, PCs) in society • Freedom from compatibility with older architectures and designs • Renaissance in computer design emphasizing new architectural innovations • Efficient use of improvements in hardware and compiler technologies • Sustaining these improvements will require continued innovations in design! Our task!

Tasks of an Architect • Determine what attributes (HSA, ISA) are important for a new machine • Design the machine to maximize its performance while staying within cost constraints • ISA design, functional organization, logic design, implementation of circuits, packaging, power supply & cooling • We will concentrate on the first 3 • See figure 1.2, page 5

Trends in Computer Usage • Replacing assembly language with high-level languages - easier software compatibility, less restrictive on hardware • Memory needs of software grows by a factor of 1.5 to 2 every year • Compiler technology continues to improve (optimization) • Improved ISA’s moving towards RISC

Trends in Hardware Technology • Transistor density increases 50%/year • Die sizes increase 10-25%/year • Combined, transistor count increases 60-80%/year • DRAM density increases 60%/year • DRAM cycle time decreases 1/3 in 10 years • Disk density increases 50%/year • Disk access time decreases 1/3 in 10 years

Cost • Factors: • learning curve (manufacturing cost which decreases over time) • Yield (how quickly can components be mass produced) • Volume (how much is mass produced) • Commodization (how widely distributed is the produce) • Packaging • See figure 1.3, page 9

Formulas for computing cost • Cost(IC) = (Cost(die) + Cost(testing) + Cost(packaging & testing)) / final test yield • Cost(die) = Cost(wafer)/(Dies per wafer * Die yield) • Dies per wafer = pi * (Diameter(wafer)/2)^2 / Die area - pi * Diameter(wafer) / (square root (2 * Die area)) • Example: # of dies per 20 cm wafer for a die that is 1.5 cm on a side = pi * (20/2)^2/2.25 - pi*20/sqrt(2*2.25) = 314/2.25 - 62.8/2.12 = 110

Another Example • yield = wafer yield * (1 + dpua * die area / a) ^ -a • dpua is defects per unit area, a is a parameter roughly corresponding to the number of masking levels (a measure of manufacturing complexity) • Find the die yield for dies that are 1 cm on a side assuming a defect of .8 per cm2. What about 1.5 cm on a side? • Total die areas are 1 cm2 and 2.25 cm2 for the example. • 1 cm per side: die yield = (1 + .8*1/3)-3 = .49 • 1.5 cm per side: die yield = (1+.8*2.25/3)-3 = .24

Reporting Performance • What does it mean that one computer is faster than another? • We might use terms such as: • execution time (also called response time) • throughput • wall-clock time (or elapsed time) • CPU time, user CPU time, system CPU time • System performance • CPU performance

Performance Measures • To say X is n times faster than Y means that • Execution time Y / Execution time X = n • Performance X / Performance Y = n • The throughput of X is 1.3 times higher than Y means that the number of tasks that can be executed on X is 1.3 more than on Y in the same amount of time

Comparing Results • MIPS, MegaFLOPS, Mhz, Throughput, Response time, etc… are all misleading statements • Computer X might execute 300 Mhz and Y might execute 100 Mhz and Y might have a larger throughput than X • It is important to compare computers that are performing the same (or equivalent) tasks - this is the only way to get accurate comparisons

Benchmarks • There are four levels or programs that can be used to test performance • Real programs - e.g., C compiler, TeX, CAD tool, programs that have input and output and options that the user can select • Kernels - remove key pieces of programs and just test those • Toy benchmarks - 10-100 lines of code such as quicksort whose performance is known in advance • Synthetic benchmarks - try to match average frequency of operations to simulate those instructions in some large program

Benchmark Suite • A set of programs that test different performance metrics such as arrays, floating point operations, loops, etc… • SPEC92 is a commonly quoted benchmark suite • One problem that has arisen is that some architectures are now optimized to perform well on SPEC92 even though the computers produced may not be as good as others!

SPEC92 programs • Consists of source programs in C and FORTRAN • Programs differ from 272 lines of code to 83589 • Real-world applications such as circuit simulator, Monte Carlo simulation of nuclear reactor, chemical application that solves equations for a model of 500 atoms, matrix multiplication and FFT, neural net training simulator, lisp interpreter, spreadsheet computations, etc… • See figure 1.9, page 22

Reporting Performance Results • One important factor is that performance results be reproducible - • However, reported results may omit such information as the input, compiler settings, version of compiler, version of OS, size and number of disks, etc… • SPEC benchmark reports must include information like compiler flags, fairly complete description of the machine, and results running with normal and optimized compilers

Comparing Performances • Consider Figure 1.11, page 24, we can say: • A is 10 times faster than B for P1 • B is 10 times faster than A for P2 • A is 20 times faster than C for P1 • C is 50 times faster than A for P2 • B is 2 times faster than C for P1 • C is 5 times faster than B for P2 • If we take one of these by itself, it does not give a real picture of the power of any computer -- but advertisers might use one of these anyways!

A Consistent Measure • We can solve the previous problem by computing total execution time for the 2 programs and say • B is 9.1 times faster than A for P1 & P2 • C is 25 times faster than A for P1 & P2 • C is 2.75 times faster than A for P1 and P2 • We can also use arithmetic mean, harmonic mean, weighted mean and geometric mean to provide a better picture. See figures 1.12 and 1.13 on pages 26-27 for example

Amdahl’s Law • A fundamental law in describing performance gain created through some architectural improvement as speedup • Speedup = performance of task in enh mode / performance without enh mode or • Speedup = execution time without enh mode / execution time using enh mode when possible

Using Amdahl’s Law • We must consider two factors in using this: • Fraction of the computation time in the original machine that can be converted to take advantage of the enhancement • Improvement gained by the enhanced execution mode how much faster will the task run if the enhanced mode is used for the entire program?) • Speedup = 1 / [ (1 - fraction enhanced) + Fraction enhanced / Speedup enhanced) ]

Examples • An enhancement runs 10 times faster but is only usable 40% of the time. • Speedup = 1 / [(1 - .4) * .4/10] = 1.56 • Suppose FP sqrt is responsible for 20% of instructions in a benchmark. We could add FP sqrt hardware that will speed up the performance by a factor of 10, or we could try to enhance all FP operations by a factor of 2 (1/2 of all instructions in the benchmark are FP operations) • Speedup FP sqrt = 1/[(1-.2) * .2/10] = 1.22 • Speedup all FP = 1/[(1-.5)*.5/2] = 1.33

CPU Performance • CPU time = CPU clock cycles * clock cycle time • CPU time = CPU clock cycles for prog / Clock rate • IC - instruction count (# of instructions in the program), CPI - clock cycles per instruction • CPI = CPU clock cycles for prog / IC • CPU time = IC * CPI * Clock cycle time • CPU time = (Sum CPIi * ICi) * clock cycle time • CPI = Sum (CPIi * ICi/ Instruction Count)

Example • Frequency of FP operations = 25% • Average CPI of FP operations = 4.0 • Average CPI of other instructions = 1.33 • Frequency of FP sqrt = 2%, CPI of FP sqrt = 20 • CPI = 4*25%+1.33*75% = 2.0 • Two alternatives: reduce CPI of FP sqrt to 2 or reduce CPI of all FP ops to 2 • CPI new FP sqrt = CPI original - 2% * (20-2) = 1.64 • CPI new FP = 75%*1.33+25%*2.0=1.5 • Speedup new FP = CPI original/CPI new FP = 1.33 (refer back to previous example)

CPU Components’ Performance • Large part of a Comp. Architect’s job is to design tools or means of measuring the CPU component performances • Low level tools: timing estimators • We can also measure the instruction count for a program using compiler technology, using the program execution duration and the instruction mix • Execution-based monitoring by including in the program, code that saves the instruction mix during execution

Measuring CPI • Requires knowing the processor’s organization and the instruction stream • Designers may use Average CPIs but this is influenced by cache and pipeline structures • We might assume a perfect memory system that does not cause delays • Pipeline CPI measures can be determined by simulating the pipeline (which might be sufficient for simple pipes but not for advanced pipes)

More Ex’s of CPU Performance • 2 alternatives for a conditional branch instruction • CPU A: condition code is set by a compare instruction and followed by a branch that tests the condition code • CPU B: compare is included in the branch • Conditional branch takes 2 cycles, all other instructions take 1 clock cycle. For CPU A, 20% of all instructions are conditional branches. • Assume CPU A has a clock cycle time 1.25 times faster than CPU B (since CPUA does not have the compare included in the branch statement) • Which CPU is faster?

Solution • CPI A = .2 * 2 + .8 * 1 = 1.2 • CPU time A = IC A * 1.2 * Clock Cycle time A • A’s clock rate is 1.25 times higher than B. Compares are not executed in isolation on B, so there are instead 25% compares and 75% other • CPI B = .25 * 2 + .75 * 1 = 1.25 • CPU time B = IC B * 1.25 * Clock Cycle time B = .8 * IC A * 1.25 * 1.25 * Clock Cycle time A • CPU time B = 1.25 * IC A * Clock Cycle time A • So, CPU time A is shorter than B and so A is faster

Memory Hierarchy • Register (CPU) • Cache • Main Memory • I/O Devices • Hard Disk • Optical disk, floppy disk • Magnetic tape • See figures 1.15 and 1.16 on p. 40-41

Cache Performance • Assume: Cache is 10 times faster than memory and cache hit rate is 90%. How much speedup is gained using this cache? • Use Amdahl’s law: • Speedup = 1 / [(1-90%) + (90%/10)] = 1/[.1+.9/10] = 5.3 • Over a 5 times speedup by using cache with these specifications!

Memory Impact on CPU • In a pipeline, a memory stall will occur if the memory fetch of an operand is not found in cache • CPU execution time = (CPU clock cycles + Memory stall cycles) * Clock cycle • Memory stall cycles = number of misses * miss penalty = IC * misses per instruction * miss penalty = IC * mem ref’s/instr * miss rate * miss penalty • Miss rate is determined by cache efficiency • Miss penalty is determined by main memory system speed (also bus load and bandwidth, etc…)

Example • Assume a machine with • CPI = 2.0 when all memory accesses are hits • Only data accesses are loads and stores (40% of all instructions are loads and stores) • Miss penalty = 25 clock cycles • Miss rate = 2% • How much faster would the machine be if all accesses are hits?

Solution • For machine with no misses: • CPU exec. Time = (CPU clock cycles + memory stall cycles) * clock cycle = (IC * CPI + 0) * clock cycle • For machine with 2% miss rate: • Memory stall cycles = IC * memory references/instr * miss rate * miss penalty = IC * (1 + .4) * .02 * 25 = IC * .7 • CPU exec Time = (IC * 2.0 + IC * .7) * clock cycle = 2.7 * IC * clock cycle • So, the machine with no misses is 2.7/2.0 times faster or 1.35 times faster

Fallacies • MIPS is an accurate measure for comparing performance among computers • MFLOPS is a consistent and useful measure of performance • Synthetic benchmarks predict performance for real programs • Benchmarks remain valid indefinitely • Peak performance tracks observed performance

What is wrong with MIPS? • MIPS is dependent on the instruction set making it difficult to compare with computers using different ISAs • MIPS varies between programs on the same computer! • MIPS can vary inversely to performance • Example: floating point operations which might be implemented in floating point hardware (and thus not counted in MIPS) or as simple integer instructions, providing a higher MIPS rating though a slower outcome

Example: optimized compiler • Optimized compiler for load-store machine with specs as shown in figure 1.17, p. 45 • Compiler discards 1/2 of the ALU instructions • Assume a 2 nsec clock cycle (and no system issues), 1.57 unoptimized CPI, what is the MIPS rating for optimized vs. unoptimized code?

Solution • CPI unopt = 1.57 • MIPS unopt = 500 Mhz/1.57 * 106 = 318.5 • CPU time unopt = IC unopt * 1.57 * 2*10-9 = 3.14 * 10-9 * IC unopt • CPI opt = [(.43/2)*1 + .21*2 + .12*2 + .24*2] / [1 - (.43/2)] = 1.73 • MIPS opt = 500 Mhz/1.73*106 = 289.0 • CPU time opt = (.785*IC unopt)*1.73* 10-9 = 2.72 * 10-9 * IC unopt • Optimized code is 3.14/2.27 = 1.15 times faster but MIPS rating is lower!

Advanced Computer Architecture