Computer Architecture

Princess Sumaya University for Technology Computer Architecture Dr. Esam Al_Qaralleh

Performance & cost

Performance Evolution • 1970s • Mainframes dominated – performance improved 25—30%/yr • Mostly due to improved architecture + some technology aids • 1980s • VLSI + microprocessor became the foundation • Technology improves at 35%/yr

Performance Evolution (Cont.) • 1980s (Cont.) • Compiler focus brought on the great CISC vs. RISC debate • With the exception of Intel – RISC won the argument • RISC performance improved by 50%/year initially • Of course RISC is not as simple anymore and the compiler is a key part of the game • Does not matter how fast your computer is, if the compiler wastes most of it due to the inability to generate efficient code • With the exploitation of instruction-level parallelism (pipeline + super-scalar) and the use of caches, performance is further enhanced CISC: Complex Instruction Set Computing RISC: Relegate Important Stuff to the Compiler (Reduced Instruction Set Computing)

Growth in Performance (Figure 1.1) Mainly due to advanced architecture ideas Technology driven

Optimizing the Design • Usually the functional requirements are set by the company/marketplace • Which design is optimal dependent on the choice of metric • Cost minimized  simple design • Performance maximized  complex design or better technology • Time to market minimized  also favors simplicity • Oh – and you only get one shot • Requires heaps of simulation and must quantify everything • Inherent requirements for deep infrastructure and support • Plus you must predict the trends…

Cost, Price, and Their Trends

Cost • Clearly a market place issue -- profit as a function of volume • Let’s focus on hardware costs • Factors impacting cost • Learning curve – manufacturing costs decrease over time • Yield – the percentage of manufactured devices that survives the testing procedure • Volume is also a key factor in determine cost • Commodities are products that are sold by multiple vendors in large volumes and are essentially identical. (laptops)

Learning Curve at Work

Integrated Circuits Costs Die Cost goes roughly with die area

Cost of an Integrated Circuit Die Yield is the fraction or percentage of good dies on a wafer number  is a parameter that corresponds roughly to the number of masking level, a measure on manufacturing complexity, critical to die yield ( = 4.0 is a good estimate).

Example: Finding the number of dies • Find the number of die per 30-cm wafer for a die that is 0.7 cm on a side. • Ans: The total die area is 049 cm2. Thus   (30/2)2   30 Dies per wafer = -------------  ---------------- = 1347 0.49 ( 2  0.49)0.5

Example: Finding the die yield • Find the die yield for dies that are 1 cm on a side and 0.7 cm on a side, assuming a defect density of 0.6 per cm2. Ans: The total die areas are 1 cm2 and 0.49 cm2. For the larger die yield is Die yield={1+(0.6  1)/4}-4=0.57 For the smaller die, it is Die yield = {1+(0.6  0.49)/4}-4=0.75

Computer Designers and Chip Costs • The computer designer affects die size, and hence cost, both by what functions are included on or excluded from the die and by the number of I/O pins

Measuring and Reporting Performance

Definitions of Time • Time can be defined in different ways, depending on what we are measuring: • Response time : Total time to complete a task, including time spent executing on the CPU, accessing disk and memory, waiting for I/O and other processes, and operating system overhead. • CPU execution time : Total time a CPU spends computing on a given task (excludes time for I/O or running other programs). This is also referred to as simply CPU time. • User CPU time : Total time CPU spends in the program • System CPU execution time : Total time operating systems spends executing tasks for the program. • For example, a program may have a system CPU time of 22 sec., a user CPU time of 90 sec., a CPU execution time of 112 sec., and a response time of 162 sec..

performance Time to do the task (Execution Time) – execution time, response time,latency Tasks per day, hour, week, sec, ns. .. (Performance) – performance, throughput, bandwidth Response time– the time between the start and the completion of a task Thus, to maximize performance, need to minimize execution time If X is n times faster than Y, then Throughput – the total amount of work done in a given time Important to data center managers Decreasing response time almost always improves throughput

Calculating CPU Performance • Want to distinguish elapsed time and the time spent on our task • CPU execution time (CPU time) – time the CPU spends working on a task • Does not include time waiting for I/O or running other programs • Can improve performance by reducing either the length of the clock cycle or the number of clock cycles required for a program

Calculating CPU Performance (Cont.) • We tend to count instructions executed = IC • Note looking at the object code is just a start • What we care about is the dynamic count - e.g. don’t forget loops, recursion, branches, etc. • CPI (Clock Per Instruction) is a figure of merit

Calculating CPU Performance (Cont.) • 3 Focus Factors -- Cycle Time, CPI, IC • Sadly - they are interdependent and making one better often makes another worse (but small or predictable impacts) • Cycle time depends on HW technology and organization • CPI depends on organization (pipeline, caching...) and ISA • IC depends on ISA and compiler technology • Often CPI’s are easier to deal with on a per instruction basis

# CPU clock cycles # Instructions Average clock cycles = x for a program for a program per instruction Clock Cycles per Instruction • Not all instructions take the same amount of time to execute • One way to think about execution time is that it equals the number of instructions executed multiplied by the average time per instruction • Clock cycles per instruction (CPI) – the average number of clock cycles each instruction takes to execute • A way to compare two different implementations of the same ISA

Effective CPI • Computing the overall effective CPI is done by looking at the different types of instructions and their individual cycle counts and averaging n Overall effective CPI =  (CPIi x ICi) i = 1 • Where ICi is the count (percentage) of the number of instructions of class i executed • CPIi is the (average) number of clock cycles per instruction for that instruction class • n is the number of instruction classes • The overall effective CPI varies by instruction mix – a measure of the dynamic frequency of instructions across one or many programs

Example of Computing CPU time • If a computer has a clock rate of 50 MHz, how long does it take to execute a program with 1,000 instructions, if the CPI for the program is 3.5? • Using the equation CPU time = Instruction count x CPI / clock rate gives CPU time = 1000 x 3.5 / (50 x 106) • If a computer’s clock rate increases from 200 MHz to 250 MHz and the other factors remain the same, how many times faster will the computer be? CPU time old clock rate new 250 MHz ------------------- = ---------------------- = ---------------- = 1.25 CPU time new clock rate old 200 MHZ

n AM = 1/n  Timei i = 1 Comparing and Summarizing Performance • How do we summarize the performance for benchmark set with a single number? • The average of execution times that is directly proportional to total execution time is the arithmetic mean (AM) • Guiding principle in reporting performance measurements is reproducibility – list everything another experimenter would need to duplicate the experiment (version of the operating system, compiler settings, input set used, specific computer configuration (clock rate, cache sizes and speed, memory size and speed, etc.)) • Where Timei is the execution time for the ith program of a total of n programs in the workload • A smaller mean indicates a smaller average execution time and thus improved performance

Choosing Programs to Evaluate Performance • Real applications – clearly the right choice • Porting and eliminating system-dependent activities • User burden -- to know which of your programs you really care about • Modified (or scripted) applications • Enhance portability or focus on particular aspects of system performance • Kernels – small, key pieces of real programs • Best used to isolate performance of individual features to explain the reasons from differences in performance of real programs • i.e. testing memory/ALU/branches intructions • Not real programs however -- no user really uses them

Choosing Programs to Evaluate Performance (Cont.) • Toy benchmarks – quicksort, puzzle • Beginning programming assignment • Synthetic benchmarks • Try to match the average frequency of operations and operands of a large set of programs • No user really runs them -- not even pieces of real programs • They typically reside in cache & don’t test memory performance • At the very least you must understand what the benchmark code is in order to understand what it might be measuring • Companies thrive or bust on benchmark performance • Hence they optimize for the benchmark • BEWARE ALWAYS!!

Benchmark Suites • SPEC (Standard Performance Evaluation Corporation) • http://www.spec.org • Desktop benchmarks • CPU-intensive: SPEC CPU2000 • Graphic-intensive: SPECviewperf • Server benchmarks • CPU throughput-oriented: SPECrate • I/O activity: SPECSFS (NFS), SPECWeb • Transaction processing: TPC (Transaction Processing Council) • Embedded benchmarks • EEMBC (EDN Embedded Microprocessor Benchmark Consortium)

SPEC Benchmarks www.spec.org

Other Performance Metrics • Power consumption – especially in the embedded market where battery life is important (and passive cooling) • For power-limited applications, the most important metric is energy efficiency

CPI Inst. Count Cycle Time Evaluating ISAs • Design-time metrics: • Can it be implemented, in how long, at what cost? • Can it be programmed? Ease of compilation? • Static Metrics: • How many bytes does the program occupy in memory? • Dynamic Metrics: • How many instructions are executed? How many bytes does the processor fetch to execute the program? • How many clocks are required per instruction? Best Metric: Time to execute the program! depends on the instructions set, the processor organization, and compilation techniques.

Other Problems • Let’s assume we can get the test jig specified properly • See the following example • Which is better? • By how much? • Are the program equally important?

Some Aggregate Job Mix Options • Arithmetic Mean - provides a simple average • Does not account for weight - all programs treated equal • Weighted arithmetic mean • Weight is the frequency % of use • Better but beware the dominant program time • Depend on the reference machine

Weighted Arithmetic Mean

Normalized Time Metrics • Geometric Mean • Has the nice property that: • Ratio of the means = Mean of the ratios • Consistent no matter which machine is the reference • Better than arithmetic means but • Don’t form accurate prediction models – don’t predict execution time • Still have to remain cautious

Normalized Time Metrics Arithmetic mean should not be used to average normalized execution time

Quantitative Principles of Computer Design

Make the Common Case Fast • Need to validate that it is common or uncommon • Often • Common cases are simpler than uncommon cases • e.g. exceptions like overflow, interrupts, ... • Truly simple is usually both cheap and fast - best of both worlds • Trick is to quantify the advantage of a proposed enhancement

Amdahl’s Law • Defines speedup gained from a particular feature • Depends on 2 factors • Fraction of original computation time that can take advantage of the enhancement - e.g. the commonality of the feature • Level of improvement gained by the feature • Amdahl’s law Quantification of the diminishing return principle

Amdahl's Law (Cont.) Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected

Simple Example • Important Application: • FPSQRT 20% • FP instructions account for 50% • Other 30% • Designers say same cost to speedup: • FPSQRT by 40x • FP by 2x • Other by 8x • Which one should you invest? • Straightforward plug in the numbers & compare BUT what’s your guess?? Amdahl’s Law says nothing about cost

And the Winner Is…?

Example of Amdahl’s Law • Floating point instructions are improved to run twice as fast, but only 10% of the time was spent on these instructions originally. How much faster is the new machine? 1 ExTimeold ExTimenew Speedup= = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 Speedup= = 1.053 (1 - 0.1) + 0.1/2 • The new machine is 1.053 times as fast, or 5.3% faster. • How much faster would the new machine be if floating point instructions become 100 times faster? 1 Speedup= = 1.109 (1 - 0.1) + 0.1/100

Estimating Performance Improvements • Assume a processor currently requires 10 seconds to execute a program and processor performance improves by 50 percent per year. • By what factor does processor performance improve in 5 years? (1 + 0.5)^5 = 7.59 • How long will it take a processor to execute the program after 5 years? ExTimenew = 10/7.59 = 1.32 seconds

Performance Example • Computers M1 and M2 are two implementations of the same instruction set. • M1 has a clock rate of 50 MHz and M2 has a clock rate of 75 MHz. • M1 has a CPI of 2.8 and M2 has a CPI of 3.2 for a given program. • How many times faster is M2 than M1 for this program? • What would the clock rate of M1 have to be for them to have the same execution time? ExTimeM1 ICM1 x CPIM1 / Clock RateM1 2.8/50 = = = 1.31 ExTimeM2 ICM2 x CPIM2 / Clock RateM2 3.2/75

Simple Example • Suppose we have made the following measurements: • Frequency of FP operations (other than FPSQR) =25% • Average CPI of FP operations=4.0 • Average CPI of other instructions=1.33 • Frequency of FPSQR=2% • CPI of FPSQR=20 • Two design alternatives • Reduce the CPI of FPSQR to 2 • Reduce the average CPI of all FP operations to 2

And The Winner is…

Thank You !

Computer Architecture