ECE200 – Computer Organization

ECE200 – Computer Organization Chapter 2 - The Role of Performance

Homework 2 • 2.1-2.4, 2.10, 2.11, 2.13-2.17, 2.26-2.28, 2.39, 2.41-2.44

Outline for Chapter 2 lectures • How computer systems are generally evaluated • How architects make design tradeoffs • Performance metrics • Combining performance results • Amdahl’s Law

Evaluating computer system performance • A workload is a collection of programs • A user’s workload are the programs that they run day in and day out on their computer • Ideally, the user evaluates the performance of their workload on a given machine before deciding whether to purchase it • We common folk don’t get to do this! • Maybe General Motors can… • So how can we as customers make purchase decisions without being able to run our programs on different machines?

Benchmarks • Benchmarks are particular programs chosen to measure the “goodness” (usually performance) of a machine • Benchmark suites attempt to mimic the workloads of particular user communities • Scientific benchmarks, business benchmarks, consumer benchmarks, etc. • Computer manufacturers report performance results for benchmarks to aid users in making machine comparisons • The hope is that most user workloads can be represented well enough by a modest set of benchmark suites

The SPEC benchmarks • SPEC = System Performance Evaluation Cooperative • Established in 1989 by computer companies to create a benchmark set and reporting practices for evaluating CPU and memory system performance • Provides a set of primarily integer benchmarks (SPECint) and a set of primarily floating point benchmarks (SPECfp) • Results reported by companies using SPEC • Individual benchmark results • A composite integer result (single number) • A composite floating point result (single number) • Throughput results obtained by simultaneously running multiple copies of each individual benchmark • SPEC also has java, web, and other benchmarks • www.spec.org

The latest version: SPEC CPU2000 • Comprised of SPECint2000 and SPECfp2000 benchmarks • SPECint2000 programs • 164.gzip: Data compression utility • 175.vpr: FPGA circuit placement and routing • 176.gcc: C compiler • 181.mcf: Minimum cost network flow solver • 186.crafty: Chess program • 197.parser: Natural language processing • 252.eon: Ray tracing • 253.perlbmk: Perl • 254.gap: Computational group theory • 255.vortex: Object-oriented database • 256.bzip2: Data compression utility • 300.twolf: Place and route simulator

The latest version: SPEC CPU2000 • SPECfp2000 programs • 168.wupwise: Quantum chromodynamics • 171.swim: Shallow water modeling • 172.mgrid: Multi-grid solver in 3D potential field • 173.applu: Parabolic/elliptic partial differential equations • 177.mesa: 3D graphics library • 178.galgel: Fluid dynamics: analysis of oscillatory instability • 179.art: Neural network simulation: adaptive resonance theory • 183.equake: Finite element simulation: earthquake modeling • 187.facerec: Computer vision: recognizes faces • 188.ammp: Computational chemistry • 189.lucas: Number theory: primality testing • 191.fma3d: Finite-element crash simulation • 200.sixtrack: Particle accelerator model • 301.apsi: Solves problems regarding temperature, wind, distribution of pollutants

Benchmarks and the architect • Computer architects developing a new machine • Want to know how fast their machine will run compared to current offerings • Also want to know the cost/performance benefit when making tradeoffs throughout the design process • Example: if I change the cache size, what will be the relative change in performance? • Standard benchmarks (like SPEC) are used by architects in evaluating design tradeoffs • Real customer applications (like PowerPoint) are also used

Architecting a new machine • Chicken and egg problem: Architects need to compare different design options before having the systems to run the benchmarks on • Solution of long ago: hardware prototyping • Build a system, evaluate it, re-design it, re-evaluate it, … • High level of circuit integration makes this nearly impossible today • Difficult to get at internals of chip • Too costly ($ and time) to re-spin

Simulation and circuit analysis • To evaluate the performance of different design options we need • The number of clock cycles required to run each benchmark for each design option • The clock frequency of each design option • Execution time = number of clock cycles/clock frequency • Clock frequency is evaluated by circuit designers working in conjunction with the architects • The number of clock cycles is determined by the architects via architectural simulation

Architectural simulation • A model that faithfully mimics the operation of the new computer system is written in a HLL • The model executes machine code and collects performance statistics while it runs • Power dissipation may also be evaluated • Functional correctness is not tested at this level (later when HDL design is completed) • Various design parameters can be changed to allow architects to explore how combinations of design options impact performance

An example: SimpleScalar • An architectural simulator written by Todd Austin (currently a Professor at Michigan) • Written in C • Executes MIPS programs among others • Widely used in the architecture community (especially by academics) • Models a high performance CPU, caches, and main memory • Publicly available at www.simplescalar.com • We use this a LOT in our research here • Check out www.ece.rochester.edu/research/acal

SimpleScalar input file (partial) -fetch:ifqsize 16 # instruction fetch queue size (in insts) -fetch:mplat 2 # extra branch mis-prediction latency -fetch:speed 2 # speed of front-end of machine relative to execution core -bpred comb # branch predictor type {nottaken|taken|perfect|bimod|2lev|comb} -bpred:bimod 4096 # bimodal predictor config (<table size>) -bpred:2lev 1 4096 12 1 # 2-level predictor config (<l1size> <l2size> <hist_size> <xor>) -bpred:comb 4096 # combining predictor config (<meta_table_size>) -bpred:ras 64 # return address stack size (0 for no return stack) -bpred:btb 2048 4 # BTB config (<num_sets> <associativity>) -bpred:spec_update <null> # speculative predictors update in {ID|WB} (default non-spec) -decode:width 8 # instruction decode B/W (insts/cycle) -issue:width 4 # instruction issue B/W (insts/cycle) -issue:inorder false # run pipeline with in-order issue -issue:wrongpath true # issue instructions down wrong execution paths -commit:width 4 # instruction commit B/W (insts/cycle) -ruu:size 64 # register update unit (RUU) size -lsq:size 16 # load/store queue (LSQ) size

SimpleScalar input file (partial) -cache:dl1 dl1:256:32:2:r # l1 data cache config, i.e., {<config>|none} -cache:dl1lat 1 # l1 data cache hit latency (in cycles) -cache:dl2 ul2:32768:32:4:l # l2 data cache config, i.e., {<config>|none} -cache:dl2lat 15 # l2 data cache hit latency (in cycles) -cache:il1 il1:512:32:4:r # l1 inst cache config, i.e., {<config>|dl1|dl2|none} -cache:il1lat 1 # l1 instruction cache hit latency (in cycles) -cache:il2 dl2 # l2 instruction cache config, i.e., {<config>|dl2|none} -cache:il2lat 15 # l2 instruction cache hit latency (in cycles) -cache:flush false # flush caches on system calls -cache:icompress false # convert 64-bit inst addresses to 32-bit inst equivalents -mem:lat 75 2 # memory access latency (<first_chunk> <inter_chunk>) -mem:width 16 # memory access bus width (in bytes) -res:ialu 2 # total number of integer ALU's available -res:imult 2 # total number of integer multiplier/dividers available -res:memport 2 # total number of memory system ports available (to CPU) -res:fpalu 2 # total number of floating point ALU's available -res:fpmult 2 # total number of floating point multiplier/dividers available ETC

SimpleScalar output file (partial) sim_num_insn 400000000 # total number of instructions committed sim_num_refs 211494189 # total number of loads and stores committed sim_num_loads 152980862 # total number of loads committed sim_num_stores 58513327.0000 # total number of stores committed sim_num_branches 6017796 # total number of branches committed sim_elapsed_time 7735 # total simulation time in seconds sim_inst_rate 51712.9929 # simulation speed (in insts/sec) sim_total_insn 432323546 # total number of instructions executed sim_total_refs 214748856 # total number of loads and stores executed sim_total_loads 155672898 # total number of loads executed sim_total_stores 59075958.0000 # total number of stores executed sim_total_branches 6921320 # total number of branches executed sim_cycle 564744596 # total simulation time in cycles sim_IPC 0.7083 # instructions per cycle sim_CPI 1.4119 # cycles per instruction sim_exec_BW 0.7655 # total instructions (mis-spec + committed) per cycle sim_IPB 66.4695 # instruction per branch

SimpleScalar output file (partial) bpred_comb.lookups 7496023 # total number of bpred lookups bpred_comb.updates 6017796 # total number of updates bpred_comb.addr_hits 5386455 # total number of address-predicted hits bpred_comb.dir_hits 5669839 # total number of direction-predicted hits (includes addr-hits) bpred_comb.used_bimod 4795297 # total number of bimodal predictions used bpred_comb.used_2lev 1222499 # total number of 2-level predictions used bpred_comb.misses 347957 # total number of misses bpred_comb.jr_hits 106171 # total number of address-predicted hits for JR's bpred_comb.jr_seen 496372 # total number of JR's seen bpred_comb.bpred_addr_rate 0.8951 # branch address-prediction rate (i.e., addr-hits/updates) bpred_comb.bpred_dir_rate 0.9422 # branch direction-prediction rate (i.e., all-hits/updates) bpred_comb.bpred_jr_rate 0.2139 # JR address-prediction rate (i.e., JR addr-hits/JRs seen) bpred_comb.retstack_pushes 363292 # total number of address pushed onto ret-addr stack bpred_comb.retstack_pops 645247 # total number of address popped off of ret-addr stack

SimpleScalar output file (partial) il1.accesses 446551244.0000 # total number of accesses il1.hits 415761995 # total number of hits il1.misses 30789249 # total number of misses il1.replacements 30787201 # total number of replacements il1.writebacks 0 # total number of writebacks il1.invalidations 0 # total number of invalidations il1.miss_rate 0.0689 # miss rate (i.e., misses/ref) il1.repl_rate 0.0689 # replacement rate (i.e., repls/ref) il1.wb_rate 0.0000 # writeback rate (i.e., wrbks/ref) il1.inv_rate 0.0000 # invalidation rate (i.e., invs/ref) dl1.accesses 207813682.0000 # total number of accesses dl1.hits 204118070 # total number of hits dl1.misses 3695612 # total number of misses dl1.replacements 3695100 # total number of replacements dl1.writebacks 1707742 # total number of writebacks dl1.invalidations 0 # total number of invalidations dl1.miss_rate 0.0178 # miss rate (i.e., misses/ref) ETC

Performance metrics • Execution time • Also known as wall clock time, elapsed time, response time • Total time to complete a task • Example: hit RETURN, how long until the answer appears on the screen • Throughput • Also known as bandwidth • Total number of operations (such as instructions, memory requests, programs) completed per unit time (rate) • Performance improves when execution time is reduced or throughput is increased

Central Processing Unit instructions operands Level1 Data Cache Level1 Instruction Cache Level2 Cache Interconnect Main Memory Input/ Output disk keyboard/mouse network etc Breaking down execution time • Breaking execution time into components allows designers to focus on particular machine levels • I/O operations are often overlapped with the execution of another task on the CPU • I/O system may be designed almost independently of rest of system • CPU time ignores the I/O component of execution time

CPU Time as a performance metric • The time spent by the CPU, caches, and main memory in executing the workload • Ignores any idle time due to I/O activity • Two components • User CPU Time: the time spent by user programs • System CPU Time: the time spent by the operating system (OS) • User CPU Time is often only evaluated… • Many standard benchmarks like SPEC have little OS activity • Many architectural simulators do not support the OS • There are exceptions, such as SimOS from Stanford • OS code is not often available to evaluate

CPU Time breakdown • CPU Time = CYCLES x CT = INST x CPI x CT • CYCLES • Total cycles to execute the program • CT • Clock cycle time (clock period) • 1/clock frequency • INST • Total number of assembly instructions executed • CPI • Average number of clock cycles executed per instruction • Total clock cycles/total instructions executed (CYCLES/INST) • Different instruction types (add, divide, etc.) may take different numbers of clock cycles to execute

CPU Time example • CYCLES = 6 • INST = 4 • CPI = 6/4 = 1.5 • CT = 1ns • CPU Time = CYCLES x CT = 6 x 1ns = 6ns lw $4, 0($2) 2 cycles lw $6, 4($2) 2 cycles add $4, $4, $6 1 cycle sw $4, 0($2) 1 cycle

What parts of CPU Time (INST, CPI, CT)… • Are influenced by the ISA designer? • Are influenced by the compiler writer? • Are influenced by the microarchitect?

What parts of CPU Time can be ignored if • The programs are already compiled and you are designing the microarchitecture? • The ISA and microarchitecture are fixed and you are developing a compiler? • You are comparing two machines that have different ISAs?

Latency • Latency = number of clock cycles required to do something • Access cache, execute a particular instruction, etc. • Alternate definition: amount of time (ns) to do something • Designers may increase latency in order to decrease CT • Why might the higher latency option increase the total delay? • Why might the higher latency option perform better? logic logic logic 1 ns 0.6 ns 0.55 ns latency = 1 cycle CT = 1 ns latency = 2 cycles CT = 0.6 ns

The CYCLES-CT tradeoff • A feature that improves either CYCLES or CT very often worsens the other • Examples • Increasing cache size to reduce CYCLES at the expense of CT • CYCLES is reduced because slow main memory is accessed less often • Larger cache operates at a slower speed, may have to increase CT • Increasing machine parallelism to reduce CYCLES at the expense of CT multiplier multiplier multiplier 1 multiply at a time 2 multiplies at a time downsides?

The INST-CPI tradeoff • In creating an assembly equivalent to a HLL program, the compiler writer may have several choices that differ in INST and CPI • Best solution may involve more, simpler instructions • Example: multiply by constant 5 muli $2, $4 , 5 4 cycles sll $2, $4 , 2 1 cycle add $2, $2, $4 1 cycle

Summarizing performance results • Useful to generate a single performance number from multiple benchmark results • For execution time (and its derivatives) • Total the execution time of all the n programs • Can also use the Arithmetic Mean where Timei is the execution time of the ith program • The Weighted AM assigns weights to each program where Weightiis the weight assigned to the ith program • All weights add to 1 • Weighting example: equalize all

Summarizing performance results • For performance as a rate, e.g., instructions/sec • Use the Harmonic Mean where Ratei is the rate of the ith program • Also have Weighted HM

Amdahl’s Law (very famous) • The law of diminishing returns • Performance improvement of an enhancement is limited by the fraction of time the enhancement is used where execution_timeold is the execution time without the enhancement, fractionenhanced is the fraction of the time (NOT the instructions) that can take advantage of the enhancement speedupenhanced is the speedup obtained when using the enhancement

Amdahl’s Law example • Assume multiply operations constitute 20% of the execution time of a benchmark • What is the execution time improvement for a new multiplier that provides a 10 times speedup over the existing multiplier?

Questions?

ECE200 – Computer Organization

ECE200 – Computer Organization

Presentation Transcript