Lecture 2: Review of Performance/Cost/Power Metrics and Architectural Basics

Lecture 2: Review of Performance/Cost/Power Metricsand Architectural Basics Prof. Jan M. Rabaey Computer Science 252 Spring 2000 “Computer Architecture in Cory Hall”

Review Lecture 1 • Class Organization • Class Projects • Trends in the Industry and Driving Forces

Computer Architecture Topics Input/Output and Storage Disks, WORM, Tape RAID Emerging Technologies Interleaving Bus protocols DRAM Coherence, Bandwidth, Latency Memory Hierarchy L2 Cache L1 Cache Addressing, Protection, Exception Handling VLSI Instruction Set Architecture Pipelining and Instruction Level Parallelism Pipelining, Hazard Resolution, Superscalar, Reordering, Prediction, Speculation, Vector, VLIW, DSP, Reconfiguration

Computer Architecture Topics Shared Memory, Message Passing, Data Parallelism P M P M P M P M ° ° ° Network Interfaces S Interconnection Network Processor-Memory-Switch Topologies, Routing, Bandwidth, Latency, Reliability Multiprocessors Networks and Interconnections

The Secret of Architecture Design: Measurement and Evaluation • Architecture Design is an iterative process: • Searching the space of possible designs • At all levels of computer systems Creativity Cost / Performance Analysis Good Ideas Mediocre Ideas Bad Ideas

Evaluate Existing Systems for Bottlenecks Implementation Complexity Benchmarks Implement Next Generation System Computer Engineering Methodology Analysis Imple- mentation Technology Trends Simulate New Designs and Organizations Workloads Design

Measurement Tools • Hardware: Cost, delay, area, power estimation • Benchmarks, Traces, Mixes • Simulation (many levels) • ISA, RT, Gate, Circuit • Queuing Theory • Rules of Thumb • Fundamental “Laws”/Principles

Review:Performance, Cost, Power

DC to Paris Speed Passengers Throughput 6.5 hours 610 mph 470 286,700 3 hours 1350 mph 132 178,200 Metric 1: Performance In passenger-mile/hour Plane Boeing 747 Concorde • Time to run the task • Execution time, response time, latency • Tasks per day, hour, week, sec, ns … • Throughput, bandwidth

The Performance Metric • "X is n times faster than Y" means • ExTime(Y) Performance(X) • --------- = --------------- • ExTime(X) Performance(Y) • Speed of Concorde vs. Boeing 747 • Throughput of Boeing 747 vs. Concorde

Amdahl's Law Speedup due to enhancement E: ExTime w/o E Performance w/ E Speedup(E) = ------------- = ------------------- ExTime w/ E Performance w/o E Suppose that enhancement E accelerates a fraction F of the task by a factor S, and the remainder of the task is unaffected

Amdahl’s Law ExTimenew = ExTimeold x (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced 1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced

ExTimenew= ExTimeold x (0.9 + .1/2) = 0.95 x ExTimeold 1 Speedupoverall = = 1.053 0.95 Amdahl’s Law • Floating point instructions improved to run 2X; but only 10% of actual instructions are FP • Law of diminishing return: • Focus on the common case!

Metrics of Performance Application Answers per month Operations per second Programming Language Compiler (millions) of Instructions per second: MIPS (millions) of (FP) operations per second: MFLOP/s ISA Datapath Megabytes per second Control Function Units Cycles per second (clock rate) Transistors Wires Pins

CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Aspects of CPU Performance Inst Count CPI Clock Rate Program X Compiler X (X) Inst. Set. X X Organization X X Technology X

Cycles Per Instruction “Average Cycles per Instruction” • CPI = Cycles / Instruction Count • = (CPU Time * Clock Rate) / Instruction Count Invest Resources where time is Spent! n CPU time = CycleTime * CPI * I i i i = 1 “Instruction Frequency” n CPI = CPI * F where F = I i i i i i = 1 Instruction Count

Example: Calculating CPI Base Machine (Reg / Reg) Op Freq CPIi CPIi*Fi (% Time) ALU 50% 1 .5 (33%) Load 20% 2 .4 (27%) Store 10% 2 .2 (13%) Branch 20% 2 .4 (27%) 1.5 Typical Mix

Creating Benchmark Sets • Real programs • Kernels • Toy benchmarks • Synthetic benchmarks • e.g. Whetstones and Dhrystones

SPEC: System Performance Evaluation Cooperative • First Round 1989 • 10 programs yielding a single number (“SPECmarks”) • Second Round 1992 • SPECInt92 (6 integer programs) and SPECfp92 (14 floating point programs) • Compiler Flags unlimited. March 93 of DEC 4000 Model 610: spice: unix.c:/def=(sysv,has_bcopy,”bcopy(a,b,c)= memcpy(b,a,c)” wave5: /ali=(all,dcom=nat)/ag=a/ur=4/ur=200 nasa7: /norecu/ag=a/ur=4/ur2=200/lc=blas • Third Round 1995 • new set of programs: SPECint95 (8 integer programs) and SPECfp95 (10 floating point) • “benchmarks useful for 3 years” • Single flag setting for all programs: SPECint_base95, SPECfp_base95

How to Summarize Performance • Arithmetic mean (weighted arithmetic mean) tracks execution time: (Ti)/n or (Wi*Ti) • Harmonic mean (weighted harmonic mean) of rates (e.g., MFLOPS) tracks execution time: n/ (1/Ri) or n/(Wi/Ri) • Normalized execution time is handy for scaling performance (e.g., X times faster than SPARCstation 10) • Arithmetic mean impacted by choice of reference machine • Use the geometric mean for comparison:(Ti)^1/n • Independent of chosen machine • but not good metric for total execution time

SPEC First Round • One program: 99% of time in single line of code • New front-end compiler could improve dramatically IBM Powerstation 550 for 2 different compilers

Impact of Means on SPECmark89 for IBM 550(without and with special compiler option) Ratio to VAX: Time:Weighted Time: Program Before After Before After Before After gcc 30 29 49 51 8.91 9.22 espresso 35 34 65 67 7.64 7.86 spice 47 47 510 510 5.69 5.69 doduc 46 49 41 38 5.81 5.45 nasa7 78 144 258 140 3.43 1.86 li 34 34 183 183 7.86 7.86 eqntott 40 40 28 28 6.68 6.68 matrix300 78 730 58 6 3.43 0.37 fpppp 90 87 34 35 2.97 3.07 tomcatv 33 138 20 19 2.01 1.94 Mean 54 72 124 108 54.42 49.99 Geometric Arithmetic Weighted Arith. Ratio 1.33 Ratio 1.16 Ratio 1.09

Performance Evaluation • “For better or worse, benchmarks shape a field” • Good products created when have: • Good benchmarks • Good ways to summarize performance • Given sales is a function in part of performance relative to competition, investment in improving product as reported by performance summary • If benchmarks/summary inadequate, then choose between improving product for real programs vs. improving product to get more sales;Sales almost always wins! • Execution time is the measure of computer performance!

Integrated Circuits Costs Die Cost goes roughly with die area4

Real World Examples Chip Metal Line Wafer Defect Area Dies/ Yield Die Cost layers width cost /cm2 mm2 wafer 386DX 2 0.90 $900 1.0 43 360 71% $4 486DX2 3 0.80 $1200 1.0 81 181 54% $12 PowerPC 601 4 0.80 $1700 1.3 121 115 28% $53 HP PA 7100 3 0.80 $1300 1.0 196 66 27% $73 DEC Alpha 3 0.70 $1500 1.2 234 53 19% $149 SuperSPARC 3 0.70 $1700 1.6 256 48 13% $272 Pentium 3 0.80 $1500 1.5 296 40 9% $417 • From "Estimating IC Manufacturing Costs,” by Linley Gwennap, Microprocessor Report, August 2, 1993, p. 15

Average Discount Gross Margin Component Cost Cost/PerformanceWhat is Relationship of Cost to Price? • Recurring Costs • Component Costs • Direct Costs(add 25% to 40%) recurring costs: labor, purchasing, scrap, warranty • Non-Recurring Costs or Gross Margin(add 82% to 186%) (R&D, equipment maintenance, rental, marketing, sales, financing cost, pretax profits, taxes • Average Discountto get List Price (add 33% to 66%): volume discounts and/or retailer markup List Price 25% to 40% Avg. Selling Price 34% to 39% 6% to 8% Direct Cost 15% to 33%

Chip Prices (August 1993) • Chip Area Mfg. Price Multi- Comment • mm2 cost plier • 386DX 43 $9 $31 3.4 Intense Competition • 486DX2 81 $35 $245 7.0No Competition • PowerPC 601 121 $77 $280 3.6 • DEC Alpha 234 $202 $1231 6.1Recoup R&D? • Pentium 296 $473 $965 2.0 Early in shipments • Assume purchase 10,000 units

Summary: Price vs. Cost

Power/Energy Source: Intel • Lead processor power increases every generation • Compactions provide higher performance at lower power

n P = (1/CPU Time) * E * I i i i= 1 Energy/Power • Power dissipation: rate at which energy is taken from the supply (power source) and transformed into heat P = E/t • Energy dissipation for a given instruction depends upon type of instruction (and state of the processor)

ReconfigurableProcessor/Logic Pleiades 10-80 MOPS/mW ASIPs DSPs 2 V DSP: 3 MOPS/mW Embedded Processors SA110 0.4 MIPS/mW The Energy-Flexibility Gap 1000 Dedicated HW 100 Energy Efficiency MOPS/mW (or MIPS/mW) 10 1 0.1 Flexibility (Coverage)

Summary, #1 • Designing to Last through Trends • Capacity Speed • Logic 2x in 3 years 2x in 3 years • SPEC RATING: 2x in 1.5 years • DRAM 4x in 3 years 2x in 10 years • Disk 4x in 3 years 2x in 10 years • 6yrs to graduate => 16X CPU speed, DRAM/Disk size • Time to run the task • Execution time, response time, latency • Tasks per day, hour, week, sec, ns, … • Throughput, bandwidth • “X is n times faster than Y” means • ExTime(Y) Performance(X) • --------- = -------------- • ExTime(X) Performance(Y)

1 ExTimeold ExTimenew Speedupoverall = = (1 - Fractionenhanced) + Fractionenhanced Speedupenhanced CPU time = Seconds = Instructions x Cycles x Seconds Program Program Instruction Cycle Summary, #2 • Amdahl’s Law: • CPI Law: • Execution time is the REAL measure of computer performance! • Good products created when have: • Good benchmarks, good ways to summarize performance • Different set of metrics apply to embedded systems

Review:Instruction Sets, Pipelines, and Caches

Computer Architecture Is … the attributes of a [computing] system as seen by the programmer, i.e., the conceptual structure and functional behavior, as distinct from the organization of the data flows and controls the logic design, and the physical implementation. Amdahl, Blaaw, and Brooks, 1964

Computer Architecture’s Changing Definition • 1950s to 1960s: Computer Architecture Course = Computer Arithmetic • 1970s to mid 1980s: Computer Architecture Course = Instruction Set Design, especially ISA appropriate for compilers • 1990s: Computer Architecture Course = Design of CPU, memory system, I/O system, Multiprocessors

Computer Architecture is ... Instruction Set Architecture Organization Hardware

Instruction Set Architecture (ISA) software instruction set hardware

Interface Design • A good interface: • Lasts through many implementations (portability, compatability) • Is used in many differeny ways (generality) • Provides convenient functionality to higher levels • Permits an efficient implementation at lower levels use time imp 1 Interface use imp 2 use imp 3

Evolution of Instruction Sets Single Accumulator (EDSAC 1950) Accumulator + Index Registers (Manchester Mark I, IBM 700 series 1953) Separation of Programming Model from Implementation High-level Language Based Concept of a Family (B5000 1963) (IBM 360 1964) General Purpose Register Machines Complex Instruction Sets Load/Store Architecture (CDC 6600, Cray 1 1963-76) (Vax, Intel 432 1977-80) RISC (Mips,Sparc,HP-PA,IBM RS6000,PowerPC . . .1987) LIW/”EPIC”? (IA-64. . .1999)

Evolution of Instruction Sets • Major advances in computer architecture are typically associated with landmark instruction set designs • Ex: Stack vs GPR (System 360) • Design decisions must take into account: • technology • machine organization • programming languages • compiler technology • operating systems • applications • And they in turn influence these

A "Typical" RISC • 32-bit fixed format instruction (3 formats I,R,J) • 32 32-bit GPR (R0 contains zero, DP take pair) • 3-address, reg-reg arithmetic instruction • Single address mode for load/store: base + displacement • no indirection • Simple branch conditions (based on register values) • Delayed branch see: SPARC, MIPS, HP PA-Risc, DEC Alpha, IBM PowerPC, CDC 6600, CDC 7600, Cray-1, Cray-2, Cray-3

Example: MIPS ( DLX) Register-Register 6 5 11 10 31 26 25 21 20 16 15 0 Op Rs1 Rs2 Rd Opx Register-Immediate 31 26 25 21 20 16 15 0 immediate Op Rs1 Rd Branch 31 26 25 21 20 16 15 0 immediate Op Rs1 Rs2/Opx Jump / Call 31 26 25 0 target Op

A B C D Pipelining: Its Natural! • Laundry Example • Ann, Brian, Cathy, Dave each have one load of clothes to wash, dry, and fold • Washer takes 30 minutes • Dryer takes 40 minutes • “Folder” takes 20 minutes

A B C D Sequential Laundry 6 PM Midnight 7 8 9 11 10 Time • Sequential laundry takes 6 hours for 4 loads • If they learned pipelining, how long would laundry take? 30 40 20 30 40 20 30 40 20 30 40 20 T a s k O r d e r

30 40 40 40 40 20 A B C D Pipelined LaundryStart work ASAP 6 PM Midnight 7 8 9 11 10 • Pipelined laundry takes 3.5 hours for 4 loads Time T a s k O r d e r

30 40 40 40 40 20 A B C D Pipelining Lessons 6 PM 7 8 9 • Pipelining doesn’t help latency of single task, it helps throughput of entire workload • Pipeline rate limited by slowest pipeline stage • Multiple tasks operating simultaneously • Potential speedup = Number pipe stages • Unbalanced lengths of pipe stages reduces speedup • Time to “fill” pipeline and time to “drain” it reduces speedup Time T a s k O r d e r

Computer Pipelines • Execute billions of instructions, so throughout is what matters • DLX desirable features: all instructions same length, registers located in same place in instruction format, memory operands only in loads or stores

Adder 4 Address Inst ALU 5 Steps of DLX DatapathFigure 3.1, Page 130 Instruction Fetch Instr. Decode Reg. Fetch Execute Addr. Calc Memory Access Write Back Next PC MUX Next SEQ PC Zero? RS1 Reg File MUX RS2 Memory Data Memory L M D RD MUX MUX Sign Extend Imm WB Data

MEM/WB ID/EX EX/MEM IF/ID Adder 4 Address ALU 5 Steps of DLX DatapathFigure 3.4, Page 134 Instruction Fetch Execute Addr. Calc Memory Access Instr. Decode Reg. Fetch Write Back Next PC MUX Next SEQ PC Next SEQ PC Zero? RS1 Reg File MUX Memory RS2 Data Memory MUX MUX Sign Extend WB Data Imm RD RD RD • Data stationary control • local decode for each instruction phase / pipeline stage

Lecture 2: Review of Performance/Cost/Power Metrics and Architectural Basics

Lecture 2: Review of Performance/Cost/Power Metrics and Architectural Basics

Presentation Transcript

Computing System Fundamentals/Trends + Review of Performance Evaluation and ISA Design

MAXIMIZING PERFORMANCE “Making a Difference”

Lecture 6: Query Processing; Hurry up!

Stage Gate – Lecture 2 Review Process

EPSY 546: LECTURE 1 SUMMARY

MANAGERIAL ECONOMICS Lecture 4 . Evaluating Country Economic Performance I: Stabilization Dr. Edilberto Segura Partner

2011 Online Performance Review of Tenured and Tenure-Track Faculty

Lecture 3. Fuel Cell Thermodynamics

Performance Management

CUDA Lecture 11 Performance Considerations

Lecture 5: Parallel Tools Landscape – Part 2

Provincial Budgets and Expenditure Review 2001/02 – 2007/08 15 September 2005

Lecture 3: IR System Elements (cont)

Chapter 6

Lecture 7

Lecture 5 Assembly Language

Performance Review Committee (PRC) meeting

Performance Review Committee (PRC) meeting

Introducing the National Guard Performance Appraisal Application (PAA)

Lecture 2

Origin Builder review - A top notch weapon