590 likes | 717 Views
This resource outlines the concept of Cycles Per Instruction (CPI), illustrating its significance in evaluating CPU performance. It details formulas for calculating CPI, average CPU time, and compares performance between two machines with varying CPI and clock cycle times. The provided examples demonstrate the impact of instruction frequency and execution time on overall performance. Additionally, strategies for optimizing CPU efficiency through improved data caching and branch prediction are discussed. This informs students of the key metrics in computer architecture critical for performance analysis.
E N D
Computer Architecture CSE 3322 Send email to Pramod Kumar, pxk3008@exchange.uta.edu, with the names and emails of your Four Project team members by Mon Sept 15. If not on a team, send your email address to Pramod. Web Site crystal.uta.edu/~jpatters/cse3322
CPI “Average clock cycles per instruction” CPI = Clock Cycles / Instruction
CPI “Average clock cycles per instruction” CPI = Clock Cycles / Instruction CPU Time = Instructions x CPI / Clock Rate = Instructions x CPI x Clock Cycle Time
CPI “Average clock cycles per instruction” • CPI = Clock Cycles / Instruction CPU Time = Instructions x CPI / Clock Rate = Instructions x CPI x Clock Cycle Time Average CPI = SUM of CPI (i) * I(i) for i=1, n Instruction Count
CPI “Average clock cycles per instruction” • CPI = Clock Cycles / Instruction Count • = (CPU Time * Clock Rate) / Instruction Count Invest Resources where time is Spent! CPU Time = Instruction Count x CPI / Clock Rate = Instruction Count x CPI x Clock Cycle Time Average CPI = SUM of CPI (i) * I(i) for i=1, n Instruction Count Average CPI = SUM of CPI(i) * F(i) for i = 1, n F(i) is the Instruction Frequency
CPI Example Suppose we have two implementations of the same instruction set For some program,Machine A has: a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much?
CPI Example Suppose we have two implementations of the same instruction set For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much?
CPI Example Suppose we have two implementations of the same instruction set For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: CPU Time = I*1.2*20ns=I*24ns a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much?
CPI Example Suppose we have two implementations of the same instruction set For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: CPU Time = I*1.2*20ns=I*24ns a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? A is 24/20 =1.2 faster than B
CPI Example Suppose we have two implementations of the same instruction set For some program,Machine A has: CPU Time = I*2.0*10ns=I*20ns a clock cycle time of 10 ns. and a CPI of 2.0 Machine B has: CPU Time = I*1.2*20ns=I*24ns a clock cycle time of 20 ns. and a CPI of 1.2 What machine is faster for this program, and by how much? A is 24/20 =1.2 faster than B Note: CPI is Smaller for B
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 Load 20% 5 Store 10% 3 Branch 20% 2 Typical Mix
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 Load 20% 5 1.0 Store 10% 3 .3 Branch 20% 2 .4 2.2 = CPI ave Typical Mix
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 Load 20% 5 1.0 Store 10% 3 .3 Branch 20% 2 .4 2.2 = CPI ave Typical Mix CPU Time(i) = Instr Cnt(i) * CPI(i) * Clk Cycle Time CPU Time Inst Cnt * CPI ave * Clk Cycle Time % Time = F(i) * CPI(i) / CPI ave
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 = CPI ave Typical Mix CPU Time(i) = Instr Cnt(i) * CPI(i) * Clk Cycle Time CPU Time Inst Cnt * CPI ave * Clk Cycle Time % Time = F(i) * CPI(i) / CPI ave
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 = CPI ave Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles?
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 (2) 1.0 (.4) 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 (1.6) Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? 2.2/1.6 = 1.375 CPU Time = Inst Cnt * CPI ave * Clk Cycle Time
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time?
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 (1) .4 (.2) 18% 2.2 (2.0) Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 .5 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0 What if two ALU instructions could be executed at once?
Example (RISC processor) Base Machine (Reg / Reg) Op Freq Cycles F(i)CPI(i) % Time ALU 50% 1 (.5) .5 (.25) 23% Load 20% 5 1.0 45% Store 10% 3 .3 14% Branch 20% 2 .4 18% 2.2 (1.95) Typical Mix How much faster would the machine be if a better data cache reduced the average load time to 2 cycles? CPI = 1.6 How does this compare with using branch prediction to shave a cycle off the branch time? CPI = 2.0 What if two ALU instructions could be executed at once? CPI=1.95
A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of CThe second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C.Which sequence will be faster? How much?What is the CPI for each sequence?
A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much?What is the CPI for each sequence?
A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much? 10 / 9 = 1.11What is the CPI for each sequence?
A compiler designer is trying to decide between two code sequences for a particular machine. Based on the hardware implementation, there are three different classes of instructions: Class A has 1 cycle Class B has 2 cycles Class C has 3 cycles The first code sequence has 5 instructions: 2 of A, 1 of B, and 2 of C 2*1+1*2+2*3 = 10 The second sequence has 6 instructions: 4 of A, 1 of B, and 1 of C. 4*1+1*2+1*3 = 9 Which sequence will be faster? How much? 10 / 9 = 1.11What is the CPI for each sequence? 10/5 = 2 9/6 = 1.5
A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10
A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10 • Cannot compare if instruction set is different
A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10 • Cannot compare if instruction set is different • Highly dependent on the program
A popular performance metric is MIPS, the number of millions of instructions per second. For a given program, Instruction Count MIPS = 6 Execution time x 10 • Cannot compare if instruction set is different • Highly dependent on the program • Can be inversely proportional to performance
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A has 1 cycle,Class B has 2 cycles, Class C has 3 cycles Instruction counts ( billions) Code from A B C Compiler 1 5 1 1 Compiler 2 10 1 1 • Which sequence will be faster according to MIPS? • Which sequence will be faster according to execution time?
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 5+1x2+1x3=10 billion Compiler 2
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion Compiler 2 15 billion
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 1010x10-8=100 Compiler 2 15 billion
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec Compiler 2 15 billion 150 sec
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec 7x103/100 Compiler 2 15 billion 150 sec
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec 70 Compiler 2 15 billion 150 sec 12x103/150
MIPS example Two different compilers are being tested for a 100 MHz. machine with three different classes of instructions: Class A Class B Class C CPI 1 2 3 Instruction counts ( billions) Code from A B C Total Compiler 1 5 1 1 7 Compiler 2 10 1 1 12 CPU cycles Exec Time MIPS Compiler 1 10 billion 100 sec 70 Compiler 2 15 billion 150 sec 80
Benchmarks • Performance best determined by running a real application • Use programs typical of expected workload • Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc.
Benchmarks • Performance best determined by running a real application • Use programs typical of expected workload • Or, typical of expected class of applications e.g., compilers/editors, scientific applications, graphics, etc. • Small benchmarks • nice for architects and designers • easy to standardize • can be abused
Benchmarks • SPEC (System Performance Evaluation Cooperative)
Benchmarks • SPEC (System Performance Evaluation Cooperative) • companies have agreed on a set of real program and inputs
Benchmarks • SPEC (System Performance Evaluation Cooperative) • companies have agreed on a set of real program and inputs • can still be abused
Benchmarks • SPEC (System Performance Evaluation Cooperative) • companies have agreed on a set of real program and inputs • can still be abused • valuable indicator of performance (and compiler technology)
SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload
SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer applications • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex
SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer applications • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • Ten floating-point intensive applications • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5
SPEC95 • Eighteen application benchmarks (with inputs) reflecting a technical computing workload • Eight integer applications • go, m88ksim, gcc, compress, li, ijpeg, perl, vortex • Ten floating-point intensive applications • tomcatv, swim, su2cor, hydro2d, mgrid, applu, turb3d, apsi, fppp, wave5 • Must run with standard compiler flags • eliminate special undocumented incantations that may not even generate working code for real programs
Amdahl's Law Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement )
Amdahl's Law Execution Time After Improvement = Execution Time Unaffected + ( Execution Time Affected / Amount of Improvement ) • Example: Suppose a program runs in 100 seconds on a machine, with multiply responsible for 80 seconds of this time. How much do we have to improve the speed of multiplication if we want the program to run 4 times faster?