Performance and Optimization

Performance and Optimization

Measuring Performance • Key measure of performance for a computing system is speed • Response time or execution time or latency. • Throughput. • We have seen how to increase throughput, while slightly increasing execution time of each single instruction • Pipeline design. • We now concentrate on measuring total execution time. • Total execution time can mean: • Elapsed time -- includes all I/O, OS and time spent on other jobs. • CPU time -- time spent by processor on your job.

CPU Execution Time • We consider CPU execution time on an unloaded system • Machine X is n times faster than machine Y if where, • CPU Time = Execution Time • Performance = 1 / CPU Time • Basic measure of performance: = Cycles count X Clock cycle time CPU TimeY CPU TimeX Clock cycles program performanceX performanceY Seconds Clock cycle = n CPU Time = X or = n

CPU Execution Time • Clock cycle time is measured in nanoseconds (10-9sec) or microseconds (10-6sec) • Clock cycle rate = 1 / (Clock cycle time) is measured in • MegaHertz (MHz) - 106 cycles/sec • GigaHertz (GHz) - 109 cycles/sec

CPI (Cycles per Instruction) = IC X CPI • CPI is one way to compare different implementations of the same Instruction Set Architecture (ISA), since instruction count (IC) for a given program will be the same in both cases. Instructions program Average Clock cycles Instruction Cycles Count = X

CPI - Pipelined Implementation • In each cycle the execute stage would either process an instruction or a bubble, injected due to one of three special cases. • Total of Ci instructions and Cb bubbles, then the processor has required Ci+Cb clock cycles to execute Ci instructions. • In pipelined implementation, CPI = (Ci+Cb)/Ci = 1 + Cb/Ci • Cb/Ci indicates the average number of bubbles injected per instruction. Thus, in our implementation CPI = 1.27

CPI Example • We have two machines with different implementations of the same ISA (Instruction Set Architecture). Machine A has a clock cycle time of 10 ns and a CPI of 2.0 for program P; machine B has a clock cycle time of 20 ns and a CPI of 1.2 for the same program. Which machine is faster? • Let IC be the number of instructions to be executed (same in both machines). Then, Cycles countA = 2.0 IC Cycles countB = 1.2 IC Calculate CPU Time for each machine: CPU TimeA = 2.0 IC x 10 ns = 20.0 IC ns CPU TimeB = 1.2 IC x 20 ns = 24.0 IC ns • Machine A is faster; in fact 24/20 = 20% faster.

Composite Performance Measure Instructions program or = Instruction Count X CPI X clock cycle time or = Instruction Count X CPI Clock rate • These formulas show that performance is always a function of 3 distinct factors; 1 or 2 factors alone are not sufficient. • IC (Instruction Count) was once the main factor advertised (VAX); today clock rate is in the headlines (3 GHz Pentiums) • CPI is more difficult to advertise • Changing one factor often affects others. For example, • Decreasing Instruction count means each instruction is doing more; hence either CPI or cycle time or both, may increase. • A smart compiler may decrease CPI by choosing the right kind and order of instructions, without a large increase in Instruction count. Average Clock cycles Instruction Seconds Clock cycle CPU Time = X X

Amdahl's Law • Make the common case fast -- why? • Denote part of system that was enhanced as the enhanced fraction or Fenhanced. • Speedup = = = • = CPU Timeold CPU Timeold CPU Timenew CPU Timeold (1 - Fenhanced) + CPU Timeold Fenhanced (1/ speedupenhanced) 1 Fenhanced speedupenhanced (1 - Fenhanced ) +

Amdahl's Law (Example) Suppose we have a technique for improving the performance of FP operations by a factor of 10 : • What fraction of the code must be floating point to achieve • a 200% improvement in performance? • 3 = ==> • Fenhanced= 20/27 = 74% Even dramatic enhancements make a limited contribution unless they relate to a very common case. 1 Fenhanced 10 (1 - Fenhanced ) +

Amdahl's Law (example cont.) • Let us assume a seq processor hardware: • Fetch – 11% (percentage of time spent) • Decode – 18% • Execute – 23% • Memory – 40% • WriteBack – 8% • And the speedups : 1x , 5x , 20x, 1.6x , 1x. • 0.11/1 + 0.18/5 + 0.23/20 + 0.40/1.6 + 0.08/1 = 0.4875 • Speedup = 1/0.4875 =~ 2 • The 5x and 20x have little effect

Amdahl's Law (Parallel) Max speedup by N Processors = • P is the portion of the code that can be made parallel • N is the number of processors • With a very big number of processors the speedup is • bound to 1/(1-p). • What does that mean about the efficiency of parallel computing ? On which kind of problems ?

Important to Keep in Mind • 90/10 rule. 90% of the time spent in 10% of the code. • Readability vs. Performances. • Time vs. Memory.

Machine-Independent Optimizations • Optimizations you should do regardless of processor / compiler • Code Motion • Reduce frequency with which computation performed • If it will always produce same result • Especially moving code out of loop for (i = 0; i < n; i++) { int ni = n*i; for (j = 0; j < n; j++) a[ni + j] = b[j]; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

Reduction in Strength • Replace costly operation with simpler one • Shift, add instead of multiply or divide 16*x --> x << 4 • Utility machine dependent • Depends on cost of multiply or divide instruction • On Pentium II or III, integer multiply only requires 4 CPU cycles • Recognize sequence of products int ni = 0; for (i = 0; i < n; i++) { for (j = 0; j < n; j++) a[ni + j] = b[j]; ni += n; } for (i = 0; i < n; i++) for (j = 0; j < n; j++) a[n*i + j] = b[j];

Using more efficient instructions Can Cannot

Arrays and Loops Example • Original code: • Assembly code:

Loop Optimization Loop optimization is the process of the increasing execution speed and reducing the overheads associated of loops. Plays an important role in improving cache performance and making effective use of parallel processing capabilities. Most execution time of a program is spent on loops, so a lot of compiler optimization techniques have been developed to make them faster.

Time Scales • Absolute Time • Typically use nanoseconds • 10–9 seconds • Time scale of computer instructions • Clock Cycles • Most computers controlled by high frequency clock signal • Typical Range • 100 MHz • 108 cycles per second • Clock period = 10ns • 2 GHz • 2 X 109 cycles per second • Clock period = 0.5ns

Cycles Per Element • Convenient way to express performance of program that operates on vectors or lists • Length = n • T = CPE*n + Overhead vsum1 Slope = 4.0 vsum2 Slope = 3.5

Optimization Example void combine1(vec_ptr v, int *dest) { int i; *dest = 0; for (i = 0; i < vec_length(v); i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Procedure • Compute sum of all elements of integer vector • Store result at destination location • Vector data structure and operations defined via abstract data type • Pentium II/III Performance: Clock Cycles / Element • 42.06 (Compiled -g) 31.25 (Compiled -O2)

General GCC optimization commands • Most optimizations are only enabled if -O is set on the command line. Otherwise they are disabled, even if individual optimization flags are specified. • With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time. • With -O2 Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. As compared to -O, this option increases both compilation time and the performance of the generated code.

Understanding Loop void combine1-goto(vec_ptr v, int *dest) { int i = 0; int val; *dest = 0; if (i >= vec_length(v)) goto done; loop: get_vec_element(v, i, &val); *dest += val; i++; if (i < vec_length(v)) goto loop done: } • Inefficiency • Procedure vec_length called every iteration • Even though result always the same 1 iteration

Move vec_length Call Out of Loop void combine2(vec_ptr v, int *dest) { int i; int length = vec_length(v); *dest = 0; for (i = 0; i < length; i++) { int val; get_vec_element(v, i, &val); *dest += val; } } • Optimization • Move call to vec_length out of inner loop • Value does not change from one iteration to next • Code motion • CPE: 20.66 (Compiled -O2) • vec_length requires only constant time, but significant overhead

Arrays optimizations - Loops • Arrays and Loops optimizations: • No need for a loop variable. • Using pointer arithmetic I: Instead of increasing the loop variable by one, it increases our pointer by the size of the data type. • Using pointer arithmetic II: it computes the address of the final array element, and uses a comparison to this address as the loop test (do-while loop).

Optimization techniques (C++ oriented) • For loops: • Use ++i instead of i++ • i++ needs to be able to return the unincremented original value and therefore store it, whereas ++i can return the incremented value without storing the previous value. (Old compilers but still good practice because you never know what machine will run your code) • specially on non primitive types • += is more efficient than x = x + a ( Probably doesn’t matter to most compilers nowadays but once x = x + a used to evaluate x twice ). • Count down to 0 instead of up • It’s usually faster to compare against 0

Reduction in Strength void combine3(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); *dest = 0; for (i = 0; i < length; i++) { *dest += data[i]; } • Optimization • Avoid procedure call to retrieve each vector element • Get pointer to start of array before loop • Within loop just do pointer reference • Not as clean in terms of data abstraction • CPE: 6.00 (Compiled -O2) • Procedure calls are expensive! • Bounds checking is expensive

Eliminate Unneeded Memory Refs void combine4(vec_ptr v, int *dest) { int i; int length = vec_length(v); int *data = get_vec_start(v); int sum = 0; for (i = 0; i < length; i++) sum += data[i]; *dest = sum; } • Optimization • Don’t need to store in destination until end • Local variable sum held in register • Avoids 1 memory read, 1 memory write per cycle • CPE: 2.00 (Compiled -O2) • Memory references are expensive!

Detecting Unneeded Memory Refs. Combine3 Combine4 .L18: movl (%ecx,%edx,4),%eax addl %eax,(%edi) incl %edx cmpl %esi,%edx jl .L18 .L24: addl (%eax,%edx,4),%ecx incl %edx cmpl %esi,%edx jl .L24 • Performance • Combine3 • 5 instructions in 6 clock cycles • addl must read and write memory • Combine4 • 4 instructions in 2 clock cycles

Optimization Blocker: Memory Aliasing • Aliasing • Two different memory references specify single location • Example • v: [3, 2, 17] • combine3(v, get_vec_start(v)+2) --> ? • combine4(v, get_vec_start(v)+2) --> ? • Observations • Easy to have happen in C • Since allowed to do address arithmetic • Direct access to storage structures • Get in habit of introducing local variables • Accumulating within loops • Your way of telling compiler not to check for aliasing

Loop Unrolling void combine5(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-2; int *data = get_vec_start(v); int sum = 0; int i; /* Combine 3 elements at a time */ for (i = 0; i < limit; i+=3) { sum += data[i] + data[i+2] + data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { sum += data[i]; } *dest = sum; } • Optimization • Combine multiple iterations into single loop body • Amortizes loop overhead across multiple iterations • Finish extras at end • Measured CPE = 1.33

Parallel Loop Unrolling void combine6(vec_ptr v, int *dest) { int length = vec_length(v); int limit = length-1; int *data = get_vec_start(v); int x0 = 1; int x1 = 1; int i; /* Combine 2 elements at a time */ for (i = 0; i < limit; i+=2) { x0 *= data[i]; x1 *= data[i+1]; } /* Finish any remaining elements */ for (; i < length; i++) { x0 *= data[i]; } *dest = x0 * x1; } • Code Version • Integer product • Optimization • Accumulate in two different products • Can be performed simultaneously • Combine at end • Performance • CPE = 2.0 • 2X performance

Code Optimizing • Time • Using efficient algorithms. • Constant calculations – outside of loops. • Accessing memory as little as possible. • Using more efficient instructions (shift vs. mult). • Minimizing the call to inefficient functions. • Memory • Using the smallest data types that fit. • Using more efficient structures that allow for the same functionality with less memory (see example later). • Using as little variables as possible.

Performance and Optimization

Performance and Optimization

Presentation Transcript

SQL Server Performance Tuning and Optimization

Code Optimization and Performance

Performance Prediction and Design Optimization

Performance Optimization: Simulation and Real Measurement

uPortal Performance Optimization

Informix Performance Optimization

Performance Optimization

Web Performance Optimization

Web Performance Optimization

ESMF Performance Evaluation and Optimization

Analysis Performance and I/O Optimization

Performance Optimization and Tuning

Module 9: Performance Monitoring and Optimization

Web Performance Optimization

Performance Analysis and Optimization T ool

uPortal Performance Optimization

Java Performance Profiling and Optimization

Database Performance Tuning and Query Optimization

Database Performance Tuning and Query Optimization

Performance Optimization

contentXXL: performance analysis and optimization