1 / 29

Introduction to application optimizations with usage of Intel ® performance tools.

Introduction to application optimizations with usage of Intel ® performance tools. Andrei Anufrienko Intel Compiler Group. The objectives of this course : Get a basic understanding of : the main factors of the processor performance, base performance improvement techniques,

iain
Download Presentation

Introduction to application optimizations with usage of Intel ® performance tools.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to application optimizations with usage of Intel® performance tools. Andrei Anufrienko Intel Compiler Group

  2. The objectives of this course: Get a basic understanding of : the main factors of the processor performance, base performance improvement techniques, Intel® tools for performance analysis, main options and components of the Intel compiler, theoretical foundations of some performance optimizations.

  3. You will be able to: describe the main problems of the processor performance; investigate the application using the VTune ™ Performance Analyzer and find problem areas; identify the main problems of an application analyzed; develop a strategy to improve application performance; describe the main components of the compiler and its functions; control the level of optimization with command line options.

  4. Course plan Intel microprocessor architecture and main factors affecting processor performance; VTune ™ Performance Analyzer usage; The role of the compiler in improving application performance; Some theoretical concepts. Control flow graph, data-flow analysis; Permutation optimizations and their applicability. Dependencies; Vectorization; Parallelization using OMP directives and auto parallelization; The main components of the compiler, their tasks and interconnection.

  5. Intel microprocessor architecture and the main factors affecting the processor performance.

  6. Simplified processor model External memory Operative memory (RAM) Input-output bus System bus Input-output unit Commands Data Arithmetic logic unit (ALU) Control unit Processor Registers

  7. Simplified processor model Control Unit, CU Arithmetic and Logic Unit,ALU System registers Front Side Bus, FSB Memory Peripheral devices Control Unit (CU): decodes instructions received from the memory;controls ALU; performs data transfer between the CPU registers, memory, peripheral devices. ALU consists of different parts, allowing to perform arithmetic and logical operations on the system registers. System registers - a piece of memory inside the CPU that is used for temporary storage of an information processed by the processor. A system bus is used for data transfer between the CPU and memory, as well as between the CPU and peripherals.

  8. High performance is one of the key factors in the competition of the computer systems manufacturers Processor performance is directly related to the amount of computational work that can be processed at a time. Roughly speaking:      Performance = Number of instructions / Time We'll talk about performance on the basis of IA32 and IA32E architectures (IA32 with EM64T). Factors affecting the processor performance:CPU clock frequency;Accessible memory amount and speed;The performance of the instructions and completeness of the instruction set;The internal memory registers usage;The quality of pipelining;The quality of prediction;The quality of the prefetching;Superscalarity;The quality of vectorization;Parallelization and multicore.

  9. Clock rate Because the processor is made of different components, working with different speeds, there is a processor timer which is providing the synchronization by sending periodic sync. Its frequency is called the clock speed of the processor. Memory speed and amount 8086 - 1 MB of memory. 80 286 - A new system registers, and a new mode of memory - 16MB of memory. 80 386 - the first 32-bit processor - 4GB Technology EM64T (Extended Memory 64 Technology) - ~ 264B

  10. The performance of the instructions and completeness of the instruction set Performance depends on how well the instructions are implemented, how well the basic instruction set covers all possible tasks. CISC, RISC (complex, reduced instruction set computing) Modern Intel processors are a hybrid of CISC and RISC; before executing a processor converts CISC instructions into simpler RISC instruction set.

  11. Registers and memory System registers have the smallest access time, so the number of available registers affects the performance of the microprocessor. Register spilling – lack of system registers causes great exchange between registers and stack of application. Ia32eTechnology EM64T – added additional system registers. Now the memory access speed is much lower than the speed of calculations. There are two characteristics describing the properties of memory: Response time (latency) – the number of processor cycles required to transfer data from the memory unit. Bandwidth – number of items can be sent from the processor to memory at one cycle. Two possible performance improvement strategies – to reduce response time or pre-fetch the necessary memory.

  12. Reducing the memory access time is achieved via cache system (small amount of memory located on processor). Memory blocks are preloaded into the cash. If the address is in the cache memory - there is a “hit” and data acquisition is greatly increased. Otherwise – “cash miss” and additional time is needed. In this case, the block of memory is read into the cache for one or more cycles of bus, called the filling cache lines. (Size of cash line is 64 bytes.) There are different kinds of cash: fully associative cache memory (each block can appear anywhere inside the cache) direct mapping from memory (each block can be loaded into one place) various hybrid options (pie memory, the memory of the set-associative access) Set-associative access: lest significant bits are used to determine cache line this memory can be loaded to; cash line may contain a few words from main memory, the mapping inside the line is held on an associative basis. The quality of the memory access is main key to the performance.

  13. Modern computing architectures contains complicated cash hierarchy. Nehalem: i7 L1 - latency 4 L2 - latency 11 L3 - latency 38 Operative memorylatency > 100 Proactive memory access mechanism is implemented with a hardware prefetching which based on the history of cash misses. It tries to detect and prefetch independent streams of data. There is a special set of instructions allows to induce the processor to load the memory specified into cache (software prefetching).

  14. The principle of locality. The quality of the prefetch. Reference locality helps to reuse variables or related data. There is difference between temporal locality – reuse of certain data and resources, and spatial locality - use of data located in the memory beside. The caching mechanism uses the principle of temporal locality. (Before new cash line is loaded to cash some cash line should be freed. Cash mechanism selects one which has oldest access time. Prefetching engine uses the principle of spatial locality. It tries to define the pattern of memory access to pre-load to cache memory which will be need soon. Size of preloaded memory (cash line) is 64 bytes. Thus in case of good spatial locality (data used jointly during calculation is located in the memory beside) less cash lines should be loaded to the cache. One of known performance problem is “cache aliasing” – bad memory locations of various objects participated in a calculation causes the replacement of useful cache lines by some other needed addresses. Z=sqrt(y2+x2) One cash line should be loaded Up to three cash lines should be loaded

  15. Pipeline tick Instruction fetch Register fetch Instruction decode Execution Data fetch Write back 0 instr. 1 - - - - - 1 instr. 2 instr. 1 - - - - 2 instr. 3 instr. 2 instr. 1 - - - 3 instr. 4 instr. 3 instr. 2 instr. 1 - - 4 instr. 5 instr. 4 instr. 3 instr. 2 instr. 1 - 5 instr. 6 instr. 5 instr. 4 instr. 3 instr. 2 instr. 1 6 instr. 7 instr. 6 instr. 5 instr. 4 instr. 3 instr. 2

  16. The quality of pipelining, instruction level of parallelism Pipelining assumes that successive instructions will be processed together during execution but on different phases of pipeline. Typical instruction execution can be divided into the following steps: instruction fetch - IF; decoding command / register selection - ID; operation / calculation of effective memory addresses - EX; memory access – MEM; storing the result - WB. Pipelining improves throughput of the processor, but if the instructions depend on the results of the previous instructions, there will be delays. Thus the benefits of pipelining depends on level of instruction parallelism.

  17. The quality of prediction The instructions may depend on the data and control logic. (Data dependence and control flow dependence). The efficiency of pipeline is limited by various conditional branches inside instruction flow. If there is conditional branch than following instructions aren’t known until the condition isn’t calculated. Should the pipeline be stopped? Branch predictor is designed to solve this problem. Predictor selects one possible way and continues instructions fetching and processing. All processed instructions are located in pipeline storage. If predictor assumption was correct all of them are marked as proper, otherwise “branch misprediction” is happened – pipeline storage should be clean and new instructions should be fetched. There are static and dynamic predictors: Static predictor uses some simple rules; Trivial prediction – the branch will be not executed if the transition is carried forward and will be made if this is a back jump; Dynamic predictor collects the statistics on every branch and its choice based on this information. There is also branch target prediction, which predicts unconditional jumps.

  18. Superscalarity Superscalar processor – a processor which is capable to perform multiple operations per one clock cycle. It has several execution units. The superscalar technique has several identifying characteristics: Instructions are issued from a sequential instruction stream There is special device which detects data dependences between instructions at run time. The CPU accepts multiple instructions per clock cycle Modern CPU is always superscalar and pipelined. Each execution unit has own specialization. "Diversity“ of instructions and high level of instruction parallelism causes best CPU effectiveness.

  19. Simplified processor model External memory Operative memory (RAM) Input-output bus Prefetching Input-output unit Cashes Branch prediction Arithmetic устройство (ALU) Control unit (CU) Arithmetic logical unit (ALU) Superscalar Регистры Registers

  20. Vector instructions and Vectorization A typical vector instruction performs an elementary operation on two vector sequences in the memory or vector registers of fixed length          C (1: n) = A (1: n) + B (1: n) Fortran array sections are convenient to notate vector opertaions Vectorization - the process of converting a scalar calculations, in which an operation is performed on a pair of operands, to the vector representation, in which an operation is performed on a pair of vector operands. Each vector contains several scalar operands. Pentium III compute system of x86 family introduced SSE (Streaming SIMD Extensions). There were eight 128 bit registers (XMM0-XMM7) and 70 new instructions including working with real numbers. SSE2, SSE3, SSEE3, SSE4, SSE4.2, AVX - further extensions of SSE.

  21. Look ahead and out-of-order execution Modern x86 family microprocessors have advanced processor mechanisms to view the instruction flow and identify instructions that can be computed in parallel. If there are enough instructions in look-ahead buffer which can be processed together, than processor pipeline will work with maximum effectiveness. This approach leads to execution with change of the instruction sequence (out-of-order execution). Implementation of out-of-order mechanisms makes processor architecture more complicated and causes additional energy costs. There are Intel processors without out-of-order support. (Itanium, Atom). In this case instruction scheduling is key factor of good processor performance.

  22. Parallelization and multi-core Multitasking is a method where multiple tasks, also known as processes, share common resources of microprocessor. Multithreading computers have hardware support to efficiently execute multiple threads. Threads are parts of a process and share the same memory. Multithreading allows to divide a calculation into several parts which are processed in parallel. Hyper-threading technology allows to mix instruction sequences of different processes to improve instruction level parallelism. Pentium 4 - Core i7 Cores – microprocessor contains several superscalar pipelines which have own calculation resources but share system bus, memory and up level cashes. Multiprocessor solutions contains several processors. Multiprocessor and multi-core systems allow to increase the application performance by creating multiple threads

  23. Main characteristics of the application, affecting its performance Calculations efficiency, Memory usage effectiveness, Correct branch prediction, Efficient use of vector instructions, The effectiveness of parallelization, Instructional parallelism level.

  24. Performance measuring What factors affect the performance of a specific program? Compiler quality Performance of computer system Consumers need criteria to determine the computer system performance A representative set of typical tasks; Universal testing scheme; Independence from MP manufacturers. Spec.org (Standart Performance Evaluated Corporation) - non-profit organization for training, support and maintenance of a standard set of tests to compare the performance of different computer systems. This organization develops and publishes standard suites for performance measuring. CPU2006 - designed to measure performance. Can be used to compare the programs running on different computer systems. OMP2001 - measures the performance on tests using OpenMP standard for parallel processing with shared memory (shared-memory parallel processing).

  25. Optimizing compiler role Compiler translates the entire source program into an equivalent program in the resulting machine code or assembly language. Does the compiler have any role in the struggle for the performance of the MP? The compiler is used during testing and debugging functionality of the new MP. Performance of new computer system related with new instruction set, increasing number of registers can be demonstrated only with optimizing compiler which supports these innovations. The compiler is able to hide the architects misses.

  26. List of literature for deeper study Randy Allen & Ken Kennedy “Optimizing compilers for modern architectures” David F. Bacon, Susan L. Graham and Oliver J.Sharp “Compiler transformations for High-Performance Computing” Aart J.C. Bik “The Software Vectorization Handbook” Richard Gerber, Aart J.C. Bik, Kevin B.Smith, XinminTian “The Software Optimization Cookbook” Intel® 64 and IA-32 Intel Architecture Software Developer's Manual Intel® 64 and IA-32 Architectures Optimization Reference Manual Agner Fog “Optimizing software in C++: An optimization guide for Windows, Linux and Mac platforms” http://www.agner.org/optimize/

  27. Thank you!

More Related