Optimizing compiler . Interpocedural optimizations .

Optimizing compiler.Interpocedural optimizations.

Interprocedural optimization How to combine the good programming style and speed requirements for the application? Good programming style assumes: • Modularity. • Readability and the code re-usage. • Implementation property encapsulation • A modularity of the source code complicates the task of optimization.All previously discussed optimizations works on procedural level: • Optimizations work effectively with local variables • Every function call is a "black box“ with unknown properties • The procedural parameters properties are unknown • The global variables properties are unknown • To solve these problems the program has to be analyzed as a whole.

Some basic problems of a procedural level optimizations 1.) Scalar optimizations: According to iterative algorithm for data flow analysis Reaches (b) = U for all predecessors (defsout (p) U (reaches (p) ∩ ¬ killed (p)) In the case of calling of an unknown function from a basic block p, all local and global variables which can be changed inside this function according to language rules should be put to killed (p).Compiler needs to know which objects can be changed inside the function according to perform high-quality optimizations. 2.) Loop optimizations:For high-quality loop optimizations compiler needs: • Determine objects which cannot address the same memory • Determine a properties of the functions within the loops (do not change the iteration variables, do not contain the program termination, etc.) • estimating the number of loop iterations 3.) Loop vectorization:For successful loop vectorization information on the objects memory alignment can be very useful. To obtain such information we need to analyze the entire program. This (interprocedural) infromation improves many classic intraprocedural optimizations.

One-pass and multi-pass compilation • In computer programming, a one-pass compiler is a compiler that processes the source code of each compilation unit only once. It doesn’t look back at the previously processed code. Multi-pass compiler traverses the source code or its internal representation several times. • In order to gather the information about a function properties the compiler needs to analyze every function and it’s interconnection with other functions because each them can contain any function calls, as well as itself call (recursion). It is necessary to analyze a call graph.A call graph represents calling relationships between subroutines in a computer program. Each node represents a procedure and each edge (f, g) indicates that the procedure f calls procedure g.

Acall graph may be static, calculated at compile time or dynamic. Static call graph contains all possible ways control can be passed. A dynamic call is obtained during the execution of program and can be useful for performance analysis. • One of the main of interprocedural analysis tasks is constructing the call graph and function property determination. For example, the global data flow analysis is intended to compute the objects which can be modified within the function. • Call graph can be complete or incomplete. If an application use utilities from an external library then the graph will be incomplete and full analysis will not be performed.

Source files FE (C++/C orFortran) Two pass and single pass compilation scheme Internal representation Profiler Temporary files or object files with IR Scalar optimizations Loop optimizations Interprocedural optimizations -Qipo/-Qip Code generation Scalar optimizations Object files Code generation Loop optimizations Executable file of library

One-pass and multi-pass compilation actions When the one-pass compilation is used, the compiler performs the following steps: parsing and internal representation creation, profile analysis, scalar and loop optimizations, code generation. Object files corresponding to sources files are generated. A linker builds the application from these files. In the case of the multi-pass compilation compiler makes parsing, creates the internal representation, performs some scalar optimizations and saves the internal representation in the object files during the first pass. These files contain packed internal representation for the corresponding source files. It allows to perform the interprocedural analysis and optimizations. During this analysis the call graph is build and additional function properties are collected. The next step is interprocedural optimizations, such optimizations work with the part of the call graph. Finally the compiler performs scalar and loop optimizations. One or several final object files are generated.

Main compiler options for interproceduraloptimizations /Qipo[n] enables interprocedural optimization between files. This is also called multifileinterprocedural optimization (multifile IPO) or Whole Program Optimization (WPO). /Qipo-c generate a multi-file object file (ipo_out.obj) /Qipo-S generate a multi-file assembly file (ipo_out.asm) /Qipo-jobs<n> specify the number of jobs to be executed simultaneously during the IPO link phase There is a partial interprocedural analysis which works on single-file scope. In this case some partial call graph is build and the interprocedural optimizations are performed according to information obtained by the graph analysis. Qip[-] enable(DEFAULT)/disable single-file IP optimization within files

Mod/Ref Analysis Interprocedural analysis collects MOD and REF sets for each routine. MOD/REF sets contain objects which can be modified or referenced during the routine execution. These sets can be used for scalar optimizations. external void unknown(int *a); int main(){ inta,b,c; a=5; c=a; unknown(&a); if(a==5) printf("a==5\n"); b=a; printf("%d %d %d\n",a,b,c); return(1); } #include <stdio.h> void unknown(int *a) { printf(“a=%d\n”, *a); } Let’s consider a simple example. There are two files. Function main contains call of function “unknown“ which is located in a other file. Local variable “a” obtains constant value before this call. In case of propagating this constant value through the function call the following if statement can be modified and check can be deleted.

We can use assembler files to define if check if(a==5) was deleted or wasn’t. • icl test.c unknown.c –S • There is this check in this case. Let’s inspect test.asm file . • call _unknown ;9.1 • … • .B1.2: ; Preds .B1.8 • mov edi, DWORD PTR [a.302.0.1] ;10.4 • cmp edi, 5 ;10.7 • jne .B1.4 ; Prob 0% ;10.7 • icl –Ob0 test.c unknown.c -Qipo–S • With –Qipo check was eliminated. –Ob0 is needed to prevent inlining of unknown. • call _unknown. ;9.1 • … • .B1.2: ; Preds .B1.7 • push OFFSET FLAT: ??_C@_05A@a?$DN?$DN5?6?$AA@ ;11.3 • call _printf ;11.3

Alias analysis • It is used to determine if a storage location may be accessed in more than one way. Two pointers are said to be aliased if they point to the same location. • explicit aliasing: different objects points to the same memory according to programing language rules (union for C/C++, equivalence for Fortran) • parameter aliasing: formal argument can be aliased with other formal argument or objects from global scope. • pointer analysis: pointers can be aliased if sets of objects which can be referenced by these pointers have common elements. • Alias analysis is important to find loop dependences.

Alias analysis example Dependence may appear if two pointers (a and b) reference the same memory location. In this case any loop optimizations are prohibited. #include <stdio.h> int p1=1,p2=2; int *a,*b; void init(int **a, int **b) { *a=&p1; *b=&p1; // <= a and b poins to p1 } int main() { inti,ar[100]; init(&a,&b); printf("*a= %d *b=%d\n",*a,*b); for(i=0;i<100;i++) { ar[i]=i*(*a)*(*a); *b+=1; /* *a is changed through *b */ } printf("ar[50]= %d p2=%d\n",ar[50],p2); }

Other aspects of the interprocedural analysis Interproceduralanalysis is used: • to determine the function’s attributes. For example, there are attributes “no_side_effect”, “always_return”, etc. used for simplifying some kind of analysis and optimizations. • to define an attributes of the variables. For example, if variable have no attribute "address was taken" than it cannot be updated through pointers, it simplifies many optimizations. Whole program analysis is required to handle the global variables. • for data promotion. Each variable has a scope . IPA allows to reduce this scope according to the real usage. • to remove unused global variables. • to remove a dead code. There can be sub graphs in call graph which aren’t connect with program entry. Such sub graphs can be safely removed from the final generated executable. • to feed the information about the argument alignment. If the actual function arguments are always aligned, then vectorization can be improved for the procedure.

Interproceduraloptimizations Interprocedural optimization is a program transformation involves more than one procedure in a program. In other words an optimization based on results of interprocedural analysis. Constant propagation is performed on base of interprocedural value propagation graph. As result of this optimization some formal arguments can be changed with corresponded constant value. Simple example: If all calls of function f(x,y,z) have the same constant value for actual argument x, than formal argument x can be changed with this constant inside function body. Constant result propagation. If a procedure returns some constant value than this value can be propagated to caller function.

Interproceduralconstant propagation example #include <stdio.h> extern void known(intvariant,int *var); int main() { intvar; intttt; var=2; ttt=3; known(var,&ttt); printf("ttt=%i\n",ttt); } void known(intvar,int *ttt) { if(var>0) (*ttt)++; else (*ttt)--; } IPO constant propagation should simplify the body of known routine. icc –Ob0 test.cknown.c -fast -ipo-S … known: # parameter 1: %edi # parameter 2: %rsi ..B2.1: # Preds ..B2.0 ..___tag_value_known.8: #1.30 addl $1, (%rsi) #3.3 ret #6.1 .align 16,0x90

#include <stdio.h> intfcall(int x){ if(x>3) printf("x>3"); else printf("x<=3"); return x+1; } int main() { intx,y; x=2; y=fcall(x); x=1; y=fcall(x); } It is easy to see that the formal argument “x” of function fcall can be equal in this program to values 2 or 1. If_condition inside fcall is resolved identically for this values. Let’s check if interprocedural optimization makes constant propagation for this case. icl test2.c –Ob0 –O3 –Qipo-S ??

Inlining Inliningor inline expansion is a compile optimization that replaces a function call site with the body of the callee. Inliningreduces execution time by the cost of the function call, eliminates branches and keep executing code close inside the memory. It improves instruction cache performance by improving the locality of reference. Inlining allows to perform intraprocedural optimizations on the inlined function body. In most of the cases larger scope enables better scheduling and register allocation. Disadvantage of inlining is the application size increase. Compile time and compiler resources are also increased as a result. Inliningheuristics are trying to choose the best candidates for inlining to get the most performance without exceeding the code increase allowed. A programmer is able to recommend to inline function with inline attribute For example, inline intexforsys(int x1) { return 5*x1;}

REAL A(100) INTEGER I DO I = 1,100 A(I) = I END DO DO I = 1,100 CALL AADD(A,I,1) END DO PRINT *, A(100) END SUBROUTINE AADD(ARRAY,EL,AD) REAL :: ARRAY(*) INTEGER EL REAL AD ARRAY(EL)=ARRAY(EL)+AD RETURN END Inlining allows to perform intraprocedural optimizations on the inlined function body. Inlining of subroutine AADD allows to perform vectorization for loop with call. ifort -Ob0 test_vec.f90 -Qvec_report3 … ..\test_vec.f90(10): (col. 2) remark: loop was not vectorized: nonstandard loop is not a vectorization candidate. ifort test_vec.f90 -Qvec_report3 … C:\users\aanufrie\students\ipo\5\test_vec.f90(8): (col. 2) remark: LOOP WAS VECTORIZED.

Inlining directives • #pragma inline[recursive] • #pragma forceinline[recursive] • #pragma noinline • Recursive demands to inline all routines which are called by the marked call. • Directive inline recommend to inline routine • noinline demand not to inline routine • forceinline demand to inline routine • Fortran directives • cDEC$ ATTRIBUTES INLINE :: procedure • cDEC$ ATTRIBUTES NOINLINE :: procedure • cDEC$ ATTRIBUTES FORCEINLINE :: procedure

Compiler options • /Ob<n> control inline expansion: • n=0 disable inlining • n=1 inline functions declared with __inline, and perform C++inlining • n=2 inline any function, at the compiler's discretion • /Qinline-min-size:<n> • set size limit for inlining small routines • /Qinline-min-size- • no size limit for inlining small routines • /Qinline-max-size:<n> • set size limit for inlining large routines • /Qinline-max-size- • no size limit for inlining large routines • /Qinline-max-total-size:<n> • maximum increase in size for inline function expansion • /Qinline-max-total-size- • no size limit for inline function expansion

/Qinline-max-per-routine:<n> • maximum number of inline instances in any function • /Qinline-max-per-routine- • no maximum number of inline instances in any function • /Qinline-max-per-compile:<n> • maximum number of inline instances in the current compilation • /Qinline-max-per-compile- • no maximum number of inline instances in the current compilation • /Qinline-factor:<n> • set inlining upper limits by n percentage • /Qinline-factor- • do not set setinlining upper limits • /Qinline-forceinline • treat inline routines as forceinline • /Qinline-dllimport • allow(DEFAULT)/disallow functions declared __declspec(dllimport) to be inlined • /Qinline-calloc directs the compiler to inline calloc() calls as malloc()/memset()

Procedure cloning Cloning is a specializing a function to a specific class of call sites. Sometimes specific characteristics of dummy arguments allow to perform a special optimizations for procedure. In this case it is possible to create specialized procedure and change the initial procedure call to new one in all the cases where the actual arguments have these characteristics. Trivial case is a call of a procedure with a constant argument. For example, if there are several calls of some procedure f in form f(x,y,TRUE) and several calls f(x,y,FALSE) than sometimes it is profitable to create procedures f_TRUE(x,y) and f_FALSE(x,y) and replace initial calls with calls of newprocedures.

Partial inlining Partial inlining is an efficient way of inlining, which inlines only part of the callee function. while (q) { if(q->type==2) process_elem(q) q=q->next; } void process_elem(plist p) { … } while (q) { process_elem(q) q=q->next; } void process_elem(plist p) { if(p->type !=2) return; … }

Data transformations Data transformation is a interprocedural optimization which change structure of user data to provide better cash locality during execution. The following types of data transformation are widely known: • permutation of structure fields • structure splitting • Permutation of structure fields can improve cash locality if the fields which are used together during calculation are located closely. In this case system bus reads fewer cash lines from memory. • Structure splitting leaves hot (frequently used) fields in main structure and removes other fields to special frozen section. After this optimization data will need less memory and will fit cash better. • Compiler need to prove correctness of such transformation. In many cases whole program analysis is needed.

Structure splitting and field reordering example #include "struct.h" int main() { int i, k; VecR *array = malloc(10000*sizeof(VecR)); #ifdef PERF for(i=0;i<10000;i++) array[i].cold=(ColdFields*)malloc(sizeof(ColdFields)); #endif for (i=0;i<10000;i++){ array[i].x = 1.0; array[i].y = 2.0; array[i].z = 0.0; } for(k=1;k<10000;k++) { for (i=k;i<9999;i++){ array[i].x = array[i-1].y+1.0; array[i].y = array[i+1].x+array[i+1].y; array[i].z = (array[i-1].y - array[i-1].x)/array[i-1].y; } } printf("%f \n",array[100].z); #ifdef PERF for(i=0;i<10000;i++) free(array[i].cold); #endif free(array); } #ifndef PERF typedef struct { double x; char title[40]; double y; char title2[22]; double z; } VecR; #else typedef struct { char title[40]; char title2[22]; } ColdFields; typedef struct { double x; double y; double z; ColdFields *cold; } VecR; #endif

Result of test execution iccstruct.c -fast -o a.out iccstruct.c -fast -DPERF -o b.out time ./a.out real 0m0.808s time ./b.out real 0m0.566s

Pointer chasing • Data access through several pointers is one of the most common problem in C++ code. If a program data doesn’t fit in the cash subsystem, then every pointer dereference will cause significant stall in calculation. • This problem can be caused also by wrong data transformation. Class Employers { Personal_info *p; … };; Class personal_info { Family_info *f; … }; Class family_info { int members; … }; All_members+= employer->p->f->members;

Devirtualizationfor C++ virtual method C++ - object-oriented language with a high level of abstraction andability to perform the class methods depending on the type of the object at run time. In this case pointers to different class methods are located in special table and call of virtual function is very expensive for the performance. Sometimes call through table of virtual method can be replaced with call of a specific method.A => B => CAll derived classes override virtual int foo ()int process (class A * a) {return (a-> foo ());}

Devirtualization example • Class A isn’t used in this source, so it is possible to perform devirtualization. #include <stdio.h> class A { virtual int foo() { return 1; }; friend int process(class A *a); }; class B: public A { virtual int foo() { return 2; }; friend int process(class A *a); }; int process(class A *a) { return(a->foo()); }; void main() { B* pB = new B; int result2 = process(pB); } icl test.cpp –S moveax, DWORD PTR [ebx] movecx, ebx call DWORD PTR [eax] (call through table) icl test.cpp –Qipo_S –Ob0 -Qipo call ?process.@@YAHPAVA@@@Z

Thank you!

Optimizing compiler . Interpocedural optimizations .

Optimizing compiler . Interpocedural optimizations .

Presentation Transcript

Compiler Optimizations for Modern VLIW/EPIC Architectures

Generating Compiler Optimizations from Proofs

Automatically Proving the Correctness of Compiler Optimizations

Weakest Precondition Synthesis for Compiler Optimizations

Optimizing Compiler . Scalar optimizations .

Reducing Misses using Compiler Optimizations

Optimizing General Compiler Optimization

Compiler Optimizations in the Berkeley UPC Translator

Compiler Speculative Optimizations

Performance Analysis and Compiler Optimizations

Compiler-Directed instruction cache leakage optimizations

Compiler Optimizations ECE 454 Computer Systems Programming

Compiler Optimizations

CSC D70: Compiler Optimization Memory Optimizations

Optimizing Compiler . Scalar optimizations .

Optimizing compiler. Vectorization .

Compiler Optimizations ECE 454 Computer Systems Programming

CStar Optimizing a C Compiler