Advanced Analysis in SUIF2 and Future Work

Advanced Analysis in SUIF2 and Future Work Monica Lam Stanford University http://suif.stanford.edu/

Program transformation for computational kernels a new technique based on affine partitioning Interprocedural analysis framework: to maximize code reuse Flow sensitivity; context sensitivity Interprocedural program analysis Pointer alias analysis (Steensgaard’s algorithm) Scalar/Scalar dependence, privatization, reduction recognition Parallel code generation Define new IR nodes for parallel code Interprocedural, High-Level Transforms for Locality and Parallelism

Loop Transforms: Cholesky factorization example DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K)

Results for Optimizing Perfect Nests Speedup on a Digital Turbolaser with 8 300Mhz 21164 processors

Optimizing Arbitrary Loop Nesting Using Affine Partitions A DO 1 J = 0, N I0 = MAX ( -M, -J ) DO 2 I = I0, -1 DO 3 JJ = I0 - I, -1 DO 3 L = 0, NMAT 3 A(L,I,J) = A(L,I,J) - A(L,JJ,I+J) * A(L,I+JJ,J) DO 2 L = 0, NMAT 2 A(L,I,J) = A(L,I,J) * A(L,0,I+J) DO 4 L = 0, NMAT 4 EPSS(L) = EPS * A(L,0,J) DO 5 JJ = I0, -1 DO 5 L = 0, NMAT 5 A(L,0,J) = A(L,0,J) - A(L,JJ,J) ** 2 DO 1 L = 0, NMAT 1 A(L,0,J) = 1. / SQRT ( ABS (EPSS(L) + A(L,0,J)) ) DO 6 I = 0, NRHS DO 7 K = 0, N DO 8 L = 0, NMAT 8 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 7 JJ = 1, MIN (M, N-K) DO 7 L = 0, NMAT 7 B(I,L,K+JJ) = B(I,L,K+JJ) - A(L,-JJ,K+JJ) * B(I,L,K) DO 6 K = N, 0, -1 DO 9 L = 0, NMAT 9 B(I,L,K) = B(I,L,K) * A(L,0,K) DO 6 JJ = 1, MIN (M, K) DO 6 L = 0, NMAT 6 B(I,L,K-JJ) = B(I,L,K-JJ) - A(L,-JJ,K) * B(I,L,K) L B L EPSS L

Results with Affine Partitioning + Blocking

New Transform Theory • Domain: arbitrary loop nesting, instruction optimized separately • Unifies • Permutation • Skewing • Reversal • Fusion • Fission • Statement reordering • Supports blocking across all loop nests • Optimal:Max. deg. of parallelism & min. deg. of synchronization • Minimize communication by aligning the computation and pipelining • More powerful & simpler software engineering

A Simple Example FOR i = 1 TO n DO FOR j = 1 TO n DO A[i,j] = A[i,j]+B[i-1,j]; (S1) B[i,j] = A[i,j-1]*B[i,j]; (S2) S1 i S2 j

Best Parallelization Scheme SPMD code: Let p be the processor’s ID number if (1-n <= p <= n) then if (1 <= p) then B[p,1] = A[p,0] * B[p,1]; (S2) for i1 = max(1,1+p) to min(n,n-1+p) do A[i1,i1-p] = A[i1,i1-p] + B[i1-1,i1-p]; (S1) B[i1,i1-p+1] = A[i1,i1-p] * B[i1,i1-p+1]; (S2) if (p <= 0) then A[n+p,n] = A[n+p,N] + B[n+p-1,n]; (S1) Solution can be expressed as affine partitions: S1: Execute iteration (i, j) on processor i-j. S2: Execute iteration (i, j) on processor i-j+1.

Let Fxj be an access to array x in statement j, ijbe an iteration index for statementj, Bjij 0 represent loop bound constraints for statementj, Find Cjwhich maps an instance of statement jto a processor:  ij, ik Bjij 0, Bkik 0 Fxj (ij) = Fxk (ik)  Cj (ij) = Ck (ik) with the objective of maximizing the rank of Cj F1(i1) Array Loops F2(i2) C1(i1) C2(i2) Processor ID Maximum Parallelism & No Communication

 ij, ik Bjij 0, Bkik 0 Fxj (ij) = Fxk (ik)  Cj (ij) = Ck (ik) Rewrite partition constraints as systems of linear equations use affine form of Farkas Lemma to rewrite constraints assystems of linear inequalities in C and l use Fourier-Motzkin algorithm to eliminate Farkas multipliers l and get systems of linear equations AC =0 Find solutions using linear algebra techniques the null space for matrix A is a solution of C with maximum rank. Algorithm

PipeliningAlternating Direction Integration Example Requires transposing data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N DO I = 1 to N (parallel) A(I,J) = g(A(I,J),A(I,J-1)) Moves only boundary data DO J = 1 to N (parallel) DO I = 1 to N A(I,J) = f(A(I,J),A(I-1,J) DO J = 1 to N(pipelined) DO I = 1 to N A(I,J) = g(A(I,J),A(I,J-1))

Let Fxj be an access to array x in statement j, ijbe an iteration index for statementj, Bjij 0 represent loop bound constraints for statementj, Find Tjwhich maps an instance of statement jto a time stage:  ij, ik Bjij 0, Bkik 0 ( ij ik)  (Fxj ( ij) = Fxk ( ik))  Tj (ij) Tk (ik) lexicographically with the objective of maximizing the rank of Tj Finding the Maximum Degree of Pipelining F1(i1) Array Loops F2(i2) T1(i1) T2(i2) Time Stage

Key Insight • Choice in time mapping => (pipelined) parallelism • Degrees of parallelism = rank(T) - 1

Putting it All Together • Find maximum outer-loop parallelism with minimum synchronization • Divide into strongly connected components • Apply processor mapping algorithm (no communication) to program • If no parallelism found, • Apply time mapping algorithm to find pipelining • If no pipelining found (found outer sequential loop) • Repeat process on inner loops • Minimize communication • Use a greedy method to order communicating pairs • Try to find communication-free, or neighborhood only communication by solving similar equations • Aggregate computations of consecutive data to improve spatial locality

Current Status • Completed: • Mathematics package • Integrated Omega: Pugh’s presburger arithmetic • Linear algebra package • Farkas lemma, gaussian elimination, finding null space • Can find communication-free partitions • In progress • Rest of affine partitioning • Code generation

Two major design choices in program analysis Across procedures No interprocedural analysis Interprocedural: context-insensitive Interprocedural: context-sensitive Within a procedure Flow-insensitive Flow-sensitive: interval/region based Flow-sensitive: iterative over flow-graph Flow-sensitive: SSA based Interprocedural Analysis

Bottom-up A region/interval: a procedure or a loop An edge: call or code in inner scope Summarize each region (with a transfer function) Find strongly connected components (sccs) Bottom-up traversal of sccs Iteration to find fixed-point for recursive functions Top-down Top-down propagation of values Iteration to find fixed-point for recursive functions Efficient Context-Sensitive Analysis call inner loop scc

E.g. Array summaries E.g. Pointer aliases Interprocedural Framework Architecture Driver Bottom-up Top-down Linear traversal Compound Handlers Primitive Handlers Procedure calls and returns Regions & Statements Basic blocks Data Structures Call graph, SCC Regions, control flow graphs

Interprocedural analysis data structures e.g. call graphs, SSA form, regions or intervals Handlers: Orthogonal sets of handlers for different groups of constructs Primitive: user specifies analysis-specific semantics of primitives Compound: handles compound statements and calls User chooses between handlers of different strengths e.g. no interprocedural analysis versus context-sensitive e.g. flow-insensitive vs. flow-sensitive cfg All the handlers are registered in a visitor Driver Driver invoked by user’s request for information (demand driven) Build prepass data structures Invokes the right set of handlers in right order(e.g. bottom-up traversal of call graph) Interprocedural Framework Architecture

Pointer Alias Analysis • Steensgaard’s pointer alias analysis (completed) • Flow-insensitive and context-insensitive, type-inference based analysis • Very efficient: near linear-time analysis • Very inaccurate

Parallelization Analysis • Scalar analysis • Mod/ref, reduction recognition: Bottom-up flow-insensitive • Liveness for privatization: Bottom-up and top-down, flow-sensitive • Region-based array analysis • May-write, must-write, read, upwards-exposed read: bottom-up • Array liveness for privatization: bottom-up and top-down • Uses our interprocedural framework + omega • Symbolic analysis • Find linear relationships between scalar variables to improve array analysis

Parallel Code Generation • Loop bound generation • Use omega based on affine mappings • Outlining and cloning primitives • Special IR nodes to represent parallelization primitives • Allows a succint and high-level description of parallelization decision • For communication to and from users • Reduction and private variables and primitives • Synchronization and parallelization primitives SUIF SUIF+ par IR SUIF

Status • Completed • Call graphs, scc • Steensgaard’s pointer alias analysis • Integration of garbage collector with SUIF • In progress • Interprocedural analysis framework • Array summaries • Scalar dependence analysis • Parallel code generation • To be done • Scalar symbolic analysis

A flexible and integrated platform for new optimizations Combinations of pointers, OO, parallelization optimizations to parallelize or SIMDize (MMX) multimedia applications Interaction between garbage collection, exception handling with back end optimizations Embedded compilers with application-specific additions at the source language and architectural level Future work: Basic compiler research

Basic ingredients of a state-of-the-art parallelizing compiler Requires experimentation, tuning, refinement First implementation of affine partitioning Interprocedural parallelization requires many analyses working together Missing functions Automatic data distribution User interaction needed for parallelizing large code region SUIF Explorer - a prototype interactive parallelizer in SUIF1 Requires tools: algorithm to guide performance tuning, program slices, visualization tools New techniques Extend affine mapping to sparse codes (with permutation index arrays) Fortran 90 front end Debugging support As a Useful Compiler for High-Performance Computers

Apply high-level program analysis to increase programmers’ productivity Many existing analyses High-level, interprocedural side effect analysis with pointers and arrays New analyses Flow and context sensitive pointer alias analysis Interprocedural control-path based analysis Examples of tools Find bugs in program Prove or disapprove user invariants Generate test cases Interactive demand-driven analysis to aid in program debugging Can also apply to Verilog/VHDL to improve hardware verification New-Generation Productivity Tool

Finally ... • The system has to be actively maintained and supported to be useful.

The End

Advanced Analysis in SUIF2 and Future Work

Advanced Analysis in SUIF2 and Future Work

Presentation Transcript

ADVANCED DESIGNS AND FUTURE DIRECTIONS

Future Work

Conclusions and Future Work

Discussion and Future Work

Future Work

Future Work

Future Work

FUTURE WORK

Summary and Future Work

Advanced Analysis in SUIF2 and Future Work

Advanced Networking: Issues and Future

Future Work

Future Work

Summary and Future Work

Future Work

FUTURE WORK

Summary and Future Work

Future Work

Future employees and future work

Advanced Risk Analysis Presentation Reuters Future Strategy

Future Work

Future Work: Gap analysis and Roadmapping