440 likes | 569 Views
Working and Researching on Open64. Institute of Computing Technology, Chinese Academy of Sciences. Outline. Reform Open64 as an aggressive program analysis tool Source code analysis and error checking Source-to-source transformation WHIRL to C Extending UPC for GPU cluster New targeting
 
                
                E N D
Working and Researching on Open64 Institute of Computing Technology, Chinese Academy of Sciences
Outline • Reform Open64 as an aggressive program analysis tool • Source code analysis and error checking • Source-to-source transformation • WHIRL to C • Extending UPC for GPU cluster • New targeting • Target to LOONGSON CPU
Whole Program analysis (WPA) • Aim at Error checking • A framework • Pointer analysis • The foundation of other program analysis • Flow - and context-sensitive • Program slicing • Interprocedural • Reduce program size for specific problems
Build Call Graph FSCS pointer analysis (LevPA) Construct SSA Form for each procedure IPL summay phase WPA Framework IPA_LINK Whole Program Analyzer Static slicer Static error checker
LevPA -- Level by Level pointer analysis • A Flow- and Context-sensitive pointer analysis • Fast analyzing millions of lines of code • The work has been published as Hongtao Yu, Jingling Xue, Wei Huo, Zhaoqing Zhang, XiaobingFeng. Level by Level: Making Flow- and Context-Sensitive Pointer Analysis Scalable for Millions of Lines of Code. In the Proceedings of the 2010 International Symposium on Code Generation and Optimization. April 24-28, 2010, Toronto, Canada.
LevPA • Level by Level analysis • Analyze the pointers in decreasing order of their points-to levels • Suppose int **q, *p, x; q has a level 2, p has a level 1 and x has a level 0. • a variable can be referenced directly or indirectly through dereferences of another pointer. • Fast flow-sensitive analysis on full sparse SSA • Fast and accurate context-sensitive analysis using a full transfer function
Framework • for points-to level from the highest to lowest Compute points-to level Bottom-up Top-down • Propagate points-to set • Evaluate transfer functions • incremental build call graph Figure 1. Level-by-level pointer analysis (LevPA).
Example • int o, t; • main() { • L1: int **x, **y; • L2: int *a, *b, *c, *d, *e; • L3: x = &a; y = &b; • L4: foo(x, y); • L5: *b = 5; • L6: if ( … ) { x = &c; y = &e; } • L7: else { x = &d; y = &d; } • L8: c = &t; • L9: foo( x, y); • L10: *e = 10; } • void foo( int **p, int **q) { • L11: *p = *q; • L12: *q = &obj; • } • ptl(x, y, p, q) =2 • ptl(a, b, c, d, e) =1 • ptl(t, o) = 0 • analyze • first { x, y, p, q } • then { a, b, c, d, e} • last { t, o }
Bottom-up analyze level 2 • void foo( int **p, int **q) { • L11: *p = *q; • L12: *q = &obj; } • main() { • L1: int **x, **y; • L2: int *a, *b, *c, *d, *e; • L3: x = &a; y = &b; • L4: foo(x, y); • L5: *b = 5; • L6: if ( … ) { x = &c; y = &e; } • L7: else { x = &d; y = &d; } • L8: c = &t; • L9: foo( x, y); • L10: *e = 10; }
Bottom-up analyze level 2 • void foo( int **p, int **q) { • L11: *p1 = *q1; • L12: *q1 = &obj; } • p1’s points-to depend on formal-in p • q1’s points-to depend on formal-in q • main() { • L1: int **x, **y; • L2: int *a, *b, *c, *d, *e; • L3: x = &a; y = &b; • L4: foo(x, y); • L5: *b = 5; • L6: if ( … ) { x = &c; y = &e; } • L7: else { x = &d; y = &d; } • L8: c = &t; • L9: foo( x, y); • L10: *e = 10; }
Bottom-up analyze level 2 • void foo( int **p, int **q) { • L11: *p1 = *q1; • L12: *q1 = &obj; } • p1’s points-to depend on formal-in p • q1’s points-to depend on formal-in q • main() { • L1: int **x, **y; • L2: int *a, *b, *c, *d, *e; • L3: x1 = &a; y1 = &b; • L4: foo(x1, y1); • L5: *b = 5; • L6: if ( … ) { x2 = &c; y2 = &e; } • L7: else { x3 = &d; y3 = &d; } • x4=ϕ (x2, x3); y4=ϕ (y2, y3) • L8: c = &t; • L9: foo( x4, y4); • L10: *e = 10; } • x1 →{ a } • y1 →{ b } • x2 →{ c } • y2 → { e } • x3 → { d } • y3 →{ d } • x4 → { c, d } • y4 → { e, d }
Full-sparse Analysis • Achieve flow-sensitivity flow-insensitively • Regard each SSA name as a unique variable • Set constraint-based pointer analysis • Full sparse • Saving time • Saving space
Top-down analyze level 2 • void foo( int **p, int **q) { • L11: *p = *q; • L12: *q = &obj; } • main: Propagate to callsite L4: foo.p → { a } foo.q → { b } • main() { • L1: int **x, **y; • L2: int *a, *b, *c, *d, *e; • L3: x = &a; y = &b; • L4: foo(x, y); • L5: *b = 5; • L6: if ( … ) { x = &c; y = &e; } • L7: else { x = &d; y = &d; } • L8: c = &t; • L9: foo( x, y); • L10: *e = 10; } • L9: • foo.p→ { c, d } • foo.q→ { d, e } • foo.p→ { a, c, d } • foo.q→ { b, d, e }
Top-down analyze level 2 • void foo( int **p, int **q) { • L11: *p = *q; • L12: *q = &obj; } • foo: Expand pointer dereferences • void foo( int **p, int **q) { • μ(b, d, e) • L11: *p1 = *q1; • χ(a, c, d) • L12: *q1 = &obj; • χ(b, d, e) • } • main() { • L1: int **x, **y; • L2: int *a, *b, *c, *d, *e; • L3: x = &a; y = &b; • L4: foo(x, y); • L5: *b = 5; • L6: if ( … ) { x = &c; y = &e; } • L7: else { x = &d; y = &d; } • L8: c = &t; • L9: foo( x, y); • L10: *e = 10; } Merging calling contexts here
Context Condition • To be context-sensitive • Points-to relation ci • p ⟹ v (p→v ) , pmust (may) point to v, p is a formal parameter. • Context Condition ℂ(c1,…,ck) • a Boolean function consists of higher-level points-to relations • Context-sensitive μ and χ • μ(vi, ℂ(c1,…,ck)) • vi+1=χ(vi, M, ℂ(c1,…,ck)) • M ∈ {may, must}, indicates weak/strong update
Context-sensitive μ and χ void foo( int **p, int **q) { μ(b, q⟹b) μ(d, q→d) μ(e, q→e) L11: *p1 = *q1; a=χ(a , must, p⟹a) c=χ(c , may, p→c) d=χ(d , may, p→d) L12: *q1 = &obj; b=χ(b , must, q⟹b) d=χ(d , may, q→d) e=χ(e , may, q→e) }
Bottom-up analyze level 1 void foo( int **p, int **q) { μ(b1, q⟹b) μ(d1, q→d) μ(e1, q→e) L11: *p1 = *q1; a2=χ(a1 , must, p⟹a) c2=χ(c1 , may, p→c) d2=χ(d1 , may, p→d) L12: *q1 = &obj; b2=χ(b1 , must, q⟹b) d3=χ(d2 , may, q→d) e2=χ(e1 , may, q→e) } • Trans(foo, a) = < { }, { <b, q⟹b> , < d, q→d>, < e, q→e>} , p⟹a, must > • Trans(foo, c) = < { }, { <b, q⟹b> , < d, q→d>, < e, q→e>} , p→c, may > • Trans(foo, b) = < {< obj, q⟹b> }, { } , q⟹b, must > • Trans(foo, e) = < {< obj, q→e> }, { } , q→e, may > • Trans(foo, d) = < {< obj, q→d> }, { <b, p→d ∧ q⟹b> , < d, p→d>, < e, p→d ∧ q→e> } , p→d ∨ q→d, may >
Bottom-up analyze level 1 • L5: *b1 = 5; • L6: if ( … ) { x2 = &c; y2 = &e; } • L7: else { x3 = &d; y3 = &d; } • x4=ϕ (x2, x3) y4=ϕ (y2, y3) • L8: c1 = &t; • μ(d1, true) • μ(e1, true) • L9: foo(x4 , y4); • a2=χ(a1 , must, true) • b2=χ(b1 , must, true) • c2=χ(c1, may , true) • d2=χ(d1, may , true) • e2=χ(e1, may , true) • L10: *e1= 10; } • intobj, t; • main() { • L1: int **x, **y; • L2: int *a, *b, *c, *d, *e; • L3: x1 = &a; y1 = &b; • μ(b1, true) • L4: foo(x1 , y1 ); • a2=χ(a1 , must, true) • b2=χ(b1 , must, true) • c2=χ(c1, may , true) • d2=χ(d1, may , true) • e2=χ(e1, may , true)
Full context-sensitive analysis • Compute a complete transfer function for each procedure • The transfer function maintains a low cost of being represented and applied • Represent calling contexts by calling conditions • Merging similar calling contexts • Better than using calling strings in reducing costs • Implement context conditions by using BDDs. • compactly represent context conditions • enable Boolean operations to be evaluated efficiently
Experiment • Analyzes million lines of code in minutes • Faster than the state-of-the art FSCS pointer analysis algorithms. Table 2. Performance (secs).
Future work • The points-to result can be only used for error checking now • We are working for • serving for optimization • Let WPA framework generate codes (connect to CG) • Let points-to set be accommodated for optimization passes • new optimizations under the WPA framework • serving for parallelization • provide precise information to programmers for guiding parallelization
An interprocedural slicer • Based on PDG (Program dependence graph) • Compressing PDG • Merging nodes that are aliased • Accommodate multiple pointer analysis • Allow many problems to be solved on slice to reduce the time and space costs
Application of slice • Now aiding program error checking • reduce the number of states to be checked • Use Saturn as our error checker • Input slices to Saturn instead of the whole program • The time the error checker (Saturn) needs to detect errors in file and memory operations is 11 and 2 times faster after slicing
Application of slice • Now aiding program error checking • reduce the number of states to be checked • Use Saturn as our error checker • Input slices to Saturn instead of the whole program • The time the error checker (Saturn) needs to detect errors in file and memory operations is 11 and 2 times faster after slicing
Application of slice • Now aiding program error checking • reduce the number of states to be checked • Use Saturn as our error checker • Input slices to Saturn instead of the whole program • The time the error checker (Saturn) needs to detect errors in file and memory operations is 11.59 and 2.06 times faster after slicing • improve the accuracy of error checking tools • Use Fastcheckas our error checker • more true errors are detected by Fastcheck
Improvement on whirl2c • Previous status • Whirl2c is designed for compiler engineers of IPA and LNO to debug • Berkeley UPC group and Houston Openuh group extend whirl2c somewhat, but it still cannot support big applications and various optimizations • Problem • Type Information incorrect because of transformations
Improvement on whirl2c • Our work • Improve whirl2c to support recompilation of its output and execution • Pass spec2000 C/C++ programs under O0/O2/O3+IPA based on pathscale-2.2 • Motivation • Some customers require us not to touch their platforms • Support the retargetability of some platform independent optimizations • Support gdb of the whirl2c output
Improvement on whirl2c • Incorrect information due to transformation • Before structure folding • After structure folding frontend whirl2c Wrong output
Improvement on whirl2c • Incorrect type information is mainly related to pointer/array/structure type and their compositions. • We reinfer the type information correctly based on basic types • Basic type information is used to generate assembly code, so it is reliable • Array element size is also reliable • A series of rules to get the correct type information based on basic type infor, array element size infor and operators. • Information useful for whirl2c but incorrect due to various optimizations is corrected just before whirl2c, which needs little change to existing IR whirl2c
Extending UPC with Hierarchical Parallelism • UPC (Unified Parallel C), parallel extension to ISO C99 • A dialect of PGAS languages (Partitioned Global Address Language) • Suitable for distributed memory machines, shared memory systems and hybrid memory systems • Good performance, portability and programmability • Important UPC features • SPMD parallelism • Shared data is partitioned to segments, each of which has affinity to one UPC thread, and shared data is referenced through shared pointer • Global workload partitioning, upc_forall with affinity expression • ICT extends UPC with hierarchical parallelism • Extend data distribution for shared arrays • Hybrid SPMD with implicit thread hierarchy • Realize important optimizations targeting GPU cluster
Source-to-source Compiler, built upon Berkeley UPC(Open64) • Frontend support • Analysis and transformation on upc_forall loops • shared memory management based on reuse analysis • Data regroup analysis for global memory coalescing • Structure splitting and array transpose • Instrumentation for memory consistency (collaborate with DSM system) • Affinity-aware loop tiling • For multidimensional data blocking on shared arrays • Create data environments for kernel loop leveraging array section analysis • Copy in, copy out, private (allocation), formal arguments • CUDA kernel code generation and runtime instrumentation • kernel function and kernel invocation • Whirl2c translator, UPC=> C+UPCR+CUDA
Memory Optimizations for CUDA • What data will be put into the shared memory? • firstly pseudo tiling • Extend REGION with reuse degree and region volume • inter-thread and intra-thread • average reuse degree for merged region • 0-1 bin packing problem (SM capacity) • Quantify the profit: reuse degree integrated with coalescing attribute • prefer inter-thread reuse • What is the optimal data layout in global memory? • Coalescing attributes of array reference • only consider contiguous constraints • Legality analysis • Cost model and amortization analysis • Code transformations (in a runtime library)
Extend UPC’s Runtime System • A DSM system on each UPC thread • Demand-driven data transfer between GPU and CPU • Manage all global variables • Grain size, upc tile for shared arrays and private array as a whole • shuffle remote and local array region into one contiguous physical block before transferring • Data transformation for memory coalescing • implemented in the GPU side using CUDA kernel • Leverage shared memory
UPC Performance on CUDA cluster • CPUs on each node: 2 dual core AMD Opteron 880 • GPU: NVIDIA GeForce 9800 GX2 • Compilers: nvcc (2.2) –O3 ; GCC (3.4.6) –O3 Use 4-node cuda cluster; ethernet
Open Source Loongcc • Target to LOONGSON CPU • Base on Open64 • Main trunk -- r2716 • A MIPS-like processor • Have new instructions • New Features • LOONGSON machine model • LOONGSON feature support • FE, LNO, WOPT, CG • Edge profiling