Program Profiling: Applications, Algorithms and Tools

1 / 71

# Program Profiling: Applications, Algorithms and Tools - PowerPoint PPT Presentation

Program Profiling: Applications, Algorithms and Tools. Thomas Ball Microsoft Research May 2001. Overview. Why profile? Applications What to profile? Algorithms How to profile? Infrastructure/Tools Future directions for profiling. Why Profile?.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## Program Profiling: Applications, Algorithms and Tools

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
1. Program Profiling: Applications, Algorithms and Tools Thomas Ball Microsoft Research May 2001

2. Overview • Why profile? • Applications • What to profile? • Algorithms • How to profile? • Infrastructure/Tools • Future directions for profiling

3. Why Profile? "If a given portion of a program has no observable effects, then you have no way of knowing if it is executing, if it has finished, if it got part way through and then stopped, or if it produced 'the right answer.' Programmers nearly always must rely on highly indirect measures to determine what happens when their programs execute. This is one reason why debugging is so difficult." [Digital Woes, Lauren Ruth Weiner, 1993, Addison-Wesley]

4. Example I: Mystery Code #include <stdio.h> main(t,_,a) char *a; { return!0<t?t<3?main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a)): 1,t<_?main(t+1,_,a):3,main(-94,-27+t,a)&&t==2?_<13? main(2,_+1,"%s %d %d\n"):9:16:t<0?t<-72?main(_,t, "@n'+,#'/*{}w+/w#cdnr/+,{}r/*de}+,/*{*+,/w{%+,/w#q#n+,/#{l+,/n{n+,/+#n+,/#\ ;#q#n+,/+k#;*+,/'r :'d*'3,}{w+K w'K:'+}e#';dq#'l \ q#'+d'K#!/+k#;q#'r}eKK#}w'r}eKK{nl]'/#;#q#n'){)#}w'){){nl]'/+#n';d}rw' i;# \ ){nl]!/n{n#'; r{#w'r nc{nl]'/#{l,+'K {rw' iK{;[{nl]'/w#q#n'wk nw' \ iwk{KK{nl]!/w{%'l##w#' i; :{nl]'/*{q#'ld;r'}{nlwb!/*de}'c \ ;;{nl'-{}rw]'/+,}##'*}#nc,',#nw]'/+kd'+e}+;#'rdq#w! nr'/ ') }+}{rl#'{n' ')# \ }'+}##(!!/") :t<-50?_==*a?putchar(31[a]):main(-65,_,a+1):main((*a=='/')+t,_,a+1) :0<t?main(2,2,"%s"):*a=='/'||main(0,main(-61,*a, "!ek;dc i@bK'(q)-[w]*%n+r3#l,{}:\nuwloca-O;m .vpbks,fxntdCeghiry"),a+1); }

5. Running the Program On the first day of Christmas my true love gave to me a partridge in a pear tree. On the second day of Christmas my true love gave to me two turtle doves and a partridge in a pear tree. ... On the twelfth day of Christmas my true love gave to me twelve drummers drumming, eleven pipers piping, ten lords a-leaping, nine ladies dancing, eight maids a-milking, seven swans a-swimming, six geese a-laying, five gold rings; four calling birds, three french hens, two turtle doves and a partridge in a pear tree.

6. Pretty Printed Code #include <stdio.h> main(t,_,a) char *a; { if ((!0) < t) { if (t < 3) main(-79,-13,a+main(-87,1-_,main(-86,0,a+1)+a)); if (t < _ ) main(t+1,_,a); if (main(-94,-27+t,a)) { if (t==2 ) { if ( _ < 13 ) { return main(2,_+1,"%s %d %d\n"); } else { return 9; } } else return 16; } else return 0; ...

7. PP Path Profiling Tool • Instruments Sparc/Solaris executables • records intraprocedural paths • based on EEL instrumentation technology • Usage: • pp a.out // produces a.out.pp • a.out.pp // produces a.out.Paths • pp_stats a.out // produces path statistics// from a.out.Paths file

8. Example: Path Profiling • How often does a control-flow path execute? • Levels of profiling: • blocksÞstatements & lines • edgesÞbranches & blocks • pathsÞ sequence of edges & blocks 400 A 57 343 B C D E F

9. A B D F Naive Path Profiling buffer A put(“A”) put(“B”) B C put(“C”) put(“D”) D E F put(“F”); record_path(); put(“E”)

10. Efficient Path Profiling A Path Encoding ACDF 0 ACDEF 1 ABCDF 2 ABCDEF 3 ABDF 4 ABDEF 5 r = 0 B C r = 2 r = 4 D r += 1 E F count[r]++ [Ball/Larus, MICRO 96]

11. v n1+n2 0 n1 2 P = 1 P = 3 w1 w2 w3 n1 n3 n2 4 P = 1 1 P = 0 P = 0 Exit Path Regeneration Given path sum P, which path produced it? P = 3 A B C D F E

13. 12 days of Christmas • 66 occurrences of non partridge-in-a-pear-tree gifts • 114 strings printed • 2358 characters printed (wc –c)

14. What can we learn from a profile? • Profile partitions program into low and high frequency clusters of paths • Number of verses vs. string searching • Profile identifies paths related to frequencies seen in analysis of program output • Printing a string or character • Inefficiences pop out • hidden O(n^2) algorithms

15. Example II: Profiling Bebop • Bebop performs reachability analysis of boolean programs • Uses symbolic version of [Reps-Horwitz-Sagiv, POPL’95] interprocedural data flow analysis • Explicit representation of control flow • Implicit representation of reachable states via BDDs • Complexity of algorithm is O( E  2n) • E = size of interprocedural control flow graph • n = max. number of variables in the scope of any label

16. Bebop • Exploits procedural abstraction • number of globals bounded by g • number of locals bounded by h • O( E  2g+h ) = O(E) • Expect space usage and time for model checking to be linear in size of program

17. void level<i>() begin decl a,b,c; if (g) then while(!a|!b|!c) do if (!a) then a := 1; elsif (!b) then a,b := 0,1; elsif (!c) then a,b,c := 0,0,1; else skip; fi od else <stmt>; <stmt>; fi g := !g; end decl g; voidmain() begin level1(); level1(); if(!g) then reach: skip; else skip; fi end

18. Simple Profiling • Memory usage • BDD libraries report on peak space usage and many other useful statistics • Wall time • cygwin “time bebop …” • Visual Studio profiler • IceCap • internal Microsoft profiling

19. A Lesson: Profiling is critical when reusing code • Eliminated various stupidities in our code • still had O(n2) time behavior! • Profiling narrowed cause down to BDD libraries • assume small number of “managed” BDD variables • BDD operations generally are O(size of BDD) • But, in both CMU and CU some BDD operations loop through all managed variables, whether or not they appear in a BDD!

20. Overview • Why profile? • Applications • What to profile? • How to profile? • Infrastructure/tools • Future directions for profiling

21. Applications -> What to profile? • Control-flow profiles • Trace scheduling [Ellis] • Code positioning [Pettis/Hansen] • Improving dataflow analysis [Ammons/Larus] • Value profiles • superscalar architecture [Sodani/Sohi] • method specialization [Dean, et al.] • Address profiles • Improving D-cache performance [Calder, et al.] • Communication profiles • Component placement [Hunt/Scott]

22. Control-flow Profiles and Optimization • Two main ideas: • Optimize the hot paths (procedures, etc.) • superblocks • VLIW • classic compiler optimizations • dataflow analysis • Separate the hot from the cold • affinity graph • temporal relationship graph

23. Trace Scheduling for VLIW • VLIW – Very Long Instruction Word • execute multiple instructions in single step • need large basic blocks to fully utilize provided “width” • Problem: conditional branches • Solution: • use edge/branch profiles to form “superblocks” • schedule instructions within superblock • generate compensation code to fix up state when prediction is wrong • early form of “superscalar+speculation” • complexity pushed to compiler (IA-64)

24. a = b + c a = b + c if x > 10 if x > 10 f = a * 3 f = a * 3 a = b + c d = a - 3 g = d + 2 d = a - 3 g = d + 2 Trace Scheduling:Code Motion Across Basic Blocks

25. Trace Scheduling Rules • If a trace op. moves below a conditional jump then place copy on off-trace edge of jump • A trace op. that writes to x can’t move above a conditional jump if x is live on off-trace edge of jump • etc.

26. Chang, Mahlke, Hwu • IMPACT compiler • Superblock formation • based on edge profiles and greedy algorithm • Tail duplication • eliminate control-flow merges into middle of superblock • Classic compiler optimizations • constant prop., CSE, loop induction vars.,

27. Superblock formation 1 A 10 100 90 B C 0 90 D E 10 90 0 F 1

28. Tail Duplication 1 A 10 90 9-10 B C 0 90 D E 10 90 0 F 89-90 0-1 F’ 0-1

29. Optimizations • Local opts. • const/copy propagation • CSE • redundant load/store removal • Dead code removal • Loop opts. • hoist loop. inv. code • don’t hoist past conditional if exception possible • induction var. elimination x=c y=x z=x/a

30. Pettis/Hansen: Profiled Guided Code Positioning • Goal • reduce working set size, TLB misses and I-cache misses • Three techniques • procedure positioning • basic block positioning • procedure splitting

31. Procedure Positioning • Profile calls between procedures via link-time insertion of monitoring code • Construct undirected, weighted call graph • Greedy algorithm to merge nodes • “closest is best” strategy • if P calls Q frequently, want P and Q to reside close to one another in executable • Merge until one node left

32. Example A 1 1 10 4 B C D 3 2 8 1 6 1 G F E 5 3 H

33. Example A,D D 1 1 7 B C 1 2 8 6 1 G F E 5 3 H

34. A 1 1 10 4 B C D 3 2 8 1 6 1 G F E 5 3 H Example D A,D A-D-C-F A-D-F-C D-A-C-F D-A-F-C 1 1 7 B C C,F 1 2 6 1 G E 3 5 H

35. Basic Block Positioning • Separate hot blocks from cold blocks • edge/branch profiles • reorganize blocks so that “normal” control-flow is straight-line code • Create chains (superblocks) • consider edges in descending order of frequency, merging chains when possible • start with procedure entry, create chains by considering outgoing edge with highest frequency

36. f g h Dataflow Analysis • Data flow functions • Composition of functions along a path • Merge operators to combine results from multiple paths 0 h o g o f o 0

37. A B C C D E E F Restructuring for Path-sensitive Data Flow [Ammons, Larus] A B C D E F

38. Ammons/Larus Algorithm • Duplicate hot paths to eliminate merges into middle of hot paths • as with superblock optimizations • Perform traditional DFA on new CFG • Compact the CFG • merge duplicated CFG nodes that have equivalent dataflow results

39. Applications -> What to profile? • Control-flow profiles • Value profiles • superscalar architecture [Sodani/Sohi] • method specialization [Chambers, et al.] • Address profiles • Communication profiles

40. Value Profiles • Observation • many (static) instructions compute the same value over and over again! • Why? • regularity in input data • repeated traversal of structures • Applications • architecture: cache results in buffer, to eliminate redundant work • compilers: partially evaluate code with respect to commonly occurring constants

41. Example s4 = search(l,4); s6 = search(l,6); bool search(list* l, int i) { while(l!= NULL) { if (l->val == i) return true; l = l->next; } return false; }

42. Some Numbers on Redundancy [Sodani-Sohi, profile produced with SimpleScalar simulator]

43. Selective Specialization for OO Languages [Dean/Chambers/Grove] • Naïve customization • given method m accepting argument of class A, superclass of B, superclass of C • generate m w.r.t A, B and C • code explosion • Specialization • compile a method multiple times, based on value/dynamic type of commonly passed parameters • use profiles to address cost/benefit

44. Technique • Collects gprof-style information • weighted call graph • node = message send • edge = actual method receiver • Algorithm focuses on • high-weight sends • dynamic dispatch • actual “pass-through” to formal

45. Applications -> What to profile? • Control-flow profiles • Value profiles • Address profiles • Communication profiles

46. Address Profiles • Want to change location of objects to minimize cache misses [Calder et al.] • Addresses change with inputs • How to name objects? • global variables • stack variables • heap objects • address of malloc site XOR’d with a few return addresses on stack [Barrett/Zorn, Lebeck/Wood]

47. 1 1 2 1 Temporal Relationship Graph [Gloy, et al.] • Undirected, weighted graph • nodes = objects • (v,n,w) = n cache misses if v and w were mapped to same cache location o1 o2o3o2o3 o1 o2o3 o1 Trace o1 o2 o3 o1 Queue o1 o2 o2 o1 o3o2 o1 o3 o3o2 o1 o1o2 o3

48. Object Placement Algorithm • Similar in spirit to [Pettis/Hansen] • Visit edges in TRG in decreasing frequency • determine placement of objects • minimize conflict misses • based on placement of previously placed objects • uses original TRG • coalesce nodes, sum edge weights