1 / 39

Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting

Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting. Bernd Mohr , Felix Wolf Forschungszentrum Jülich John von Neumann - Institut für Computing Zentralinstitut für Angewandte Mathematik 52425 Jülich {b.mohr,f.wolf}@fz-juelich.de.

imelda
Download Presentation

Towards a Performance Tool Interface for OpenMP: An Approach Based on Directive Rewriting

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards a Performance Tool Interface for OpenMP:An Approach Based onDirective Rewriting Bernd Mohr, Felix Wolf Forschungszentrum Jülich John von Neumann - Institut für Computing Zentralinstitut für Angewandte Mathematik 52425 Jülich {b.mohr,f.wolf}@fz-juelich.de Allen Malony, Sameer Shende University of Oregon Department of Computer and Information Science Eugene, Oregon 97403 {malony,sameer}@cs.uoregon.edu

  2. Outline • Introduction • Proposed OpenMP Performance Tool Interface • Prototype Implementation • Examples • Future Work

  3. Introduction • Motivation • “Standard” OpenMP performance tools interfacesimilar in spirit to the MPI profiling interface (PMPI)” • Goals • Expose OpenMP parallel execution to theperformance measurement system • Define it at the abstraction level of theOpenMP programming model • Make the performance measurement interface portable • across different platforms • across all OpenMP supported languages • different performance tools • Allow flexibility in how the interface is applied

  4. Proposed OpenMP Performance Tool Interface • POMP • OpenMP Directive Instrumentation • OpenMP Runtime Library Routine Instrumentation • Performance Monitoring Library Control • User Code Instrumentation • Context Descriptors • Conditional Compilation • Conditional / Selective Transformations • Remarks • C/C++ OpenMP Pragma Instrumentation • Implementation Issues • Open Issues

  5. OpenMP Directive Instrumentation • Insert calls topomp_NAME_TYPE(d)at appropriate places around directives • NAME name of the OpenMP construct • TYPE • fork, join mark change in parallelism grade • enter, exit flag entering/exiting OpenMP construct • begin, end mark start/end of body of construct • d context descriptor • Observation of implicit barrier atDO, SECTIONS, WORKSHARE, SINGLE constructs • Add NOWAIT to construct • Make barrier explicit

  6. Example: !$OMP PARALLEL DO Instrumentation !$OMP PARALLEL DO clauses...do loop!$OMP END PARALLEL DO !$OMP PARALLEL other-clauses... !$OMP DO schedule-clauses, ordered-clauses,lastprivate-clausesdo loop !$OMP END DO !$OMP END PARALLEL DO NOWAIT!$OMP BARRIER call pomp_parallel_fork(d)call pomp_parallel_begin(d)call pomp_parallel_end(d)call pomp_parallel_join(d) call pomp_do_enter(d)call pomp_do_exit(d) call pomp_barrier_enter(d)call pomp_barrier_exit(d)

  7. OpenMP Runtime Library Routine Instrumentation • Transform • omp_###_lock()pomp_###_lock() • omp_###_nest_lock()pomp_###_nest_lock() [ ### = init | destroy | set | unset | test ] • POMP version • Calls omp version internally • Can do extra stuff before and after call • Transformations of other OpenMP API functions necessary?

  8. Performance Monitoring Library Control • Give programmer control over performance monitoringat runtime • !$OMP INST [ INIT | FINALIZE | ON | OFF ] • Translated into • pomp_init(), pomp_finalize() • pomp_on(), pomp_off() • Ignored in “normal” OpenMP compilation mode • Alternatives • !$POMP? • Use conditional compilation with explicit POMP calls

  9. User Code Instrumentation • Compiler / transformation tool should insert • pomp_begin(d) • pomp_end(d) calls at beginning and end of each(?) user function • Allow user-specified arbitrary (non-function) code regions • !$OMP INST BEGIN ( <region name> )arbitrary user code!$OMP INST END ( <region name> ) • Alternatives • !$POMP? • Use conditional compilation with explicit POMP calls descriptor?

  10. Context Descriptors • Describe execution contexts through context descriptor typedef struct ompregdescr {char name[];/* construct */char sub_name[];/* region name */int num_sections;char filename[];/* src filename */int begin_line1, begin_lineN;/* begin line # */int end_line1, end_lineN;/* end line # */WORD data[4];/* perf. data */struct ompregdescr* next;} OMPRegDescr; • Generate context descriptors in global static memory: OMPRegDescr rd42675 = { "critical", "phase1", 0, "foo.c", 5, 5, 13, 13 }; • Pass address to POMP functions

  11. Conditional Compilation • C, C++, [Fortran, if supported] • #ifdef _POMParbitrary user code#endif • Fortran Free Form • !P$ arbitrary user code • Fortran Fixed Form • CP$ arbitrary *P$ user !P$ code • Usual restrictions apply

  12. Conditional / Selective Transformations • (Temporarily) disable / re-enable POMP instrumentationat compile time • !$OMP NOINSTRUMENT • !$OMP INSTRUMENT • Alternative: • !$POMP?

  13. C/C++ OpenMP Pragma Instrumentation • No END pragmas • instrumentation for “closing” part follows structured block • adding nowait has to be done in the “opening part” • #pragma omp XXX structured block; • Simple differences in language • no “call” keyword • “;” • !$OMP#pragma omp pomp_###_begin(d); pomp_###_end(d); { }

  14. Example: #pragma omp sections Instrumentation #pragma omp sections{ #pragma omp sectionstructured block; #pragma omp sectionstructured block;} pomp_sections_enter(d);{ pomp_section_begin(d);pomp_section_end(d); }{ pomp_section_begin(d);pomp_section_end(d); }pomp_sections_exit(d); nowait#pragma omp barrier pomp_barrier_enter(d);pomp_barrier_exit(d);

  15. Implementation Issues • pomp_NAME_TYPE(d) more efficient / simpler than pomp_event(POMP_TYPE, POMP_NAME, fname, line#, ...) • Inlining of POMP calls possible • Context descriptors • Full context information available, incl. source reference • But minimal runtime overhead • just one argument needs to be passed • no need to dynamically allocate memory for data!! • context data initialization at compile time • Context data is kept together with executable • Allows for separate compilation • Potentially too much overhead for ATOMIC, CRITICAL, MASTER, SINGLE, and OpenMP lock calls--pomp-disable=construct-list

  16. Open Issues • ORDERED? • FLUSH? • Instrumentation of PARALLELDO / FOR loop iterations • Potentially allows measurement of influence of loop scheduling policies • Overhead?? • Allow passing additional user information to POMP library • Conditional compilation • Extra parameter to !$OMPINSTBEGIN/END • ... • Specification of extent of user code instrumentation • Additional pragmas/directives? • Separate (outside source code) specification? • OpenMP Runtime Instrumentation necessary?

  17. Prototype Implementation: OPARI • OpenMP Pragma And Region Instrumentor (OPARI) • Source-to-Source translator to insert POMP calls around OpenMP constructs and API functions • Supports • Fortran77 and Fortran90, OpenMP 2.0 • C and C++, OpenMP 1.0 • Runtime Library Control (init, finalize, on, off) • (Manual) User Code Instrumentation (begin, end) • Conditional Compilation (#ifdef _POMP, !P$) • Conditional / Selective Transformation ([no]instrument) • Preserves source code information (#line linefile) • ~ 2000 lines of C++ code

  18. OPARI • Limitations • Fortran: • ENDDO and ENDPARALLELDO directives required • atomic expression on line by itself • C/C++: • structured blocks: simple expression statement or block (compound statement) • Exception: for statement after parallelfor • Could be fixed by enhancing OPARI’s parsing capabilities • Source code and documentation available athttp://www.fz-juelich.de/zam/kojak/opari/

  19. Prototype Implementation: POMP Library • EXtensible PERformance Tool (EXPERT) • Automatic event trace analyzer • http://www.fz-juelich.de/zam/kojak/expert/ • Tuning and Analysis Utilities (TAU) • Performance analysis framework • http://www.acl.lanl.gov/tau/ • Required ~ 1 day to implement tool specific POMP libraries

  20. Prototype Implementation: EXPERT POMP Library void pomp_for_enter(OMPRegDescr* r) { /* Get EPILOG region descriptor stored in r */ElgRegion* e = (ElgRegion*)(r->data[0]); /* If not yet there, initialize and store it */if (! e) e = ElgRegion_Init(r); /* Record enter event */ elg_enter(e->rid);} void pomp_for_exit(OMPRegDescr* r) { /* Record collective exit event */ elg_omp_collexit();}

  21. Prototype Implementation: TAU POMP Library TAU_GLOBAL_TIMER(tfor, "for enter/exit", "[OpenMP]", OpenMP); void pomp_for_enter(OMPRegDescr* r) {#ifdef TAU_AGGREGATE_OPENMP_TIMINGS TAU_GLOBAL_TIMER_START(tfor);#endif #ifdef TAU_OPENMP_REGION_VIEW TauStartOpenMPRegionTimer();#endif} void pomp_for_exit(OMPRegDescr* r) { ...}

  22. Examples • EXPERT • REMO: Weather Forecast • DKRZ Germany • MPI + OpenMP (experimental) • TAU • Stommel: Ocean Circulation Simulation • SDSC • MPI + OpenMP • event trace based  Vampir • profile based  RACY

  23. Future Work • Measure typical POMP calling overhead • EPCC OpenMP Microbenchmarks? • Investigate “formal” standardization with OpenMP forum[OpenMP Supplemental Standard?] • OpenMP programmers • What do you expect from an OpenMP performance tool? • Tool developers: • Download and try out OPARI • Implement POMP interface for your tool • Tell us about problems, comments, enhancements • OpenMP ARB members • What do we need to do next?

  24. Conclusion • POMP OpenMP Performance Tool Interface • Portable • Flexible • Efficient • Defined at the abstraction level of theOpenMP programming model • Standard? • Prototype Software • OpenMP Pragma And Region Instrumentor (OPARI)http://www.fz-juelich.de/zam/kojak/opari/ • Tuning and Analysis Utilities (TAU)http://www.acl.lanl.gov/tau/

  25. !$OMP PARALLEL Instrumentation call pomp_parallel_fork(d)!$OMP PARALLELcall pomp_parallel_begin(d)structured blockcall pomp_barrier_enter(d) !$OMP BARRIER call pomp_barrier_exit(d)call pomp_parallel_end(d)!$OMP END PARALLELcall pomp_parallel_join(d)

  26. !$OMP DO Instrumentation call pomp_do_enter(d)!$OMP DOdo loop!$OMP END DO NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_do_exit(d)

  27. !$OMP WORKSHARE Instrumentation call pomp_workshare_enter(d)!$OMP WORKSHAREstructured block!$OMP END WORKSHARE NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_workshare_exit(d)

  28. !$OMP SECTIONS Instrumentation call pomp_sections_enter(d)!$OMP SECTIONS!$OMP SECTIONcall pomp_section_begin(d)structured blockcall pomp_section_end(d)!$OMP SECTIONcall pomp_section_begin(d)structured blockcall pomp_section_end(d)!$OMP END SECTIONS NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_sections_exit(d)

  29. Synchronization Constructs Instrumentation 1 call pomp_single_enter(d)!$OMP SINGLEcall pomp_single_begin(d)structured blockcall pomp_single_end(d)!$OMP END SINGLE NOWAITcall pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_single_exit(d)!$OMP MASTERcall pomp_master_begin(d)structured blockcall pomp_master_end(d)!$OMP END MASTER

  30. Synchronization Constructs Instrumentation 2 call pomp_critical_enter(d)!$OMP CRITICALcall pomp_critical_begin(d)structured blockcall pomp_critical_end(d)!$OMP END CRITICALcall pomp_sections_exit(d)call pomp_barrier_enter(d)!$OMP BARRIERcall pomp_barrier_exit(d)call pomp_atomic_enter(d)!$OMP ATOMICatomic expressioncall pomp_atomic_exit(d)

  31. Automatic Analysis • EXtensible PERformance Tool (EXPERT) • programmable, extensible, flexible performanceproperty specification • based on event patterns • analyzes along three hierarchical dimensions • performance properties (general specific) • dynamic call tree position • location (machine  node  process  thread) • Done: fully functional demonstration prototype • Work in Progress: • optimization / generalization • more performance properties • source code and time line displays

  32. 100 main 10 main 30 foo 60 bar Expert Result Presentation • Interconnectedweighted treebrowser • scalable still accurate • Each node has weight • Percentage of CPU allocation time • i.e. time spent in subtree of call tree • Displayed weight depends on state of node • Collapsed (including weight of descendants) • Expanded (without weight of descendants) • Displayed using • Color: allows to easily identify hot spots (bottlenecks) • Numerical value: Detailed comparison

  33. Fine: OpenMP +MPI Fine: OpenMP +MPI Performance Properties View Fine: User code Main Problem: Idle Threads

  34. Dynamic Call Tree View 3rd Optimization Opportunity 1st Optimization Opportunity 2nd Optimization Opportunity

  35. Locations View • Supports locationsup to Grid scale • Easily allows explorationof load balance problemson different levels • [ Of course, Idle Thread Problem only applies to slave threads ]

More Related