1 / 63

CUG14 BoF : Future Needs for Understanding User– Level Activity with ALTD

CUG14 BoF : Future Needs for Understanding User– Level Activity with ALTD. ALTD. What it does Intercepts linker ( ln ) and job launcher ( aprun ) Uses linker tracemap option to get all libraries Stores all of this in a database What it gets Full path of the executable

sari
Download Presentation

CUG14 BoF : Future Needs for Understanding User– Level Activity with ALTD

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CUG14 BoF: Future Needs for Understanding User–Level Activity with ALTD

  2. ALTD • What it does • Intercepts linker (ln) and job launcher (aprun) • Uses linker tracemap option to get all libraries • Stores all of this in a database • What it gets • Full path of the executable • Static and dynamic libraries used by the executable • What it can be used for • Which executables use the largest number of core hours? • Are they managed by center? Do they use the system efficiently? • Which libraries, applications, or tools are being used? • Are there libraries we should remove? Are there libraries we should install? • What percentage of executables are scripts? • Are these scripts being used because the job starter isn’t sophisticated enough? • Are there any executables with modification times older than 1 year? • Should we ask the user to recompile? CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  3. What Does NERSC Collect? • ALTD • Track library usage both at compile and run time • Torque Logs • Job information, accounting • ALPS Logs • Track applications run time data and options on the Cray systems • Darshan • IO profiling data • IPM • MPI profiling data • Performance Monitoring • Monitoring system performance over the life time of the machines • LMT • Lustredata CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  4. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  5. ALTD is enabled on all major computing platforms at NERSC CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  6. Applications of ALTD An ALTD tool to restore the build environment for an application: aryal@edison12:~> linkinfo.sh /global/homes/a/aryal/bin/gvasp5.3.2 User : zz217 Linked on : 2013-01-03 Executable Name: vasp Libraries Used : //usr/lib64/libhugetlbfs.a ../vasp.5.lib/libdmy.a /opt/cray/atp/1.6.0/lib//libAtpSigHCommData.a /opt/cray/atp/1.6.0/lib//libAtpSigHandler.a /opt/cray/libsci/12.0.00/cray/81/sandybridge/lib/libsci_cray_mp.a /opt/fftw/3.3.0.1/x86_64/lib/libfftw3.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpich_cray.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpl.a /opt/cray/xpmem/0.1-2.0500.36799.3.6.ari/lib64/libxpmem.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/ugni/4.0-1.0500.5836.7.58.ari/lib64/libugni.a /opt/cray/udreg/2.3.2-1.0500.5931.3.1.ari/lib64/libudreg.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpslli.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpsutil.a /opt/cray/cce/8.1.2/craylibs/x86-64/libpgas-dmapp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/dmapp/4.0.1-1.0500.5932.6.5.ari/lib64/libdmapp.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/cce/8.1.2/craylibs/x86-64/libfi.a /opt/gcc/4.4.4/snos/lib64/libstdc++.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a /opt/cray/cce/8.1.2/craylibs/x86-64/libf.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcsup.a //usr/lib64/librt.a /opt/cray/cce/8.1.2/craylibs/x86-64/libtcmalloc_minimal.a //usr/lib64/libpthread.a //usr/lib64/libc.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a //usr/lib64/libm.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc.a • Understanding current library usage and plan for future software need • Providing usage statistics to developers and vendors • Restoring the program environment where user applications were built • Assisting with debugging system issues CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  7. ALTD at CSCS • In production at CSCS since 2011 • Rock solid: just a single downtime in two years • Rosa (Cray XE6) since March 2011 • 600K compilations, 2.8M jobs • Todi (Cray XK6/XK7) since October 2012 • 470K compilations, 500K jobs • Daint (Cray XC30) since March 2013 • 100K compilations, 550K jobs • We’ve added an additional SQL table “accounting” which logs more data about the application execution – number of cores used, number of cores claimed, number of threads, MPI processes, processes per node, … • We want to be able to detect situations like the use of a buggy or non-performant library CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  8. How we mine data: ahypotheticsituation A critical bug has been identified in FFTW version 3.3.0.2, affecting code correctness CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  9. First, find which users have linked this library mysql> select distinct username from altd_rosa_link_tags,altd_rosa_linkline where altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' ; +----------+ | username | +----------+ | tkachenn | | boswald | | liang | | robinson | | yunding | | zilia | +----------+ 5 rows in set (4.33 sec) • Querying the ALTD database reveals that several users have applications linked to the buggy library CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  10. Now, check if they are using the buggy application • And it’s confirmed that user “robinson” is running the application linked to the buggy library • It’s now up to the user services group to contact the user and recommend relinking their applications against the newer version of FFTW, which has fixed the bug mysql> select altd_rosa_jobs.* from altd_rosa_link_tags,altd_rosa_linkline,altd_rosa_jobs where altd_rosa_jobs.tag_id=altd_rosa_link_tags.tag_id and altd_rosa_link_tags.linkline_id=altd_rosa_linkline.linking_inc and exit_code=0 and linkline like '%fftw/3.3.0.2/%' and altd_rosa_jobs.username="robinson"; +---------+--------+------------------------+----------+------------+--------+---------------+ | run_inc | tag_id | executable | username | run_date | job_id | build_machine | +---------+--------+------------------------+----------+------------+--------+---------------| | 2410158 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa| | 2410172 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa | | 2410198 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa| | 2410222 | 438583 | /users/robinson/mycode | robinson | 2013-11-05 | 834805 | rosa| +---------+--------+------------------------+----------+------------+--------+---------------| 4 rows in set (0.65 sec) CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  11. This methodology is clearly unmanageable! • Ideally, user support specialists would be alerted automatically to “situations of interest” • Users running applications linked to legacy, less-performant, or buggy libraries • Users running legacy versions of applications • Users building code with legacy compilers • Users making use of their own libs or apps, when more optimized versions are available centrally How can we automate the processes of data mining, reporting and alerting? CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  12. Lariat (TACC) • What it does • Intercepts job launcher (ibrun) • Uses ldd to get shared libraries • Checks run time environment against compile time environmet • What it gets • Full path of the executable • Dynamic libraries used by the executable • Last modification time of the executable • Size of the executable(e.g. bss, data, text) • Unique hash of the executable • Whether the executable is a binary or a shell script • What it can be used for • Which executables use the largest number of core hours? • Are they managed by TACC? Do they use the system efficiently? • Which libraries, applications, or tools are being used? • Are there libraries we should remove? Are there libraries we should install? • What percentage of executables are scripts? • Are these scripts being used because the job starter isn’t sophisticated enough? • Should we direct these users to our parametric job launcher? • Are there any executables with modification times older than 1 year? • Should we ask the user to recompile? • Are there any executables with large statically allocated arrays(bss)? These can be obtained with ALTD as well with straightforward modifications like CSCS already did CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  13. TACC_Stats • Job-level transparent performance monitoring from HPC compute nodes • CPU performance counters • IB statistics • Lustre statistics • Scheduler job statistics • Host data • OS statistics • Analyses integrate available Lariat data CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  14. Nightly Analyses • Automatically analyzes jobs nightly • Highlights jobs worth looking at • Tries to provide a one-stop view of a job for • Support staff • Sysadmins • And soon, users CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  15. Current Reports • High levels of imbalance • Low Flops (but other activity) • Idle hosts • Catastrophic performance drop CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  16. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  17. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  18. XALT: Understanding the Software Needs of High End Computer Users • Newly NSF funded project • Will be combining the best of Lariat and ALTD • Collecting job-level and link-time level data and subsequent analytics • Building a community around analytics – potentially one of many tools • Will make it available to the community • Optional interface to XDMod/SUPREMME CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  19. XALT Goals • Goal is a census of libraries and applications and automatic filtering of user issues • what additional user problems can we detect and report (perhaps correct) automatically? • How can we leverage lessons learned by the tacc stats team to implement additional automatic filtering? • Plan to add tracking of function calls as well • Want to balance the need for portability with support for site-specific capabilities • Want to simplify the processes system administrators use to install, configure, and manage CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  20. Mark’s [not so hidden] agenda • Do you know what libraries are being used? Can you help a user figure out what he did X months ago?  Do you know how many users have trouble with runtime environment matching compile time? • Would you have strong opposition to intercepting the linker "ld"? • Anyone willing to be a beta tester for our before and after study?   • Do you have any issues with dropping dot files in user home directories? • Do you want to track library function calls? • xalt-users@lists.sourceforge.net • Want feedback, hungry for ideas CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  21. Thanks to • Richard Gerber and Zhengji Zhao, NERSC • Tim Robinson, CSCS • Bill Barth, TACC CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  22. Contact Info • Mark R. Fahey • mfahey@utk.edu • Robert McLay • mclay@tacc.utexas.edu CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  23. Background Slides CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  24. Robert McLayTACC CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  25. My Passions • Protect new user but stay out of vet's way • Make staff support efficient and effective • Automate detection, correction, prevention • Make the repeat tickets go away! CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  26. Making a difference… Maintain consistent, compatible software environment Lmod and related tools $ module swap mvapich2 impi Inactive Modules: 1) vasp Due to MODULEPATH changes the following have been reloaded: 1) fftw3/3.3.2 $ module load mvapich2 Lmod Error: You can only have one MPI module loaded at a time. You already have impi loaded. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  27. Making a difference… Detect potential problems and alert users Lariat and related tools TACC: Starting up job 423224 ****************************************************** WARNING: Your MPI Environment is: mvapich2/1.9a2 Your executable was built with: impi/4.1.0.030 ****************************************************** CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  28. Making a difference… Job-level usage data on libraries and applications ALTD (Mark Fahey -- NICS) CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  29. Joining forces… TACC: Starting up job 423224 ****************************************************** WARNING: Your MPI Environment is: mvapich2/1.9a2 Your executable was built with: impi/4.1.0.030 ****************************************************** Job-level usage data on libraries and applications Detect potential problems and alert users XALT ALTD Lariat CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  30. My own not-so-hidden agenda... • Looking for XALT beta users • Hungry for ideas, needs, feedback • Wanting to begin conversation with kindred souls CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  31. Lariat User #1 Bill Barth Director of HPC, TACC Co-PI SUPreMM CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  32. TACC Stats • Job-level transparent performance monitoring from HPC compute nodes • CPU performance counters • IB statistics • Lustre statistics • Scheduler job statistics • Host data • OS statistics • Analyses integrate available Lariat data CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  33. Nightly Analyses • Automatically analyzes jobs nightly • Highlights jobs worth looking at • Tries to provide a one-stop view of a job for • Support staff • Sysadmins • And soon, users CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  34. Current Reports • High levels of imbalance • Low Flops (but other activity) • Idle hosts • Catastrophic performance drop CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  35. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  36. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  37. NERSC Job Data Richard GerberZhengji ZhaoNERSC User Services CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  38. What Does NERSC Collect? • ALTD • Track library usage both at compile and run time • Torque Logs • Job information, accounting • ALPS Logs • Track applications run time data and options on the Cray systems • Darshan • IO profiling data • IPM • MPI profiling data • Performance Monitoring • Monitoring system performance over the life time of the machines • LMT • Lustredata CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  39. Expose job data via the web • We try to make as much data available as possible via the web • For users to track usage • For users to check resource utilization • For users to monitor performance • For staff to help debug jobs • For summary reports • The following are web screen shots • All data collection is transparent to users CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  40. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  41. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  42. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  43. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  44. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  45. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  46. CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  47. ALTD is enabled on all major computing platforms at NERSC CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  48. Applications of ALTD An ALTD tool to restore the build environment for an application: aryal@edison12:~> linkinfo.sh /global/homes/a/aryal/bin/gvasp5.3.2 User : zz217 Linked on : 2013-01-03 Executable Name: vasp Libraries Used : //usr/lib64/libhugetlbfs.a ../vasp.5.lib/libdmy.a /opt/cray/atp/1.6.0/lib//libAtpSigHCommData.a /opt/cray/atp/1.6.0/lib//libAtpSigHandler.a /opt/cray/libsci/12.0.00/cray/81/sandybridge/lib/libsci_cray_mp.a /opt/fftw/3.3.0.1/x86_64/lib/libfftw3.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpich_cray.a /opt/cray/mpt/5.6.0/gni/mpich2-cray/74/lib/libmpl.a /opt/cray/xpmem/0.1-2.0500.36799.3.6.ari/lib64/libxpmem.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/ugni/4.0-1.0500.5836.7.58.ari/lib64/libugni.a /opt/cray/udreg/2.3.2-1.0500.5931.3.1.ari/lib64/libudreg.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpslli.a /opt/cray/alps/5.0.1-2.0500.7663.1.1.ari/lib64/libalpsutil.a /opt/cray/cce/8.1.2/craylibs/x86-64/libpgas-dmapp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/dmapp/4.0.1-1.0500.5932.6.5.ari/lib64/libdmapp.a /opt/cray/pmi/4.0.0-1.0000.9282.69.4.ari/lib64/libpmi.a /opt/cray/cce/8.1.2/craylibs/x86-64/libfi.a /opt/gcc/4.4.4/snos/lib64/libstdc++.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a /opt/cray/cce/8.1.2/craylibs/x86-64/libf.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymath.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcraymp.a /opt/cray/cce/8.1.2/craylibs/x86-64/libu.a /opt/cray/cce/8.1.2/craylibs/x86-64/libcsup.a //usr/lib64/librt.a /opt/cray/cce/8.1.2/craylibs/x86-64/libtcmalloc_minimal.a //usr/lib64/libpthread.a //usr/lib64/libc.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc_eh.a //usr/lib64/libm.a /opt/gcc/4.4.4/snos/lib/gcc/x86_64-suse-linux/4.4.4/libgcc.a • Understanding current library usage and plan for future software need • Providing usage statistics to developers and vendors • Restoring the program environment where user applications were built • Assisting with debugging system issues CUG14 BoF: Future Needs for Understanding User-Level Activity with ALTD

  49. Monitoring Software Usage at CSCS Dr Tim Robinson CSCS Drilling Down: Understanding User-Level Activity on Today’s Supercomputers

  50. We support many, many libs, tools, apps, compilers…

More Related