s3d performance impact of hybrid xt3 xt4 n.
Skip this Video
Loading SlideShow in 5 Seconds..
S3D: Performance Impact of Hybrid XT3/XT4 PowerPoint Presentation
Download Presentation
S3D: Performance Impact of Hybrid XT3/XT4

Loading in 2 Seconds...

play fullscreen
1 / 50

S3D: Performance Impact of Hybrid XT3/XT4 - PowerPoint PPT Presentation

  • Uploaded on

S3D: Performance Impact of Hybrid XT3/XT4. Sameer Shende tau-team@cs.uoregon.edu. Acknowledgements. Alan Morris [UO] Kevin Huck [UO] Allen D. Malony [UO] Kenneth Roche [ORNL] Bronis R. de Supinski [LLNL] John Mellor-Crummey [Rice] Nick Wright [SDSC] Jeff Larkin [Cray, Inc.]

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'S3D: Performance Impact of Hybrid XT3/XT4' - devin-rodriquez

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
s3d performance impact of hybrid xt3 xt4

S3D: Performance Impact of Hybrid XT3/XT4

Sameer Shende


  • Alan Morris [UO]
  • Kevin Huck [UO]
  • Allen D. Malony [UO]
  • Kenneth Roche [ORNL]
  • Bronis R. de Supinski [LLNL]
  • John Mellor-Crummey [Rice]
  • Nick Wright [SDSC]
  • Jeff Larkin [Cray, Inc.]

The performance data presented here is available at:


tau parallel performance system
TAU Parallel Performance System
  • http://www.cs.uoregon.edu/research/tau/
  • Multi-level performance instrumentation
    • Multi-language automatic source instrumentation
  • Flexible and configurable performance measurement
  • Widely-ported parallel performance profiling system
    • Computer system architectures and operating systems
    • Different programming languages and compilers
  • Support for multiple parallel programming paradigms
    • Multi-threading, message passing, mixed-mode, hybrid
the story so far
The Story So Far...
  • Scalability study of S3D using TAU
    • MPI_Wait
    • Loop: ComputeSpeciesDiffFlux (630-656) [Rice, SDSC]
    • Loop: ReactionRateBounds (374-386) [exp]
  • 3D Scatter plots pointed to a single “slow” node before
  • Identifying individual nodes by mapping ranks to nodes within TAU
  • Cray utilities: nodeinfo, xtshowmesh, xtshowcabs
  • Ran a 6400 core simulation to identify XT3/XT4 partition performance issues (removed -feature=xt3)
case study
Case Study
  • Harness testcase
  • Platform: Jaguar Combined Cray XT3/XT4 at ORNL
    • 6400p
  • Goal:
    • To evaluate the performance impact of combined XT3/XT4 nodes on S3D executions
    • Performance evaluation of MPI_Wait
    • Study mapping of MPI ranks to nodes
3d scatter plots
3D Scatter Plots
  • Plot four routines along X, Y, Z, and Color axes
  • Each routine has a range (max, min)
  • Each process (rank) has a unique position along the three axes and a unique color
  • Allows us to examine the distribution of nodes (clusters)
3d triangle mesh display
3D Triangle Mesh Display
  • Plot MPI rank, routine name, and exclusive time along X, Y and Z axes
  • Color can be shown by a fourth metric
  • Scalable view
  • Suitable for very large number of processors
zoom change color to l1 data cache misses
Zoom, Change Color to L1 Data Cache Misses
  • Loop in ComputeSpeciesDiffFlux (630-656) has high L1 DCMs (red)
  • Takes longer to execute on this “slice” of processors. So do other routines. Slower memory?
changing color to mflops
Changing Color to MFLOPS
  • Loop in ComputeSpeciesDiffFlux (630-656) lower Mflops (dark blue)
getting back to mpi wait
Getting Back to MPI_Wait()
  • Why does MPI_Wait take less time on these cores?
  • What does the profile of MPI_Wait look like?
mpi wait sorted by exclusive time
MPI_Wait - Sorted by Exclusive Time
  • MPI_Wait takes 435.84 seconds on rank 3101
  • It takes 59.6 s on rank 3233 and 29.2 s on rank 3200
  • It takes 15.49 seconds on rank 0!
  • How is rank 3101 different from rank 0?
comparing papi floating point instructions
Comparing PAPI Floating Point Instructions
  • PAPI_FP_INS are the same - as expected
comparing performance mflops
Comparing Performance - MFLOPS
  • For the memory intensive loop in ComputeSpeciesDiffFlux,
  • rank 0 gets 65% Mflops of rank 3101 (114 vs 174 Mflops)!
comparing mflops rank 3101 vs rank 0
Comparing MFLOPS: Rank 3101 vs Rank 0
  • Rank 0 appears to be “slower” than rank 3101
  • Are there other nodes that are similarly slow with less wait times?
  • How does the MPI_Wait profile look like over all nodes?
mpi wait profile
MPI_Wait Profile

What is this rank?

mpi wait profile shifts at rank 114
MPI_Wait Profile Shifts at rank 114!
  • Ranks 0 through 113 take less time in MPI_Wait than 114...
another shift in mpi wait
Another Shift in MPI_Wait()
  • This shift is observed in ranks 3200 through 3313
  • Again 114 processors... (like ranks 0 through 113)
  • Hmm...
  • How do other routines perform on these ranks?
  • What are the physical node ids?
mpi wait
  • While MPI_Wait takes
  • less time on these cpus,
  • other routines take longer
  • Points to a load imbalance!
metadata for ranks 3200 and 0
MetaData for Ranks 3200 and 0
  • Rank 3200 and 0 both lie on the same physical node nid03406!
mapping ranks from tau to physical processors
Mapping Ranks from TAU to Physical Processors
  • Ranks 0..113 lie on
  • processors 3406..3551
  • Ranks 3200..3313 are also on 3406..3551
results from cray s nodeinfo utility
Results from Cray’s nodeinfo Utility
  • Processors 3406..3551 (physical ids) are located on the XT3 partition
  • XT3 partition has slow DDR-400 memory (5986 MB/s)
  • XT3 has a slower SS1 (1109 MB/s) interconnect
  • XT4 partition has faster DDR2-667 memory modules (7147 MB/s) and
  • faster Seastar2 (SS2) (2022 MB/s) interconnect
location of physical nodes in the cabinets
Location of Physical Nodes in the Cabinets
  • Using Cray utilities xtshowcabs, and xtshowmesh utilities
  • All nodes marked with a Job “c” came from our S3D job
  • Nodes marked with a “c” are from our S3D run
  • What does the mesh look like?
xtshowmesh 1 of 2
xtshowmesh (1 of 2)
  • Nodes marked with a “c” are from our S3D run
xtshowmesh 2 of 2
xtshowmesh (2 of 2)
  • Nodes marked with a “c” are from our S3D run
  • Using a combination of XT3/XT4 nodes slowed down parts of S3D
  • The application spends a considerable amount of time spinning/polling in MPI_Wait
  • The load imbalance is probably caused by non-uniform nodes
  • Conducted a performance characterization of S3D
  • This data will help derive communication models that explain the performance data observed [John Mellor-Crummey, Rice]
  • Techniques to improve cache memory utilization in the loops identified by TAU will help overall performance [SDSC, Rice]
  • I/O characterization of S3D will help identify I/O scaling issues
s3d building with tau
S3D - Building with TAU
  • Change name of compiler in build/make.XT3
    • ftn=> tau_f90.sh
    • cc => tau_cc.sh
  • Set compile time environment variables
    • setenv TAU_MAKEFILE /spin/proj/perc/TOOLS/tau_latest/xt3/lib/ Makefile.tau-nocomm-multiplecounters-mpi-papi-pdt-pgi
      • Disabled tracking message communication statistics in TAU
      • MPI_Comm_compare() is not called inside TAU’s MPI wrapper
      • Choose callpath, PAPI counters, MPI profiling, PDT for source instrumentation
    • setenv TAU_OPTIONS ‘-optTauSelectFile=select.tau -optPreProcess’
      • Selective instrumentation file eliminates instrumentation in lightweight routines
      • Pre-process Fortran source code using cpp before compiling
  • Set runtime environment variables for instrumentation control and event PAPI counter selection in job submission script:
    • export TAU_THROTTLE=1
    • export COUNTER2 PAPI_FP_INS
    • export COUNTER3 PAPI_L1_DCM
    • export COUNTER4 PAPI_TOT_INS
    • export COUNTER5 PAPI_L2_DCM
selective instrumentation in tau
Selective Instrumentation in TAU

% cat select.tau


















loops routine="#"


getting access to tau on jaguar
Getting Access to TAU on Jaguar
  • set path=(/spin/proj/perc/TOOLS/tau_latest/x86_64/bin $path)
  • Choose Stub Makefiles (TAU_MAKEFILE env. var.) from /spin/proj/perc/TOOLS/tau_latest/xt3/lib/Makefile.*
    • Makefile.tau-mpi-pdt-pgi (flat profile)
    • Makefile.tau-mpi-pdt-pgi-trace (event trace, for use with Vampir)
    • Makefile.tau-callpath-mpi-pdt-pgi (single metric, callpath profile)
  • Binaries of S3D can be found in:
      • ~sameer/scratch/S3D-BINARIES
        • withtau
          • papi, multiplecounters, mpi, pdt, pgi options
        • without_tau
concluding discussion
Concluding Discussion
  • Performance tools must be used effectively
  • More intelligent performance systems for productive use
    • Evolve to application-specific performance technology
    • Deal with scale by “full range” performance exploration
    • Autonomic and integrated tools
    • Knowledge-based and knowledge-driven process
  • Performance observation methods do not necessarily need to change in a fundamental sense
    • More automatically controlled and efficiently use
  • Develop next-generation tools and deliver to community
  • Open source with support by ParaTools, Inc.
  • http://www.cs.uoregon.edu/research/tau
support acknowledgements
Support Acknowledgements
  • Department of Energy (DOE)
    • Office of Science
    • PERI