multi experiment performance data management and data mining l.
Skip this Video
Download Presentation
Multi-Experiment Performance Data Management and Data Mining

Loading in 2 Seconds...

play fullscreen
1 / 41

Multi-Experiment Performance Data Management and Data Mining - PowerPoint PPT Presentation

  • Uploaded on

Multi-Experiment Performance Data Management and Data Mining. Allen D. Malony Department of Computer and Information Science Performance Research Laboratory University of Oregon. Outline of Talk. Performance problem solving

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Multi-Experiment Performance Data Management and Data Mining' - rune

Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
multi experiment performance data management and data mining

Multi-Experiment Performance Data Management and Data Mining

Allen D. Malony

Department of Computer and Information Science

Performance Research Laboratory

University of Oregon

outline of talk
Outline of Talk
  • Performance problem solving
    • Scalability, productivity, and performance technology
    • Application-specific and autonomic performance tools
  • TAU parallel performance system
  • Performance data management and data mining
    • Performance Data Management Framework (PerfDMF)
    • PerfExplorer
  • Multi-experiment case studies
    • Comparative analysis (PERC tool study)
    • Clustering analysis
  • Future work and concluding remarks
research motivation






  • Experimentmanagement
  • Performancestorage





  • Instrumentation
  • Measurement
  • Analysis
  • Visualization




Research Motivation
  • Tools for performance problem solving
    • Empirical-based performance optimization process
    • Performance technology concerns
challenges in performance problem solving
Challenges in Performance Problem Solving
  • How to make the process more effective (productive)?
  • Process may depend on scale of parallel system
    • Standard approaches deliver a lot of data with little value
  • What are the important events and performance metrics?
    • Tied to application structure and computational model
  • Process and tools can be more application-aware
    • Tools have poor support for application-specific aspects
  • What are the significant issues that will affect the technology used to support the process?
  • Enhance application development and benchmarking
  • New paradigm in performance process and technology
role of automation and knowledge discovery
Role of Automation and Knowledge Discovery
  • Scale forces the process to become more intelligent
  • Even with intelligent and application-specific tools, the decisions of what to analyze is difficult and intractable
  • More automation and knowledge-based decision making
  • Build autonomic capabilities into the tools
    • Support broader experimentation methods and refinement
    • Access and correlate data from several sources
    • Automate performance data analysis / mining / learning
    • Include predictive features and experiment refinement
  • Knowledge-driven adaptation and optimization guidance
  • Address scale issues through increased expertise
tau performance system
TAU Performance System
  • Tuning and Analysis Utilities (13+ year project effort)
  • Performance system framework for HPC systems
    • Integrated, scalable, flexible, and parallel
  • Targets a general complex system computation model
    • Entities: nodes / contexts / threads
    • Multi-level: system / software / parallelism
    • Measurement and analysis abstraction
  • Integrated toolkit for performance problem solving
    • Instrumentation, measurement, analysis, and visualization
    • Portable performance profiling and tracing facility
    • Performance data management and data mining
  • University of Oregon , Research Center Jülich, LANL
important questions for application developers
Important Questions for Application Developers
  • How does performance vary with different compilers?
  • Is poor performance correlated with certain OS features?
  • Has a recent change caused unanticipated performance?
  • How does performance vary with MPI variants?
  • Why is one application version faster than another?
  • What is the reason for the observed scaling behavior?
  • Did two runs exhibit similar performance?
  • How are performance data related to application events?
  • Which machines will run my code the fastest and why?
  • Which benchmarks predict my code performance best?
performance problem solving goals
Performance Problem Solving Goals
  • Answer questions at multiple levels of interest
    • Data from low-level measurements and simulations
      • use to predict application performance
    • High-level performance data spanning dimensions
      • machine, applications, code revisions, data sets
      • examine broad performance trends
  • Discover general correlations application performance and features of their external environment
  • Develop methods to predict application performance on lower-level metrics
  • Discover performance correlations between a small set of benchmarks and a collection of applications that represent a typical workload for a given system
automatic performance analysis tool concept
Automatic Performance Analysis Tool (Concept)

PSU: Kathryn Mohror, Karen Karavanic

UO: Kevin Huck

LLNL: John May, Brian Miller (CASC)




paraprof performance profile analysis
ParaProf Performance Profile Analysis

Raw files










perfexplorer k huck uo
PerfExplorer (K. Huck, UO)
  • Performance knowledge discovery framework
    • Use the existing TAU infrastructure
      • TAU instrumentation data, PerfDMF
    • Client-server based system architecture
    • Data mining analysis applied to parallel performance data
  • Technology integration
    • Relational DatabaseManagement Systems (RDBMS)
    • Java API and toolkit
    • R-project / Omegahat statistical analysis
    • Web-based client
      • Jakarta web server and Struts (for a thin web-client)
perfexplorer architecture
PerfExplorer Architecture

Server accepts multiple client requests and returns results

Server supports R data mining operations built using RSJava

PerfDMF Java API used to access DBMS via JDBC

Client is a traditional Java application with GUI (Swing)

Analyses can be scripted, parameterized, and monitored

Browsing of analysis results via automatic web page creation and thumbnails

perc tool requirements and evaluation
PERC Tool Requirements and Evaluation
  • Performance Evaluation Research Center (PERC)
    • DOE SciDAC
    • Evaluation methods/tools for high-end parallel systems
  • PERC tools study (led by ORNL, Pat Worley)
    • In-depth performance analysis of select applications
    • Evaluation performance analysis requirements
    • Test tool functionality and ease of use
  • Applications
    • Start with fusion code – GYRO
    • Repeat with other PERC benchmarks
    • Continue with SciDAC codes
gyro execution parameters
GYRO Execution Parameters
  • Three benchmark problems
    • B1-std : 16n processors, 500 timesteps
    • B2-cy : 16n processors, 1000 timesteps
    • B3-gtc : 64n processors, 100 timesteps
  • Test different methods to evaluate nonlinear terms:
    • Direct method
    • FFT (“nl2” for B1 and B2, “nl1” for B3)
  • Task affinity enabled/disabled (p690 only)
  • Memory affinity enabled/disabled (p690 only)
  • Filesystem location (Cray X1 only)
primary evaluation machines
Primary Evaluation Machines
  • Phoenix (ORNL – Cray X1)
    • 512 multi-streaming vector processors
  • Ram (ORNL – SGI Altix (1.5 GHz Itanium2))
    • 256 total processors
  • TeraGrid
    • ~7,738 total processors on 15 machines at 9 sites
  • Cheetah (ORNL – p690 cluster (1.3 GHz, HPS))
    • 864 total processors on 27 compute nodes
  • Seaborg (NERSC – IBM SP3)
    • 6080 total processors on 380 compute nodes
region events of interest
Region (Events) of Interest
  • Total program is measured, plus specific code regions
  • NL : nonlinear advance
  • NL_tr* : transposes before / after nonlinear advance
  • Coll : collisions
  • Coll_tr* : transposes before/after main collision routine
  • Lin_RHS : compute right hand side of the electron and ion GKEs (GyroKinetic (Vlasov) Equations)
  • Field : explicit or implicit advance of fields and solution of explicit maxwell equations
  • I/O, extras


data collected thus far
Data Collected Thus Far…
  • User timer data
    • Self instrumentation in the GYRO application
    • Outputs aggregate data per N timesteps
      • N = 50 (B1, B3)
      • N = 125 (B2)
  • HPM (Hardware Performance Monitor) data
    • IBM platform (p690) only
  • MPICL profiling/tracing
    • Cray X1 and IBM p690
  • TAU (all platforms, profiling/tracing, in progress)
  • Data processed by hand into Excel spreadsheets
perfexplorer analysis of self instrumented data
PerfExplorer Analysis of Self-Instrumented Data
  • PerfExplorer
    • Focus on comparative analysis
    • Apply to PERC tool evaluation study
  • Look at user timer data
    • Aggregate data
      • no per process data
      • process clustering analysis is not applicable
    • Timings output every N timesteps
      • some phase analysis possible
  • Goal
    • Recreate manually generated performance reports
comparative analysis
Comparative Analysis
  • Supported analysis
    • Timesteps per second
    • Relative speedup and efficiency
      • For entire application (compare machines, parameters, etc.)
      • For all events (on one machine, one set of parameters)
      • For one event (compare machines, parameters, etc.)
    • Fraction of total runtime for one group of events
    • Runtime breakdown (as a percentage)
  • Initial analysis implemented as scalability study
  • Future analysis
    • Arbitrary organization
    • Parametric studies
perfexplorer interface
PerfExplorer Interface


Select experiments and trials of interest

Data organized in application, experiment, trial structure

(will allow arbitrary in future)

perfexplorer interface24
PerfExplorer Interface

Select analysis

timesteps per second
Timesteps per Second
  • Cray X1 is the fastest to solution in all 3 tests
  • FFT (nl2) improves time for B3-gtc only
  • TeraGrid faster than p690 for B1-std?
  • Plots generated automatically







relative efficiency b1 std
Relative Efficiency (B1-std)
  • By experiment (B1-std)
    • Total runtime (Cheetah (red))
  • By event for one experiment
    • Coll_tr (blue) is significant
  • By experiment for one event
    • Shows how Coll_tr behaves for all experiments



16 processorbase case

relative speedup b2 cy
Relative Speedup (B2-cy)
  • By experiment (B2-cy)
    • Total runtime (X1 (blue))
  • By event for one experiment
    • NL_tr (orange) is significant
  • By experiment for one event
    • Shows how NL_tr behaves for all experiments
fraction of total runtime communication
Fraction of Total Runtime (Communication)
  • IBM SP3 (cyan) has the highest fraction of total time spent in communication for all three benchmarks
  • Cray X1 has the lowest fraction in communication




runtime breakdown on ibm sp3
Runtime Breakdown on IBM SP3
  • Communications grows as a percentage of total as the application scales (colors match in graphs)
  • Both Coll_tr (blue) and NL_tr (orange) scale poorly
  • I/O (green) scales poorly, but its percentage of total runtime is small
clustering analysis
Clustering Analysis
  • “Scalable Analysis Techniques for Microprocessor Performance Counter Metrics,” Ahn and Vetter, SC2002
  • Applied multivariate statistical analysis techniques to large datasets of performance data (PAPI events)
  • Cluster Analysis and F-Ratio
    • Agglomerative Hierarchical Method - dendogram identified groupings of master, slave threads in sPPM
    • K-means clustering and F-ratio - differences between master, slave related to communication and management
  • Factor Analysis
    • shows highly correlated metrics fall into peer groups
  • Combined techniques (recursively) leads to observations of application behavior hard to identify otherwise
similarity analysis



Similarity Analysis
  • Can we recreate Ahn and Vetter’s results?
  • Apply techniques from the phase analysis (Sherwood)
    • Threads of execution can be compared for similarity
    • Threads with abnormal behavior show up as less similar
  • Each thread is represented as a vector (V) of dimension n
    • n is the number of functions in the application

V = [f1, f2, …, fn] (represent event mix)

    • Each value is the percentage of time spent in that function
      • normalized from 0.0 to 1.0
  • Distance calculated between the vectors U and V:

ManhattanDistance(U, V) = ∑ |ui - vi|

sppm on blue horizon 64x4 openmp mpi
sPPM on Blue Horizon (64x4, OpenMP+MPI)
  • TAU profiles
  • 10 events
  • PerfDMF
  • threads 32-47
sppm on mcr total instructions 16x2
sPPM on MCR (total instructions, 16x2)
  • TAU/PerfDMF
  • 120 events
  • master (even)
  • worker (odd)
sppm on mcr papi fp ins 16x2
sPPM on MCR (PAPI_FP_INS, 16x2)
  • TAU profiles
  • PerfDMF
  • master/worker
  • higher/lower

Same result as Ahn/Vetter

sppm on frost papi fp ins 256 threads
sPPM on Frost (PAPI_FP_INS, 256 threads)
  • View of fewer than half of the threads of execution is possible on the screen at one time
  • Three groups are obvious:
    • Lower ranking threads
    • One unique thread
    • Higher ranking threads
      • 3% more FP
  • Finding subtle differences is difficult with this view
sppm on frost papi fp ins 256 threads36
sPPM on Frost (PAPI_FP_INS, 256 threads)
  • Dendrogram shows 5 natural clusters:
    • Unique thread
    • High ranking master threads
    • Low ranking master threads
    • High ranking worker threads
    • Low ranking worker threads
  • TAU profiles
  • PerfDMF
  • R direct access to DM
  • R routine


sppm on frost papi fp ins 256 threads38
sPPM on Frost (PAPI_FP_INS, 256 threads)
  • After K-means clustering into 5 clusters
    • Similar clusters are formed (seed with group means)
    • Each cluster’s performance characteristics analyzed
  • Dimensionality reduction (256 threads to 5 clusters!)





Barrier [OpenMP:runhyd3.F <604,0>]






current and future work
Current and Future Work
  • ParaProf
    • Developing 3D performance displays
  • PerfDMF
    • Adding new database backends and distributed support
    • Building support for user-created tables
  • PerfExplorer
    • Extending comparative and clustering analysis
    • Adding new data mining capabilities
    • Building in scripting support
  • Performance regression testing tool (PerfRegress)
  • Integrate in Eclipse Parallel Tool Project (PTP)
concluding discussion
Concluding Discussion
  • Performance tools must be used effectively
  • More intelligent performance systems for productive use
    • Evolve to application-specific performance technology
    • Deal with scale by “full range” performance exploration
    • Autonomic and integrated tools
    • Knowledge-based and knowledge-driven process
  • Performance observation methods do not necessarily need to change in a fundamental sense
    • More automatically controlled and efficiently use
  • Develop next-generation tools and deliver to community
support acknowledgements
Support Acknowledgements
  • Department of Energy (DOE)
    • Office of Science contracts
    • University of Utah ASCI Level 1 sub-contract
    • ASC/NNSA Level 3 contract
  • NSF
    • High-End Computing Grant
  • Research Centre Juelich
    • John von Neumann Institute
    • Dr. Bernd Mohr
  • Los Alamos National Laboratory