1 / 10

The Case for Monitoring and Testing

The Case for Monitoring and Testing. David Montoya CScADS July 15, 2013. LA-UR-13-25132. From a Production Computing Perspective. Where do traditional performance analysis tools fit in the process and what is the usage model? Low use / usage entry cost / skill required

mickey
Download Presentation

The Case for Monitoring and Testing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. The Case for Monitoring and Testing David Montoya CScADS July 15, 2013 LA-UR-13-25132

  2. From a Production Computing Perspective Where do traditional performance analysis tools fit in the process and what is the usage model? • Low use / usage entry cost / skill required What is the usage model that will increase awareness and both increase application and drive environment efficiency? • Monitor health of both applications and system resources • Baseline and track • Proper balance of tools to track and probe

  3. Target Usage – Monitoring and Testing User • Understand how applications are utilizing platform resources • Diagnose problems • Adjust mapping of processes onto resources to optimize for: minimum resource use, minimum power consumption, shortest run-time System/Software Administrators • Diagnose problems / Discover root causes • Ensure health and balance of the system • Mitigate effects of errors • Develop better utilization policies for all resources System Architects • Develop a deep understanding of interactions between system components (hardware, firmware, system software, application) • Develop new architectural features to address current shortcomings

  4. Current State of Affairs • No longer enough to analyze the performance of the application. There is a wide rage of node/processor architectures that are evolving that force closer assessment to the environment. • Increasing scale of resources and compute environment, machine failure rates come to the forefront – MTTF / MTTI • New resources such as burst buffers, file system architectures, IO approaches(PLFS), and tools, programming models, etc…. That impact resource utilization and performance. • Issues such as power management having larger impact

  5. Moving toward tighter integration • As scale increases, the computing architecture becomes more integrated with sub-systems to provide services. Distributed approaches for those services are evolving. • Additional run-time systems that are more tightly integrated are evolving. • We have come full circle to where the compute environments are no longer individual components or systems that are loosely coupled but architected systems that need to behave in a more holistic manner. • A focus of the HPC performance analysis capability needs to move from application performance to its ability to perform in a given computing environment – and the environment’s performance. • This is a move for balance, resource utilization and targets application flexibility.

  6. The current tool box and evolution • Typical monitoring systems target failure detection, uptime, and resource state/trend overview: • Information targeted to system administration • Collection intervals of minutes • Relatively high overhead (both compute node and aggregators) • Application profiling/debugging/tracing tools: • Collection intervals of sub seconds (even sub-millisecond) • Typically requires linking (i.e. tools may perturb the application profile) • Limits on scale • Don’t account for external applications competing for the same resource • (monitoring tool example) -Lightweight Distributed Metric Service (LDMS): • Continuous data collection, transport, storage as a system service • Targets system administrators, users, and applications • Enables collection of a reasonably large number of metrics with collection periods that enable job-centric resource utilization analysis and run-time anomaly detection • Variable collection period (~seconds) • On-node interface to run-time data

  7. How do you move forward? Data and integration.. • You need to understand the health of the system, where there is stress, tie it back to application behavior. • Aspects of traditional application analysis but includes system monitoring of all key subsystems with the ability to assess the impact of the application behavior and resource interaction. • integration of the data to provide assessment of the application, the various subsystems, and then the ability to apply solutions to better balance, enact efficiencies, establish throughput.. Monitoring and Testing • Collect system and subsystem data. • network, file systems, compute nodes, resource manager data, etc.. • Currently collaborating with monitoring tools development (SNL, others). Taking inventory via Monitoring and Testing Summit.

  8. LANL Monitoring and Testing Summit Monitoring / Testing Frameworks • Splunk • Zenoss • RabbitMQ • LDMS framework • Monitoring Infrastructure • OVIS – HPC system analysis • Gazebo Testing Framework • CTS Testing Framework Application: • MTT OpenMPI - testing • Darshan IO analysis • EAP and LAP dashboards • ByFL Network: • IB Performance monitoring • IB Monitoring • IB Error monitoring • ibperf_seq, ibperf_ring, ibperf_agg, mpiring • IDS Project (security) • Network Monitoring in Splunk • DISCOM TestingTrilab Data Transfer Cat 2 function/Performance testing

  9. LANL Monitoring and Testing Summit – cont. FileSystems: • File Systems Monitoring in Splunk • New SystemIntegrationtesting • FStools, filesystemtoolsfortheusers • Filesystemtreewalk • FileSystemHealthCheck • Splunk FTA monitoring • PLFS Regression and Performance Testing • PanfsReleaseFileSystemTesting and Analysis • HPSS Monitoring Cluster/Node • Baler -- Log file analysis tool – • LDMS node collection • Automatic Library Tracking Database (ALTD) • General software usage tracking • Cielo DRAM and SRAM monitoring – • HPCSTATs (Reporting more than monitoring) Moab Logs • CBTF based GPU/Nvidia monitoring • GPU/Cluster Testing • SecMon / Security Monitoring via Zenoss • Splunk Cluster Testing • New System Integration • Post DST /Utilization testing • Software testing

  10. Next Steps • Assess efforts, integrate • Assess data, integrate • Assess information view to target users, integrate • Start over

More Related