How Early is too Early to Plan for Operational Readiness?

How Early is too Early to Plan for Operational Readiness? Sadaf Alam Chief Architect and Head of HPC Operations Swiss National Supercomputing Centre (CSCS) Switzerland 2014 Smoky Mountains Computational Sciences and Engineering Conference

Late Late How Early is too Early to Plan for Operational Readiness? A Proposal for Robust505 List Sadaf Alam Chief Architect and Head of HPC Operations Swiss National Supercomputing Centre (CSCS) Switzerland 2014 Smoky Mountains Computational Sciences and Engineering Conference

Outline * compute, data, visualization, etc. 2014 Smoky Mountains Computational Sciences and Engineering Conference

2014 Smoky Mountains Computational Sciences and Engineering Conference

Computing Systems @ CSCS http://www.cscs.ch/computers User Lab: Cray XC30 with GPU devices User Lab: Cray XE6 User Lab R&D: Cray XK7 with GPU devices User Lab: InfiniBand Cluster User Lab: InfiniBand Cluster User Lab: SGI Altix User Lab: Cray XMT User Lab: InfiniBand Cluster with GPU devices Meteo Swiss Cray XE6 EPFL Blue Brain Project IBM BG/Q & viz cluster PASC InfiniBand Cluster LCG Tier2 InfiniBand Cluster and several T&D systems (incl. an IMB IDataPlex M3 with GPU and two dense GPU servers) plus networking and storage infrastructure … 2014 Smoky Mountains Computational Sciences and Engineering Conference

Customers, Users & Operational Responsibilities • Customers & users priorities: • Robust and sustainable performance for production-level simulations • Debugging and performance measurement tools to identify and isolate issues (e.g. TAU, Vampir, vendor tools, DDT, Totalview, etc.) • 24/7 operational support considerations: • Monitoring for degradation and failures, isolate components as needed (e.g. Ganglia, customized vendor interfaces) • Quick diagnostics and fixes of known problems • Alerting mechanisms for on-call services (e.g. Nagios) # Realities of using bleeding edge technologies Tools primarily available for non-accelerated clusters running MPI &OpenMP applications plus processors, memories, NICs, ... 2014 Smoky Mountains Computational Sciences and Engineering Conference

Piz Daint: Applications readiness -> installation -> operation 2014 Smoky Mountains Computational Sciences and Engineering Conference

Green 500 Top500 Cray XC30 Cray XC30 (adaptive) Cray XK6 Cray XK7 Prototypes with accelerator devices GPU nodes in a viz cluster GPU cluster Collaborative projects HP2C training program PASC conferences & workshops High Performance High Productivity Computing (HP2C) Platform for Advanced Scientific Computing (PASC) 2009 2010 2011 2012 2013 2014 2015 … Application Investment & Engagement Training and workshops Prototypes & early access parallel systems HPC installation and operations * Timelines & releases are not precise

Tesla Deployment Kit (TDK), NVML & healthmon Ganglia plugins Cray PMDB, Node Health Check & RUR GPU Deployment Kit (GDK), NVML & healthmon v2 Custom solutions & integration at CSCS on case-by-case basis GPU-enabled MPI & MPS GPUDirect GPUDirect-RDMA OpenACC 2.0 OpenACC 1.0 OpenCL 1.0 OpenCL 1.1 OpenCL 1.2 OpenCL 2.0 CUDA 2.x CUDA 3.x CUDA 4.x CUDA 5.x CUDA 6.x CUDA 2.x CUDA 3.x CUDA 4.x CUDA 5.x CUDA 2.x CUDA 3.x CUDA 4.x CUDA 2.x Cray XK6 Cray XK7 Cray XC30 & hybrid XC30 X86 cluster with C2070, M2050, S1070 iDataPlex cluster M2090 Testbed with Kepler & Xeon Phi 2009 2010 2011 2012 2013 2014 2015 … Requirements analysis Applications development and tuning 24/7 monitoring & troubleshooting * Timelines & releases are not precise

Classification of NVIDIA Tools and Interfaces Users & Code Developers Sys Admins Additional effort required for integration into a cluster environment 2014 Smoky Mountains Computational Sciences and Engineering Conference

Case Study # 1 • Finding and resolving bugs in the GPU driver • Intermittent bug appears at scale only, 1K+ GPU devices • Error code can be confused with user programming bugs • Users do not see the error code output, recorded on console logs • Availability of driver patch • Validation of patch by vendor & OEM • Driver patch evaluation and regression • Deployment or a driver patch == major intervention • Verification and resume of operations • … until a new, unknown issue is triggered 2014 Smoky Mountains Computational Sciences and Engineering Conference

Case Study 2 • Enabling privileged modes for legitimate use cases Implemented to support applications needs K20X Extracted from the NVML reference document 2014 Smoky Mountains Computational Sciences and Engineering Conference

I want root permission I want root permission I want root permission User User User 2014 Smoky Mountains Computational Sciences and Engineering Conference

Enabling privileged modes via Resource Manager Allows users to use default mode, visualization mode, clock frequency boost, etc. without compromising default operational settings

Work in Progress • Monitoring interfaces and diagnostics • Add as new ones are identified—feedback to vendors • Extending logic to interpret logs • Implementing new alerts • Early identification of degradation of components • Partly identified by regression suite • … still alarms are triggered by users  Unintended consequence: Reduction in users productivity and service quality—credibility of service provider 2014 Smoky Mountains Computational Sciences and Engineering Conference

Robust505 List—Incentivizing Operational Readiness for Vendors & Service Providers 2014 Smoky Mountains Computational Sciences and Engineering Conference

Robust505 2014 Smoky Mountains Computational Sciences and Engineering Conference

Proposal & Guidelines • Zero to minimum overhead for making a submission • Metrics (TBD): • Data collection and reporting for Top500 runs • Uptime • Failures classification (known vs. unknown) • Self-healing vs. intervention, i.e. unscheduled maintenance • Known errors database (KEDB) • Faster workaround & resumption of service to users • Knowledge sharing • Ganglia and Nagios integration and completeness (main system, ecosystem, file system) • Best practices from other service providers, e.g. cloud 2014 Smoky Mountains Computational Sciences and Engineering Conference

Next Steps • Form a working group to make a concrete proposal • Include future requirements for integration of additional services, e.g. big data • Find volunteers to iron out details • Explore opportunities to get some leverage thru upcoming deployments, e.g. Trinity and CORAL installations 2014 Smoky Mountains Computational Sciences and Engineering Conference

Final Thoughts Success Performance Me wearing a computer scientist hat Time No unscheduled downtime & min service interruptions Average slowdown for production users Success Me wearing an operational staff hat Time 2014 Smoky Mountains Computational Sciences and Engineering Conference

Thank you 2014 Smoky Mountains Computational Sciences and Engineering Conference

How Early is too Early to Plan for Operational Readiness?