Redfish Aggregator for HPC monitoring and control

Redfish Aggregator for HPC monitoring and control Ghazanfar Ali, Doctoral Student, TTU ghazanfar.ali@ttu.edu Advisors: Mr. Lowell Wofford, LANL Mr. Cory Donald Lueninghoener, LANL Dr. Alan Sill, Managing Director, HPCC, TTU Dr. Yong Chen, Site Director, NSF Cloud and Autonomic Computing, TTU Ultrascale Systems Research Center (USRC) Los Alamos National Laboratory 9/17/2019

Agenda • Background & Issues Analysis • Proposed: Definition, Capabilities, Architecture • Conclusions

Background: An Example of Traditional Monitoring Framework Nagios Core InfluxDB Visual Analytics Aggregation Service (GET Polling) UGE REST API Redfish-enabled TTU Quanah Cluster Lack of framework-oriented Group Operations (Parallelism, Timeout) Lack of Profiling/Classification of Resources/Metrics Lack of Metrics Caching/Interleaving Lack of Distributed Aggregator Lack of Push based Communication Model (Server Sent Events (SSE)) Manual Resource Discovery Lack of Universal/Unified Interfacing with other HPC Infrastructure Lack of Exertion of Systematic Control over HPC Infrastructure Node 1 Node 2 Node 467 Job Scheduler Node 3 …

Background: Group Operations • To enable Redfish client to acquire or push data to large clusters: • Client requires to implement fanning out strategy to perform cluster wide operations. • Client requires to implement a pre-determine time-out for operations result convergence. • Handling of failed requests. • Therefore, support for a framework-oriented parallelism and other mechanisms are crucial to enable monitoring & management on massive cluster scale.

Background: Lack of Profiling/Classification of Metrics • Different HPC metrics have different sensitivities (e.g. monitoring intervals). • For example, there can be a requirement to monitor node power usage at 1 second interval. At the same time, there is also requirement to monitor job at 120 seconds interval. • Acquiring monitoring data with uniform monitoring interval, potentially: • Overwhelmed the “slow” monitored resources (e.g. job schedular) • Miss the data from “fast” monitored resources (e.g. node power usage) • Therefore, support for profiling of metrics is crucial to enable monitoring in optimal & sustainable way.

Background: Single Instance-based Aggregator Service • Single Instance based Aggregator Service introduced the following bottlenecks in terms of scalability: • Can acquire monitoring data from a limited number of nodes. • Cumbersome to acquire monitoring data in heterogenous resources. • Therefore, support for distributed aggregator component is central to enable scalable monitoring.

Background: Pull based Communication Model (GET Polling) • Pull is based on request/response communication pattern. • This approach has myriad of bottlenecks in terms of latency, reliability, scalability and performance. • W3C [1] has defined Server-sent events (SSE) which enables server to Push data periodically to (subscribed) client via HTTP. • [1]https://www.w3.org/TR/eventsource • Therefore, support for Push communication pattern is vital to enable more deterministic monitoring.

Background: Manual Resource Discovery • To enable monitoring and other management functions (e.g. provisioning), resource reachability is critical. • Traditional methodologies are quite complicated (gathering MAC addresses of nodes etc) and involves human intervention and dependency. • Therefore, support for auto resource discovery is extremely useful to auto-build resource inventory to enable zero-touch monitoring.

Background: Lack of Unified Interfacing with other HPC Infrastructure Elements • In HPC, different data sources expose monitoring data via different protocols/APIs. • A BMC can expose data via IPMI and/or Redfish API • A Job Schedular can expose jobs data via univa grid engine (UGE) REST API and/or SLURM API. • Its quite complicated to implement all south bound interfaces within in aggregator • Therefore, support for universal/unified interfacing with other HPC Infrastructure elements is very helpful to interact with variants and/or different HPC systems in a consistent manner.

Background: Lack of Exertion of Systematic Control over HPC Infrastructure • Controlling HPC infrastructure refers to pushing specific configurations/states to HPC infrastructure • This programmatic control over HPC infrastructure will be instrumental to adjust the behavior (in terms of functionality & usage) of HPC resources. • Many tools are lack of this much needed capability. • Therefore, support for exertion of control over HPC infrastructure is crucial to automate HPC operations in a systematic manner.

Aggregator Definition • Aggregation can be referred to as framework that enables acquisition of data from across multiple sources [Source: ISE DATA AGGREGATION REFEENCE ARCHITECTURE] • For HPC community, Redfish Aggregator is a single point of entry to monitor, manage and control HPC data center.

List of Proposed Aggregation Capabilities Group Operations (Parallelism, Timeout) Metric Profiling: Telemetry Configs (Metric Def., Metric Report Def., Trigger Def.) Distributed Aggregation Support for SSE based metric retrieval and Combining Event Streams Example HPC Control Implementations (power usage, temperature) Metrics Interleaving & Caching Automated Resource Discovery Interfacing with other HPC Infrastructure Components

Overall Architectural Goals • “Monitor” refers to acquiring data using Pull/Push. • “Scientize” refers to data interpretation in terms of thresholds (e.g. OK, Warning, Critical) & enumeration. • “Control” refers to adjusting resource configs. to desired state.

Potential Architecture for Redfish Aggregator for HPC Monitoring and Control Proposed Model HPC Controller Redfish Aggregator Data Repository Aggreg. Handler Aggreg. Handler Aggreg. Handler Aggreg. Handler Aggreg.Handler Aggreg.Handler Aggreg. Handler Aggreg. Handler Aggreg. Handler Aggreg. Handler Wrapper Wrapper Wrapper Wrapper Wrapper Wrapper Wrapper Wrapper Wrapper Wrapper Infiniband Chil. API UPS API IPMI Scheduler API Storage API PDU API Gen. API In-band NETCONF Job Scheduler Group of BMC Group of OS based Metrics Group of PDU Group of Chiller Group of UPS Group of Gens. Group of Storage Nodes Group of Fabric Network Group of Router/Switch

Synergy with Relevant Standards of Open Grid Forum (OGF) • Redfish Aggregator has natural symbiosis with OGF GLUE Specification[1] • GLUE specifies conceptual information model, schema, and use cases for Grid and Cloud infrastructure. Discovery Aggregator Job Scheduler Networking devices Redfish Resource Model Job https://redmine.ogf.org/boards/43/topics/506

Conclusions • Discuss proposed requirements & framework related to “Redfish Aggregator for HPC monitoring and control”. • Contribute & Develop Redfish Aggregator in collaboration with DMTF Redfish Forum, Research Labs, Academia, and Industry.

Redfish Aggregator for HPC monitoring and control

Redfish Aggregator for HPC monitoring and control

Presentation Transcript

Corrosion Control and Monitoring

AGGREGATOR TRANSFORMATION

Aggregator

Project Monitoring and Control

Control, Mixing and Monitoring

Aggregator

An Aggregator

aggregator

An Aggregator

CONTROL/MONITORING

AN AGGREGATOR

Project Monitoring and Control

Wireless Network for Aircraft Control and Monitoring

Project monitoring system for control

Control and monitoring

Xrootd Monitoring and Control

Monitoring and Control

AGGREGATOR ELEMENTS

The Redfish

Techniques for monitoring and control

Control/Monitoring

Aramfejleszto Aggregator