monitoring and fault tolerance n.
Skip this Video
Loading SlideShow in 5 Seconds..
Monitoring and Fault Tolerance PowerPoint Presentation
Download Presentation
Monitoring and Fault Tolerance

Loading in 2 Seconds...

play fullscreen
1 / 16

Monitoring and Fault Tolerance - PowerPoint PPT Presentation

  • Uploaded on

Monitoring and Fault Tolerance. Helge Meinhard / CERN-IT OpenLab workshop 08 July 2003. Fault Mgmt System. Monitoring System. Node. Configuration System. Installation System. Monitoring and Fault Tolerance: Context. History (1).

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Monitoring and Fault Tolerance' - guinevere-mckinney

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
monitoring and fault tolerance

Monitoring and Fault Tolerance

Helge Meinhard / CERN-IT

OpenLab workshop

08 July 2003

monitoring and fault tolerance context

Fault Mgmt









Monitoring and Fault Tolerance: Context
history 1
History (1)
  • In the 1990s, “massive” deployments of Unix boxes required automated monitoring of system state
  • Answer: SURE
    • Pure exception/alarm system
    • No archiving of values, hence not useful for performance monitoring
    • Not scalable to O(1000) nodes
history 2
History (2)
  • PEM project at CERN (1999/2000) took fresh look at fabric mgmt, in particular monitoring
  • PEM tool survey: Commercial tools found not flexible enough and too expensive; free solutions not appropriate
  • Architecture, design and implementation from scratch
history 3
History (3)
  • 2001 - 2003: European DataGrid project with work package on Fabric Management
    • Subtasks: configuration, installation, monitoring, fault tolerance, resource management, gridification
    • Profited from PEM work, developed ideas further
history 4
History (4)
  • In 2001, some doubts about ‘do-it-all-ourselves’ approach of EDG WP4
  • Parallel to EDG WP4, project launched to investigate whether commercial SCADA system could be used
  • Architecture deliberately kept similar to WP4
monitoring and ft architecture 1
Monitoring and FT architecture (1)
  • Monitoring: Captures non-intrusively actual state of a system (supposed not to change its state)
  • Fault Tolerance: Reads and correlates data from monitoring system, triggers corrective actions (state-changing)
monitoring and ft architecture 2



Monitoring and FT architecture (2)


MonitoringSensorAgent (MSA)

MR – Monitoring Repository

WP4: MR code with lower layer as flat file archive, or using Oracle

CCS: PVSS system








monitoring and ft architecture 3
Monitoring and FT architecture (3)
  • MSA controls communication with Monitoring Repository, configures sensors, requests samples, listens to sensors
  • Sensors send metrics on request or spontaneously to MSA
  • Communication MSA – MR: UDP or TCP based
monitoring and ft architecture 4
Monitoring and FT architecture (4)
  • FT system subscribing to metrics from monitoring subsystem
  • Rule-based correlation engine takes decisions on firing actuators
  • Actuators controlled by Actuator Agent, all actions logged by monitoring system
deployment 1
Deployment (1)
  • End 2001: Put early versions of MSA and sensors on big clusters (~800 Linux machines), sending data (~100 metrics per machine, 1/min…1/day) to a PVSS-based repository
  • At the same time, ~300 machines started sending performance metrics into flat file WP4 repository
deployment 2
Deployment (2)
  • Sensors more refined over time (metrics added according to operational needs)
  • Both exception and performance oriented sensors now deployed in parallel (some 150 metrics per node)
  • More special machines added, currently ~1500 machines being monitored
  • Test in May 2003: some 500 metric changes per second into the repository (~150 changes/s after “smoothing”)
deployment 3
Deployment (3)
  • Repository requirements:
    • Repository API implementation
    • Oracle based
    • fully functional alarm display for operators
  • Currently using both an Oracle-MR based repository, and a PVSS based one
  • Operators using PVSS based alarm screen as alternative to Sure display
deployment 4
Deployment (4)
  • Interfaces: C API available, simple command line interface by end July, prototype Web access to time series of a metric available
  • Fault tolerance: Just starting to look at WP4 prototype
  • Configuration of monitoring: ad-hoc, to be migrated to CDB
  • Near term: Production services for LCG-1
    • Add more machines (e.g. network), metrics
    • Software and service monitoring
  • Medium term (end 2003): Monitoring for Solaris and Windows, …
  • 2004 or 2005: Review of chosen solution for monitoring and FT
    • Some of 1999 arguments no longer valid
    • Will look at commercial and freeware solutions
machine control
Machine control
  • High level: interplay of State Management System, Configuration Management, Monitoring, Fault Tolerance, …
  • Low level:
    • Past: CPU boxes didn’t have anything (5 rolling tables with monitors and keyboards per 500…1000 machines), disk and tape servers with analog KVM switches
    • Future: Have investigated various options, benefit/cost analysis. Will go to serial consoles on all machines, 1 head node per 50…100 machines with serial multiplexers