1 / 23

NGOP Overview

NGOP Overview. Jim Fromm Farms and Clustered Systems Group Computing Division Fermilab. People. Integrated Systems Development Department Don Petravick Krzysztof Genser Jim Fromm Tanya Levshina Igor Mandrichenko Terry Jones Operating Systems Support Dept. Troy Dawson Lisa Giachetti

dirk
Download Presentation

NGOP Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NGOP Overview Jim Fromm Farms and Clustered Systems Group Computing Division Fermilab

  2. People • Integrated Systems Development Department • Don Petravick • Krzysztof Genser • Jim Fromm • Tanya Levshina • Igor Mandrichenko • Terry Jones • Operating Systems Support Dept. • Troy Dawson • Lisa Giachetti • Ken Schumacher • Marc Mengel • Computing Services Dept. • Jeff Mack • Rick Thies • Rich Thompson http://www-isd.fnal.gov/ngop

  3. Goals • NGOP working group charged with the task of developing a Distributed Management System (DMS) that would scale to the anticipated requirements for Run II farms. • Future size of farms require that the DMS be pro-active. The system should take corrective action when possible. • Must detect hardware, system, and application problems. • Problem diagnostics should eliminate “noise”, or false alarms. • Should provide tools to do performance analysis. http://www-isd.fnal.gov/ngop

  4. NGOP History • Summer 1999: NGOP group created to gather requirements for a Distributed Management System capable of efficiently monitoring Fermilab computing facility for Run II. • Sept 1999: Requirement gathering completed. • Dec 1999: Evaluation of available products presented. • Jan 2000: Decision to develop a custom DMS made • Today: Development of prototype underway. Completion is expected before year end. http://www-isd.fnal.gov/ngop

  5. We are not alone… • As computer farms get larger, other HEP sites are looking at a similar problem • March 2000, CERN and BNL visited Fermilab to exchange ideas on lessons learned. SLAC, JLAB, and IN2P3 participated via video conference. • July 2000 Fermilab visited CERN to follow up on the March meetings. http://www-isd.fnal.gov/ngop

  6. Some Terminology • Monitored Object is one of the following: • Host: A computer identified by it’s full domain name • Cluster: A collection of hosts • Component: An atomic element that has a well defined behavior. • System: A collection of components. • Condition: A pre-defined state of an Monitored Object. • Event: A description of a detected condition. • Action: An activity initiated by the NGOP system based on an event. • Alarm: An asynchronous indicator initiated by NGOP. • Status: Shows the level of the monitored element “functionality”. • Monitoring Agent: A software component that generates events based on conditions and performs actions. http://www-isd.fnal.gov/ngop

  7. NGOP Requirements • Essential Features • Should detect hardware, network, system, and application errors. • System Daemon status (inetd, mbatchd) • Unreachable hosts. • Security breaches • /tmp full. • Should run on all Fermilab supported operating systems. • Scalable to 1000s of hosts. • Must be multi-user, must support different authorization levels. • Provide an interface for user written monitoring tools. • Generate different levels of alarms (Warning,Info, etc…). • Perform actions based on alarms and events (email,page,restart daemon). • Provide a hierarchical view of the monitored system. • Dynamic configuration. • Provide monitoring capabilities via a web browser, GUI, and command line interface. • Provide special states for monitored objects such as “known bad”. • Desirable Features: • Ability to have overlapping clusters. • Ability to generate reports based on selection criteria. • Implement step by step notification of performed actions. http://www-isd.fnal.gov/ngop

  8. Products Evaluation • Some Evaluated Products: • Patrol • Not scalable for centralized monitoring • One level of hierarchy • No overlapping clusters • No filtering of events • No GUI/UI • Tkined/Scotty • Not scaleable for multiple users • System monitored only while GUI running • Only one level of alarms • Nocol • No notion of hierarchy or clusters. • Web and “GUI”(curses) interface have limited customization. • Very limited filtering of events • Netlogger • Limited off-shelf functionality • No customization for monitoring agents • Very limited way to create hierarchy. • Requires too much knowledge of underlying system to detect a problem. • Misc Commercial Products • Complex • Did not meet requirements • Very expensive, both in terms of licensing and setup costs. http://www-isd.fnal.gov/ngop

  9. Product Evaluation Summary • Many commercial and open-source products try to solve the problem in many different ways. • None of the evaluated products met the basic requirements at Fermilab. • Discussion with others who chose the commercial route were not encouraging. Many bad experiences documented. • Decision was made to develop our own custom DMS. http://www-isd.fnal.gov/ngop

  10. Design Summary – Key System Components • Monitoring Agent:Monitors a monitored object,generates events based on certain conditions. • Sensor Agent: Similar to a monitoring agent, but this process collects performance data and generates events at a higher rate than a monitoring agent. • NGOP Central Server(NCS): The central daemon process that gathers events from MA’s, provides users with requested information, and dumps persistent data into the Archive Server. • NGOP Configuration File Management Service: Provides a mechanism to centrally locate system configuration and rules. Allows for dynamic reconfiguration of system. • Archive Server: daemon that handles archive storage. Provides a means to write, read, and query the data. • Monitoring Client: Communicate with NCS using an API to display system status in a meaningful manner. http://www-isd.fnal.gov/ngop

  11. NGOP Architecture Report Generator Cluster A Archive Service Archive MA Monitor MA Administrator MA Central Server Configuraton File Management Service Persistent Config.Data Cluster B Cluster B1 • Monitored Objects • Host Element • Cluster System • NGOP Components • Sensor Agent Server • Monitoring Agent Monitoring • Data Storage Clients • Connections • TCP connection between • UDP Monitored Element • and MA • Not implemented in prototype yet MA MA Action Client MA s S s MA s Data Analyzer Router MA MA s s s s Performance Storage Service Cluster B2 Performance Data http://www-isd.fnal.gov/ngop

  12. Monitoring Agents – The hook into NGOP • The monitoring agents (MA) is the process that monitors an object, and generates events when a condition is met. A message describing this event is sent to the NGOP Central Server (NCS). • NGOP defines the protocol to exchange information with the central server. • A set of basic MA’s will be deployed with the NGOP system, users are free to write their own. • An API(C,C++,Perl,Python) will be provided to allow for development of MA’s. • MA’s should send info to the NCS when: • When current characteristics of a monitored object meet a condition. • When the condition is no longer satisfied. • Heartbeat messages sent periodically to let the NCS know it is still alive. • Examples: • Monitor whether or not a batch system is running. • Monitor the size of a file system, issuing alarms when it is 90% full. http://www-isd.fnal.gov/ngop

  13. Sensor Agents • Sensor Agents send performance data to the Performance Storage Service. • The rate of this data is expected to be much higher than that of the MA’s. • Examples: • Monitor the temperature of a computer every second. • Monitor the CPU utilization continuously. http://www-isd.fnal.gov/ngop

  14. NGOP Central Server • NCS is the process that gets messages sent from MA’s, stores them via the Archive Server, and provides monitoring clients (GUI for example) requested information. • One instance of the NCS will be running in the system. • NCS must handle many (10,000+) MA’s, and ~ 50 clients. • NCS should • Update object characteristics when MA reports a change. • Determine if an MA is dead, and forward this info along to the relevant monitoring client. • Forward event and action messages to the Archive Server. • Forward event messages to subscribed monitoring clients. http://www-isd.fnal.gov/ngop

  15. NGOP Configuration File Management Service • Responsible for providing a central repository for system configuration and monitoring rules. • Allows for dynamic reconfiguration of the system. • Configuration files written in xml. • Central repository is implemented using CVS in the prototype. • Only authorized users can update. http://www-isd.fnal.gov/ngop

  16. Rules • Rules define the status and the alarm level associated with monitored objects. • Rules describe the condition that should be satisfied in order for a monitored object to have status and alarm level. • Master rules are stored in the Configuration File Management Service (CFMS). • Users can create their own rules and store them locally. Users with permission can store these rules in the CFMS. • Dependency rules are a mechanism to filter out noise. For example, a batch system can be dependent on the power supply. If the power goes out on a machine, the fact that the batch system is down will not be raised. • Alarm/Action rules define the condition that will cause an alarm/action to be performed. http://www-isd.fnal.gov/ngop

  17. Monitoring Clients • Monitoring clients will be developed with an API that allows determination of the status of each node in a hierarchy, based on rules and current information obtained from the NCS. • Monitoring clients will initiate action requests. • Monitoring clients determine the state of the system and monitored elements based on information gathered from the NCS. http://www-isd.fnal.gov/ngop

  18. Archiver/Performance Storage Service • The Archive/Performance Storage Service(PSS) is responsible for storing and retrieving messages generated by the NGOP system. These messages represent event, sensor, or action data. • Components: • Archive Server • Archive Retriever • Performance Storage Subsystem(PSS) • PSS Retriever • Archive Database Interface • Database (Oracle). • DBArchiver • The PSS is simply another instance of the Archive Server. • Performance data will need to be consolidated. http://www-isd.fnal.gov/ngop

  19. NGOP Prototype NGOP prototype development is currently underway. The prototype consists of the following modules: • NGOP Central Server • Configuration File Management Service • Monitoring Agents: • OS Health: Monitors specific system daemons, file system existence and size, CPU load, and free memory. • Ping Agent: Monitors node reachability • FBSNG Agent: Monitors the FBSNG batch system. • NGOP Client API • Determines the status of the each monitored elements based on pre-defined rules and current information received from the NGOP Central Server • NGOP Monitor • Graphical representation of monitored elements status. • Provides means to see and acknowledge occurred events and alarms • Provides limited configuration options • Archive Server • Stores event and action messages to local disk. • The Archive Database Interface moves the message from local disk to an Oracle database. http://www-isd.fnal.gov/ngop

  20. NGOP Monitor Alarm: Status: Bad Warning Good Undefined Event description http://www-isd.fnal.gov/ngop

  21. NGOP Monitor(event acknowledgment, known-status modification…) Monitored Element Info: http://www-isd.fnal.gov/ngop

  22. NGOP Monitor (Configuration Options) Default icons for known object types: Default colors for status representation: Selecting elements for top level display: http://www-isd.fnal.gov/ngop

  23. Summary • Building a DMS is a complex problem. • Various commercial and open source systems were analyzed. None met the basic requirements for the NGOP project at Fermilab. • Prototype system is under development. • See http://www-isd.fnal.gov/ngop for project details. http://www-isd.fnal.gov/ngop

More Related