Ngop prototype status report
This presentation is the property of its rightful owner.
Sponsored Links
1 / 20

NGOP Prototype Status Report PowerPoint PPT Presentation


  • 77 Views
  • Uploaded on
  • Presentation posted in: General

NGOP Prototype Status Report. T.Levshina. N ext G eneration O peration GROUP. Integrated Systems Development Department Krzysztof Genser Terry Jones Tanya Levshina Igor Mandrichenko Don Petravick Operating Systems Support Department Troy Dawson Jim Fromm Lisa Giacchetti

Download Presentation

NGOP Prototype Status Report

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Ngop prototype status report

NGOP PrototypeStatus Report

T.Levshina


N ext g eneration o peration group

Next Generation OperationGROUP

Integrated Systems Development Department

Krzysztof Genser

Terry Jones

Tanya Levshina

Igor Mandrichenko

Don Petravick

Operating Systems Support Department

Troy Dawson

Jim Fromm

Lisa Giacchetti

Marc Mengel

Ken Schumacher

Steven Timm

Computing Services Department

Rick Thies

Rich Thompson

[email protected]


Presentation highlights

Presentation Highlights

  • NGOP project phases

  • Status of the Framework

  • Status of the prototype deployment

  • Near future milestones

[email protected]


Ngop project phases since last hepix

NGOP Project Phases (since last HEPIX)

  • December 2000: First prototype implementation was released.

  • January 2001:Prototype installation on farms. Classes for farm administrators.

  • February 2001: Ngop server node in the operator console area was installed. Monitoring byoperators was started.

  • March 2001:New release (“Swatch” and “PlugIns” Agents). Ngop was evaluated by system administrators, operators and others. Strategy meeting was carried out.

  • April 2001 “Xfalive” service (low-level ping) was provided for the all nodes monitored by Computing Services Department.

[email protected]


Ngop architecture

NGOP Architecture

Report

Generator

Cluster A

Archive

Service

Archive

MA

Monitor

MA

Administrator

MA

Central Server

Configuraton

File Management

Service

Persistent

Config.Data

Cluster B

Cluster B1

  • Monitored Objects

  • Host Element

  • Cluster System

  • NGOP Components

  • Sensor Agent Server

  • Monitoring Agent Monitoring

  • Data Storage Clients

  • Connections

    • TCP connection between

    • UDP Monitored Element

    • and MA

    • Not implemented in prototype yet

  • MA

    MA

    Action

    Client

    MA

    s

    S

    s

    MA

    s

    Data

    Analyzer

    Router

    MA

    MA

    s

    s

    s

    s

    Performance

    Storage

    Service

    Cluster B2

    Performance

    Data

    [email protected]


    Data flow and ngop components interaction

    Data Flow and NGOP Components Interaction

    ID=swap.nodeA

    State=Up Value=98

    SevLevel=Error

    Dscrb=“swap > 95 %”

    MA

    Monitored

    Elements

    Monitor

    Monitor

    Action Request

    MA

    MA

    MA

    Monitored

    Elements

    Monitored

    Elements

    Monitored

    Elements

    Monitor

    Central

    Server

    Action Request

    MA

    Monitored

    Elements

    CVS

    ID=syslogd.nodeB

    State=Down

    Dscrb=“syslogd is down”

    MA

    Action

    Client

    Monitored

    Elements

    Configuration

    Service

    Archiver

    [email protected]


    Status of framework implemented components

    Status of Framework(Implemented Components)

    • Monitoring Agent:

      • MA API (only Python binding)

      • PlugIns Agent (XML configuration is required)

      • Several types of MAs are provided in NGOP Prototype:

        • Linux Node "health" :

          • System Daemons presence

          • Critical File Systems presence and size

          • Cpu load

          • Memory utilization

          • Swap utilization

          • Number of users

          • Number of users’ processes

          • Number of processors

          • Baseboard temperature

          • Fan speed

        • “Xfalive”:

          • Node availability (low level ping)

          • Node reset

        • FBS :

          • FBS Daemons presence

          • Resources (“cpu” and scratch disk availability)

        • “Swatch” :

          • watches a log file for lines matching a regular expression, e.g. syslog or console log

    [email protected]


    Status of framework implemented components1

    Status of Framework(Implemented Components)

    • NGOP Central Server(NCS):

      • Gather events from MA’s

      • Scalable (so far ~ 512 nodes)

      • Provide users with requested information

      • Handle multiple users

      • Primitive locking mechanism to prevent simultaneous actions

      • Action broadcasting

      • Store information locally and forward it to Archive Storage

    • NGOP Configuration File Management Service:

      • Provide a central repository for system configuration and monitoring rules.

      • Perform configuration sanity check

      • Provide clients with component subscription list

      • Allow dynamic reconfiguration

      • Notify clients about new configuration

    [email protected]


    Status of framework implemented components2

    Status of Framework(Implemented Components)

    • Archive Server:

      • Handles archive storage (Oracle).

      • Provides a means to read and query the data (FNAL web interface: MISWEB)

      • Performs data roll out

      • Performs clean up procedure

    • Action Client:

      • Performs centralized actions

      • Verifies user authorization to perform the action

      • Notifies NCS about action exit status

    • Monitoring Client:

      • Allows to configure custom-built system views

      • Defines rules that determine the status of the system and their components

      • Requests and receives information about monitored objects

      • Determines the status of system based on the rules and obtained information

      • Initiates request to perform action.

      • All configuration files are written in XML

    [email protected]


    Status of framework not yet implemented components

    Status of Framework(Not yet implemented components)

    • Sensor Agent:

      Agent that collects performance data and generates events at a higher rate than a monitoring agent.

    • Performance Data Storage Service:

      Service that allows persistent storage of performance data, as

      well as means to read and query the data.Performance data will

      need to be consolidated.

    • Looping Monitoring Agent:

      Agent that is capable to received information form NCS, analyze it, derive new events and send it back to NCS.

    [email protected]


    Cfms admin

    CFMS Admin

    [email protected]


    Ngop monitor configuration

    NGOP Monitor(Configuration)

    [email protected]


    Ngop monitor display

    NGOP Monitor(Display)

    [email protected]


    Ngop monitor display1

    NGOP Monitor(Display)

    [email protected]


    Prototype statistics

    Prototype Statistics

    • Some implementation details:

      • Written primarily in Python (some modules in C)

        • ~ 10, 000 line of Python code and ~1,000 of C code

      • Use XML (and partially MATHML) for all configuration files

        • ~ 600 configuration files

    • Some deployment details:

      • Monitoring 512 nodes , checking for node being down and node reset.

      • Monitoring four farms (CDF, D0, two Fix Target experiment farms) - (270 nodes out of 512)

      • Number of Monitoring Agents ~ 557( 270 local MAs monitor operating system and sensors data on the farms, 270 local MAs monitor syslog on the farms, 4 MAs monitor FBS on corresponding farms, 13 MAs perform “xfalive” service)

      • Number of Monitored Objects ~ 6,500

      • About 5 instances of “ngop monitor” (GUI) are running simultaneously.

      • Local event log is kept since January,12.

        • Rate is ~ 13 events per hour

    [email protected]


    Current configuration

    Current Configuration

    CDF Farm

    FixTarget Farm

    cdffarm1

    fnsfo

    MA

    (CDF_FBS)

    MA

    (FT_FBS)

    PPD

    MISCOMP

    CMS

    CDF

    D0

    Kerberos

    FNALU

    Division

    Servers

    SDSS

    License

    Servers

    License

    Servers

    Mail

    Servers

    KTEV

    MINOS

    HPPC

    ODS

    BTEV

    Enstore

    D0 Farm

    fncdf 1 - 90

    Fnpc 201 - 250

    d0bbin

    Swatch

    Swatch

    MA

    (OSHealth)

    MA

    (OSHealth)

    MA

    (D0_FBS)

    fnd0 1 - 100

    MA

    (OSHealth)

    NGOP

    MAs

    (Ping)

    Old FixTarget Farm

    User Node

    User Node

    User Node

    fnsfh

    NGOP

    Monitor

    NGOP

    Monitor

    NGOP

    Monitor

    MA

    (OFT_FBS)

    Config

    File Management

    Server

    NGOP

    Central

    Server

    fnpc 1 - 37

    FNCDUH

    Swatch

    Action

    Client

    MA

    (OSHealth)

    Archive

    Service

    WWW

    Swatch

    [email protected]


    Summary of occurred events

    Summary Of Occurred Events

    • Detected Problems:

      • Node reset

      • Node is down

      • One CPU is missing after reboot

      • File system not mounted

      • System daemon is dead

      • FBS Batch Manager is down

    • Raised Alarms:

      • Memory usage is high

      • Swap usage is high

      • CPU Load is high

      • File System is full

      • Baseboard temperature is high

      • Specific messages found in syslog : nfs timeouts, drive timeouts …

    [email protected]


    Report generator miscomp web query interface

    Report Generator(MISCOMP Web Query Interface)

    [email protected]


    Next milestone from prototype to production system for 600 nodes

    Next Milestone: From Prototype to Production System (for ~600 nodes)

    • Goal 1: Gradually give the System Managers a Framework to develop and evolve tools to locally monitor their systems and enable them to send filtered information to the CSD operators

    • Goal 2: Make sure all production systems can be supported by NGOP (excluding Windows2000 in the first phase)

    [email protected]


    Wish list improve the production system

    Wish List: Improve the Production System

    • Provide Monitoring Client API

    • Implement Looping Agents

    • Implement historical rules and escalating alarms

    • Implement “snapshot” (“give me the updated system status now”) feature

    • Provide other than Python Monitoring Agent API

    • Fully Kerberize

    • Provide Standard Win2000 Monitoring Agents

    • Design and provide dynamic handling of configuration changes for the Monitoring Client

    • Allow for easier handling of multiple configurations

    • Improve Admin (Configuration Client) Client GUI

    • Provide Configuration GUI (hoping for a good free XML Editor though)

    • Provide Performance Data Framework

    • Redesign/Rewrite GUI (for scalability and friendliness)

    • Provide GUI for non-Linux platforms if really needed

    • Work on scalability up to 10000 hosts

    [email protected]


  • Login