ngop prototype status report
Download
Skip this Video
Download Presentation
NGOP Prototype Status Report

Loading in 2 Seconds...

play fullscreen
1 / 20

NGOP Prototype Status Report - PowerPoint PPT Presentation


  • 109 Views
  • Uploaded on

NGOP Prototype Status Report . T.Levshina. N ext G eneration O peration GROUP. Integrated Systems Development Department Krzysztof Genser Terry Jones Tanya Levshina Igor Mandrichenko Don Petravick Operating Systems Support Department Troy Dawson Jim Fromm Lisa Giacchetti

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' NGOP Prototype Status Report ' - helki


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
n ext g eneration o peration group
Next Generation OperationGROUP

Integrated Systems Development Department

Krzysztof Genser

Terry Jones

Tanya Levshina

Igor Mandrichenko

Don Petravick

Operating Systems Support Department

Troy Dawson

Jim Fromm

Lisa Giacchetti

Marc Mengel

Ken Schumacher

Steven Timm

Computing Services Department

Rick Thies

Rich Thompson

[email protected]

presentation highlights
Presentation Highlights
  • NGOP project phases
  • Status of the Framework
  • Status of the prototype deployment
  • Near future milestones

[email protected]

ngop project phases since last hepix
NGOP Project Phases (since last HEPIX)
  • December 2000: First prototype implementation was released.
  • January 2001: Prototype installation on farms. Classes for farm administrators.
  • February 2001: Ngop server node in the operator console area was installed. Monitoring by operators was started.
  • March 2001: New release (“Swatch” and “PlugIns” Agents). Ngop was evaluated by system administrators, operators and others. Strategy meeting was carried out.
  • April 2001 “Xfalive” service (low-level ping) was provided for the all nodes monitored by Computing Services Department.

[email protected]

ngop architecture
NGOP Architecture

Report

Generator

Cluster A

Archive

Service

Archive

MA

Monitor

MA

Administrator

MA

Central Server

Configuraton

File Management

Service

Persistent

Config.Data

Cluster B

Cluster B1

    • Monitored Objects
    • Host Element
    • Cluster System
    • NGOP Components
    • Sensor Agent Server
    • Monitoring Agent Monitoring
    • Data Storage Clients
  • Connections
    • TCP connection between
    • UDP Monitored Element
    • and MA
    • Not implemented in prototype yet

MA

MA

Action

Client

MA

s

S

s

MA

s

Data

Analyzer

Router

MA

MA

s

s

s

s

Performance

Storage

Service

Cluster B2

Performance

Data

[email protected]

data flow and ngop components interaction
Data Flow and NGOP Components Interaction

ID=swap.nodeA

State=Up Value=98

SevLevel=Error

Dscrb=“swap > 95 %”

MA

Monitored

Elements

Monitor

Monitor

Action Request

MA

MA

MA

Monitored

Elements

Monitored

Elements

Monitored

Elements

Monitor

Central

Server

Action Request

MA

Monitored

Elements

CVS

ID=syslogd.nodeB

State=Down

Dscrb=“syslogd is down”

MA

Action

Client

Monitored

Elements

Configuration

Service

Archiver

[email protected]

status of framework implemented components
Status of Framework(Implemented Components)
  • Monitoring Agent:
    • MA API (only Python binding)
    • PlugIns Agent (XML configuration is required)
    • Several types of MAs are provided in NGOP Prototype:
      • Linux Node "health" :
        • System Daemons presence
        • Critical File Systems presence and size
        • Cpu load
        • Memory utilization
        • Swap utilization
        • Number of users
        • Number of users’ processes
        • Number of processors
        • Baseboard temperature
        • Fan speed
      • “Xfalive”:
        • Node availability (low level ping)
        • Node reset
      • FBS :
        • FBS Daemons presence
        • Resources (“cpu” and scratch disk availability)
      • “Swatch” :
        • watches a log file for lines matching a regular expression, e.g. syslog or console log

[email protected]

status of framework implemented components1
Status of Framework(Implemented Components)
  • NGOP Central Server(NCS):
    • Gather events from MA’s
    • Scalable (so far ~ 512 nodes)
    • Provide users with requested information
    • Handle multiple users
    • Primitive locking mechanism to prevent simultaneous actions
    • Action broadcasting
    • Store information locally and forward it to Archive Storage
  • NGOP Configuration File Management Service:
    • Provide a central repository for system configuration and monitoring rules.
    • Perform configuration sanity check
    • Provide clients with component subscription list
    • Allow dynamic reconfiguration
    • Notify clients about new configuration

[email protected]

status of framework implemented components2
Status of Framework(Implemented Components)
  • Archive Server:
    • Handles archive storage (Oracle).
    • Provides a means to read and query the data (FNAL web interface: MISWEB)
    • Performs data roll out
    • Performs clean up procedure
  • Action Client:
    • Performs centralized actions
    • Verifies user authorization to perform the action
    • Notifies NCS about action exit status
  • Monitoring Client:
    • Allows to configure custom-built system views
    • Defines rules that determine the status of the system and their components
    • Requests and receives information about monitored objects
    • Determines the status of system based on the rules and obtained information
    • Initiates request to perform action.
    • All configuration files are written in XML

[email protected]

status of framework not yet implemented components
Status of Framework(Not yet implemented components)
  • Sensor Agent:

Agent that collects performance data and generates events at a higher rate than a monitoring agent.

  • Performance Data Storage Service:

Service that allows persistent storage of performance data, as

well as means to read and query the data.Performance data will

need to be consolidated.

  • Looping Monitoring Agent:

Agent that is capable to received information form NCS, analyze it, derive new events and send it back to NCS.

[email protected]

prototype statistics
Prototype Statistics
  • Some implementation details:
    • Written primarily in Python (some modules in C)
      • ~ 10, 000 line of Python code and ~1,000 of C code
    • Use XML (and partially MATHML) for all configuration files
      • ~ 600 configuration files
  • Some deployment details:
    • Monitoring 512 nodes , checking for node being down and node reset.
    • Monitoring four farms (CDF, D0, two Fix Target experiment farms) - (270 nodes out of 512)
    • Number of Monitoring Agents ~ 557( 270 local MAs monitor operating system and sensors data on the farms, 270 local MAs monitor syslog on the farms, 4 MAs monitor FBS on corresponding farms, 13 MAs perform “xfalive” service)
    • Number of Monitored Objects ~ 6,500
    • About 5 instances of “ngop monitor” (GUI) are running simultaneously.
    • Local event log is kept since January,12.
      • Rate is ~ 13 events per hour

[email protected]

current configuration
Current Configuration

CDF Farm

FixTarget Farm

cdffarm1

fnsfo

MA

(CDF_FBS)

MA

(FT_FBS)

PPD

MISCOMP

CMS

CDF

D0

Kerberos

FNALU

Division

Servers

SDSS

License

Servers

License

Servers

Mail

Servers

KTEV

MINOS

HPPC

ODS

BTEV

Enstore

D0 Farm

fncdf 1 - 90

Fnpc 201 - 250

d0bbin

Swatch

Swatch

MA

(OSHealth)

MA

(OSHealth)

MA

(D0_FBS)

fnd0 1 - 100

MA

(OSHealth)

NGOP

MAs

(Ping)

Old FixTarget Farm

User Node

User Node

User Node

fnsfh

NGOP

Monitor

NGOP

Monitor

NGOP

Monitor

MA

(OFT_FBS)

Config

File Management

Server

NGOP

Central

Server

fnpc 1 - 37

FNCDUH

Swatch

Action

Client

MA

(OSHealth)

Archive

Service

WWW

Swatch

[email protected]

summary of occurred events
Summary Of Occurred Events
  • Detected Problems:
    • Node reset
    • Node is down
    • One CPU is missing after reboot
    • File system not mounted
    • System daemon is dead
    • FBS Batch Manager is down
  • Raised Alarms:
    • Memory usage is high
    • Swap usage is high
    • CPU Load is high
    • File System is full
    • Baseboard temperature is high
    • Specific messages found in syslog : nfs timeouts, drive timeouts …

[email protected]

next milestone from prototype to production system for 600 nodes
Next Milestone: From Prototype to Production System (for ~600 nodes)
  • Goal 1: Gradually give the System Managers a Framework to develop and evolve tools to locally monitor their systems and enable them to send filtered information to the CSD operators
  • Goal 2: Make sure all production systems can be supported by NGOP (excluding Windows2000 in the first phase)

[email protected]

wish list improve the production system
Wish List: Improve the Production System
  • Provide Monitoring Client API
  • Implement Looping Agents
  • Implement historical rules and escalating alarms
  • Implement “snapshot” (“give me the updated system status now”) feature
  • Provide other than Python Monitoring Agent API
  • Fully Kerberize
  • Provide Standard Win2000 Monitoring Agents
  • Design and provide dynamic handling of configuration changes for the Monitoring Client
  • Allow for easier handling of multiple configurations
  • Improve Admin (Configuration Client) Client GUI
  • Provide Configuration GUI (hoping for a good free XML Editor though)
  • Provide Performance Data Framework
  • Redesign/Rewrite GUI (for scalability and friendliness)
  • Provide GUI for non-Linux platforms if really needed
  • Work on scalability up to 10000 hosts

[email protected]

ad