1 / 13

NGOP Overview

NGOP Overview. J.Fromm K.Genser T.Levshina M.Mengel. N ext G eneration O peration GROUP. Integrated Systems Development Department Krzysztof Genser Terry Jones Tanya Levshina Igor Mandrichenko Don Petravick Operating Systems Support Department Troy Dawson Jim Fromm

adli
Download Presentation

NGOP Overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. NGOP Overview J.Fromm K.Genser T.Levshina M.Mengel

  2. Next Generation OperationGROUP Integrated Systems Development Department Krzysztof Genser Terry Jones Tanya Levshina Igor Mandrichenko Don Petravick Operating Systems Support Department Troy Dawson Jim Fromm Lisa Giacchetti Marc Mengel Ken Schumacher Steven Timm Computing Services Department Rick Thies Rich Thompson Large Scale Cluster Computing Workshop at Fermilab

  3. Current way of monitoring • Various monitoring tools, thus no comprehensive picture of status of services • Xfalive • Patrol • NOC (network) • Fermi software (Enstore, FBS ….) • When actions initiated by user’s problem report • Sometime misleading information • Postmortem investigation Large Scale Cluster Computing Workshop at Fermilab

  4. Fermi Computing Environment • Heterogeneous clusters • Various OSs • Different services (batch, interactive, farms) • Various sets of applications (lsf, fbs, enstore, sam) • Mixed management • system administrators • software administrators • Computer Services Department (CSD) provides a single point of contact for reporting problems Large Scale Cluster Computing Workshop at Fermilab

  5. NGOP Goals • Active monitoring • Problem diagnostics • Early error detection and problem prevention • Centralized data collection • Status of service evaluation • Execution of corrective and notification actions • Performance analysis Large Scale Cluster Computing Workshop at Fermilab

  6. NGOP Project Phases 8/1999 – 3/2000 : Creation of NGOP group.Gathering requirements for Distributed Monitoring System. Evaluation of available commercial and freeware products. 3/2000 – 12/2000: Design and development of NGOP prototype 1/2001 - present: Prototype deployment on the farms. Farms monitoring by system administrators and operators. Prototype evaluation. Extending “xfalive” service to all nodes monitored by CSD. Large Scale Cluster Computing Workshop at Fermilab

  7. Prototype Statistics • Some implementation details: • Written primarily in Python (some modules in C) • Use XML (and partially MATHML) for all configuration files • Some deployment details: • Monitoring a total of 512 nodes • Checking for node being down and node reset • On four farms (CDF, D0, two Fix Target experiment farms) - (270 nodes) • System daemons presence • Critical file systems presence and size • Cpu load, memory and swap utilization • Number of users and users’ processes • Number of processors off-line • Baseboard temperature and fan speed • NFS timeouts • Disk errors • Number of Monitored Objects ~ 6,500 • About 5 instances of “ngop monitor” (GUI) are running simultaneously. • Events are stored in Oracle Database Large Scale Cluster Computing Workshop at Fermilab

  8. Current Configuration CDF Farm FixTarget Farm cdffarm1 fnsfo MA (CDF_FBS) MA (FT_FBS) PPD MISCOMP CMS CDF D0 Kerberos FNALU Division Servers SDSS License Servers License Servers Mail Servers KTEV MINOS HPSS ODS BTEV Enstore D0 Farm fncdf 1 - 90 Fnpc 201 - 250 d0bbin Swatch Swatch MA (OSHealth) MA (OSHealth) MA (D0_FBS) fnd0 1 - 100 MA (OSHealth) NGOP MAs (Ping) Old FixTarget Farm User Node User Node User Node fnsfh NGOP Monitor NGOP Monitor NGOP Monitor MA (OFT_FBS) Config File Management Server NGOP Central Server fnpc 1 - 37 FNCDUH Swatch Action Client MA (OSHealth) Archive Service WWW Swatch Large Scale Cluster Computing Workshop at Fermilab

  9. Summary Of Occurred Events • Detected Problems: • Node reset • Node is down • One CPU is missing after reboot • File system not mounted • System daemon is dead • FBS Batch Manager is down • Raised Alarms: • Memory usage is high • Swap usage is high • CPU Load is high • File System is full • Baseboard temperature is high • Specific messages found in syslog : nfs timeouts, drive timeouts … Large Scale Cluster Computing Workshop at Fermilab

  10. GUI Monitor Snapshots Large Scale Cluster Computing Workshop at Fermilab

  11. Report Generator(MISCOMP Web Query Interface) Large Scale Cluster Computing Workshop at Fermilab

  12. What’s next? • NGOP Production (end of summer 2001) • Wish List: • Provide Monitoring Client API • Implement Correlation(aka Looping) Agents • Implement historical rules and escalating alarms • Implement “snapshot” (“give me the updated system status now”) feature • Provide other than Python Monitoring Agent API • Fully Kerberize • Provide Standard Win2000 Monitoring Agents • Design and provide dynamic handling of configuration changes for the Monitoring Client • Allow for easier handling of multiple configurations • Improve Admin (Configuration Client) Client GUI • Provide Configuration GUI (hoping for a good free XML Editor though) • Provide Performance Data Framework • Redesign/Rewrite GUI (for scalability and friendliness) • Provide GUI for non-Linux platforms if really needed • Work on scalability up to 10000 hosts Large Scale Cluster Computing Workshop at Fermilab

  13. More Info url: http://www-isd.fnal.gov/ngop/ E-mail: ngop@fnal.gov Large Scale Cluster Computing Workshop at Fermilab

More Related