Hepsysman
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

HEPSYSMAN PowerPoint PPT Presentation


  • 44 Views
  • Uploaded on
  • Presentation posted in: General

HEPSYSMAN. Monitoring Workshop Introduction to the Day and Overview of Ganglia Pete Gronbech. Agenda. Wednesday 31st October 2007 10:00Start / Coffee 10:30 - 11:00Introduction & GangliaOverviewPete Gronbech 11:00 - 12:30MonAMI Interactive WorkshopPaul Millar   

Download Presentation

HEPSYSMAN

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Hepsysman

HEPSYSMAN

Monitoring Workshop

Introduction to the Day

and

Overview of Ganglia

Pete Gronbech


Agenda

Agenda

Wednesday 31st October 2007

10:00Start / Coffee

10:30 - 11:00Introduction & GangliaOverviewPete Gronbech

11:00 - 12:30MonAMI Interactive WorkshopPaul Millar   

12:30 - 13:30Lunch

13:30 - 14:00Intro To NagiosA. Elwell

14:00 - 14:30GRID Service Monitoring GroupIan Neilson

14:30 - 15:00Further Nagios Scripts.Chris Brew

15:00 - 16:00Live Install at a site and workshop discussion. 

16:00 - 16:30Other Monitoring Tools Discussion (Pakiti, gridmap, accounting cpu and storage, SAM, SAM admins page etc.)    

16:30AOB and wrap up.

Introduction & Ganglia


Why monitoring

Why Monitoring

  • Untrustworthy machines, that are critical. Your systems will fail. When they do fail, two things save you from downtime: Redundancy and Monitoring systems

  • Limited Man Power at sites

  • Ever increasing sizes of clusters

  • Complex software with many failure modes

  • Need to meet SLAs – 95% uptime

  • PR and reporting

Introduction & Ganglia


Many external monitoring sites

Many external monitoring sites

  • Gstat - http://goc.grid.sinica.edu.tw/gstat/UKI.html

  • Steve Lloyds Page - http://hepwww.ph.qmul.ac.uk/~lloyd/gridpp/ukgrid.html

  • SAM - https://lcg-sam.cern.ch:8443/sam/sam.py?sensors=CE&regions=UKI&vo=ops&order=SiteName&funct=ShowSensorTests

Introduction & Ganglia


Many more external monitoring sites

Many more external monitoring sites

  • Gridmap - http://gridmap.cern.ch/gm/

  • Gridview - http://gridview.cern.ch/GRIDVIEW/

  • Accounting - http://www3.egee.cesga.es/gridsite/accounting/CESGA/egee_view.html

Introduction & Ganglia


Local site monitoring

Local Site Monitoring

  • Unix tools

  • Batch System tools

  • Really need something that can provide a quick visual overview of the health and load on your cluster … ganglia

Introduction & Ganglia


Ganglia

Ganglia

Introduction & Ganglia


How does ganglia work

How does Ganglia work?

  • Ganglia works through a small agent, gmond, on each node or machine to be monitored. You can distribute a single gmond instance to lots of machines at once. Gmonds communicate the state of their local node to a machine running a Master gmetad instance.

  • The server uses RRDtool to store the data over time

  • The Ganglia framework can be extended to monitor many parameters.

Introduction & Ganglia


Setup the software can be downloaded from http ganglia sourceforge net

Computer A

Computer B

Computer C

Computer D

Computer D

SetupThe software can be downloaded from http://ganglia.sourceforge.net/

Clients just have to run gmond, which is configured by /etc/gmond.conf

Runs gmond

Runs gmond

Runs gmond

Server to collect the data runs gmetad.

It could also run gmond to monitor itself.

The web interface needs to run on a webserver.

gmetad

gmond

httpd

Runs gmetad

Introduction & Ganglia


Client setup

Computer A

Computer B

Computer C

Client Setup

/etc/gmond.conf extracts

cluster {

name = "LCG Workers"

}

/* Feel free to specify as many udp_send_channels as you like. Gmond

used to only support having a single channel */

udp_send_channel {

mcast_join = 239.2.11.95

port = 8649

}

/* You can specify as many udp_recv_channels as you like as well. */

udp_recv_channel {

mcast_join = 239.2.11.95

port = 8649

bind = 239.2.11.95

}

/* You can specify as many tcp_accept_channels as you like to share

an xml description of the state of the cluster */

tcp_accept_channel {

port = 8649

}

udp_send_channel {

port = 8649

host = pplxconfig

}

Runs gmond

Runs gmond

Runs gmond

yum install ganglia-gmond

edit config file

service gmond start

chkconfig gmond on

Introduction & Ganglia


Server setup

Computer D

Server Setup

Extracts from /etc/gmetad.conf

data_source "LCG Workers" computerA.physics.ox.ac.uk ComputerB.physics.ox.ac.uk computerC.physics.ox.ac.uk

data_source "LCG Servers" t2se01.physics.ox.ac.uk:8656 t2ce02.physics.ox.ac.uk:8656 gridlogger.physics.ox.ac.uk:8656

gmetad

gmond

httpd

yum install ganglia-gmond ganglia-gmetad ganglia-web

edit /etc/gmond.conf

edit /etc/gmetad.conf

Introduction & Ganglia


Aggregating sub clusters

Aggregating sub clusters

Introduction & Ganglia


Host level detail

Host level detail

Introduction & Ganglia


Customizing

Customizing

  • Adding PBS Batch Queue data

Introduction & Ganglia


Pbs queue monitoring

PBS Queue Monitoring

  • Originally based on RAL Tier 1 work

  • Actually fairly complicated.

  • see Chris Brew or me later for details.

Introduction & Ganglia


How is ganglia different from nagios

How is Ganglia different from Nagios

  • Ganglia is architecturally designed to perform efficiently in very large monitoring environments: each Ganglia gmond performs its service checks locally, reporting in at a regular interval to the gmetad. Nagios performs its service checks by polling each device across a network connection and waiting for a response (known as "active checks"), which can be more resource and bandwidth intensive.

  • Nagios uses the results of its active checks to determine state by comparing the metrics it polls to thresholds. These state changes can in turn be used to generate notifications and customizable corrective actions. Ganglia, by contrast, has no built-in thresholds, and so does not generate events or notifications.

  • The general rule of thumb has been: if you need to monitor a limited number of aspects of a large number of identical devices, use Ganglia; if you want to monitor lots of aspects of a smaller number of different devices, use Nagios. But those distinctions are blurring as Ganglia supports more and more devices, and as Nagios' scalability improves.

Introduction & Ganglia


How is ganglia different from nagios1

How is Ganglia different from Nagios

  • The problem with ganglia and all the other external web pages we have been looking at is that you have to look at them!

  • If all is well with your system you don’t want to have to look.

  • This is where Nagios comes in. It can be setup to alert you when something goes wrong, or a value passes a threshold.

Introduction & Ganglia


  • Login