1 / 57

Monitoring Systems and POWER5

2. Monitoring Systems and POWER5/6 LPARs with Ganglia. Agenda. Ganglia what is it ?Ganglia components and data flowAn introduction to RRDToolGanglia metrics what can be measured ?New POWER5/6 metrics (AIX

linus
Download Presentation

Monitoring Systems and POWER5

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. Monitoring Systems and POWER5/6 LPARs with Ganglia Michael Perzl – michael@perzl.org

    2. 2 Monitoring Systems and POWER5/6 LPARs with Ganglia Agenda Ganglia – what is it ? Ganglia components and data flow An introduction to RRDTool Ganglia metrics – what can be measured ? New POWER5/6 metrics (AIX & Linux) Extending Ganglia with gmetric Add device specific information to Ganglia Ganglia network communication Installation issues Where to get Ganglia for AIX and Linux on POWER ? Best practices Future additions / plans Discussion Links

    3. Ganglia – what is it ?

    4. 4 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – what is it ? (1/3) Ganglia is an Open Source cluster performance monitoring tool and has been extended to include POWER5/6 features like shared processor LPARs, entitlement, physical CPU usage etc. This session covers: the technical details of Ganglia and the POWER5/6 extensions how to set it up and use it to monitor all LPARs in a single machine and lots of machines

    5. 5 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – what is it ? (2/3) Ganglia properties: scalable distributed monitoring system for high-performance computing systems such as clusters and grids based on a hierarchical design targeted at federations of clusters relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state leverages widely used technologies such as XML for data representation XDR (eXternal Data Representation) for compact, portable data transport RRDtool for data storage and visualization uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency robust implementation Open Source, written in C Downloaded 110,000+ times, 145+ countries, 500+ clusters, 2000+ nodes

    6. 6 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – what is it ? (3/3) Ganglia properties (cont.): has been ported to an extensive set of operating systems and processor architectures: AIX Darwin FreeBSD HP-UX IRIX Linux OSF NetBSD Solaris Windows (via Cygwin) is currently in use on over 500+ clusters around the world has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000+ nodes check http://ganglia.info/ for more details

    7. Ganglia components and data flow

    8. 8 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia components The ganglia system consists of: two unique daemons: Ganglia Monitoring Daemon (gmond) monitoring daemon, collects the metrics runs on each node Ganglia Meta Daemon (gmetad) polls all gmond clients and stores the collected metrics in Round-Robin Databases (RRDs) a PHP-based web frontend a few other small utility programs gmetric can be used to easily extend Ganglia with additional user-defined metrics gstat gexec

    9. 9 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – Schematic View

    10. 10 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Architecture

    11. 11 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Monitoring Daemon (gmond) Ganglia Monitoring Daemon (gmond) is a multi-threaded daemon which runs on each cluster node you want to monitor. Installation is easy: just the daemon and a configuration file (/etc/gmond.conf) gmond has four main responsibilities: monitor changes in host state announce relevant changes listen to the state of all other ganglia nodes via a unicast or multicast channel answer requests for an XML description of the cluster state Each gmond transmits information in two different ways: unicasting or multicasting host state in external data representation (XDR) format using UDP messages sending XML over a TCP connection

    12. 12 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Meta Daemon (gmetad) (1/2) Ganglia Meta Daemon (gmetad) is a daemon which typically only runs on one specific cluster node – or on more when using a staged setup. Installation is easy: just the daemon and a configuration file (/etc/gmetad.conf) Federation in Ganglia is achieved using a tree of point-to-point connections amongst representative cluster nodes to aggregate the state of multiple clusters. At each node in the tree a gmetad periodically polls a collection of child data sources parses the collected XML saves all numeric volatile metrics to round-robin databases exports the aggregated XML over a TCP socket to clients

    13. 13 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Meta Daemon (gmetad) (2/2) Data sources may be either gmond daemons, representing specific clusters or other gmetad daemons, representing sets of clusters Data sources use source IP addresses for access control Multiple IP addresses can be specified for failover The capability is natural for aggregating data from clusters since each gmond daemon contains the entire state of its cluster

    14. 14 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia PHP web frontend (1/2) Web frontend properties: provides a view of the gathered information via real-time dynamic web pages displays Ganglia data in a meaningful way for system administrators and users For example, one can view the CPU utilization over the past hour, day, week, month, or year The web frontend shows similar graphs for memory usage, disk usage, network statistics, number of running processes, and all other Ganglia metrics

    15. 15 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia PHP web frontend (2/2) Web frontend properties (cont.): depends on the existence of the gmetad which provides it with data from several Ganglia sources opens the local port 8651 (by default) and expects to receive a Ganglia XML tree the web pages themselves are highly dynamic; any change to the Ganglia data appears immediately on the site This behavior leads to a very responsive site, but requires that the full XML tree be parsed on every page access Therefore, the Ganglia web frontend should run on a fairly powerful, dedicated machine if it presents a large amount of data is written in the PHP scripting language and uses graphs generated by gmetad to display history information has been tested on many flavors of Unix (primarily Linux) with the Apache web server and the PHP 4.1 module

    16. 16 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (1/4)

    17. 17 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (2/4)

    18. 18 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (3/4)

    19. 19 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (4/4)

    20. 20 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow again

    21. An introduction to RRDTool

    22. 22 Monitoring Systems and POWER5/6 LPARs with Ganglia RRDTool Homepage: http://oss.oetiker.ch/rrdtool/ RRD is the Acronym for Round-Robin Database. RRD is a system to store and display time-series data (i.e., network bandwidth, machine-room temperature, server load average). It stores the data in a very compact way that will not expand over time (fixed size of DB), and it presents useful graphs by processing the data to enforce a certain data density. It can be used either via simple wrapper scripts (from shell or Perl) or via frontends that poll network devices and put a friendly user interface on it.

    23. 23 Monitoring Systems and POWER5/6 LPARs with Ganglia RRDTool example graph

    24. 24 Monitoring Systems and POWER5/6 LPARs with Ganglia RRDTool example # rrdtool create test.rrd \ --start 920804400 \ --step 300 \ DS:km:COUNTER:600:U:U \ RRA:AVERAGE:0.5:1:24 # rrdtool update test.rrd 920804700:12345 920805000:12357 920805300:12363 # rrdtool update test.rrd 920805600:12363 920805900:12363 920806200:12373 # rrdtool update test.rrd 920806500:12383 920806800:12393 920807100:12399 # rrdtool update test.rrd 920807400:12405 920807700:12411 920808000:12415 # rrdtool update test.rrd 920808300:12420 920808600:12422 920808900:12423 # rrdtool graph kilometer.png \ --start 920804400 \ --end 920808000 \ DEF:mykm=test.rrd:km:AVERAGE \ LINE2:mykm#FF0000

    25. Ganglia metrics – what can be monitored ?

    26. 26 Monitoring Systems and POWER5/6 LPARs with Ganglia Metrics Definition of a metric: A metric is a certain observed property of the system. Number of metrics: 34 standard metrics, i.e., available (i.e., defined) on all platforms Additional platform dependent metrics available Solaris 8 additional metrics available HP-UX 4 additional metrics available AIX 18 additional new metrics available for POWER5/6 !!! details later…. Remarks: One RRD database per Ganglia metric is used Database size is fixed (~ 12 kB per RRD database with default settings) Some standard metrics do not exist on all platforms, e.g., some metrics (coming from Linux) don’t exist or don’t make sense on AIX

    27. 27 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia standard metrics (1/2) boottime system boot timestamp bytes_in number of network bytes received per second bytes_out number of network bytes sent out per second cpu_aidle percent of time since boot idle CPU not defined on AIX, Linux yes cpu_idle percent CPU idle time cpu_nice percent CPU nice not defined on AIX, Linux yes cpu_num number of CPUs cpu_intr number of interrupts (??) not defined on AIX, Linux yes cpu_sintr number of system interrupts (??) not defined on AIX, Linux yes cpu_speed speed of CPUs in MHz cpu_system percent CPU system cpu_user percent CPU user cpu_wio CPU time spent waiting for I/O disk_free total free disk space in GB disk_total total available disk space in GB load_one load average over 1 minute load_five load average over 5 minutes load_fifteen load average over 15 minutes

    28. 28 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia standard metrics (2/2) machine_type type of machine (e.g., POWER5) mem_total total available memory in kB mem_free amount of free memory in kB mem_shared amount of shared memory not defined on AIX, Linux yes mem_buffers amount of memory used for buffers not defined on AIX, Linux yes mem_cached amount of memory used for cache AIX: numperm memory pages mtu MTU size reported in bytes os_name name of OS os_release OS release version (on AIX: level of fileset bos.mp) part_max_used most filled disk partition not defined on AIX, Linux yes pkts_in number of network packets received pkts_out number of network packets sent out proc_run total number of running processes proc_total total number of processes swap_free free swap space in kB AIX: paging space free swap_total total available swap space in kB AIX: paging space

    29. New POWER5/6 metrics (AIX & Linux)

    30. 30 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia and POWER5/6 Current deficiences of Ganglia on POWER5/6: Ganglia does not understand Shared Processor LPAR statistics things like capped, weight, CPU entitlement etc. these metrics can be added (see below) How to fix these deficiences, i.e., add these new metrics ? Easy solution: Extend Ganglia with the utility program gmetric Details in section “Extending Ganglia with gmetric” (later) Preferred solution: Add these new metrics to the gmond implementation on AIX and Linux on POWER Requires significant patching of Ganglia source code This has been completed and tested, i.e., ready to go ! Where to get it ? My personal web site: http://www.perzl.org/ganglia/

    31. 31 Monitoring Systems and POWER5/6 LPARs with Ganglia Additional Ganglia POWER5/6 metrics (1/5) Question: How are those additional metrics programmed and where do they get the information from? Answer: AIX: Only the APIs provided by libperfstat are used As a consequence the fileset bos.perf.libperfstat must be installed Linux: Only entries in the /proc file system are used, e.g., /proc/cpuinfo, /proc/meminfo, /proc/ppc64/lparcfg etc. No additional fileset must be installed

    32. 32 Monitoring Systems and POWER5/6 LPARs with Ganglia Additional Ganglia POWER5/6 metrics (2/5) capped cpu_entitlement cpu_in_lpar cpu_in_machine cpu_in_pool cpu_pool_idle cpu_used disk_read disk_write kernel64bit lpar lpar_name lpar_num oslevel serial_num smt splpar weight

    33. 33 Monitoring Systems and POWER5/6 LPARs with Ganglia Additional Ganglia POWER5 metrics (3/5) capped Type: String value returns "yes" if the system is a POWER5 Shared Processor LPAR which is running in capped mode or "no" otherwise cpu_entitlement Type: Float value returns the Capacity Entitlement of the system in units of physical CPUs cpu_in_lpar Type: Integer value returns the number of CPUs the OS sees in the system. In a POWER5 Shared Processor LPAR this returns the number of virtual CPUs. When SMT is enabled this number is doubled. cpu_in_machine Type: Integer value returns the number of physical CPUs in the whole system cpu_in_pool Type: Integer value returns the number of physical CPUs in the Shared Processor Pool cpu_pool_idle Type: Float value returns in fractional numbers of physical CPUs how much the Shared Processor Pool is idle

    34. 34 Monitoring Systems and POWER5/6 LPARs with Ganglia Additional Ganglia POWER5 metrics (4/5) cpu_used Type: Float value returns in fractional numbers of physical CPUs how much compute resources this shared processor has used since the last time this metric was measured disk_read Type: Float value returns in units of kB the total read I/O of the system disk_write Type: Float value returns in units of kB the total write I/O of the system kernel64bit Type: String value returns "yes" if the running kernel is a 64-bit kernel or "no" otherwise lpar Type: String value returns "yes" if the system is a LPAR or "no" otherwise lpar_name Type: String value returns the name of the LPAR as defined on the Hardware Management Console (HMC) or some reasonable message otherwise

    35. 35 Monitoring Systems and POWER5/6 LPARs with Ganglia Additional Ganglia POWER5 metrics (5/5) lpar_num Type: Integer value returns the partition ID of the LPAR as defined on the Hardware Management Console (HMC) or some reasonable message otherwise oslevel Type: String value returns the version string as provided by the AIX command 'oslevel‘ serial_num Type: String value returns the serial number of the system as provided by the AIX command 'uname‘ smt Type: String value returns "yes" if SMT is enabled or "no" otherwise splpar Type: String value returns "yes" if the system is running in a shared processor LPAR or "no" otherwise weight Type: Integer value returns the weight of the LPAR running in uncapped mode

    36. Extending Ganglia with gmetric

    37. 37 Monitoring Systems and POWER5/6 LPARs with Ganglia Extending Ganglia How can I easily add metrics to Ganglia ? Ganglia has a simple way to add metrics that should be monitored The utility program gmetric is used for that purpose These new metrics are then automatically added to the database and web server data and graphs

    38. 38 Monitoring Systems and POWER5/6 LPARs with Ganglia gmetric example – Machine firmware level Example of a static metric: Machine firmware level gmetric --name firmware \ --value `lsattr -El sys0 -a modelname -F value` \ --type "string" Remarks: The above will only save the statistics once. The firmware level is unlikely to change without reboot, therefore it is sufficient to run this command once.

    39. 39 Monitoring Systems and POWER5/6 LPARs with Ganglia gmetric example – database transactions Example of a variable metric: Transaction rate of your database To add the number of transaction and assuming you have a script that will work this out called "transactions" that returns a number with a decimal point – you will have to write this script yourself ! gmetric --name tpm \ --value `/usr/local/bin/transactions` \ --type double Remarks: This command will only save the statistics once. As the number of transactions per minute will definitely change, to get these always up to date, it is recommended to run the command regularly, e.g., run once every 60 seconds via cron.

    40. Add device specific information to Ganglia

    41. 41 Monitoring Systems and POWER5/6 LPARs with Ganglia General Remarks Add device specific information to Ganglia via gmetric for Network adapters Disks Disk adapters (SCSI + Fibre Channel) Available in two variants: as a daemon (implemented in C) as a shell script Network information: Daemons: g_aix_netif for AIX g_linux_netif for Linux on POWER Shell script: ent_adapter.sh for AIX Disk information: Daemons: g_aix_disk for AIX g_linux_disk for Linux on POWER Disk adapter information (AIX only) Daemons g_aix_adapter for AIX Shell script: fcs_adapter.sh for AIX

    42. 42 Monitoring Systems and POWER5/6 LPARs with Ganglia Add device specific information on AIX and Linux (1/4) Network adapter information: g_aix_netif or g_linux_netif Utility daemon program periodically calls gmetric (interval configurable) Monitored parameters per network interface: Bytes received / second Bytes transmitted / second Packets received / second Packets transmitted / second MTU size (AIX only) Example: g_aix_netif -s5 -b3 -p3 -m en1 Every 5 seconds get the number of bytes/sec and packets/sec transferred in and out as well as the current MTU size for network interface en1.

    43. 43 Monitoring Systems and POWER5/6 LPARs with Ganglia Add device specific information on AIX and Linux (2/4) Disk information: g_aix_disk or g_linux_disk Utility daemon program periodically calls gmetric (interval configurable) Monitored parameters per disk: Bytes read / second Bytes written / second Example: g_aix_disk -s10 -i -o hdisk3 Every 10 seconds get the number of bytes/sec transferred in and out for disk hdisk3.

    44. 44 Monitoring Systems and POWER5/6 LPARs with Ganglia Add device specific information on AIX and Linux (3/4) Disk adapter information: g_aix_adapter (AIX only) Utility daemon program periodically calls gmetric (interval configurable) Monitored parameters per disk adapter: Bytes read / second Bytes written / second Example: g_aix_adapter –s5 –i –o scsi0 Every 5 seconds get the number of bytes/sec transferred in and out for SCSI adapter scsi0.

    45. 45 Monitoring Systems and POWER5/6 LPARs with Ganglia Add device specific information on AIX and Linux (4/4) ./g_aix_netif: Version 1.0 g_aix_netif [OPTIONS] <network-interface> [-?] or [-h] This help information [-s seconds] The time between output (default is 60 seconds), seconds must be in the range [1..3600]. [-c loop_count] The number of loops (default = 20 million), loop_count must be in the range [10..20000000]. [-b 0|1|2|3] Show bytes received/sent (default = off) 0 = don't show, 1 = show incoming 2 = show outgoing, 3 = show both [-p 0|1|2|3] Show packets received/sent (default = off) 0 = don't show, 1 = show incoming 2 = show outgoing, 3 = show both [-m] Show MTU size for <network-interface> (default = off) [-d] Debug mode, remain in foreground, output only to screen Please note: If not in debug mode (-d) g_aix_netif runs the same command via a shell and assumes gmetric will be found in the PATH and gmond is already running. g_aix_netif will disconnect from your terminal to become a daemon. Use "ps -ef | grep g_aix_netif" to confirm it is running.

    46. Ganglia network communication

    47. 47 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia network communication Ganglia by default uses Multicast Some network administrators might not like Multicast This can be changed to Unicast - requires changes to default config files Very simple changes to the gmond.conf files Ganglia gmond network “chatter” The processes talk to each other quite a lot Not large in 100Mb or 1Gb or virtual networks terms Recommendation: Ganglia over admin network rather than user network if possible

    48. 48 Monitoring Systems and POWER5/6 LPARs with Ganglia gmond: Multicast configuration example gmond.conf: /* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */ globals { daemonize = yes setuid = yes ??? on some Linux distros: no user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 3600 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no } /* If a cluster attribute is specified, then all gmond hosts are wrapped inside of a <CLUSTER> tag. If you do not specify a cluster tag, then all <HOSTS> will NOT be wrapped inside of a <CLUSTER> tag. */ cluster { name = "System p5 Model 550" owner = "unspecified" latlong = "unspecified" url = "unspecified" } gmond.conf continued: ... /* The host section describes attributes of the host, like the location */ host { location = "System p5 Model 550" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { mcast_join = 239.2.11.71 port = 8649 } /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { mcast_join = 239.2.11.71 port = 8649 bind = 239.2.11.71 } /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 }

    49. 49 Monitoring Systems and POWER5/6 LPARs with Ganglia gmond: Unicast configuration example gmond.conf: /* This configuration is as close to 2.5.x default behavior as possible The values closely match ./gmond/metric.h definitions in 2.5.x */ globals { daemonize = yes setuid = yes ??? on some Linux distros: no user = nobody debug_level = 0 max_udp_msg_len = 1472 mute = no deaf = no host_dmax = 3600 /*secs */ cleanup_threshold = 300 /*secs */ gexec = no } /* If a cluster attribute is specified, then all gmond hosts are wrapped inside of a <CLUSTER> tag. If you do not specify a cluster tag, then all <HOSTS> will NOT be wrapped inside of a <CLUSTER> tag. */ cluster { name = "System p5 Model 550" owner = "unspecified" latlong = "unspecified" url = "unspecified" } ... gmond.conf continued: ... /* The host section describes attributes of the host, like the location */ host { location = "System p5 Model 550" } /* Feel free to specify as many udp_send_channels as you like. Gmond used to only support having a single channel */ udp_send_channel { host = p550-aix port = 8649 } /* You can specify as many udp_recv_channels as you like as well. */ udp_recv_channel { port = 8649 } /* You can specify as many tcp_accept_channels as you like to share an xml description of the state of the cluster */ tcp_accept_channel { port = 8649 }

    50. 50 Monitoring Systems and POWER5/6 LPARs with Ganglia gmetad: configuration example # Format: # data_source "my cluster" [polling interval] address1:port addreses2:port ... # # The keyword 'data_source' must immediately be followed by a unique string which identifies the source, # then an optional polling interval in seconds. The source will be polled at this interval on average. # If the polling interval is omitted, 15sec is asssumed. # # A list of machines which service the data source follows, in the format ip:port, or name:port. # If a port is not specified then 8649 (the default gmond port) is assumed. # default: There is no default value # # data_source "my cluster" 10 localhost my.machine.edu:8649 1.2.3.5:8655 # data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651 # data_source "another source" 1.3.4.7:8655 1.3.4.8 data_source "Systen p5 Model 550" 15 p550-aix:8649 p550-nim:8649 # Round-Robin Archives # You can specify custom Round-Robin archives here (defaults are listed below) # # RRAs "RRA:AVERAGE:0.5:1:240" "RRA:AVERAGE:0.5:24:240" "RRA:AVERAGE:0.5:168:240" \ # "RRA:AVERAGE:0.5:672:240" "RRA:AVERAGE:0.5:5760:370" # The name of this Grid. All the data sources above will be wrapped in a GRID tag with this name. # default: Unspecified gridname "Mycluster"

    51. 51 Monitoring Systems and POWER5/6 LPARs with Ganglia Security issues Ganglia setup can be tunneled through SSH if certain security guidelines must be adhered to security guidelines allow only SSH connections, nothing else Detailed description of such a setup can be found at: http://www.ibm.com/collaboration/wiki/display/LinuxP/Ganglia/

    52. Installation issues

    53. 53 Monitoring Systems and POWER5/6 LPARs with Ganglia Build requirements for Ganglia from scratch RRDTool package dependencies freetype2 libart_lpgl libpng Perl zlib Apache 2 package dependencies httpd expat Perl zlib mod_ssl httpd openssl PHP package dependencies none gmond package dependencies none gmetad packages dependencies Apache 2 PHP libxml2 rrdtool

    54. 54 Monitoring Systems and POWER5/6 LPARs with Ganglia Installation issues on POWER5/6 Linux: Installation on any recent Linux distribution is very easy! All Linux distributions contain the necessary RPM packages. AIX: For AIX some (though not all) of the prerequisites can be fulfilled with RPM packages from the AIX Toolbox for Linux Applications: http://www.ibm.com/servers/aix/products/aixos/linux/ There was also a problem for people getting hold of Apache 2 on AIX with the latest PHP version, so Nigel Griffiths wrote a How-To to build these popular Open Source tools for AIX with GCC: http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen Compilation of Ganglia with the IBM C/C++ compilers is also easy and is done for the Ganglia binaries provided on my personal website: http://www.perzl.org/ganglia/

    55. Where to get Ganglia for AIX and Linux on POWER ?

    56. 56 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia binaries and source code for POWER5/6 (1/2) Where to get it ? My personal web site: http://www.perzl.org/ganglia/ What is available ? Binary and source RPMs available for: AIX v4.3.3 (gmond only) AIX 5L v5.1 and v5.2 AIX 5L v5.3 Red Hat Enterprise Linux 4 and 5 SLES 9 and SLES 10 Fedora Core 4, 5, 6 and 7 openSUSE 10.0, 10.1, 10.2 and 10.3 Precompiled Apache 2 + PHP for Ganglia gmetad on AIX5L v5.1 and higher My enhanced Ganglia web interface Device specific information (network, disk, adapter) added to Ganglia via gmetric

    57. 57 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia binaries and source code for POWER5/6 (2/2) Required source code changes against version 3.0.5 of Ganglia to incorporate the new POWER5/6 metrics: configure configure.in gmetad/Makefile.in gmond/gmond.c lib/apr_net.c lib/libgmond.c lib/protocol.x libmetrics/libmetrics.h libmetrics/aix/metrics.c libmetrics/linux/metrics.c libmetrics/tests/test-metrics.c

    58. 58 Monitoring Systems and POWER5/6 LPARs with Ganglia Download statistics of http://perzl.org/ganglia/ (1/2)

    59. 59 Monitoring Systems and POWER5/6 LPARs with Ganglia Download statistics of http://perzl.org/ganglia/ (2/2)

    60. Best Practices

    61. 61 Monitoring Systems and POWER5/6 LPARs with Ganglia Some things to consider before you start… Hostnames To Ganglia a new hostname is a new machine Has to resolve IP address so use DNS IP addresses stable Make sure you are not going to change IP addresses Time and date Make sure the timezone, time and date is consistent on all machines in a cluster Use of NTP is recommended So These are normal on production machines For prototype and test systems – get this right before starting Ganglia Simple Ganglia How-To available for people setting up their first Ganglia system available at: http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia

    62. 62 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia data flow and what goes where…

    63. 63 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices Preferred setup Ganglia sampling intervals Ganglia default ports Shared Ethernet statistics Fibre Channel statistics Enhanced web interface

    64. 64 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Preferred Setup Define each System p machine with all its LPARs as a separate cluster Use Unicast for network communication Define at least two LPARs per System p machine as gmond hosts for gmetad One would be sufficient, however, two is better for high availibility reasons Define those two LPARs in /etc/gmetad.conf as the information brokers for that machine From gmetad: Don’t poll the gmond hosts more frequently than every 15 secs Know upfront what time intervals to use for sampling (RRAs stanza in /etc/gmetad.conf) See next slides Use my extensions for Ethernet adapters Fibre Channel adapters Web interface

    65. 65 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (1/6) Important to know: The sampling interval is defined in /etc/gmetad.conf. The "RRAs" stanza is used to defined individual settings. The sampling settings are global. If no "RRAs" stanza is defined a default configuration is used. For historic reasons all values are specified in intervals of 15 seconds.

    66. 66 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (2/6) Example: Default settings in Ganglia RRAs "RRA:AVERAGE:0.5:1:240" \ "RRA:AVERAGE:0.5:24:240" \ "RRA:AVERAGE:0.5:168:240" \ "RRA:AVERAGE:0.5:672:240" \ "RRA:AVERAGE:0.5:5760:370" used for Translation: display of Take 240 samples at 1 × 15 seconds intervals hour Take 240 samples at 24 × 15 seconds (= 6 minutes) intervals day Take 240 samples at 168 × 15 seconds (= 42 minutes) intervals week Take 240 samples at 672 × 15 seconds (= 168 minutes) intervals month Take 370 samples at 5760 × 15 seconds (= 24 hours) intervals year

    67. 67 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (3/6) Example: 1-minute sampling for one year RRAs "RRA:AVERAGE:0.5:4:525600" Translation: Take 525600 samples at 4 × 15 seconds (= 1 minute) intervals 525600 = 60 (samples/hour) × 24 (hours) × 365 (days) × 1 (year)

    68. 68 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (4/6) Example: 1-minute sampling for 6 months, 5-minute sampling for 2 years RRAs "RRA:AVERAGE:0.5:4:259200" \ "RRA:AVERAGE:0.5:20:210240" Translation: Take 259200 samples at every 4 × 15 seconds (= 1 minute) intervals 259200 = 60 (samples/hour) × 24 (hours) × 30 (days) × 6 (months) Take 210240 samples at every 20 × 15 seconds (= 5 minutes) intervals 210240 = 12 (samples/hour) × 24 (hours) × 365 (days) × 2 (years)

    69. 69 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (5/6) Example: 15-second sampling for 1 day, 1-minute sampling for 2 months, 10-minute sampling for 1 year RRAs "RRA:AVERAGE:0.5:1:5760" \ "RRA:AVERAGE:0.5:4:86400" \ "RRA:AVERAGE:0.5:40:52560" Translation: Take 5760 samples at every 1 × 15 seconds intervals 5760 = 4 (samples/minute) 60 (samples/hour) × 24 (hours) Take 86400 samples at every 4 × 15 seconds (= 1 minute) intervals 86400 = 60 (samples/hour) × 24 (hours) × 30 (days) × 2 (months) Take 52560 samples at every 40 × 15 seconds (= 10 minutes) intervals 52560 = 6 (samples/hour) × 24 (hours) × 365 (days) × 1 (year)

    70. 70 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (6/6) Example: 1-minute sampling for 2 months, 5-minute sampling for 6 months, 15-minute sampling for 3 years RRAs "RRA:AVERAGE:0.5:4:86400" \ "RRA:AVERAGE:0.5:20:51840" \ "RRA:AVERAGE:0.5:60:105120" Translation: Take 86400 samples at every 4 × 15 seconds (= 1 minute) intervals 86400 = 60 (samples/hour) × 24 (hours) × 30 (days) × 2 (months) Take 210240 samples at every 20 × 15 seconds (= 5 minutes) intervals 51840 = 12 (samples/hour) × 24 (hours) × 30 (days) × 6 (month) Take 105120 samples at every 60 × 15 seconds (= 15 minutes) intervals 105120 = 4 (samples/hour) × 24 (hours) × 365 (days) × 3 (years)

    71. 71 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Default Ports Ganglia by default uses the following ports: 8649 The port gmond uses for Sending to other gmonds via UDP (udp_send_channel in /etc/gmond.conf) Receiving from other gmonds via UDP (udp_receive_channel in /etc/gmond.conf) Sending an XML description of the state of the cluster (tcp_accept_channel in /etc/gmond.conf) 8651 The port gmetad will answer requests for XML. 8652 The port gmetad will answer queries for XML. This facility allows simple subtree and summation views of the XML tree.

    72. 72 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Shared Ethernet Statistics Question: How to monitor SEA statistics on the VIO server ? The AIX libperfstat library seems not to report any statistics about Ethernet adapters if there are no interfaces defined on that adapter. Only seldom interfaces are defined on SEAs. The AIX command ‘entstat’ however provides these statistics. Solution: Extension through gmetric via a shell script Korn shell script ‘ent_adapter.sh’ Get it from http://www.perzl.org/ganglia/devicespecific.html Graphs will appear immediately for that specific host

    73. 73 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Fibre Channel Statistics Question: How to monitor Fibre Channel statistics on the VIO server ? The AIX libperfstat library seems not to report any statistics about Fibre Channel adapters if there are no disks attached to the adapter. Tapes, for instance, would be left out. The AIX command ‘fcstat’ however provides these statistics. Solution: Extension through gmetric via a shell script Korn shell script ‘fcs_adapter.sh’ Get it from http://www.perzl.org/ganglia/devicespecific.html Graphs will appear immediately for that specific host

    74. 74 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Enhanced Web Interface (1/2) Get it from: http://www.perzl.org/ganglia/webinterface.html Includes the following extensions: Jscalendar patch of Timothy D. Witham for selecting custom intervals for graph display. Custom Graph Patch of Alex Balk. Zoomable graphs Adapted from the UC Berkeley Grid live Demo website. They implemented some nice graph zooming, i.e., when you click onto a single node metric graph or on one of the Grid overview graphs you get a nicely scaled bigger version of that graph.

    75. 75 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Enhanced Web Interface (2/2) Includes the following extensions (cont.) cpu_used statistics for IBM POWER5/6 systems A cpu_used overview graph for the cluster_view template which - when clicked onto - produces a detailed cpu_used statistics graph for the whole cluster

    76. Future additions / plans

    77. 77 Monitoring Systems and POWER5/6 LPARs with Ganglia Future additions / plans Try to incorporate the POWER5/6 additions into the Ganglia mainstream source code Adapt the AIX and Linux on POWER metrics to POWER6 Provide a custom web interface tailored specifically to POWER5/6 Future updates on my personal web site: Update RPMs to newest versions Provide RPMs for Apache + PHP on my AIX Open Source web site (very soon)

    78. Discussion

    79. 79 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia advantages Overview and major statistics available at one look with nice graphs Available on many platforms Widely used, many users and many different setups Global view (cluster/grid) and local view (single node) available Easily remote accessible through web interface Highly customizable through config files Shared Processor LPAR statistics available Monitored data is stored in Round-Robin Databases (RRDs), i.e., this information could be easily passed on accounting too Fine granular statistics possible Different time interval views: hour, day, week, month, year or customizable Open Source (source code and binary RPMs for AIX and Linux on POWER) available Low risk and low cost Easily extendible through utility program gmetric through adapting the Ganglia source code (as shown in this presentation)

    80. 80 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia disadvantages Not an official IBM tool No official support available Primarily a monitoring tool, not an accounting tool ‘‘Only‘‘ a monitoring (visualization) tool, no actions can be triggered AIX setup from scratch requires some work (building all prerequisite software), although it is well documented Normally not necessary by using my binary RPMs

    81. Links

    82. 82 Monitoring Systems and POWER5/6 LPARs with Ganglia Links (1/2) Main Ganglia website http://ganglia.info/ Ganglia Documentation http://ganglia.info/docs/ Ganglia Source Code Download http://ganglia.sourceforge.net/downloads.php Ganglia POWER5/6 extensions and ready-to-run binaries (RPM files) as well as source code http://www.perzl.org/ganglia/ My personal AIX Open Source repository http://www.perzl.org/aix/

    83. 83 Monitoring Systems and POWER5/6 LPARs with Ganglia Links (2/2) Ganglia Usage at Wikipedia http://ganglia.wikimedia.org/ RRDTool homepage http://oss.oetiker.ch/rrdtool/ Ganglia How-To on IBM AIX wiki site (written by Nigel Griffiths) http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia Open Source with AIX on IBM AIX wiki site (written by Nigel Griffiths) http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen IBM AIX wiki site: http://www.ibm.com/collaboration/wiki/display/WikiPtype/Home IBM Linux on POWER wiki site: http://www.ibm.com/collaboration/wiki/display/LinuxP/Home

    84. 84 Monitoring Systems and POWER5/6 LPARs with Ganglia Questions ?

    85. Backup Slides

    86. Simple setup example

    87. 87 Monitoring Systems and POWER5/6 LPARs with Ganglia 1 Simple gmond install (1/2) On all data nodes: Install the gmond RPM file on each data node rpm –Uvh ganglia-gmond-VVV.PPP.rpm VVV.PPP is the version number and platform like: 3.0.5 aix5.1.ppc, aix5.3.ppc, suse.ppc64, redhat.ppc64 Edit the configuration file /etc/gmond.conf On Linux on POWER need access to /proc/ppc64/lparcfg = root user only So also set “setuid = no” depending on Linux distribution

    88. 88 Monitoring Systems and POWER5/6 LPARs with Ganglia 1 Simple gmond install (2/2) Use the gmond control script located in /etc to start gmond: /etc/init.d/gmond start Linux (SUSE and Red Hat) /etc/rc.d/init.d/gmond start AIX These scripts also automatically start gmond when booting the system All options are: start stop restart status

    89. 89 Monitoring Systems and POWER5/6 LPARs with Ganglia 2 Simple gmetad - prerequisites (1/3) Gmetad needs rrdtool to store the data and Apache2 + PHP to serve the web pages For Linux rrdtool, Apache2 and PHP5 is part of every Linux distribution For AIX you can resolve the prerequisites as follows: rrdtool provided at http://www.perzl.org/ganglia/ libart_lgpl AIX Toolbox for Linux Applications libpng AIX Toolbox for Linux Applications freetype2 AIX Toolbox for Linux Applications zlib AIX Toolbox for Linux Applications Perl AIX Toolbox for Linux Applications AIX Toolbox for Linux Applications http://www.ibm.com/servers/aix/products/aixos/linux

    90. 90 Monitoring Systems and POWER5/6 LPARs with Ganglia 2 Simple gmetad install (2/3) Install the gmetad RPM file on each node rpm –Uvh ganglia-gmetad-VVV.PPP.rpm VVV.PPP is the version number and platform like: 3.0.5 and aix5.1.ppc, aix5.3.ppc, suse.ppc64, redhat.ppc64 Edit the configuration file /etc/gmetad.conf data_source “mycluster" localhost

    91. 91 Monitoring Systems and POWER5/6 LPARs with Ganglia 2 Start gmetad (3 of 3) Use the gmetad control script located in /etc to start gmetad: /etc/init.d/gmetad start Linux (SUSE and Red Hat) /etc/rc.d/init.d/gmetad start AIX These scripts also automatically start gmetad when booting the system All options are: start stop restart status

    92. 92 Monitoring Systems and POWER5/6 LPARs with Ganglia 3 Ganglia Web Server front end setup (1/2) Could use existing web server but we need PHP support On AIX – we have only used Apache2 and PHP5 (see next foil) On Linux use version included with the distribution (Red Hat EL 4 & 5, SUSE SLES 9 & SLES 10) Simple test of PHP if it works: Create <web-server-directory>/phptest.php Use browser to access this file Should print out lots of interesting data rpm –Uvh ganglia-web-3.0.5-1.noarch.rpm “noarch” as this is PHP scripts only may have to use the “--ignoreos” flag for installation may have to move these files to your web server directory tree depends on if you are running AIX or Linux

    93. 93 Monitoring Systems and POWER5/6 LPARs with Ganglia For AIX you will not be able to find Apache2 and PHP5 with the required features: AIX CD-ROM or update download site – nope AIX repositories (Bull or UCLA) – nope AIX Toolbox for Linux Applications – nope WAS – nope IBM HTTP Server (Apache with add-ons) – nope Get Apache + PHP from my personal website instead! http://www.perzl.org/ganglia/ http://www.perzl.org/aix/ If you have to recompile your own Apache + PHP Using latest GNU GCC compiler and latest libraries - it is actually easy Nigel Griffiths wrote the details of how to do this on the AIX 5L Wiki at http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen 3 Ganglia Web Server front end setup (AIX) (2/2)

More Related