570 likes | 1.07k Views
2. Monitoring Systems and POWER5/6 LPARs with Ganglia. Agenda. Ganglia what is it ?Ganglia components and data flowAn introduction to RRDToolGanglia metrics what can be measured ?New POWER5/6 metrics (AIX
E N D
1. Monitoring Systems andPOWER5/6 LPARs with Ganglia
Michael Perzl – michael@perzl.org
2. 2 Monitoring Systems and POWER5/6 LPARs with Ganglia Agenda Ganglia – what is it ?
Ganglia components and data flow
An introduction to RRDTool
Ganglia metrics – what can be measured ?
New POWER5/6 metrics (AIX & Linux)
Extending Ganglia with gmetric
Add device specific information to Ganglia
Ganglia network communication
Installation issues
Where to get Ganglia for AIX and Linux on POWER ?
Best practices
Future additions / plans
Discussion
Links
3. Ganglia – what is it ?
4. 4 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – what is it ? (1/3) Ganglia is an Open Source cluster performance monitoring tool and has been extended to include POWER5/6 features like shared processor LPARs, entitlement, physical CPU usage etc.
This session covers:
the technical details of Ganglia and the POWER5/6 extensions
how to set it up and use it to monitor all LPARs in a single machine and lots of machines
5. 5 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – what is it ? (2/3) Ganglia properties:
scalable distributed monitoring system for high-performance computing systems such as clusters and grids
based on a hierarchical design targeted at federations of clusters
relies on a multicast-based listen/announce protocol to monitor state within clusters and uses a tree of point-to-point connections amongst representative cluster nodes to federate clusters and aggregate their state
leverages widely used technologies such as
XML for data representation
XDR (eXternal Data Representation) for compact, portable data transport
RRDtool for data storage and visualization
uses carefully engineered data structures and algorithms to achieve very low per-node overheads and high concurrency
robust implementation
Open Source, written in C
Downloaded 110,000+ times, 145+ countries, 500+ clusters, 2000+ nodes
6. 6 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – what is it ? (3/3) Ganglia properties (cont.):
has been ported to an extensive set of operating systems and processor architectures:
AIX
Darwin
FreeBSD
HP-UX
IRIX
Linux
OSF
NetBSD
Solaris
Windows (via Cygwin)
is currently in use on over 500+ clusters around the world
has been used to link clusters across university campuses and around the world and can scale to handle clusters with 2000+ nodes
check http://ganglia.info/ for more details
7. Ganglia components and data flow
8. 8 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia components The ganglia system consists of:
two unique daemons:
Ganglia Monitoring Daemon (gmond)
monitoring daemon, collects the metrics
runs on each node
Ganglia Meta Daemon (gmetad)
polls all gmond clients and stores the collected metrics in Round-Robin Databases (RRDs)
a PHP-based web frontend
a few other small utility programs
gmetric
can be used to easily extend Ganglia with additional user-defined metrics
gstat
gexec
9. 9 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia – Schematic View
10. 10 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Architecture
11. 11 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Monitoring Daemon (gmond) Ganglia Monitoring Daemon (gmond) is a multi-threaded daemon which runs on each cluster node you want to monitor.
Installation is easy:
just the daemon and a configuration file (/etc/gmond.conf)
gmond has four main responsibilities:
monitor changes in host state
announce relevant changes
listen to the state of all other ganglia nodes via a unicast or multicast channel
answer requests for an XML description of the cluster state
Each gmond transmits information in two different ways:
unicasting or multicasting host state in external data representation (XDR) format using UDP messages
sending XML over a TCP connection
12. 12 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Meta Daemon (gmetad) (1/2) Ganglia Meta Daemon (gmetad) is a daemon which typically only runs on one specific cluster node – or on more when using a staged setup.
Installation is easy:
just the daemon and a configuration file (/etc/gmetad.conf)
Federation in Ganglia is achieved using a tree of point-to-point connections amongst representative cluster nodes to aggregate the state of multiple clusters.
At each node in the tree a gmetad
periodically polls a collection of child data sources
parses the collected XML
saves all numeric volatile metrics to round-robin databases
exports the aggregated XML over a TCP socket to clients
13. 13 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia Meta Daemon (gmetad) (2/2) Data sources may be either
gmond daemons, representing specific clusters
or
other gmetad daemons, representing sets of clusters
Data sources use source IP addresses for access control
Multiple IP addresses can be specified for failover
The capability is natural for aggregating data from clusters since each gmond daemon contains the entire state of its cluster
14. 14 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia PHP web frontend (1/2) Web frontend properties:
provides a view of the gathered information via real-time dynamic web pages
displays Ganglia data in a meaningful way for system administrators and users
For example, one can view the CPU utilization over the past hour, day, week, month, or year
The web frontend shows similar graphs for memory usage, disk usage, network statistics, number of running processes, and all other Ganglia metrics
15. 15 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia PHP web frontend (2/2) Web frontend properties (cont.):
depends on the existence of the gmetad which provides it with data from several Ganglia sources
opens the local port 8651 (by default) and expects to receive a Ganglia XML tree
the web pages themselves are highly dynamic; any change to the Ganglia data appears immediately on the site
This behavior leads to a very responsive site, but requires that the full XML tree be parsed on every page access
Therefore, the Ganglia web frontend should run on a fairly powerful, dedicated machine if it presents a large amount of data
is written in the PHP scripting language and uses graphs generated by gmetad to display history information
has been tested on many flavors of Unix (primarily Linux) with the Apache web server and the PHP 4.1 module
16. 16 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (1/4)
17. 17 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (2/4)
18. 18 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (3/4)
19. 19 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow (4/4)
20. 20 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia - data flow again
21. An introduction to RRDTool
22. 22 Monitoring Systems and POWER5/6 LPARs with Ganglia RRDTool Homepage: http://oss.oetiker.ch/rrdtool/
RRD is the Acronym for Round-Robin Database.
RRD is a system to store and display time-series data (i.e., network bandwidth, machine-room temperature, server load average).
It stores the data in a very compact way that will not expand over time (fixed size of DB), and it presents useful graphs by processing the data to enforce a certain data density.
It can be used either via simple wrapper scripts (from shell or Perl) or via frontends that poll network devices and put a friendly user interface on it.
23. 23 Monitoring Systems and POWER5/6 LPARs with Ganglia RRDTool example graph
24. 24 Monitoring Systems and POWER5/6 LPARs with Ganglia RRDTool example # rrdtool create test.rrd \
--start 920804400 \
--step 300 \
DS:km:COUNTER:600:U:U \
RRA:AVERAGE:0.5:1:24
# rrdtool update test.rrd 920804700:12345 920805000:12357 920805300:12363
# rrdtool update test.rrd 920805600:12363 920805900:12363 920806200:12373
# rrdtool update test.rrd 920806500:12383 920806800:12393 920807100:12399
# rrdtool update test.rrd 920807400:12405 920807700:12411 920808000:12415
# rrdtool update test.rrd 920808300:12420 920808600:12422 920808900:12423
# rrdtool graph kilometer.png \
--start 920804400 \
--end 920808000 \
DEF:mykm=test.rrd:km:AVERAGE \
LINE2:mykm#FF0000
25. Ganglia metrics – what can be monitored ?
26. 26 Monitoring Systems and POWER5/6 LPARs with Ganglia Metrics Definition of a metric:
A metric is a certain observed property of the system.
Number of metrics:
34 standard metrics, i.e., available (i.e., defined) on all platforms
Additional platform dependent metrics available
Solaris
8 additional metrics available
HP-UX
4 additional metrics available
AIX
18 additional new metrics available for POWER5/6 !!!
details later….
Remarks:
One RRD database per Ganglia metric is used
Database size is fixed (~ 12 kB per RRD database with default settings)
Some standard metrics do not exist on all platforms, e.g., some metrics (coming from Linux) don’t exist or don’t make sense on AIX
27. 27 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia standard metrics (1/2) boottime
system boot timestamp
bytes_in
number of network bytes received per second
bytes_out
number of network bytes sent out per second
cpu_aidle
percent of time since boot idle CPU
not defined on AIX, Linux yes
cpu_idle
percent CPU idle time
cpu_nice
percent CPU nice
not defined on AIX, Linux yes
cpu_num
number of CPUs
cpu_intr
number of interrupts (??)
not defined on AIX, Linux yes
cpu_sintr
number of system interrupts (??)
not defined on AIX, Linux yes cpu_speed
speed of CPUs in MHz
cpu_system
percent CPU system
cpu_user
percent CPU user
cpu_wio
CPU time spent waiting for I/O
disk_free
total free disk space in GB
disk_total
total available disk space in GB
load_one
load average over 1 minute
load_five
load average over 5 minutes
load_fifteen
load average over 15 minutes
28. 28 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia standard metrics (2/2) machine_type
type of machine (e.g., POWER5)
mem_total
total available memory in kB
mem_free
amount of free memory in kB
mem_shared
amount of shared memory
not defined on AIX, Linux yes
mem_buffers
amount of memory used for buffers
not defined on AIX, Linux yes
mem_cached
amount of memory used for cache
AIX: numperm memory pages
mtu
MTU size reported in bytes
os_name
name of OS os_release
OS release version (on AIX: level of fileset bos.mp)
part_max_used
most filled disk partition
not defined on AIX, Linux yes
pkts_in
number of network packets received
pkts_out
number of network packets sent out
proc_run
total number of running processes
proc_total
total number of processes
swap_free
free swap space in kB
AIX: paging space free
swap_total
total available swap space in kB
AIX: paging space
29. New POWER5/6 metrics (AIX & Linux)
30. 30 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia and POWER5/6 Current deficiences of Ganglia on POWER5/6:
Ganglia does not understand Shared Processor LPAR statistics
things like capped, weight, CPU entitlement etc.
these metrics can be added (see below)
How to fix these deficiences, i.e., add these new metrics ?
Easy solution:
Extend Ganglia with the utility program gmetric
Details in section “Extending Ganglia with gmetric” (later)
Preferred solution:
Add these new metrics to the gmond implementation on AIX and Linux on POWER
Requires significant patching of Ganglia source code
This has been completed and tested, i.e., ready to go !
Where to get it ?
My personal web site: http://www.perzl.org/ganglia/
31. 31 Monitoring Systems and POWER5/6 LPARs with Ganglia Additional Ganglia POWER5/6 metrics (1/5) Question:
How are those additional metrics programmed and where do they get the information from?
Answer:
AIX:
Only the APIs provided by libperfstat are used
As a consequence the fileset bos.perf.libperfstat must be installed
Linux:
Only entries in the /proc file system are used, e.g., /proc/cpuinfo, /proc/meminfo, /proc/ppc64/lparcfg etc.
No additional fileset must be installed
32. 32 Monitoring Systems and POWER5/6 LPARs with Ganglia Additional Ganglia POWER5/6 metrics (2/5) capped
cpu_entitlement
cpu_in_lpar
cpu_in_machine
cpu_in_pool
cpu_pool_idle
cpu_used
disk_read
disk_write kernel64bit
lpar
lpar_name
lpar_num
oslevel
serial_num
smt
splpar
weight
33. 33 Monitoring Systems and POWER5/6 LPARs with Ganglia Additional Ganglia POWER5 metrics (3/5) capped
Type: String value
returns "yes" if the system is a POWER5 Shared Processor LPAR which is running in capped mode or "no" otherwise
cpu_entitlement
Type: Float value
returns the Capacity Entitlement of the system in units of physical CPUs
cpu_in_lpar
Type: Integer value
returns the number of CPUs the OS sees in the system. In a POWER5 Shared Processor LPAR this returns the number of virtual CPUs. When SMT is enabled this number is doubled.
cpu_in_machine
Type: Integer value
returns the number of physical CPUs in the whole system
cpu_in_pool
Type: Integer value
returns the number of physical CPUs in the Shared Processor Pool
cpu_pool_idle
Type: Float value
returns in fractional numbers of physical CPUs how much the Shared Processor Pool is idle
34. 34 Monitoring Systems and POWER5/6 LPARs with Ganglia Additional Ganglia POWER5 metrics (4/5) cpu_used
Type: Float value
returns in fractional numbers of physical CPUs how much compute resources this shared processor has used since the last time this metric was measured
disk_read
Type: Float value
returns in units of kB the total read I/O of the system
disk_write
Type: Float value
returns in units of kB the total write I/O of the system
kernel64bit
Type: String value
returns "yes" if the running kernel is a 64-bit kernel or "no" otherwise
lpar
Type: String value
returns "yes" if the system is a LPAR or "no" otherwise
lpar_name
Type: String value
returns the name of the LPAR as defined on the Hardware Management Console (HMC) or some reasonable message otherwise
35. 35 Monitoring Systems and POWER5/6 LPARs with Ganglia Additional Ganglia POWER5 metrics (5/5) lpar_num
Type: Integer value
returns the partition ID of the LPAR as defined on the Hardware Management Console (HMC) or some reasonable message otherwise
oslevel
Type: String value
returns the version string as provided by the AIX command 'oslevel‘
serial_num
Type: String value
returns the serial number of the system as provided by the AIX command 'uname‘
smt
Type: String value
returns "yes" if SMT is enabled or "no" otherwise
splpar
Type: String value
returns "yes" if the system is running in a shared processor LPAR or "no" otherwise
weight
Type: Integer value
returns the weight of the LPAR running in uncapped mode
36. Extending Ganglia with gmetric
37. 37 Monitoring Systems and POWER5/6 LPARs with Ganglia Extending Ganglia How can I easily add metrics to Ganglia ?
Ganglia has a simple way to add metrics that should be monitored
The utility program gmetric is used for that purpose
These new metrics are then automatically added to the database and web server data and graphs
38. 38 Monitoring Systems and POWER5/6 LPARs with Ganglia gmetric example – Machine firmware level Example of a static metric: Machine firmware level
gmetric --name firmware \ --value `lsattr -El sys0 -a modelname -F value` \ --type "string"
Remarks:
The above will only save the statistics once.
The firmware level is unlikely to change without reboot, therefore it is sufficient to run this command once.
39. 39 Monitoring Systems and POWER5/6 LPARs with Ganglia gmetric example – database transactions Example of a variable metric: Transaction rate of your database
To add the number of transaction and assuming you have a script that will work this out called "transactions" that returns a number with a decimal point – you will have to write this script yourself !
gmetric --name tpm \ --value `/usr/local/bin/transactions` \ --type double
Remarks:
This command will only save the statistics once.
As the number of transactions per minute will definitely change, to get these always up to date, it is recommended to run the command regularly, e.g., run once every 60 seconds via cron.
40. Add device specific information to Ganglia
41. 41 Monitoring Systems and POWER5/6 LPARs with Ganglia General Remarks Add device specific information to
Ganglia via gmetric for
Network adapters
Disks
Disk adapters (SCSI + Fibre Channel)
Available in two variants:
as a daemon (implemented in C)
as a shell script Network information:
Daemons:
g_aix_netif for AIX
g_linux_netif for Linux on POWER
Shell script:
ent_adapter.sh for AIX
Disk information:
Daemons:
g_aix_disk for AIX
g_linux_disk for Linux on POWER
Disk adapter information (AIX only)
Daemons
g_aix_adapter for AIX
Shell script:
fcs_adapter.sh for AIX
42. 42 Monitoring Systems and POWER5/6 LPARs with Ganglia Add device specific information on AIX and Linux (1/4) Network adapter information: g_aix_netif or g_linux_netif
Utility daemon program periodically calls gmetric (interval configurable)
Monitored parameters per network interface:
Bytes received / second
Bytes transmitted / second
Packets received / second
Packets transmitted / second
MTU size (AIX only)
Example:
g_aix_netif -s5 -b3 -p3 -m en1
Every 5 seconds get the number of bytes/sec and packets/sec transferred in and out as well as the current MTU size for network interface en1.
43. 43 Monitoring Systems and POWER5/6 LPARs with Ganglia Add device specific information on AIX and Linux (2/4) Disk information: g_aix_disk or g_linux_disk
Utility daemon program periodically calls gmetric (interval configurable)
Monitored parameters per disk:
Bytes read / second
Bytes written / second
Example:
g_aix_disk -s10 -i -o hdisk3
Every 10 seconds get the number of bytes/sec transferred in and out for disk hdisk3.
44. 44 Monitoring Systems and POWER5/6 LPARs with Ganglia Add device specific information on AIX and Linux (3/4) Disk adapter information: g_aix_adapter (AIX only)
Utility daemon program periodically calls gmetric (interval configurable)
Monitored parameters per disk adapter:
Bytes read / second
Bytes written / second
Example:
g_aix_adapter –s5 –i –o scsi0
Every 5 seconds get the number of bytes/sec transferred in and out for SCSI adapter scsi0.
45. 45 Monitoring Systems and POWER5/6 LPARs with Ganglia Add device specific information on AIX and Linux (4/4) ./g_aix_netif: Version 1.0
g_aix_netif [OPTIONS] <network-interface>
[-?] or [-h] This help information
[-s seconds] The time between output (default is 60 seconds),
seconds must be in the range [1..3600].
[-c loop_count] The number of loops (default = 20 million),
loop_count must be in the range [10..20000000].
[-b 0|1|2|3] Show bytes received/sent (default = off)
0 = don't show, 1 = show incoming
2 = show outgoing, 3 = show both
[-p 0|1|2|3] Show packets received/sent (default = off)
0 = don't show, 1 = show incoming
2 = show outgoing, 3 = show both
[-m] Show MTU size for <network-interface> (default = off)
[-d] Debug mode, remain in foreground, output only to screen
Please note:
If not in debug mode (-d) g_aix_netif runs the same command via a shell
and assumes gmetric will be found in the PATH and gmond is already running.
g_aix_netif will disconnect from your terminal to become a daemon.
Use "ps -ef | grep g_aix_netif" to confirm it is running.
46. Ganglia network communication
47. 47 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia network communication Ganglia by default uses Multicast
Some network administrators might not like Multicast
This can be changed to Unicast - requires changes to default config files
Very simple changes to the gmond.conf files
Ganglia gmond network “chatter”
The processes talk to each other quite a lot
Not large in 100Mb or 1Gb or virtual networks terms
Recommendation:
Ganglia over admin network rather than user network if possible
48. 48 Monitoring Systems and POWER5/6 LPARs with Ganglia gmond: Multicast configuration example gmond.conf:
/* This configuration is as close to 2.5.x default behavior as possible
The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
daemonize = yes
setuid = yes ??? on some Linux distros: no
user = nobody
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
host_dmax = 3600 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
}
/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
of a <CLUSTER> tag. If you do not specify a cluster tag, then all
<HOSTS> will NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
name = "System p5 Model 550"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}
gmond.conf continued:
...
/* The host section describes attributes of the host, like the location */
host {
location = "System p5 Model 550"
}
/* Feel free to specify as many udp_send_channels as you like. Gmond
used to only support having a single channel */
udp_send_channel {
mcast_join = 239.2.11.71
port = 8649
}
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
mcast_join = 239.2.11.71
port = 8649
bind = 239.2.11.71
}
/* You can specify as many tcp_accept_channels as you like to share
an xml description of the state of the cluster */
tcp_accept_channel {
port = 8649
}
49. 49 Monitoring Systems and POWER5/6 LPARs with Ganglia gmond: Unicast configuration example gmond.conf:
/* This configuration is as close to 2.5.x default behavior as possible
The values closely match ./gmond/metric.h definitions in 2.5.x */
globals {
daemonize = yes
setuid = yes ??? on some Linux distros: no
user = nobody
debug_level = 0
max_udp_msg_len = 1472
mute = no
deaf = no
host_dmax = 3600 /*secs */
cleanup_threshold = 300 /*secs */
gexec = no
}
/* If a cluster attribute is specified, then all gmond hosts are wrapped inside
of a <CLUSTER> tag. If you do not specify a cluster tag, then all
<HOSTS> will NOT be wrapped inside of a <CLUSTER> tag. */
cluster {
name = "System p5 Model 550"
owner = "unspecified"
latlong = "unspecified"
url = "unspecified"
}
... gmond.conf continued:
...
/* The host section describes attributes of the host, like the location */
host {
location = "System p5 Model 550"
}
/* Feel free to specify as many udp_send_channels as you like. Gmond
used to only support having a single channel */
udp_send_channel {
host = p550-aix
port = 8649
}
/* You can specify as many udp_recv_channels as you like as well. */
udp_recv_channel {
port = 8649
}
/* You can specify as many tcp_accept_channels as you like to share
an xml description of the state of the cluster */
tcp_accept_channel {
port = 8649
}
50. 50 Monitoring Systems and POWER5/6 LPARs with Ganglia gmetad: configuration example # Format:
# data_source "my cluster" [polling interval] address1:port addreses2:port ...
#
# The keyword 'data_source' must immediately be followed by a unique string which identifies the source,
# then an optional polling interval in seconds. The source will be polled at this interval on average.
# If the polling interval is omitted, 15sec is asssumed.
#
# A list of machines which service the data source follows, in the format ip:port, or name:port.
# If a port is not specified then 8649 (the default gmond port) is assumed.
# default: There is no default value
#
# data_source "my cluster" 10 localhost my.machine.edu:8649 1.2.3.5:8655
# data_source "my grid" 50 1.3.4.7:8655 grid.org:8651 grid-backup.org:8651
# data_source "another source" 1.3.4.7:8655 1.3.4.8
data_source "Systen p5 Model 550" 15 p550-aix:8649 p550-nim:8649
# Round-Robin Archives
# You can specify custom Round-Robin archives here (defaults are listed below)
#
# RRAs "RRA:AVERAGE:0.5:1:240" "RRA:AVERAGE:0.5:24:240" "RRA:AVERAGE:0.5:168:240" \
# "RRA:AVERAGE:0.5:672:240" "RRA:AVERAGE:0.5:5760:370"
# The name of this Grid. All the data sources above will be wrapped in a GRID tag with this name.
# default: Unspecified
gridname "Mycluster"
51. 51 Monitoring Systems and POWER5/6 LPARs with Ganglia Security issues Ganglia setup can be tunneled through SSH if
certain security guidelines must be adhered to
security guidelines allow only SSH connections, nothing else
Detailed description of such a setup can be found at:
http://www.ibm.com/collaboration/wiki/display/LinuxP/Ganglia/
52. Installation issues
53. 53 Monitoring Systems and POWER5/6 LPARs with Ganglia Build requirements for Ganglia from scratch RRDTool package dependencies
freetype2
libart_lpgl
libpng
Perl
zlib
Apache 2 package dependencies
httpd
expat
Perl
zlib
mod_ssl
httpd
openssl
PHP package dependencies
none
gmond package dependencies
none
gmetad packages dependencies
Apache 2
PHP
libxml2
rrdtool
54. 54 Monitoring Systems and POWER5/6 LPARs with Ganglia Installation issues on POWER5/6 Linux:
Installation on any recent Linux distribution is very easy!
All Linux distributions contain the necessary RPM packages.
AIX:
For AIX some (though not all) of the prerequisites can be fulfilled with RPM packages from the AIX Toolbox for Linux Applications:
http://www.ibm.com/servers/aix/products/aixos/linux/
There was also a problem for people getting hold of Apache 2 on AIX with the latest PHP version, so Nigel Griffiths wrote a How-To to build these popular Open Source tools for AIX with GCC:
http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen
Compilation of Ganglia with the IBM C/C++ compilers is also easy and is done for the Ganglia binaries provided on my personal website:
http://www.perzl.org/ganglia/
55. Where to get Ganglia for AIXand Linux on POWER ?
56. 56 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia binaries and source code for POWER5/6 (1/2) Where to get it ?
My personal web site: http://www.perzl.org/ganglia/
What is available ?
Binary and source RPMs available for:
AIX v4.3.3 (gmond only)
AIX 5L v5.1 and v5.2
AIX 5L v5.3
Red Hat Enterprise Linux 4 and 5
SLES 9 and SLES 10
Fedora Core 4, 5, 6 and 7
openSUSE 10.0, 10.1, 10.2 and 10.3
Precompiled Apache 2 + PHP for Ganglia gmetad on AIX5L v5.1 and higher
My enhanced Ganglia web interface
Device specific information (network, disk, adapter) added to Ganglia via gmetric
57. 57 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia binaries and source code for POWER5/6 (2/2) Required source code changes against version 3.0.5 of Ganglia to incorporate the new POWER5/6 metrics:
configure
configure.in
gmetad/Makefile.in
gmond/gmond.c
lib/apr_net.c
lib/libgmond.c
lib/protocol.x
libmetrics/libmetrics.h
libmetrics/aix/metrics.c
libmetrics/linux/metrics.c
libmetrics/tests/test-metrics.c
58. 58 Monitoring Systems and POWER5/6 LPARs with Ganglia Download statistics of http://perzl.org/ganglia/ (1/2)
59. 59 Monitoring Systems and POWER5/6 LPARs with Ganglia Download statistics of http://perzl.org/ganglia/ (2/2)
60. Best Practices
61. 61 Monitoring Systems and POWER5/6 LPARs with Ganglia Some things to consider before you start… Hostnames
To Ganglia a new hostname is a new machine
Has to resolve IP address so use DNS
IP addresses stable
Make sure you are not going to change IP addresses
Time and date
Make sure the timezone, time and date is consistent on all machines in a cluster
Use of NTP is recommended
So
These are normal on production machines
For prototype and test systems – get this right before starting Ganglia
Simple Ganglia How-To available for people setting up their first Ganglia system available at:
http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia
62. 62 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia data flow and what goes where…
63. 63 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices Preferred setup
Ganglia sampling intervals
Ganglia default ports
Shared Ethernet statistics
Fibre Channel statistics
Enhanced web interface
64. 64 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Preferred Setup Define each System p machine with all its LPARs as a separate cluster
Use Unicast for network communication
Define at least two LPARs per System p machine as gmond hosts for gmetad
One would be sufficient, however, two is better for high availibility reasons
Define those two LPARs in /etc/gmetad.conf as the information brokers for that machine
From gmetad: Don’t poll the gmond hosts more frequently than every 15 secs
Know upfront what time intervals to use for sampling (RRAs stanza in /etc/gmetad.conf)
See next slides
Use my extensions for
Ethernet adapters
Fibre Channel adapters
Web interface
65. 65 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (1/6) Important to know:
The sampling interval is defined in /etc/gmetad.conf.
The "RRAs" stanza is used to defined individual settings.
The sampling settings are global.
If no "RRAs" stanza is defined a default configuration is used.
For historic reasons all values are specified in intervals of 15 seconds.
66. 66 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (2/6) Example: Default settings in Ganglia
RRAs "RRA:AVERAGE:0.5:1:240" \ "RRA:AVERAGE:0.5:24:240" \ "RRA:AVERAGE:0.5:168:240" \ "RRA:AVERAGE:0.5:672:240" \ "RRA:AVERAGE:0.5:5760:370"
used for
Translation: display of
Take 240 samples at 1 × 15 seconds intervals hour
Take 240 samples at 24 × 15 seconds (= 6 minutes) intervals day
Take 240 samples at 168 × 15 seconds (= 42 minutes) intervals week
Take 240 samples at 672 × 15 seconds (= 168 minutes) intervals month
Take 370 samples at 5760 × 15 seconds (= 24 hours) intervals year
67. 67 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (3/6) Example: 1-minute sampling for one year
RRAs "RRA:AVERAGE:0.5:4:525600"
Translation:
Take 525600 samples at 4 × 15 seconds (= 1 minute) intervals
525600 = 60 (samples/hour) × 24 (hours) × 365 (days) × 1 (year)
68. 68 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (4/6) Example: 1-minute sampling for 6 months, 5-minute sampling for 2 years
RRAs "RRA:AVERAGE:0.5:4:259200" \ "RRA:AVERAGE:0.5:20:210240"
Translation:
Take 259200 samples at every 4 × 15 seconds (= 1 minute) intervals
259200 = 60 (samples/hour) × 24 (hours) × 30 (days) × 6 (months)
Take 210240 samples at every 20 × 15 seconds (= 5 minutes) intervals
210240 = 12 (samples/hour) × 24 (hours) × 365 (days) × 2 (years)
69. 69 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (5/6) Example: 15-second sampling for 1 day, 1-minute sampling for 2 months, 10-minute sampling for 1 year
RRAs "RRA:AVERAGE:0.5:1:5760" \ "RRA:AVERAGE:0.5:4:86400" \ "RRA:AVERAGE:0.5:40:52560"
Translation:
Take 5760 samples at every 1 × 15 seconds intervals
5760 = 4 (samples/minute) 60 (samples/hour) × 24 (hours)
Take 86400 samples at every 4 × 15 seconds (= 1 minute) intervals
86400 = 60 (samples/hour) × 24 (hours) × 30 (days) × 2 (months)
Take 52560 samples at every 40 × 15 seconds (= 10 minutes) intervals
52560 = 6 (samples/hour) × 24 (hours) × 365 (days) × 1 (year)
70. 70 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Ganglia Sampling Intervals (6/6) Example: 1-minute sampling for 2 months, 5-minute sampling for 6 months, 15-minute sampling for 3 years
RRAs "RRA:AVERAGE:0.5:4:86400" \ "RRA:AVERAGE:0.5:20:51840" \ "RRA:AVERAGE:0.5:60:105120"
Translation:
Take 86400 samples at every 4 × 15 seconds (= 1 minute) intervals
86400 = 60 (samples/hour) × 24 (hours) × 30 (days) × 2 (months)
Take 210240 samples at every 20 × 15 seconds (= 5 minutes) intervals
51840 = 12 (samples/hour) × 24 (hours) × 30 (days) × 6 (month)
Take 105120 samples at every 60 × 15 seconds (= 15 minutes) intervals
105120 = 4 (samples/hour) × 24 (hours) × 365 (days) × 3 (years)
71. 71 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Default Ports Ganglia by default uses the following ports:
8649The port gmond uses for
Sending to other gmonds via UDP (udp_send_channel in /etc/gmond.conf)
Receiving from other gmonds via UDP (udp_receive_channel in /etc/gmond.conf)
Sending an XML description of the state of the cluster (tcp_accept_channel in /etc/gmond.conf)
8651The port gmetad will answer requests for XML.
8652The port gmetad will answer queries for XML.This facility allows simple subtree and summation views of the XML tree.
72. 72 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Shared Ethernet Statistics Question: How to monitor SEA statistics on the VIO server ?
The AIX libperfstat library seems not to report any statistics about Ethernet adapters if there are no interfaces defined on that adapter.
Only seldom interfaces are defined on SEAs.
The AIX command ‘entstat’ however provides these statistics.
Solution: Extension through gmetric via a shell script
Korn shell script ‘ent_adapter.sh’
Get it from http://www.perzl.org/ganglia/devicespecific.html
Graphs will appear immediately for that specific host
73. 73 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Fibre Channel Statistics Question: How to monitor Fibre Channel statistics on the VIO server ?
The AIX libperfstat library seems not to report any statistics about Fibre Channel adapters if there are no disks attached to the adapter.
Tapes, for instance, would be left out.
The AIX command ‘fcstat’ however provides these statistics.
Solution: Extension through gmetric via a shell script
Korn shell script ‘fcs_adapter.sh’
Get it from http://www.perzl.org/ganglia/devicespecific.html
Graphs will appear immediately for that specific host
74. 74 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Enhanced Web Interface (1/2) Get it from:http://www.perzl.org/ganglia/webinterface.html
Includes the following extensions:
Jscalendar patch of Timothy D. Witham for selecting custom intervals for graph display.
Custom Graph Patch of Alex Balk.
Zoomable graphs
Adapted from the UC Berkeley Grid live Demo website.
They implemented some nice graph zooming, i.e., when you click onto a single node metric graph or on one of the Grid overview graphs you get a nicely scaled bigger version of that graph.
75. 75 Monitoring Systems and POWER5/6 LPARs with Ganglia Best Practices – Enhanced Web Interface (2/2) Includes the following extensions (cont.)
cpu_used statistics for IBM POWER5/6 systems
A cpu_used overview graph for the cluster_view template which - when clicked onto - produces a detailed cpu_used statistics graph for the whole cluster
76. Future additions / plans
77. 77 Monitoring Systems and POWER5/6 LPARs with Ganglia Future additions / plans Try to incorporate the POWER5/6 additions into the Ganglia mainstream source code
Adapt the AIX and Linux on POWER metrics to POWER6
Provide a custom web interface tailored specifically to POWER5/6
Future updates on my personal web site:
Update RPMs to newest versions
Provide RPMs for Apache + PHP on my AIX Open Source web site (very soon)
78. Discussion
79. 79 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia advantages Overview and major statistics available at one look with nice graphs
Available on many platforms
Widely used, many users and many different setups
Global view (cluster/grid) and local view (single node) available
Easily remote accessible through web interface
Highly customizable through config files
Shared Processor LPAR statistics available
Monitored data is stored in Round-Robin Databases (RRDs), i.e., this information could be easily passed on accounting too
Fine granular statistics possible
Different time interval views: hour, day, week, month, year or customizable
Open Source (source code and binary RPMs for AIX and Linux on POWER) available
Low risk and low cost
Easily extendible
through utility program gmetric
through adapting the Ganglia source code (as shown in this presentation)
80. 80 Monitoring Systems and POWER5/6 LPARs with Ganglia Ganglia disadvantages Not an official IBM tool
No official support available
Primarily a monitoring tool, not an accounting tool
‘‘Only‘‘ a monitoring (visualization) tool, no actions can be triggered
AIX setup from scratch requires some work (building all prerequisite software), although it is well documented
Normally not necessary by using my binary RPMs
81. Links
82. 82 Monitoring Systems and POWER5/6 LPARs with Ganglia Links (1/2) Main Ganglia website
http://ganglia.info/
Ganglia Documentation
http://ganglia.info/docs/
Ganglia Source Code Download
http://ganglia.sourceforge.net/downloads.php
Ganglia POWER5/6 extensions and ready-to-run binaries (RPM files) as well as source code
http://www.perzl.org/ganglia/
My personal AIX Open Source repository
http://www.perzl.org/aix/
83. 83 Monitoring Systems and POWER5/6 LPARs with Ganglia Links (2/2) Ganglia Usage at Wikipedia
http://ganglia.wikimedia.org/
RRDTool homepage
http://oss.oetiker.ch/rrdtool/
Ganglia How-To on IBM AIX wiki site (written by Nigel Griffiths)
http://www.ibm.com/collaboration/wiki/display/WikiPtype/ganglia
Open Source with AIX on IBM AIX wiki site (written by Nigel Griffiths)
http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen
IBM AIX wiki site:
http://www.ibm.com/collaboration/wiki/display/WikiPtype/Home
IBM Linux on POWER wiki site:
http://www.ibm.com/collaboration/wiki/display/LinuxP/Home
84. 84 Monitoring Systems and POWER5/6 LPARs with Ganglia Questions ?
85. Backup Slides
86. Simple setup example
87. 87 Monitoring Systems and POWER5/6 LPARs with Ganglia 1 Simple gmond install (1/2) On all data nodes:
Install the gmond RPM file on each data node
rpm –Uvh ganglia-gmond-VVV.PPP.rpm
VVV.PPP is the version number and platform like:
3.0.5
aix5.1.ppc, aix5.3.ppc, suse.ppc64, redhat.ppc64
Edit the configuration file
/etc/gmond.conf
On Linux on POWER
need access to /proc/ppc64/lparcfg = root user only
So also set “setuid = no” depending on Linux distribution
88. 88 Monitoring Systems and POWER5/6 LPARs with Ganglia 1 Simple gmond install (2/2) Use the gmond control script located in /etc to start gmond:
/etc/init.d/gmond start
Linux (SUSE and Red Hat)
/etc/rc.d/init.d/gmond start
AIX
These scripts also automatically start gmond when booting the system
All options are:
start
stop
restart
status
89. 89 Monitoring Systems and POWER5/6 LPARs with Ganglia 2 Simple gmetad - prerequisites (1/3) Gmetad needs rrdtool to store the data and Apache2 + PHP to serve the web pages
For Linux rrdtool, Apache2 and PHP5 is part of every Linux distribution
For AIX you can resolve the prerequisites as follows:
rrdtool provided at http://www.perzl.org/ganglia/
libart_lgpl AIX Toolbox for Linux Applications
libpng AIX Toolbox for Linux Applications
freetype2 AIX Toolbox for Linux Applications
zlib AIX Toolbox for Linux Applications
Perl AIX Toolbox for Linux Applications
AIX Toolbox for Linux Applications
http://www.ibm.com/servers/aix/products/aixos/linux
90. 90 Monitoring Systems and POWER5/6 LPARs with Ganglia 2 Simple gmetad install (2/3) Install the gmetad RPM file on each node
rpm –Uvh ganglia-gmetad-VVV.PPP.rpm
VVV.PPP is the version number and platform like:
3.0.5 and
aix5.1.ppc, aix5.3.ppc, suse.ppc64, redhat.ppc64
Edit the configuration file /etc/gmetad.conf
data_source “mycluster" localhost
91. 91 Monitoring Systems and POWER5/6 LPARs with Ganglia 2 Start gmetad (3 of 3) Use the gmetad control script located in /etc to start gmetad:
/etc/init.d/gmetad start
Linux (SUSE and Red Hat)
/etc/rc.d/init.d/gmetad start
AIX
These scripts also automatically start gmetad when booting the system
All options are:
start
stop
restart
status
92. 92 Monitoring Systems and POWER5/6 LPARs with Ganglia 3 Ganglia Web Server front end setup (1/2) Could use existing web server but we need PHP support
On AIX – we have only used Apache2 and PHP5 (see next foil)
On Linux use version included with the distribution(Red Hat EL 4 & 5, SUSE SLES 9 & SLES 10)
Simple test of PHP if it works:
Create <web-server-directory>/phptest.php
Use browser to access this file
Should print out lots of interesting data
rpm –Uvh ganglia-web-3.0.5-1.noarch.rpm
“noarch” as this is PHP scripts only
may have to use the “--ignoreos” flag for installation
may have to move these files to your web server directory tree
depends on if you are running AIX or Linux
93. 93 Monitoring Systems and POWER5/6 LPARs with Ganglia For AIX you will not be able to find Apache2 and PHP5 with the required features:
AIX CD-ROM or update download site – nope
AIX repositories (Bull or UCLA) – nope
AIX Toolbox for Linux Applications – nope
WAS – nope
IBM HTTP Server (Apache with add-ons) – nope
Get Apache + PHP from my personal website instead!
http://www.perzl.org/ganglia/
http://www.perzl.org/aix/
If you have to recompile your own Apache + PHP
Using latest GNU GCC compiler and latest libraries - it is actually easy
Nigel Griffiths wrote the details of how to do this on the AIX 5L Wiki at
http://www.ibm.com/collaboration/wiki/display/WikiPtype/aixopen 3 Ganglia Web Server front end setup (AIX) (2/2)