How to monitor the $H!T out of Hadoop

How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters

Relevant Hadoop Information • From 3 – 3000 Nodes • Hardware/Software failures “common” • Redundant Components DataNode, TaskTracker • Non-redundant Components NameNode, JobTracker, SecondaryNameNode • Fast Evolving Technology (Best Practices?)

Monitoring Software • Nagios – • Red Yellow Green Alerts, Escalations • Defacto Standard – Widely deployed • Text base configuration • Web Interface • Pluggable with shell scripts/external apps • Return 0 - OK

Cacti • Performance Graphing System • RRD/RRA Front End • Slick Web Interface • Template System for Graph Types • Pluggable • SNMP input • Shell script /external program

hadoop-cacti-jtg • JMX Fetching Code w/ (kick off) scripts • Cacti templates For Hadoop • Premade Nagios Check Scripts • Helper/Batch/automation scripts • Apache License

Hadoop JMX

Sample Cluster P1 • NameNode & SecNameNode • Hardware RAID • 8 GB RAM • 1x QUAD CORE • DerbyDB (hive) on SecNameNode • JobTracker • 8GB RAM • 1x QUAD CORE

A Sample Cluster p2 • Slave (hadoopdata1-XXXX) • JBOD 8x 1TB SATA Disk • RAM 16GB • 2x Quad Core

Prerequisites • Nagios (install) DAG RPMs • Cacti (install) Several RPMS • Liberal network access to the cluster

Alerts & Escalations • X nodes * Y Services = < Sleep • Define a policy • Wake Me Up’s (SMS) • Don’t Wake Me Up’s (EMAIL) • Review (Daily, Weekly, Monthly)

Wake Me Up’s • NameNode • Disk Full (Big Big Headache) • RAID Array Issues (failed disk) • JobTracker • SecNameNode • Do not realize it is not working too late

Don’t Wake Me Up’s • Or ‘Wake someone else up’ • DataNode • Warning Currently Failed Disk will down the Data Node (see Jira) • TaskTracker • Hardware • Bad Disk (Start RMA) • Slaves are expendable (up to a point)

Monitoring Battle Plan • Start With the Basics • Ping, Disk • Add Hadoop Specific Alarms • check_data_node • Add JMX Graphing • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%

The Basics Nagios • Nagios (All Nodes) • Host up (Ping check) • Disk % Full • SWAP > 85 % * Load based alarms are somewhat useless 389% CPU load is not necessarily a bad thing in Hadoopville

The Basics Cacti • Cacti (All Nodes) • CPU (full CPU) • RAM/SWAP • Network • Disk Usage

Disk Utilization

RAID Tools • Hpacucli – not a Street Fighter move • Alerts on RAID events (NameNode) • Disk failed • Rebuilding • JBOD (DataNode) • Failed Drive • Drive Errors • Dell, SUN, Vendor Specific Tools

Before you jump in • X Nodes * Y Checks * = Lots of work • About 3 Nodes into the process … • Wait!!! I need some interns!!! • Solution S.I.C.C.T. Semi-Intelligent-Configuration-cloning-tools • (I made that up) • (for this presentation)

Nagios • Answers “IS IT RUNNING?” • Text based Configuration

Cacti • Answers “HOW WELL IS IT RUNNING?” • Web Based configuration • php-cli tools

Monitoring Battle PlanThus Far • Start With the Basics • Ping, Disk !!!!!!Done!!!!!! • Add Hadoop Specific Alarms • check_data_node • Add JMX Graphing • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%

Add Hadoop Specific Alarms • Hadoop Components with a Web Interface • NameNode 50070 • JobTracker 50030 • TaskTracker 50060 • DataNode 50075 • check_http + regex = simple + effective

nagios_check_commands.cfg define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode }define service { service_description check_remote_namenode use generic-service host_name hadoopname1 check_command check_remote_namenode!50070} • Component Failure • (Future) Newer Hadoop will have XML status

Monitoring Battle Plan • Start With the Basics • Ping, Disk (Done) • Add Hadoop Specific Alarms • check_data_node (Done) • Add JMX Graphing • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%

JMX Graphing • Enable JMX • Import Templates

JMX Graphing

Standard Java JMX

Monitoring Battle PlanThus Far • Start With the Basics !!!!!!Done!!!!! • Ping, Disk • Add Hadoop Specific Alarms !Done! • check_data_node • Add JMX Graphing !Done! • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%

Add JMX based Alarms • hadoop-cacti-jtg is flexible • extend fetch classes • Don’t call output() • Write your own check logic

Quick JMX Base Walkthrough • url, user, pass, object specified from CLI • wantedVariables, wantedOperations by inheritance • fetch() output() provided

Extend for NameNode

Extend for Nagios

Monitoring Battle Plan • Start With the Basics !DONE! • Ping, Disk • Add Hadoop Specific Alarms !DONE! • check_data_node • Add JMX Graphing !DONE! • NameNodeOperations • Add JMX Based alarms !DONE! • FilesTotal > 1,000,000 or LiveNodes < 50%

Review • File System Growth • Size • Number of Files • Number of Blocks • Ratio’s • Utilization • CPU/Memory • Disk • Email (nightly) • FSCK • DSFADMIN

The Future • JMX Coming to JobTracker and TaskTracker (0.21) • Collect and Graph Jobs Running • Collect and Graph Map / Reduce per node • Profile Specific Jobs in Cacti?

How to monitor the $H!T out of Hadoop

How to monitor the $H!T out of Hadoop

Presentation Transcript

How to Monitor Your Grant Without Using Tylenol to Numb the Pain

Patient Monitor (Bedside Monitor)

How to use a Heart Rate Monitor

Monitor

How organisations can monitor employees

How to test/monitor new/existing SIP deployments

Welcome To The Multiple Monitor Displays

Monitor

How to Use Heart-Rate Monitor to Lose Weight

The Car Monitor

How to monitor your Partner's Activities

How to Monitor Your Energy Usage - VSwitch USave

How Do You Choose The Best Server Monitor?

How to monitor indoor air quality at home?

How To Solve Acer Computer’s Monitor Issues?

How to Select Right LG Refurbished Monitor?

Dell Monitor: Know How to Run Diagnostic Test On A Dell Monitor.

Computer start but the monitor is black. How to solve this issue?

Pollution levels: How to monitor it effectively

How to Calibrate Your Monitor on Windows 10

How To Adjust Settings On Your Acer Predator Monitor?

How To Select The Best Gaming Monitor