how to monitor the h t out of hadoop n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
How to monitor the $H!T out of Hadoop PowerPoint Presentation
Download Presentation
How to monitor the $H!T out of Hadoop

Loading in 2 Seconds...

play fullscreen
1 / 39
marcos

How to monitor the $H!T out of Hadoop - PowerPoint PPT Presentation

143 Views
Download Presentation
How to monitor the $H!T out of Hadoop
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. How to monitor the $H!T out of Hadoop Developing a comprehensive open approach to monitoring hadoop clusters

  2. Relevant Hadoop Information • From 3 – 3000 Nodes • Hardware/Software failures “common” • Redundant Components DataNode, TaskTracker • Non-redundant Components NameNode, JobTracker, SecondaryNameNode • Fast Evolving Technology (Best Practices?)

  3. Monitoring Software • Nagios – • Red Yellow Green Alerts, Escalations • Defacto Standard – Widely deployed • Text base configuration • Web Interface • Pluggable with shell scripts/external apps • Return 0 - OK

  4. Cacti • Performance Graphing System • RRD/RRA Front End • Slick Web Interface • Template System for Graph Types • Pluggable • SNMP input • Shell script /external program

  5. hadoop-cacti-jtg • JMX Fetching Code w/ (kick off) scripts • Cacti templates For Hadoop • Premade Nagios Check Scripts • Helper/Batch/automation scripts • Apache License

  6. Hadoop JMX

  7. Sample Cluster P1 • NameNode & SecNameNode • Hardware RAID • 8 GB RAM • 1x QUAD CORE • DerbyDB (hive) on SecNameNode • JobTracker • 8GB RAM • 1x QUAD CORE

  8. A Sample Cluster p2 • Slave (hadoopdata1-XXXX) • JBOD 8x 1TB SATA Disk • RAM 16GB • 2x Quad Core

  9. Prerequisites • Nagios (install) DAG RPMs • Cacti (install) Several RPMS • Liberal network access to the cluster

  10. Alerts & Escalations • X nodes * Y Services = < Sleep • Define a policy • Wake Me Up’s (SMS) • Don’t Wake Me Up’s (EMAIL) • Review (Daily, Weekly, Monthly)

  11. Wake Me Up’s • NameNode • Disk Full (Big Big Headache) • RAID Array Issues (failed disk) • JobTracker • SecNameNode • Do not realize it is not working too late

  12. Don’t Wake Me Up’s • Or ‘Wake someone else up’ • DataNode • Warning Currently Failed Disk will down the Data Node (see Jira) • TaskTracker • Hardware • Bad Disk (Start RMA) • Slaves are expendable (up to a point)

  13. Monitoring Battle Plan • Start With the Basics • Ping, Disk • Add Hadoop Specific Alarms • check_data_node • Add JMX Graphing • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%

  14. The Basics Nagios • Nagios (All Nodes) • Host up (Ping check) • Disk % Full • SWAP > 85 % * Load based alarms are somewhat useless 389% CPU load is not necessarily a bad thing in Hadoopville

  15. The Basics Cacti • Cacti (All Nodes) • CPU (full CPU) • RAM/SWAP • Network • Disk Usage

  16. Disk Utilization

  17. RAID Tools • Hpacucli – not a Street Fighter move • Alerts on RAID events (NameNode) • Disk failed • Rebuilding • JBOD (DataNode) • Failed Drive • Drive Errors • Dell, SUN, Vendor Specific Tools

  18. Before you jump in • X Nodes * Y Checks * = Lots of work • About 3 Nodes into the process … • Wait!!! I need some interns!!! • Solution S.I.C.C.T. Semi-Intelligent-Configuration-cloning-tools • (I made that up) • (for this presentation)

  19. Nagios • Answers “IS IT RUNNING?” • Text based Configuration

  20. Cacti • Answers “HOW WELL IS IT RUNNING?” • Web Based configuration • php-cli tools

  21. Monitoring Battle PlanThus Far • Start With the Basics • Ping, Disk !!!!!!Done!!!!!! • Add Hadoop Specific Alarms • check_data_node • Add JMX Graphing • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%

  22. Add Hadoop Specific Alarms • Hadoop Components with a Web Interface • NameNode 50070 • JobTracker 50030 • TaskTracker 50060 • DataNode 50075 • check_http + regex = simple + effective

  23. nagios_check_commands.cfg define command { command_name check_remote_namenode command_line $USER1$/check_http -H $HOSTADDRESS$ -u http://$HOSTADDRESS$:$ARG1$/dfshealth.jsp -p $ARG1$ -r NameNode }define service {               service_description            check_remote_namenode               use                             generic-service               host_name                       hadoopname1               check_command               check_remote_namenode!50070} • Component Failure • (Future) Newer Hadoop will have XML status

  24. Monitoring Battle Plan • Start With the Basics • Ping, Disk (Done) • Add Hadoop Specific Alarms • check_data_node (Done) • Add JMX Graphing • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%

  25. JMX Graphing • Enable JMX • Import Templates

  26. JMX Graphing

  27. JMX Graphing

  28. JMX Graphing

  29. Standard Java JMX

  30. Monitoring Battle PlanThus Far • Start With the Basics !!!!!!Done!!!!! • Ping, Disk • Add Hadoop Specific Alarms !Done! • check_data_node • Add JMX Graphing !Done! • NameNodeOperations • Add JMX Based alarms • FilesTotal > 1,000,000 or LiveNodes < 50%

  31. Add JMX based Alarms • hadoop-cacti-jtg is flexible • extend fetch classes • Don’t call output() • Write your own check logic

  32. Quick JMX Base Walkthrough • url, user, pass, object specified from CLI • wantedVariables, wantedOperations by inheritance • fetch() output() provided

  33. Extend for NameNode

  34. Extend for Nagios

  35. Monitoring Battle Plan • Start With the Basics !DONE! • Ping, Disk • Add Hadoop Specific Alarms !DONE! • check_data_node • Add JMX Graphing !DONE! • NameNodeOperations • Add JMX Based alarms !DONE! • FilesTotal > 1,000,000 or LiveNodes < 50%

  36. Review • File System Growth • Size • Number of Files • Number of Blocks • Ratio’s • Utilization • CPU/Memory • Disk • Email (nightly) • FSCK • DSFADMIN

  37. The Future • JMX Coming to JobTracker and TaskTracker (0.21) • Collect and Graph Jobs Running • Collect and Graph Map / Reduce per node • Profile Specific Jobs in Cacti?