1 / 13

Apache Hadoop

Apache Hadoop. Daniel Lust , Anthony Taliercio. What is Apache Hadoop ?. Allows applications to utilize thousands of nodes while exchanging thousands of terabytes of data to complete a task Supports distributed applications under a free license Used by many popular companies

peyton
Download Presentation

Apache Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Apache Hadoop Daniel Lust , Anthony Taliercio

  2. What is Apache Hadoop? • Allows applications to utilize thousands of nodes while exchanging thousands of terabytes of data to complete a task • Supports distributed applications under a free license • Used by many popular companies • Such as: Facebook, Twitter, Ebay, IBM, Apple, Microsoft, Hewlett-Packard, and many others…

  3. Continued… • Written in Java • Scales well • Can be used with thousands of nodes • Can be used with just a few nodes and inexpensive hardware • Your average Hadoop cluster will consist of two major parts • A single master node and multiple working nodes. • The master node is made up of four parts: the Job Tracker, Task Tracker, NameNode, and DataNode. • A worker node, which is also known as a slave node, can either be a DataNode and TaskTracker or just one of the two.

  4. Overview Of Hadoop • - Hadoop uses whats called an HDFS • Hadoop Distributed File System • HDFS takes files and splits them across the networkredundantly in a cluster • The redundancy to eliminate possible data loss

  5. MapReduce • MapReduce • Software wrote by googleto process massive amounts of unstructured data in a parallel process across a distributed cluster of processors

  6. MapReduce . Offers a clean abstraction between data analysis tasks, organizing the jobs Issued by the HDFS, so no jobs are unnecessarily repeated. - If one of them fail, a node may point to a different node to complete the task

  7. Running Hadoop • First run of Hadoop on Master Computer • Various processes are started including: • TaskTracker • JobTracker • DataNode • Secondary Node • NameNode It also makes a connection through SSH to other SLAVE computers to start a DataNode and TaskTracker

  8. Running Hadoop • Used Hadoop to do a word count on six different books. • HDFS copied the books to different clusters, and ran a pre-written program to do a word count on the books. • Each node returned data, using the DataNodeproccess to save its results. • When a node failed, it will issue the job to another node

  9. Example Output of Job Processes

  10. Word count Output

  11. Tested on 1-3 Nodes • 1 NODE: JOB COMPLETION • 00:01:45 2 NODES: JOB COMPLETION 00:01:28 3 NODE : JOB COMPLETION 00:01:00

  12. Conclusion • Our guide covered everything you need to get started with Apache Hadoop • Although, there are many problems you can see along the way • Troubleshooting was a large part of our project

More Related