Apache Hadoop

Apache Hadoop Daniel Lust , Anthony Taliercio

What is Apache Hadoop? • Allows applications to utilize thousands of nodes while exchanging thousands of terabytes of data to complete a task • Supports distributed applications under a free license • Used by many popular companies • Such as: Facebook, Twitter, Ebay, IBM, Apple, Microsoft, Hewlett-Packard, and many others…

Continued… • Written in Java • Scales well • Can be used with thousands of nodes • Can be used with just a few nodes and inexpensive hardware • Your average Hadoop cluster will consist of two major parts • A single master node and multiple working nodes. • The master node is made up of four parts: the Job Tracker, Task Tracker, NameNode, and DataNode. • A worker node, which is also known as a slave node, can either be a DataNode and TaskTracker or just one of the two.

Overview Of Hadoop • - Hadoop uses whats called an HDFS • Hadoop Distributed File System • HDFS takes files and splits them across the networkredundantly in a cluster • The redundancy to eliminate possible data loss

MapReduce • MapReduce • Software wrote by googleto process massive amounts of unstructured data in a parallel process across a distributed cluster of processors

MapReduce . Offers a clean abstraction between data analysis tasks, organizing the jobs Issued by the HDFS, so no jobs are unnecessarily repeated. - If one of them fail, a node may point to a different node to complete the task

Running Hadoop • First run of Hadoop on Master Computer • Various processes are started including: • TaskTracker • JobTracker • DataNode • Secondary Node • NameNode It also makes a connection through SSH to other SLAVE computers to start a DataNode and TaskTracker

Running Hadoop • Used Hadoop to do a word count on six different books. • HDFS copied the books to different clusters, and ran a pre-written program to do a word count on the books. • Each node returned data, using the DataNodeproccess to save its results. • When a node failed, it will issue the job to another node

Example Output of Job Processes

Word count Output

Tested on 1-3 Nodes • 1 NODE: JOB COMPLETION • 00:01:45 2 NODES: JOB COMPLETION 00:01:28 3 NODE : JOB COMPLETION 00:01:00

Conclusion • Our guide covered everything you need to get started with Apache Hadoop • Although, there are many problems you can see along the way • Troubleshooting was a large part of our project

Apache Hadoop

Apache Hadoop

Presentation Transcript

Apache Hadoop and Hive

Introduction to Apache Hadoop

Apache Hadoop and Hive

Apache Hadoop Training

Introduction to Apache Hadoop

Intel® Distribution for Apache Hadoop *

Fault Tolerance in Apache Hadoop

Apache Hadoop 2.0

Making Apache Hadoop Secure

Apache Hadoop Ingestion Patterns & Apache Flume

Apache Tez : Accelerating Hadoop Data Processing

Apache hadoop training in hyderabad

Hadoop vs Apache Spark

Apache Hadoop Operations Course

Apache hadoop Online Training

Apache Hadoop Corporate Training | Hadoop Classroom training -TT

Basics Of Apache Hadoop

Big Data Overview of apache Hadoop

Hadoop – An Apache Hadoop Tutorial for Beginners

Apache Hadoop

Apache Hadoop

Presentation Transcript

Apache Hadoop and Hive

Introduction to Apache Hadoop

Apache Hadoop and Hive

Apache Hadoop Training

Introduction to Apache Hadoop

Intel® Distribution for Apache Hadoop *

Fault Tolerance in Apache Hadoop

Apache Hadoop 2.0

Making Apache Hadoop Secure

Apache Hadoop Ingestion Patterns &amp; Apache Flume

Apache Tez : Accelerating Hadoop Data Processing

Apache hadoop training in hyderabad

Hadoop vs Apache Spark

Apache Hadoop Operations Course

Apache hadoop Online Training

Apache Hadoop Corporate Training | Hadoop Classroom training -TT

Basics Of Apache Hadoop

Big Data Overview of apache Hadoop

Hadoop – An Apache Hadoop Tutorial for Beginners

Apache Hadoop Ingestion Patterns & Apache Flume