a tour to apache hadoop it s components flavors n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Hadoop tutorial PowerPoint Presentation
Download Presentation
Hadoop tutorial

Loading in 2 Seconds...

play fullscreen
1 / 9

Hadoop tutorial - PowerPoint PPT Presentation


  • 11 Views
  • Uploaded on

Big Data Hadoop Tutorial PDF for Beginners A tour to Apache Hadoop its components, Flavor and much more... This PDF Tutorial covers the following topics: 1. What is Hadoop 2. Hadoop History 3. Why Hadoop 4. Hadoop Nodes 5. Hadoop Architecture 6. Hadoop data flow 7. Hadoop components – HDFS, MapReduce, Yarn 8. Hadoop Daemons 9. Hadoop characteristics Wish to Learn Hadoop & Carve your career in Big Data, Contact us: info@data-flair.training +91-7718877477, +91-9111133369 Or visit our website https://data-flair.training/

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Hadoop tutorial' - PritamPal


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
contents 1 hadoop tutorial 1

Contents

1. Hadoop Tutorial .................................................................................................................................. 1

2. What is Hadoop? ................................................................................................................................. 1

3. Why Hadoop? ..................................................................................................................................... 2

4. Hadoop Architecture? ........................................................................................................................ 2

5. Hadoop Components .......................................................................................................................... 3

5.1. HDFS – Strong Layer ................................................................................................................. 3

5.2. MapReduce – Processing Layer ............................................................................................... 3

5.3. YARN – Resource Management Layer ..................................................................................... 4

6. Hadoop Daemons................................................................................................................................ 4

7. How Hadoop works? ........................................................................................................................... 5

8. Hadoop Flavors ................................................................................................................................... 5

9. Hadoop Ecosystem Components ........................................................................................................ 6

10. Conclusion ......................................................................................................................................... 7

hadoop tutorial

Hadoop Tutorial

1. Hadoop Tutorial

Apache Hadoop is an open source, Scalable, and Fault tolerant framework written in Java. It

efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is not only a

storage system but is a platform for large data storage as well as processing.

We will learn in this Hadoop tutorial about Hadoop architecture, Hadoop daemons, different flavors

of Hadoop. At last, we will cover the Hadoop components like HDFS, MapReduce, Yarn, etc.

2. What is Hadoop?

Hadoop is an open-source tool from the ASF– Apache Software Foundation. Open source project

means it is freely available and we can even change its source code as per the requirements. If certain

functionality does not fulfill your need then you can change it. Most of Hadoop code is written by

Yahoo, IBM, Facebook, Cloudera.

It provides an efficient framework for running jobs on multiple nodes of clusters. Cluster means a

group of systems connected via LAN. Apache Hadoop provides distributed processing of data as it

works on multiple machines simultaneously.

By getting inspiration from Google, which has written a paper about the technologies it is using

technologies like Map-Reduce programming model as well as its file system (GFS). Hadoop was

originally written for the Nutch search engine project. When Doug Cutting and his team were working

on it, very soon Hadoop became a top-level project due to its huge popularity.

Apache Hadoop is an open source framework written in Java. The basic Hadoop programming

language is Java, but this does not mean you can code only in Java. You can code in C, C++, Perl,

Python, ruby etc. but it will be better to code in java as you will have lower level control of the code.

Hadoop efficiently processes large volumes of data on a cluster of commodity hardware. Hadoop is

developed for processing huge volume of data. Commodity hardware is the low-end hardware; they

are cheap devices which are very economical. Hence, Hadoop is very economic.

Hadoop can be setup on a single machine (pseudo-distributed mode), but it shows its real power with

a cluster of machines. We can scale it to thousand nodes on the fly ie, without any downtime.

Therefore, we need not to make the system down to add more nodes in the cluster. Follow this guide

to learn Hadoop installation on a multi-node cluster.

Hadoop consists of three key parts –

Hadoop Distributed File System(HDFS) –It is the storage layer of Hadoop. HDFS is the

most reliable storage system on the planet.

Map-Reduce –It is the data processing layer of Hadoop. MapReduce is the distributed

processing framework, which processes the data at lightning fast speed.

YARN –YetAnother Resource Negotiator, It is the resource management layer of Hadoop.

Yarn manages resources on the cluster

1

https://data-flair.training/big-data-hadoop/

hadoop tutorial 1

Hadoop Tutorial

3. Why Hadoop?

Let us now understand why Hadoop is very popular, why Hadoop capture more than 90% of big

data market.

Apache Hadoop is not only a storage system but is a platform for data storage as well as processing.

It is scalable (as we can add more nodes on the fly), Fault tolerant (Even if nodes go down, data is

processed by another node).

Following characteristics of Hadoop make it a unique platform:

Flexibility to store and mine any type of data whether it is structured, semi-structured or

unstructured. It is not bounded by a single schema.

Excels at processing data of complex nature. Its scale-out architecture divides workloads

across many nodes. Another added advantage is that its flexible file-system eliminates ETL

bottlenecks.

Scales economically, as discussed it can deploy on commodity hardware. Apart from this its

open-source nature guards against vendor lock.

4. Hadoop Architecture?

After understanding what is Apache Hadoop, let us now understand the Hadoop Architecture in detail.

Hadoop works in master-slave fashion. There are master nodes (very few) and n numbers of slave

nodes where n can be 1000s. Master manages, maintains and monitors the slaves while slaves are the

actual worker nodes. In Hadoop architecture the Master should be deployed on a good hardware, not

just commodity hardware. As it is the centerpiece of Hadoop cluster.

Master stores the metadata (data about data) while slaves are the nodes which store the actual data

distributedly in the cluster. The client connects with master node to perform any task. Now in this

Hadoop tutorial, we will discuss different components of Hadoop in detail.

2

https://data-flair.training/big-data-hadoop/

hadoop tutorial 2

Hadoop Tutorial

5. Hadoop Components

There are three most important Apache Hadoop Components. In this Hadoop tutorial, you will learn

what is HDFS, what is MapReduce and what is Yarn. Let us discuss them one by one:

5.1. HDFS – Strong Layer

Hadoop HDFS or Hadoop Distributed File System is a distributed file system which provides storage

in Hadoop in a distributed fashion.

In Hadoop Architecture on the master node, a daemon called namenode run for HDFS. On all the

slaves a daemon called datanode run for HDFS. Hence slaves are also called as datanode. Namenode

stores meta-data and manages the datanodes. On the other hand, Datanodes stores the data and do

the actual task.

HDFS is a highly fault tolerant, distributed, reliable and scalable file system for data storage.

First Follow this guide to learn more about features of HDFS and then proceed further with the Hadoop

tutorial.

HDFS is developed to handle huge volumes of data. The file size expected is in the range of GBs to TBs.

A file is split up into blocks (default 128 MB) and stored distributedly across multiple machines. These

blocks replicate as per the replication factor. HDFS handles the failure of a node in the cluster.

5.2. MapReduce – Processing Layer

Now it’s time to understand one of the most important pillar if Hadoop, i.e. MapReduce. MapReduce

is a programming model. As it is designed for large volumes of data in parallel by dividing the work

into a set of independent tasks. MapReduce is the heart of Hadoop, it moves computation close to

the data. As a movement of a huge volume of data will be very costly. It allows massive scalability

across hundreds or thousands of servers in a Hadoop cluster.

3

https://data-flair.training/big-data-hadoop/

hadoop tutorial 3

Hadoop Tutorial

Hence, MapReduce is a framework for distributed processing of huge volumes of data set over a

cluster of nodes. As data is stored in a distributed manner in HDFS. It provides the way to Map–

Reduce to perform distributed processing.

5.3. YARN – Resource Management Layer

YARN – Yet Another Resource Negotiator is the resource management layer of Hadoop. In the multi-

node cluster, it becomes very complex to manage/allocate/release the resources (CPU, memory,

disk). Hadoop Yarn manages the resources quite efficiently. It allocates the same on request from any

application.

On the master node, the ResourceManager daemon runs for the YARN then for all the slave

nodes NodeManager daemon runs.

Learn the differences between two resource manager Yarn vs. Apache Mesos. Next topic in the

Hadoop tutorial is a very important part i.e. Hadoop Daemons

6. Hadoop Daemons

Daemons are the processes that run in the background. There are mainly 4 daemons which run

for Hadoop.

Hadoop Daemons:

Namenode– It runs on master node for HDFS.

Datanode– It runs on slave nodes for HDFS.

ResourceManager– It runs on master node for Yarn.

NodeManager– It runs on slave node for Yarn.

These 4 demons run for Hadoop to be functional. Apart from this, there can be secondary NameNode,

standby NameNode, Job HistoryServer, etc.

4

https://data-flair.training/big-data-hadoop/

hadoop tutorial 4

Hadoop Tutorial

7. How Hadoop works?

Till now we have studied Hadoop Introduction and Hadoop architecture in great details. Now let us

summarize Apache Hadoop working step by step:

i) Input data is broken into blocks of size 128 MB (by default) and then moves to different nodes.

ii) Once all the blocks of the file stored on datanodes then user can process the data.

iii) Now, master schedules the program (submitted by the user) on individual nodes.

iv) Once all the nodes process the data then the output is written back to HDFS.

8. Hadoop Flavors

This section of Hadoop Tutorial talks about the various flavors of Hadoop.

Apache– Vanilla flavor, as the actual code is residing in Apache repositories.

Hortonworks– Popular distribution in the industry.

Cloudera– It is the most popular in the industry.

MapR– It has rewritten HDFS and its HDFS is faster as compared to others.

IBM– Proprietary distribution is known as Big Insights.

All flavors are almost same and if you know one, you can easily work on other flavors as well.

5

https://data-flair.training/big-data-hadoop/

hadoop tutorial 5

Hadoop Tutorial

9. Hadoop Ecosystem Components

In this section, we will cover Hadoop ecosystem components. Let us see what all the components form

the Hadoop Eco-System:

Hadoop HDFS – Distributed storage layer for Hadoop.

Yarn Hadoop – Resource management layer introduced in Hadoop 2.x.

Hadoop Map-Reduce – Distributed processing layer for Hadoop.

HBase – It is a column-oriented database that runs on top of HDFS. It is a NoSQL database which does

not understand the structured query. For sparse data set, it suits well.

Hive – Apache Hive is a data warehousing infrastructure based on Hadoop and it enables easy data

summarization, using SQL queries.

Pig – It is a top-level scripting language. Pig enables writing complex data processing without Java

programming.

Flume – It is a reliable system for efficiently collecting large amounts of data from many different

sources in real-time.

Sqoop – It is a tool design to transport huge volumes of data between Hadoop and RDBMS.

Oozie – It is a Java Web application uses to schedule Apache Hadoop jobs. It combines multiple jobs

sequentially into one logical unit of work.

Zookeeper– A centralized service for maintaining configuration information, naming, providing

distributed synchronization, and providing group services.

6

https://data-flair.training/big-data-hadoop/

hadoop tutorial 6

Hadoop Tutorial

Mahout – A library of scalable machine-learning algorithms, implemented on top of Apache Hadoop

and using the MapReduce paradigm.

Refer this Hadoop Ecosystem Components tutorial for the detailed study of All the Ecosystem

components of Hadoop.

10. Conclusion

In conclusion to this Hadoop tutorial, we can say that Apache Hadoop is the most popular and

powerful big data tool. Hadoop stores & processes huge amount of data in the distributed manner on

a cluster of nodes. It provides world’s most reliable storage layer- HDFS. Batch processing engine

MapReduce and Resource management layer- YARN.

This conclusion is not the end but a foundation to learn Apache Hadoop. Here are the next steps after

you are through with this Apache Hadoop Tutorial:

1.Internal Working of Hadoop

2.Hadoop Distributed File System

3.MapReduce Tutorial

4.YARN - Tutorial

5.Install Hadoop 2 on Ubuntu

7

https://data-flair.training/big-data-hadoop/