Hadoop Tutorial

Hadoop Tutorial

US Primary Election PROBLEM STATEMENT: In the US Primary Election 2016, Hillary Clinton was nominated over Bernie Sanders from Democrats and on the other hand, Donald Trump wasnominated from Republican Party to contest for the presidential position. As an analyst, you have been tasked to understand different factors that led to the winning of Hillary Clinton and Donald Trump in the primary elections based on demographic features to plan their next initiatives and campaigns. Republican Democrat Copyright © 2017, edureka and/or its affiliates. All rights reserved.

US Primary Election Dataset Now as a data analyst you have 2 datasets available : US Primary Election Data Set US Demographic Features (County-wise) Data Set Copyright © 2017, edureka and/or its affiliates. All rights reserved.

US Primary Election Dataset state:List of US states state_abbreviation:Abbreviation of each US state county: List of counties in each US states fips: FIPS county code is a Federal Information Processing Standards (FIPS) code which uniquely identifies counties party: Different parties in US (i.e. Republican & Democrat) candidate:candidates in US primary election from different parties votes:number of votes gained by a candidate fraction_votes: total number of votes gained by a candidate/ total votes gained by the party Copyright © 2017, edureka and/or its affiliates. All rights reserved.

US County Demographic Features Dataset DETAILS: Population, 2014 estimate Population, 2010 (April 1) estimates base Population, percent change - April 1, 2010 to July 1, 2014 Population, 2010 Persons under 5 years, percent, 2014 Persons under 18 years, percent, 2014 Persons 65 years and over, percent, 2014 Female persons, percent, 2014 White alone, percent, 2014 … Copyright © 2017, edureka and/or its affiliates. All rights reserved.

US Election Solution Strategy 3 2 1 Processing Data Using Spark Components Storing Data in HDFS US Primary Election Dataset 6 5 4 Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin Transforming Data Using Spark SQL Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Market Analysis for US Cab Start-Ups PROBLEM STATEMENT: A US cab service start-up wants to meet the demandsin an optimum manner and maximize the profit. Thus, they hired you as a data analyst to interpret the available Uber’s data set and find out the beehive customer pick-up points & peak hours for meeting the demand in a profitable manner. Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Market Analysis for US Cab Start-Ups Solution Strategy 1 4 Uber Pick-Up Locations Dataset K-Means Clustering On Latitude & Longitude Predictions Lon 2 Storing Data in HDFS 3 Transforming DataSet Lat Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Fundamentals Road Map Introduction to Hadoop & Spark HDFS (Hadoop Storage) Apache Spark YARN (Hadoop Processing) Solution of Use-Cases K-Means & Zeppelin Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Introduction to Hadoop & Spark Hadoop Spark Hadoop is a framework that allows you to store and process large Apache Spark is an open-source cluster-computing framework for data sets in parallel and distributed fashion. real time processing ❖ Hadoop has two core components: ❖ Provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance ▪ HDFS: Allows to dump any kind of data across the cluster ❖ Built on top of YARN and it extends the YARN model to efficiently ▪ YARN: Allows parallel processing of the data stored in HDFS use more types of computations Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Spark & Hadoop Challenges Addressed : 1 Spark processes data 100 times faster than MapReduce Faster Analytic 2 Spark Applications can run on YARN leveraging Hadoop cluster Cost Optimization 3 Avoid Duplication Apache Spark can use HDFS as its storage Combining Spark’s ability, i.e. high processing speed, advance analytics and multiple integration support with Hadoop’s low cost operation on commodity hardware gives the best results Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Big Data Use-Cases Solution Architecture Apache HIVE Apache Spark Solution Options MapReduce Kafka Processing through YARN framework Storing Big Data on HDFS Tools used for processing Copyright © 2017, edureka and/or its affiliates. All rights reserved.

HDFS ❖ HDFS stands for Hadoop Distributed File System ❖ HDFS is the storage unit of Hadoop Secondary NameNode NameNode DataNode DataNode DataNode HDFS creates an abstraction layer over the distributed storage resources, from where we can see the whole HDFS as a single unit. Copyright © 2017, edureka and/or its affiliates. All rights reserved.

NameNode Secondary NameNode NameNode NameNode • Master daemon • Maintains and Manages DataNodes • Records metadata e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. • Receives heartbeat and block report from all the DataNodes Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Secondary NameNode Secondary NameNode NameNode Secondary NameNode • Checkpointing is a process of combining edit logs with FsImage • Allows faster Failover as we have a back up of the metadata • Checkpointing happens periodically (default: 1 hour) Copyright © 2017, edureka and/or its affiliates. All rights reserved.

HDFS Architecture in Detail Metadata ops Metadata (Name, replicas, …): /hdfs/foo/data, 3, … NameNode Client Block ops DataNodes DataNodes Read Replication Write Rack 1 Client Rack 2 Copyright © 2017, edureka and/or its affiliates. All rights reserved.

HDFS Data Block • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 380 MB: How many blocks will be created if a file of size 500 MB is copied to HDFS? 380 MB 128 MB 128 MB 124 MB Block 3 Block 1 Block 2 Copyright © 2017, edureka and/or its affiliates. All rights reserved.

HDFS Data Block • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 500 MB: How many blocks will be created if a file of size 500 MB is copied to HDFS? 380 MB 500 MB 128 MB 128 MB 124 MB 128 MB 128 MB 128 MB 116 MB Block 3 Block 1 Block 2 Block 3 Block 4 Block 1 Block 2 Copyright © 2017, edureka and/or its affiliates. All rights reserved.

HDFS Block Replication Each data blocks are replicated (thrice by default) and are distributed across different DataNodes Secondary NameNode NameNode 128 MB 120 MB DataNode DataNode DataNode Block 1 Block 2 248 MB Replication Factor = 3 Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Rack Awareness • Rack Awareness Algorithm reduces latency as well as provide fault tolerance by replicating data block • Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack & the next two replicas will be stored on a different (remote) rack Copyright © 2017, edureka and/or its affiliates. All rights reserved.

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

Presentation Transcript

Hadoop , Hadoop , Hadoop !!!

Tutorial : Big Data Algorithms and Applications Under Hadoop

Big data Hadoop Training Online

Hadoop Training Institute | Big Data Training

Big Data Hadoop Online Training

Big Data Hadoop Training Course

Big Data Hadoop Training and Certification [MyLearningCube]

Hadoop tutorial

Big Data Hadoop

Hadoop Training | Best Big data Hadoop Online training - GOT

Big Data Hadoop Training | Big Data Hadoop Courses | Hadoop Online Training

Big Data Hadoop Tutorial for Beginners

Big Data and Hadoop Training

Big Data Hadoop Training Course

Hadoop Testing Training | Best Big Data Hadoop Testing Training - GOT

Hadoop Testing Training | Best Big Data Hadoop Testing Training - GOT

Hadoop Training in Bangalore | Big Data Hadoop Training Institutes

Guide for Big Data Hadoop Training

Hadoop Big Data

Big Data Hadoop Certification

Apache spark tutorial in Big data hadoop