1 / 108

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

This Edureka "Hadoop Tutorial" ( Hadoop Blog series: https://goo.gl/LFesy8 ) will help you to solve Big Data use-cases just like a data analyst. You will learn all the concepts of both Hadoop & Spark. You will also learn k means clustering and zeppelin to visualize your data. <br><br>Below are the topics covered in this tutorial: <br>1. Big Data Use Cases - US Election & Instant Cabs <br>2. Solution strategy of the use cases <br>3. Hadoop & Spark Introduction <br>4. Hadoop Master/Slave Architecture <br>5. Hadoop Core Components <br>6. HDFS Data Blocks <br>7. HDFS Read/Write Mechanism <br>8.YARN Components <br>9. Spark Components <br>10.Spark Architecture <br>11. K-Means and Zeppelin <br>12.Implementing Solution of the use cases using Hadoop, Spark and other big data tools.

EdurekaIN
Download Presentation

Hadoop Tutorial | Big Data Hadoop Tutorial For Beginners | Hadoop Certification Training | Edureka

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Hadoop Tutorial

  2. Big Data Use Cases Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  3. Big Data Use-Cases 1 2 Market Analysis for US Cab Startup US Primary Election Analysis Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  4. US Primary Election Analysis 1 2 Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  5. US Election STEP 1: Primary & Caucuses Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  6. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  7. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions STEP 3: General Elections Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  8. US Election STEP 1: Primary & Caucuses STEP 2: National Conventions STEP 3: General Elections STEP 4: Electoral College Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  9. US Primary Election PROBLEM STATEMENT: In the US Primary Election 2016, Hillary Clinton was nominated over Bernie Sanders from Democrats and on the other hand, Donald Trump wasnominated from Republican Party to contest for the presidential position. As an analyst, you have been tasked to understand different factors that led to the winning of Hillary Clinton and Donald Trump in the primary elections based on demographic features to plan their next initiatives and campaigns. Republican Democrat Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  10. US Primary Election Dataset Now as a data analyst you have 2 datasets available : US Primary Election Data Set US Demographic Features (County-wise) Data Set Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  11. US Primary Election Dataset state:List of US states state_abbreviation:Abbreviation of each US state county: List of counties in each US states fips: FIPS county code is a Federal Information Processing Standards (FIPS) code which uniquely identifies counties party: Different parties in US (i.e. Republican & Democrat) candidate:candidates in US primary election from different parties votes:number of votes gained by a candidate fraction_votes: total number of votes gained by a candidate/ total votes gained by the party Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  12. US County Demographic Features Dataset DETAILS: Population, 2014 estimate Population, 2010 (April 1) estimates base Population, percent change - April 1, 2010 to July 1, 2014 Population, 2010 Persons under 5 years, percent, 2014 Persons under 18 years, percent, 2014 Persons 65 years and over, percent, 2014 Female persons, percent, 2014 White alone, percent, 2014 … Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  13. US Election Solution Strategy 1 US Primary Election Dataset Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  14. US Election Solution Strategy 2 Storing Data in HDFS Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  15. US Election Solution Strategy 3 Processing Data Using Spark Components Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  16. US Election Solution Strategy 4 Transforming Data Using Spark SQL Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  17. US Election Solution Strategy 5 Clustering Data Using Spark MLlib (K-Means) Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  18. US Election Solution Strategy 6 Visualizing the Result Using Zeppelin Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  19. US Election Solution Strategy 3 2 1 Processing Data Using Spark Components Storing Data in HDFS US Primary Election Dataset 6 5 4 Clustering Data Using Spark MLlib (K-Means) Visualizing the Result Using Zeppelin Transforming Data Using Spark SQL Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  20. Visualization of Result 1 2 Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  21. Market Analysis for US Cab Start-Ups 1 2 Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  22. Market Analysis for US Cab Start-Ups PROBLEM STATEMENT: A US cab service start-up wants to meet the demandsin an optimum manner and maximize the profit. Thus, they hired you as a data analyst to interpret the available Uber’s data set and find out the beehive customer pick-up points & peak hours for meeting the demand in a profitable manner. Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  23. Uber Dataset • Date/Time – Pickup Date & Time • Lat – Latitude of Pickup • Lon – Longitude of Pickup • Base – TLC Base Code Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  24. Market Analysis for US Cab Start-Ups Solution Strategy 1 Uber Pick-Up Locations Dataset Predictions Lon Lat Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  25. Market Analysis for US Cab Start-Ups Solution Strategy Predictions Lon 2 Storing Data in HDFS Lat Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  26. Market Analysis for US Cab Start-Ups Solution Strategy Predictions Lon 3 Transforming Dataset Lat Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  27. Market Analysis for US Cab Start-Ups Solution Strategy 4 K-Means Clustering On Latitude & Longitude Predictions Lon Lat Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  28. Market Analysis for US Cab Start-Ups Solution Strategy 1 4 Uber Pick-Up Locations Dataset K-Means Clustering On Latitude & Longitude Predictions Lon 2 Storing Data in HDFS 3 Transforming DataSet Lat Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  29. Let Us Know What It Takes… Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  30. Fundamentals Road Map Introduction to Hadoop & Spark HDFS (Hadoop Storage) Apache Spark YARN (Hadoop Processing) Solution of Use-Cases K-Means & Zeppelin Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  31. Introduction to Hadoop & Spark Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  32. Introduction to Hadoop & Spark Hadoop Spark Hadoop is a framework that allows you to store and process large Apache Spark is an open-source cluster-computing framework for data sets in parallel and distributed fashion. real time processing ❖ Hadoop has two core components: ❖ Provides an interface for programming entire clusters with implicit data parallelism and fault-tolerance ▪ HDFS: Allows to dump any kind of data across the cluster ❖ Built on top of YARN and it extends the YARN model to efficiently ▪ YARN: Allows parallel processing of the data stored in HDFS use more types of computations Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  33. Spark Complementing Hadoop Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  34. Spark & Hadoop Challenges Addressed : 1 Spark processes data 100 times faster than MapReduce Faster Analytic 2 Spark Applications can run on YARN leveraging Hadoop cluster Cost Optimization 3 Avoid Duplication Apache Spark can use HDFS as its storage Combining Spark’s ability, i.e. high processing speed, advance analytics and multiple integration support with Hadoop’s low cost operation on commodity hardware gives the best results Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  35. Big Data Use-Cases Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  36. Big Data Use-Cases Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  37. Big Data Use-Cases Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  38. Big Data Use-Cases Solution Architecture Apache HIVE Apache Spark Solution Options MapReduce Kafka Processing through YARN framework Storing Big Data on HDFS Tools used for processing Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  39. Introduction to Hadoop & Spark HDFS (Hadoop Storage) Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  40. HDFS ❖ HDFS stands for Hadoop Distributed File System ❖ HDFS is the storage unit of Hadoop Secondary NameNode NameNode DataNode DataNode DataNode HDFS creates an abstraction layer over the distributed storage resources, from where we can see the whole HDFS as a single unit. Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  41. NameNode Secondary NameNode NameNode NameNode • Master daemon • Maintains and Manages DataNodes • Records metadata e.g. location of blocks stored, the size of the files, permissions, hierarchy, etc. • Receives heartbeat and block report from all the DataNodes Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  42. Secondary NameNode Secondary NameNode NameNode Secondary NameNode • Checkpointing is a process of combining edit logs with FsImage • Allows faster Failover as we have a back up of the metadata • Checkpointing happens periodically (default: 1 hour) Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  43. Secondary NameNode Secondary NameNode NameNode Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  44. DataNode DataNode • Slave daemons • Stores actual data • Serves read and write requests DataNode DataNode DataNode Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  45. HDFS Architecture in Detail Metadata ops Metadata (Name, replicas, …): /hdfs/foo/data, 3, … NameNode Client Block ops DataNodes DataNodes Read Replication Write Rack 1 Client Rack 2 Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  46. HDFS Block & Replication Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  47. HDFS Data Block • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 380 MB: How many blocks will be created if a file of size 500 MB is copied to HDFS? 380 MB 128 MB 128 MB 124 MB Block 3 Block 1 Block 2 Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  48. HDFS Data Block • Each file is stored on HDFS as block • The default size of each block is 128 MB • Let us say, I have a file example.txt of size 500 MB: How many blocks will be created if a file of size 500 MB is copied to HDFS? 380 MB 500 MB 128 MB 128 MB 124 MB 128 MB 128 MB 128 MB 116 MB Block 3 Block 1 Block 2 Block 3 Block 4 Block 1 Block 2 Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  49. HDFS Block Replication Each data blocks are replicated (thrice by default) and are distributed across different DataNodes Secondary NameNode NameNode 128 MB 120 MB DataNode DataNode DataNode Block 1 Block 2 248 MB Replication Factor = 3 Copyright © 2017, edureka and/or its affiliates. All rights reserved.

  50. Rack Awareness • Rack Awareness Algorithm reduces latency as well as provide fault tolerance by replicating data block • Rack Awareness Algorithm says that the first replica of a block will be stored on a local rack & the next two replicas will be stored on a different (remote) rack Copyright © 2017, edureka and/or its affiliates. All rights reserved.

More Related