1 / 13

Introduction to Hadoop

Introduction to Hadoop. Richard Holowczak Baruch College. Problems of Scale. As data size and processing complexity grows: Contention for disks – disks have limited throughput Processing cores per server/OS Image limited … processing throughput is limited

iden
Download Presentation

Introduction to Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Hadoop Richard Holowczak Baruch College

  2. Problems of Scale • As data size and processing complexity grows: • Contention for disks – disks have limited throughput • Processing cores per server/OS Image limited • … processing throughput is limited • Reliability of distributed systems: • Tightly coupled distributed systems fall apart when one component (disk, network, cpu, etc.) fails • What happens to processing jobs when there is a failure? • Rigid structure of distributed systems • Consider our ETL processes: Target schema is fixed ahead of time

  3. Hadoop • A distributed data processing “eco system” that is • Scalable • Reliable • Fault Tolerant • A collection of projects currently maintained under the Apache Foundation: • hadoop.apache.org • Storage Layer: Hadoop Distributed File System (HDFS) • Scheduling Layer: Hadoop YARN • Execution Layer: Hadoop MapReduce • Plus many more projects built on top of this

  4. Hadoop Distributed File System (HDFS) • Created on top of commodity hardware and operating system • Any functioning Linux (or Windows) system can be set up as a node • Files are split in to 64MB blocks that are distributed and replicated across nodes • Typically at least 3 copies of a blocks are made • File I/O semantics are simplified: • Write once (no notion of “update”) • Read many times as a stream (no random file I/O) • When a node fails, additional blocks copies are created on other nodes • A special “Name Node” keeps track of how a file blocks is stored across different nodes • Some location designations • Node • Rack • Data Center

  5. HDFS Example 1 Node 1 File Block xyz.txt Block1 xyz.txt Block2 … Name Node File Block xyz.txt block1_N1 xyz.txt block2_N1 xyz.txt Block1_N2 xyz.txt Block1_N3 xyz.txt Block2_N3 xyz.txt Block2_N4 … Node 2 File Block xyz.txt Block1 … Node 3 File Block xyz.txt Block1 xyz.txt Block2 … Node 4 File Block xyz.txt Block2 … Network

  6. HDFS Example 2 Node 1 File Block xyz.txt Block1 xyz.txt Block2 … Name Node File Block xyz.txt Block1_N1 xyz.txt Block2_N1 xyz.txt Block1_N2 xyz.txt Block1_N3 xyz.txt Block2_N3 xyz.txt Block2_N4 … Node 2 File Block xyz.txt Block1 … Node Failure Node 3 File Block xyz.txt Block1 xyz.txt Block2 … Node 4 File Block xyz.txt Block2 … Network

  7. HDFS Example 3 Node 1 File Block xyz.txt Block1 xyz.txt Block2 … Name Node File Block xyz.txt Block1_N1 xyz.txt Block2_N1 xyz.txt Block1_N2 xyz.txt Block1_N3 xyz.txt Block2_N3 xyz.txt Block2_N4 xyz.txt Block1_N4 … Node 2 File Block xyz.txt Block1 … Node 3 File Block xyz.txt Block1 xyz.txt Block2 … Node 4 File Block xyz.txt Block2 xyz.txt Block1 … Blocks from failed node are replicated Network

  8. Hadoop Execution Layer: MapReduce • Processing architecture for Hadoop • Processing functions are sent to where the data reside on nodes • Map function is mainly concerned about parsing and filtering data • Collects instances of vales V for each key K • This function is programmed by the developer • Shuffle Instances of { Ki, Vi } to merge • This step is done automatically by MapReduce • Reduce function is mainly concerned with summarizing data • Summarize a set of V for each Key K • This function is programmed by the developer

  9. Hadoop Scheduling Layer • Job Tracker writes out a plan for completing a job and then tracks its progress • A job is broken up into independent Tasks • Route a task to CPU that is “close” to the data (Same Node, Same Rack, different rack) • Nodes have Task Trackers that carry out the Tasks required to complete a job • When a node fails, Job Tracker automatically re-starts the task on a new node • Scheduler may also distribute the same task to multiple nodes and keep the results from the node that finishes first

  10. MapReduce Example • Compare 2012 total sales with 2011 total sales broken down by product category • Data set: Sales transaction records: • Date, Product, ProductCategory, CustomerName, …, Quantity, Price • Key: [ Year, ProductCategory ] • Value: [ Price * Quantity ] • Map Function: For every record, form the Key then multiply Price * Quantity and then assign it to the Value. < [2011, Electronics] , [ $50.40 ] > • Shuffle: Merge/Sort all of the <K, V> pairs on common key • Reduce Function: For each K, sum up all of the associated values V

  11. Node 1 File Block xyz.txt Block1 6/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 xyz.txt Block2 3/15/2012, Outdoors, …, 4, $12 8/16/2012, Outdoors, …, 1, $41 … TaskManager: Ta, Tx, Ty MapReduce Example Name Node File Block xyz.txt block1_N1 xyz.txt block2_N1 xyz.txt Block1_N2 xyz.txt Block1_N3 xyz.txt Block2_N3 xyz.txt Block2_N4 … Node 2 File Block xyz.txt Block1 6/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 … TaskManager: Ta, Tz Job Tracker Node Job TaskNodeBlock J101 Ta N1 Block1 J101 Ta N2 Block1 J101 Tb N3 Block2 … TaskManager: Tb, Tz Node 3 File Block xyz.txt Block1 6/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 xyz.txt Block2 3/15/2012, Outdoors, …, 4, $12 8/16/2012, Outdoors, …, 1, $41 … Network

  12. Common MapReduce domains • Indexing documents or web pages • Counting word frequencies • Processing log files • ETL • Processing Image archives • Common characteristics • Files/Blocks can be independently processed and the results easily merged • Scales with the number of nodes, size of data, number of CPUs

  13. Additional Apache/Hadoop Projects • Hbase – Large table NoSQL database • Hive – Data warehousing infrastructure / SQL support • PIG – Data processing scripting / MapReduce • OOZIE – Workflow Scheduling • FLUME – Distributed log file processing • MAHOUT – Machine learning libraries

More Related