Introduction to Hadoop

Introduction to Hadoop Richard Holowczak Baruch College

Problems of Scale • As data size and processing complexity grows: • Contention for disks – disks have limited throughput • Processing cores per server/OS Image limited • … processing throughput is limited • Reliability of distributed systems: • Tightly coupled distributed systems fall apart when one component (disk, network, cpu, etc.) fails • What happens to processing jobs when there is a failure? • Rigid structure of distributed systems • Consider our ETL processes: Target schema is fixed ahead of time

Hadoop • A distributed data processing “eco system” that is • Scalable • Reliable • Fault Tolerant • A collection of projects currently maintained under the Apache Foundation: • hadoop.apache.org • Storage Layer: Hadoop Distributed File System (HDFS) • Scheduling Layer: Hadoop YARN • Execution Layer: Hadoop MapReduce • Plus many more projects built on top of this

Hadoop Distributed File System (HDFS) • Created on top of commodity hardware and operating system • Any functioning Linux (or Windows) system can be set up as a node • Files are split in to 64MB blocks that are distributed and replicated across nodes • Typically at least 3 copies of a blocks are made • File I/O semantics are simplified: • Write once (no notion of “update”) • Read many times as a stream (no random file I/O) • When a node fails, additional blocks copies are created on other nodes • A special “Name Node” keeps track of how a file blocks is stored across different nodes • Some location designations • Node • Rack • Data Center

HDFS Example 1 Node 1 File Block xyz.txt Block1 xyz.txt Block2 … Name Node File Block xyz.txt block1_N1 xyz.txt block2_N1 xyz.txt Block1_N2 xyz.txt Block1_N3 xyz.txt Block2_N3 xyz.txt Block2_N4 … Node 2 File Block xyz.txt Block1 … Node 3 File Block xyz.txt Block1 xyz.txt Block2 … Node 4 File Block xyz.txt Block2 … Network

HDFS Example 2 Node 1 File Block xyz.txt Block1 xyz.txt Block2 … Name Node File Block xyz.txt Block1_N1 xyz.txt Block2_N1 xyz.txt Block1_N2 xyz.txt Block1_N3 xyz.txt Block2_N3 xyz.txt Block2_N4 … Node 2 File Block xyz.txt Block1 … Node Failure Node 3 File Block xyz.txt Block1 xyz.txt Block2 … Node 4 File Block xyz.txt Block2 … Network

HDFS Example 3 Node 1 File Block xyz.txt Block1 xyz.txt Block2 … Name Node File Block xyz.txt Block1_N1 xyz.txt Block2_N1 xyz.txt Block1_N2 xyz.txt Block1_N3 xyz.txt Block2_N3 xyz.txt Block2_N4 xyz.txt Block1_N4 … Node 2 File Block xyz.txt Block1 … Node 3 File Block xyz.txt Block1 xyz.txt Block2 … Node 4 File Block xyz.txt Block2 xyz.txt Block1 … Blocks from failed node are replicated Network

Hadoop Execution Layer: MapReduce • Processing architecture for Hadoop • Processing functions are sent to where the data reside on nodes • Map function is mainly concerned about parsing and filtering data • Collects instances of vales V for each key K • This function is programmed by the developer • Shuffle Instances of { Ki, Vi } to merge • This step is done automatically by MapReduce • Reduce function is mainly concerned with summarizing data • Summarize a set of V for each Key K • This function is programmed by the developer

Hadoop Scheduling Layer • Job Tracker writes out a plan for completing a job and then tracks its progress • A job is broken up into independent Tasks • Route a task to CPU that is “close” to the data (Same Node, Same Rack, different rack) • Nodes have Task Trackers that carry out the Tasks required to complete a job • When a node fails, Job Tracker automatically re-starts the task on a new node • Scheduler may also distribute the same task to multiple nodes and keep the results from the node that finishes first

MapReduce Example • Compare 2012 total sales with 2011 total sales broken down by product category • Data set: Sales transaction records: • Date, Product, ProductCategory, CustomerName, …, Quantity, Price • Key: [ Year, ProductCategory ] • Value: [ Price * Quantity ] • Map Function: For every record, form the Key then multiply Price * Quantity and then assign it to the Value. < [2011, Electronics] , [ $50.40 ] > • Shuffle: Merge/Sort all of the <K, V> pairs on common key • Reduce Function: For each K, sum up all of the associated values V

Node 1 File Block xyz.txt Block1 6/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 xyz.txt Block2 3/15/2012, Outdoors, …, 4, $12 8/16/2012, Outdoors, …, 1, $41 … TaskManager: Ta, Tx, Ty MapReduce Example Name Node File Block xyz.txt block1_N1 xyz.txt block2_N1 xyz.txt Block1_N2 xyz.txt Block1_N3 xyz.txt Block2_N3 xyz.txt Block2_N4 … Node 2 File Block xyz.txt Block1 6/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 … TaskManager: Ta, Tz Job Tracker Node Job TaskNodeBlock J101 Ta N1 Block1 J101 Ta N2 Block1 J101 Tb N3 Block2 … TaskManager: Tb, Tz Node 3 File Block xyz.txt Block1 6/02/2011, Electronics, …, 3, $130 7/13/2011, Electronics, …, 1, $125 7/14/2011, Kitchen, …, 1, $65 xyz.txt Block2 3/15/2012, Outdoors, …, 4, $12 8/16/2012, Outdoors, …, 1, $41 … Network

Common MapReduce domains • Indexing documents or web pages • Counting word frequencies • Processing log files • ETL • Processing Image archives • Common characteristics • Files/Blocks can be independently processed and the results easily merged • Scales with the number of nodes, size of data, number of CPUs

Additional Apache/Hadoop Projects • Hbase – Large table NoSQL database • Hive – Data warehousing infrastructure / SQL support • PIG – Data processing scripting / MapReduce • OOZIE – Workflow Scheduling • FLUME – Distributed log file processing • MAHOUT – Machine learning libraries

Introduction to Hadoop