Hadoop

Introduction to Map-Reduce and Join Processing

Hadoop – A Very Brief Introduction • A framework for creating distributed applications that process huge amounts of data. • Scalability, Fault Tolerance, Ease of Programming • Two main components • HDFS – Hadoop Distributed File System • Map-Reduce • How data is organized on HDFS? • How data is processed using Map-Reduce?

HDFS • Stores files in blocks across many nodes in a cluster • Replicates the blocks across nodes for durability • Default – 64 MB • Master/Slave Architecture • HDFS Master • NameNode • Runs on a single node as a master process • Directs client access to files in HDFS • HDFS Slave • DataNode • Runs on all nodes in the cluster • Block creation/replication/deletion • Takes orders from the namenode

HDFS Replication Factor = 3 A B C R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 R6 8 9 1 R7 2 3 1 R8 9 9 2 R9 1 7 4 R10 1 2 2 R11 2 3 4 R12 4 5 6 R13 6 7 8 R14 9 8 3 R15 3 2 1 64 MB All these blocks Distributed on the Cluster 64 MB 64 MB

HDFS NameNode 1, 4, 5 File1.txt Put File 2, 5, 6 2, 3, 4 1 2 3 4 5 6 Data Nodes

NameNode 1, 4, 5 Read File 2, 5, 6 2, 3, 4 1 2 3 4 5 6 Data Nodes HDFS Read-Time = Transfer-Rate x Number of Machines

HDFS NameNode • Fault-Tolerant • Handles Node Failures • Self-Healing • Rebalances files across cluster • Data from the remaining two nodes is automatically copied • Scalable • Just by adding new nodes 1, 4, 5 Read File 2, 5, 6 3, 5, 6 2, 3, 4 2, 3, 6 1 2 3 4 5 6 Data Nodes

Map-Reduce • Logical Functions : Mappers and Reducers • Developers write map and reducer functions then submit a jar to the Hadoop Cluster • Hadoop handles distributing the Map and Reduce tasks across the cluster

Map-Reduce • A map task is started for each split / 64 MB block. Each map task generates some intermediate data. • Hadoop collects the output of all map tasks, reorganizes them and passes the reorganized data to Reduce tasks • Reduce tasks process this re-organized data and generate the final output • Flow • HDFS Block to Map Task • Map Task to Hadoop Engine • Hadoop Shuffles and Sorts the Map output • Hadoop Engine to Reduce Tasks and Reduce Processing

HDFS to Map Tasks • Records are read one by one from each block and passed to map for processing. The component is called InputFormat / RecordReader • A record is passed as a key-value pair. Key is an offset and the value is the record • Offset is usually ignored by the map ( 0, R1 1 2 3) (10, R2 2 3 5) (20, R3 2 4 6) (30, R4 6 4 2) (40, R5 1 3 6) Input-Format MAP-1 ( 50, R6 8 9 1) (60, R7 2 3 1) (70, R8 9 9 2) (80, R9 1 7 4) (90, R10 1 2 2)) MAP-2 Input-Format (100, R11 2 3 4) (110, R12 4 5 6) (120, R13 6 7 8) (130, R14 9 8 3) (140, R15 3 2 1) Input-Format MAP-3

Map Task • Takes in a key-value pair and transforms it to a set of key-value pairs {K1, V1} ==> [{K2, V2}] Example: If the second column is an odd number, don’t do anything. If the second column is an even number generate as many pairs as the number of even divisors of the value in the second column. The key is the divisor and the value is the value in the third column ( 0, R1 1 2 3) (10, R2 2 3 5) (20, R3 2 4 6) (30, R4 6 4 2) (40, R5 1 3 6) (2, 9) (4, 9) (8, 9) (2, 3) (2, 3) (2, 5) (4, 5) (2, 7) (2, 3) (2, 4) (2,4) (6, 4) MAP-1 ( 0, R6 8 9 1) (10, R7 2 3 1) (20, R8 9 9 2) (30, R9 1 7 4) (50, R10 1 2 2)) MAP-2 ( 0, R11 2 3 4) (10, R12 4 5 6) (20, R13 6 7 8) (30, R14 9 8 3) (50, R15 3 2 1) MAP-3

Hadoop Sorting And Shuffling • Hadoop processes the key-value pairs output by map in a fashion so that the values in all pairs with the same key are grouped together • These groups are then passed to reducers for processing (2, 3) (2, 4) (2,4) (6, 4) (2, 9) (4, 9) (8, 9) (2, 3) (2, 3) (2, 5) (4, 5) (2, 7) MAP-1 Hadoop Shuffle (2, [3, 3, 3, 4, 4, 5, 7, 9]) (4, [5, 9]) (6, [4]) (8, [9]) MAP-2 MAP-3

Hadoop Engine to Reduce Tasks and Reduce Processing • Let the number of distinct keys (groups) be m • Let the number of reduce tasks be k. • These m groups are distributed across k reduce tasks using a Hash function • Reduce task processes each group and generates the output. Example – Sums all the values REDUCER 1 (2, [3, 4, 4, 9, 3, 3, 5, 7]) (6, [4]) (2, 38) (6, 4) REDUCER 2 (4, [9, 5]) (8, [9]) (4, 14) (8, 9)

Word-Count A-I (a, 2) (hadoop, 1) (is, 2) (a, [1,1]) (Hadoop, 1) (is, [1,1]) Hadoop Uses Map-Reduc (Hadoop, 1) (Uses, 1) (Map, 1) (Reduce , 1) J-Q (There, 1) (is, 1) (a, 1) (Map, 1) (Phase, 1) (map, 2) (phase, 2) (map, [1,1]) (phase, [1,1]) There is a Map-Phase R-Z (There, 1) (is, 1) (a, 1) (Reduce, 1) (Phase, 1) (reduce, 2) (there, 2) (uses, 1) There is a Reduce phase (reduce, [1,1]) (there, [1,1]) (uses, 1)

(1, 17.5) (2, 15) (3, 40) Map-Reduce Example: Aggregation • Compute Avg of B for each distinct value of A Reducer 1 (1, [10, 10, 30, 20]) (1, 10) (2, 20) (1, 10) MAP 1 Reducer 2 MAP 2 (1, 30) (3, 40) (2, 10) (1, 20) (2, [10, 20]) (3, [40])

Designing a Map-Reduce Algorithm • Thinking in terms of Map and Reduce • What data should be the key? • What data should be the values? • Minimizing Cost • Reading and Map Processing Cost • Communication Cost • Processing Cost at Reducer • Load Balancing • All reducers should get similar volume of traffic • Should not happen that only few machines are busy while others are loaded

Reducer 1 (1, 10, 20) (1, 10, 20) (1, 30, 20) (1, [(R, 10), (R, 10), (R, 30), (S, 20)] ) Reducer 2 (2, [(R, 20), (S, 30), (S, 10)] ) (3, [(R, 40), (S, 50), (S, 40)] (2, 20, 30) (2, 20, 10) (3, 40, 50) (3, 40, 40) Join On Point Data • Select R.A, R.B, S.D where R.A==S.A (1, [R, 10]) (2, [R, 20]) (1, [R, 10]) (1, [R, 30]) (3, [R, 40]) MAP 1 (1, [S, 20]) (2, [S, 30]) (2, [S, 10]) (3, [S, 50]) (3, [S, 40]) MAP 2

Join On Point Data • Select R.A, R.B, S.D where R.A==S.A • Attribute A range is divided into k parts. A hash function hashes the value of attribute A to [1,…,k] 1 2 … … k • A reducer is defined for each of the k part • A tuple from R and S is communicated to reducer k if the value of R.A or S.A hashes to bucket k • Each reducer computes the partial join output

Join On Point Data • Assume k = 3, h(1)=0, h(2)=1, h(3)=2 R1 S1 R3 S1 R4 S1 R2 S2 R2 S3 R5 S4 R5 S5 0 1 2 R5 3 40 17 S4 3 50 16 S5 3 40 37 R1 1 10 12 R3 1 10 22 R4 1 30 56 S1 1 20 22 R2 2 20 34 S2 2 30 36 S3 2 10 29

Reducer 1 (r1, [R, 1, 10]) (r2, [R, 1, 10]) (r3, [R, 1, 10]) (r2, [R, 2, 20]) (r3, [R, 2, 20]) ….. ….. (r3, [R, 3, 40]) (r1, ([R, 1, 10], [R, 1, 10], [R, 1, 30], [S, 1, 20]) MAP 1 Reducer 2 …… (r1, [S, 1, 20]) (r2, [S, 2, 30]) (r2, [S, 2, 10]) (r3, [S, 3, 50]) (r3, [S, 3, 40]) Reducer 3 MAP 2 (r3, ([R, 1, 10], [R, 2, 20], [R, 1, 10], [R, 1, 30], [R, 3, 40], [S, 3, 50], [S, 3, 40]) Map-Reduce Example : Inequality Join • Select R.A, R.B, S.D where R.A <= S.A • Consider 3-Node Cluster (1, 10, 20) (1, 10, 20) (1, 30, 20) (1, 10, 50) (1, 10, 40) (2, 20, 50) (2, 20, 40) (1, 10, 50) (1, 10, 40) (1, 30, 50) (1, 30, 40) (3, 40, 50) (3, 40, 40)

Why Join On Map-Reduce Is A Complex Task? • Data for multiple relations distributed across different machines • Map-Reduce is inherently designed for processing a single dataset. • An output tuple can be generated only when all the input tuples are collected at a common machine • This needs to happen for all output tuples, is non-trivial. • Apriori, we don’t know which tuples are going to join to form an output tuple. That is precisely the join problem • Ensuring it, may involve lot of replication and hence lot of communication • Tuples from every candidate combination need to be collected at reducers and the join predicates need to be checked

Hadoop – A Very Brief Introduction

Hadoop – A Very Brief Introduction

Presentation Transcript

How to monitor the $H!T out of Hadoop

Introduction to Apache Hadoop HDFS

Hadoop , Hadoop , Hadoop !!!

Hadoop

Hadoop YARN in the Cloud

Hadoop & Map Reduce

Beyond Hadoop : Pig and Giraph

Hadoop Introduction

Let's Break It Up: Using Informix with Hadoop

Hadoop Online Training | Hadoop Online Training in usa, uk,

基于Hadoop的大数据应用分析

HADOOP ADMIN: Session -2

Big Data & Hadoop

Hadoop & Map Reduce

Hadoop Online Training Online Hadoop Training in usa, uk

Hadoop Online Training

Hadoop Administration Introduction

Big Data Hadoop Training Institute in Pune - Mindkraftors

Hadoop Training in Chennai

Big Data Simplified in the Easy Way

Is it tough to learn big data Hadoop?

Hadoop – A Very Brief Introduction

Hadoop – A Very Brief Introduction

Presentation Transcript

How to monitor the $H!T out of Hadoop

Introduction to Apache Hadoop HDFS

Hadoop

Hadoop , Hadoop , Hadoop !!!

Hadoop

Hadoop YARN in the Cloud

Hadoop &amp; Map Reduce

Beyond Hadoop : Pig and Giraph

Hadoop Introduction

Let's Break It Up: Using Informix with Hadoop

Hadoop Online Training | Hadoop Online Training in usa, uk,

基于Hadoop的大数据应用分析

HADOOP ADMIN: Session -2

Big Data &amp; Hadoop

Hadoop &amp; Map Reduce

Hadoop Online Training Online Hadoop Training in usa, uk

Hadoop Online Training

Hadoop Administration Introduction

Big Data Hadoop Training Institute in Pune - Mindkraftors

Hadoop Training in Chennai

Big Data Simplified in the Easy Way

Is it tough to learn big data Hadoop?

Hadoop & Map Reduce

Big Data & Hadoop

Hadoop & Map Reduce