1 / 21

Hadoop – A Very Brief Introduction

Introduction to Map-Reduce and Join Processing. Hadoop – A Very Brief Introduction. A framework for creating distributed applications that process huge amounts of data. Scalability, Fault Tolerance, Ease of Programming Two main components

ebonyd
Download Presentation

Hadoop – A Very Brief Introduction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Map-Reduce and Join Processing

  2. Hadoop – A Very Brief Introduction • A framework for creating distributed applications that process huge amounts of data. • Scalability, Fault Tolerance, Ease of Programming • Two main components • HDFS – Hadoop Distributed File System • Map-Reduce • How data is organized on HDFS? • How data is processed using Map-Reduce?

  3. HDFS • Stores files in blocks across many nodes in a cluster • Replicates the blocks across nodes for durability • Default – 64 MB • Master/Slave Architecture • HDFS Master • NameNode • Runs on a single node as a master process • Directs client access to files in HDFS • HDFS Slave • DataNode • Runs on all nodes in the cluster • Block creation/replication/deletion • Takes orders from the namenode

  4. HDFS Replication Factor = 3 A B C R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 R6 8 9 1 R7 2 3 1 R8 9 9 2 R9 1 7 4 R10 1 2 2 R11 2 3 4 R12 4 5 6 R13 6 7 8 R14 9 8 3 R15 3 2 1 64 MB All these blocks Distributed on the Cluster 64 MB 64 MB

  5. HDFS NameNode 1, 4, 5 File1.txt Put File 2, 5, 6 2, 3, 4 1 2 3 4 5 6 Data Nodes

  6. NameNode 1, 4, 5 Read File 2, 5, 6 2, 3, 4 1 2 3 4 5 6 Data Nodes HDFS Read-Time = Transfer-Rate x Number of Machines

  7. HDFS NameNode • Fault-Tolerant • Handles Node Failures • Self-Healing • Rebalances files across cluster • Data from the remaining two nodes is automatically copied • Scalable • Just by adding new nodes 1, 4, 5 Read File 2, 5, 6 3, 5, 6 2, 3, 4 2, 3, 6 1 2 3 4 5 6 Data Nodes

  8. Map-Reduce • Logical Functions : Mappers and Reducers • Developers write map and reducer functions then submit a jar to the Hadoop Cluster • Hadoop handles distributing the Map and Reduce tasks across the cluster

  9. Map-Reduce • A map task is started for each split / 64 MB block. Each map task generates some intermediate data. • Hadoop collects the output of all map tasks, reorganizes them and passes the reorganized data to Reduce tasks • Reduce tasks process this re-organized data and generate the final output • Flow • HDFS Block to Map Task • Map Task to Hadoop Engine • Hadoop Shuffles and Sorts the Map output • Hadoop Engine to Reduce Tasks and Reduce Processing

  10. HDFS to Map Tasks • Records are read one by one from each block and passed to map for processing. The component is called InputFormat / RecordReader • A record is passed as a key-value pair. Key is an offset and the value is the record • Offset is usually ignored by the map ( 0, R1 1 2 3) (10, R2 2 3 5) (20, R3 2 4 6) (30, R4 6 4 2) (40, R5 1 3 6) Input-Format MAP-1 ( 50, R6 8 9 1) (60, R7 2 3 1) (70, R8 9 9 2) (80, R9 1 7 4) (90, R10 1 2 2)) MAP-2 Input-Format (100, R11 2 3 4) (110, R12 4 5 6) (120, R13 6 7 8) (130, R14 9 8 3) (140, R15 3 2 1) Input-Format MAP-3

  11. Map Task • Takes in a key-value pair and transforms it to a set of key-value pairs {K1, V1} ==> [{K2, V2}] Example: If the second column is an odd number, don’t do anything. If the second column is an even number generate as many pairs as the number of even divisors of the value in the second column. The key is the divisor and the value is the value in the third column ( 0, R1 1 2 3) (10, R2 2 3 5) (20, R3 2 4 6) (30, R4 6 4 2) (40, R5 1 3 6) (2, 9) (4, 9) (8, 9) (2, 3) (2, 3) (2, 5) (4, 5) (2, 7) (2, 3) (2, 4) (2,4) (6, 4) MAP-1 ( 0, R6 8 9 1) (10, R7 2 3 1) (20, R8 9 9 2) (30, R9 1 7 4) (50, R10 1 2 2)) MAP-2 ( 0, R11 2 3 4) (10, R12 4 5 6) (20, R13 6 7 8) (30, R14 9 8 3) (50, R15 3 2 1) MAP-3

  12. Hadoop Sorting And Shuffling • Hadoop processes the key-value pairs output by map in a fashion so that the values in all pairs with the same key are grouped together • These groups are then passed to reducers for processing (2, 3) (2, 4) (2,4) (6, 4) (2, 9) (4, 9) (8, 9) (2, 3) (2, 3) (2, 5) (4, 5) (2, 7) MAP-1 Hadoop Shuffle (2, [3, 3, 3, 4, 4, 5, 7, 9]) (4, [5, 9]) (6, [4]) (8, [9]) MAP-2 MAP-3

  13. Hadoop Engine to Reduce Tasks and Reduce Processing • Let the number of distinct keys (groups) be m • Let the number of reduce tasks be k. • These m groups are distributed across k reduce tasks using a Hash function • Reduce task processes each group and generates the output. Example – Sums all the values REDUCER 1 (2, [3, 4, 4, 9, 3, 3, 5, 7]) (6, [4]) (2, 38) (6, 4) REDUCER 2 (4, [9, 5]) (8, [9]) (4, 14) (8, 9)

  14. Word-Count A-I (a, 2) (hadoop, 1) (is, 2) (a, [1,1]) (Hadoop, 1) (is, [1,1]) Hadoop Uses Map-Reduc (Hadoop, 1) (Uses, 1) (Map, 1) (Reduce , 1) J-Q (There, 1) (is, 1) (a, 1) (Map, 1) (Phase, 1) (map, 2) (phase, 2) (map, [1,1]) (phase, [1,1]) There is a Map-Phase R-Z (There, 1) (is, 1) (a, 1) (Reduce, 1) (Phase, 1) (reduce, 2) (there, 2) (uses, 1) There is a Reduce phase (reduce, [1,1]) (there, [1,1]) (uses, 1)

  15. (1, 17.5) (2, 15) (3, 40) Map-Reduce Example: Aggregation • Compute Avg of B for each distinct value of A Reducer 1 (1, [10, 10, 30, 20]) (1, 10) (2, 20) (1, 10) MAP 1 Reducer 2 MAP 2 (1, 30) (3, 40) (2, 10) (1, 20) (2, [10, 20]) (3, [40])

  16. Designing a Map-Reduce Algorithm • Thinking in terms of Map and Reduce • What data should be the key? • What data should be the values? • Minimizing Cost • Reading and Map Processing Cost • Communication Cost • Processing Cost at Reducer • Load Balancing • All reducers should get similar volume of traffic • Should not happen that only few machines are busy while others are loaded

  17. Reducer 1 (1, 10, 20) (1, 10, 20) (1, 30, 20) (1, [(R, 10), (R, 10), (R, 30), (S, 20)] ) Reducer 2 (2, [(R, 20), (S, 30), (S, 10)] ) (3, [(R, 40), (S, 50), (S, 40)] (2, 20, 30) (2, 20, 10) (3, 40, 50) (3, 40, 40) Join On Point Data • Select R.A, R.B, S.D where R.A==S.A (1, [R, 10]) (2, [R, 20]) (1, [R, 10]) (1, [R, 30]) (3, [R, 40]) MAP 1 (1, [S, 20]) (2, [S, 30]) (2, [S, 10]) (3, [S, 50]) (3, [S, 40]) MAP 2

  18. Join On Point Data • Select R.A, R.B, S.D where R.A==S.A • Attribute A range is divided into k parts. A hash function hashes the value of attribute A to [1,…,k] 1 2 … … k • A reducer is defined for each of the k part • A tuple from R and S is communicated to reducer k if the value of R.A or S.A hashes to bucket k • Each reducer computes the partial join output

  19. Join On Point Data • Assume k = 3, h(1)=0, h(2)=1, h(3)=2 R1 S1 R3 S1 R4 S1 R2 S2 R2 S3 R5 S4 R5 S5 0 1 2 R5 3 40 17 S4 3 50 16 S5 3 40 37 R1 1 10 12 R3 1 10 22 R4 1 30 56 S1 1 20 22 R2 2 20 34 S2 2 30 36 S3 2 10 29

  20. Reducer 1 (r1, [R, 1, 10]) (r2, [R, 1, 10]) (r3, [R, 1, 10]) (r2, [R, 2, 20]) (r3, [R, 2, 20]) ….. ….. (r3, [R, 3, 40]) (r1, ([R, 1, 10], [R, 1, 10], [R, 1, 30], [S, 1, 20]) MAP 1 Reducer 2 …… (r1, [S, 1, 20]) (r2, [S, 2, 30]) (r2, [S, 2, 10]) (r3, [S, 3, 50]) (r3, [S, 3, 40]) Reducer 3 MAP 2 (r3, ([R, 1, 10], [R, 2, 20], [R, 1, 10], [R, 1, 30], [R, 3, 40], [S, 3, 50], [S, 3, 40]) Map-Reduce Example : Inequality Join • Select R.A, R.B, S.D where R.A <= S.A • Consider 3-Node Cluster (1, 10, 20) (1, 10, 20) (1, 30, 20) (1, 10, 50) (1, 10, 40) (2, 20, 50) (2, 20, 40) (1, 10, 50) (1, 10, 40) (1, 30, 50) (1, 30, 40) (3, 40, 50) (3, 40, 40)

  21. Why Join On Map-Reduce Is A Complex Task? • Data for multiple relations distributed across different machines • Map-Reduce is inherently designed for processing a single dataset. • An output tuple can be generated only when all the input tuples are collected at a common machine • This needs to happen for all output tuples, is non-trivial. • Apriori, we don’t know which tuples are going to join to form an output tuple. That is precisely the join problem • Ensuring it, may involve lot of replication and hence lot of communication • Tuples from every candidate combination need to be collected at reducers and the join predicates need to be checked

More Related