1 / 37

INTRODUCTION TO HADOOP & MAP- REDUCE

Discover what Hadoop is and how it can process huge amounts of data in a distributed environment. Learn about HDFS, Map-Reduce, and the Hadoop ecosystem.

tankersley
Download Presentation

INTRODUCTION TO HADOOP & MAP- REDUCE

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. INTRODUCTION TO HADOOP & MAP- REDUCE

  2. What is Hadoop? An Open-Source Software , batch-offline oriented, data & I/Ointensive general purpose framework for creating distributed applications that process huge amounts of data. HUGE - Few thousand machines - Peta-bytes of data - Processing thousands of job each week What is not Hadoop? - A Relational Database - An OLTP System - A Structured data-store of any kind

  3. Hadoop vs Relational • General Purpose vs Relational Data • User Control vs System Defined • No Schema vs Schema • Key-Value Pairs vs Tables • Offline/batch vs Online/Real-time

  4. Hadoop Eco-System • HDFS • Hadoop Distributed File System • Map-Reduce System • A distributed framework for executing work in parallel • Hive/Pig/Jaql • SQL like languages to manipulate relational data on HDFS • HBase • Column-Store on Hadoop • Misc • Avro, Ganglia, Sqoop, ZooKeeper, Mahout

  5. HDFS • Hadoop Distributed File System • Stores files in blocks across many nodes in a cluster • Replicates the blocks across nodes for durability • Default – 64 MB • Master/Slave Architecture • HDFS Master • NameNode • Runs on a single node as a master process • Directs client access to files in HDFS • HDFS Slave • DataNode • Runs on all nodes in the cluster • Block creation/replication/deletion • Takes orders from the namenode

  6. HDFS NameNode 1 2 3 4 5 6 Data Nodes

  7. HDFS NameNode 1, 4, 5 File1.txt Put File 2, 5, 6 2, 3, 4 1 2 3 4 5 6 Data Nodes

  8. HDFS NameNode 1, 4 Read File 2, 6 2, 3 Read-Time = Transfer-Rate x Number of Machines 1 2 3 4 5 6 Data Nodes

  9. HDFS • Fault-Tolerant • Handles Node Failures • Self-Healing • Rebalances files across cluster • Data from the remaining two nodes is automatically copied • Scalable • Just by adding new nodes

  10. Map-Reduce • Logical Functions : Mappers and Reducers • Developers write map and reducer functions then submit a jar to the Hadoop Cluster • Hadoop handles distributing the Map and Reduce tasks across the cluster • Typically Batch-Oriented

  11. Map-Reduce Job-Flow

  12. Word-Count Sort/Shuffle A-I (a, 2) (hadoop, 1) (is, 2) (a, [1,1]) (Hadoop, 1) (is, [1,1]) Hadoop Uses Map-Reduce (Hadoop, 1) (Uses, 1) (Map, 1) (Reduce , 1) J-Q (There, 1) (is, 1) (a, 1) (Map, 1) (Phase, 1) (map, 2) (phase, 2) (map, [1,1]) (phase, [1,1]) There is a Map-Phase R-Z (There, 1) (is, 1) (a, 1) (Reduce, 1) (Phase, 1) (reduce, 2) (there, 2) (uses, 1) There is a Reduce phase (reduce, [1,1]) (there, [1,1]) (uses, 1)

  13. Map-Reduce Daemons • Job-Tracker (Master) • Manages map-reduce jobs, • Partitions tasks across different nodes, • Manages task-failures, Restarts tasks on different nodes • Speculative Execution • Task-Tracker (Slave) • Creates individual map and reduce tasks • Reports task status to job-tracker

  14. Type of Output Key Type of Output Value Word-Count Map • public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable>{ public void map(LongWritable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens.length; i++){ context.write(new Text(tokens[i]), new IntWritable(1)); } } } Type of Input Value Type of Input Key

  15. Type of Output Key Type of Output Value Type of Input Key Type of Input Value Word Count Reduce public class DataReadReduce extends Reducer<Text, IntWritable, Text, IntWritable>{ public void reduce(Text key, Iterable<IntWritable> values, Context context){ context.write(key, new IntWritable(count(values))); } }

  16. Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(WordCountMap.class); job.setReducerClass(WordCountReduce.class); job.setJarByClass(WordCountRunner.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setMapOutputKeyClass(Text.class); job.setMapOutputValuesClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.waitForCompletion(true); } }

  17. Running a Job • ./bin/hadoop jar WC.jar WordCountRunner WC

  18. Task Tracker Task Tracker Data Node Data Node Cluster View of a MR Job Flow NameNode JAR M R JobTracker Task Tracker MAP PHASE k,v M k,v k,v M k,v k,v M k,v k,v R R R k,v SHUFFLE SORT JOB FINISHED k,v REDUCE PHASE Data Node

  19. Reducer 1 (1, 10) (2, 20) (1, 10) (1, [10, 10, 30, 20]) MAP 1 (1, 17.5) Reducer 2 MAP 2 (1, 30) (3, 40) (2, 10) (1, 20) (2, 10) (3, 40) (2, 15) (3, 40) Map-Reduce Example: Aggregation • Compute Avg of B for each distinct value of A

  20. Reducer 1 (1, 10, 30) (1, 10, 30) (1, 10, 20) (1, 10, 20) (1, [R, 10]) (2, [R, 20]) (1, [R, 10]) (1, [R, 30]) (3, [R, 40]) (1, [(R, 10), (R, 10), (R, 30), (S, 20)] ) MAP 1 Reducer 2 (1, [S, 20]) (2, [S, 30]) (2, [S, 10]) (3, [S, 50]) (3, [S, 40]) (2, [(R, 20), (S, 30), (S, 10)] ) (3, [(R, 40), (S, 50), (S, 40)] MAP 2 (2, 20, 30) (2, 20, 10) (3, 40, 50) (3, 40, 40) Map-Reduce Example : Join • Select R.A, R.B, S.D where R.A==S.A

  21. Reducer 1 (r1, [R, 1, 10]) (r2, [R, 1, 10]) (r3, [R, 1, 10]) (r2, [R, 2, 20]) (r3, [R, 2, 20]) ….. ….. (r3, [R, 3, 40]) (r1, ([R, 1, 10], [R, 1, 10], [R, 1, 30], [S, 1, 20]) MAP 1 Reducer 2 …… (r1, [S, 1, 20]) (r2, [S, 2, 30]) (r2, [S, 2, 10]) (r3, [S, 3, 50]) (r3, [S, 3, 40]) Reducer 3 MAP 2 (r3, ([R, 1, 10], [R, 2, 20], [R, 1, 10], [R, 1, 30], [R, 3, 40], [S, 3, 50], [S, 3, 40]) Map-Reduce Example : Inequality Join • Select R.A, R.B, S.D where R.A <= S.A • Consider 3-Node Cluster (1, 10, 20) (1, 10, 20) (1, 30, 20) (1, 10, 50) (1, 10, 40) (2, 20, 50) (2, 20, 40) (1, 10, 50) (1, 10, 40) (1, 30, 50) (1, 30, 40) (3, 40, 50) (3, 40, 40)

  22. Designing a Map-Reduce Algorithm • Thinking in terms of Map and Reduce • What data should be the key? • What data should be the values? • Minimizing Cost • Reading Cost • Communication Cost • Processing Cost at Reducer • Load Balancing • All reducers should get similar volume of traffic • Should not happen that only few machines are busy while others are loaded

  23. SQL-Like Languages For Map-Reduce • Hive, Pig, JAQL • A user need not write native Java Map-Reduce Code • SQL like statements can be written to process data on Hadoop • Allows users without a sound understanding of map-reduce to work on data stored on HDFS

  24. JAQL • Simpler language for writing Map-Reduce jobs • Reduce the barrier to Hadoop use by eliminating the need to write Java programs for many users • Exploit massive parallelism using Hadoop • Provides a simple yet powerful language to manipulate semi-structured data • Uses JSON as data model • Most data has a natural JSON representation • Easily extended using Java, Python, JavaScript • Inspired from UNIX pipes • Other languages: Hive, Pig • Resources • http://code.google.com/p/jaql • http://jaql.org

  25. JSON has arrays, records, strings, numbers, boolean, and null [] == array, {} == record or object, x: == field name JavaScript Object Notation (JSON) • $emp = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false},] • $emp = [ {name: "Jon Doe", income: 20000, mgr: false, dob: {day:1, month:1, year:1975}}, {name: "Vince Wayne", income: 32500, mgr: false, dob: {day:1, month:2, year:1978}},] • $emp = [ {name: "Jon Doe", income: 20000, mgr: false, skills: [java, C++, Hadoop]}, {name: "Vince Wayne", income: 32500, mgr: false, skills: [java, DB2, SQL]},] • $emp = [ {name: "Jon Doe", income: 20000, mgr: false, exp: [{org:`IBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, exp: [{org:`IBM’, from: 2000, to:2003},{org:`oracle’, from:2003, to:`2010’}] ]

  26. Accessing Data • $emp = [ {name: "Jon Doe", income: 20000, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org:`IBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org:`IBM’, from: 2000, to:2003},{org:`oracle’, from:2003, to:`2010’}] } ] • $emp[0] = {name: "Jon Doe", income: 20000, mgr: false, exp: [{org:`IBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] } • $emp[0].name = “Jon Doe” • $emp[0].exp[0] = {org:`IBM’, from: 2000, to:2005} • $emp[0].exp[0].org = ‘IBM’ • $emp[0].skills[0] = ‘Java’ • $emp[*].name = [‘Jon Doe’, ‘Vince Wayne’] • $emp[0].exp[*].org = [‘IBM’,’yahoo’] • $emp[*].exp[*].org = [[‘IBM’,’yahoo’],[‘IBM’,’oracle’]]

  27. JAQL core functionalities • Filter • Transform • Group • Join • Sort • Expand

  28. Filter • $input -> filter <boolean expression>; • In <boolean expression> the variable $ is bound to each item of the input • The <boolean expression> can be composed of the relations ==, !=, >, >=, <, <= • Complex expressions can be created with not, and, or which are evaluated in this order • If the <boolean expression> evaluates to true, the item from the input is included in the output

  29. Filter Example • $employees = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false}, {name: "Jane Dean", income: 72000, mgr: true}, {name: "Alex Smith", income: 25000, mgr: false}]; • $employees -> filter $.mgr or $.income > 30000; • [  {    "income": 32500,   "mgr": false,    "name": "Vince Wayne"  },  {    "income": 72000,    "mgr": true,    "name": "Jane Dean"  }]

  30. Group By • $input -> group by <variable> = <grouping items> into <expression> • Similar to SQL group-by • $ is bound to the grouped items • To get an array of all values for an item that are aggregated into one group, use $[*]

  31. Group By Example • $employees = [  {id:1, dept: 1, band:7, income:12000},  {id:2, dept: 1, band:8, income:13000},  {id:3, dept: 2, band:7, income:15000},  {id:4, dept: 1, band:8, income:10000},  {id:5, dept: 3, band:7, income:8000},  {id:6, dept: 2, band:8, income:5000},  {id:7, dept: 1, band:7, income:24000}] • $emplyees -> group by $.dept into {$dept, total: sum($[*].income)}; [ {dept: 1, total: 59000}, {dept:2, total:20000}, {dept:3, total:8000} ] • $emplyees -> group by $.dept_group = $dept into {$dept_group, total: sum($[*].income)}; • $employees -> group by $dept_group = {$.dept, $.band} into {$dept_group.*, total:sum($[*].income)} • $employees -> group by $dept_group = {$.dept, $.band} into {$dept_group, total:sum($[*].income)}

  32. Join • Join <variable-list> where <join-condition(s)> into <expression> • <variable list> contains two or more variables that should share at least one attribute • <join condition(s)> : only equality predicates are allowed • <expression> is applied to all items from the input that match the join condition. To copy all fields of an input, use $input.* • Add the keyword ‘preserve’ to make it full join

  33. Join Example • $users = [  {name: "Jon Doe", password: "asdf1234", id: 1},  {name: "Jane Doe", password: "qwertyui", id: 2},  {name: "Max Mustermann", password: "q1w2e3r4", id: 3}];$pages = [  {userid: 1, url:"code.google.com/p/jaql/"},  {userid: 2, url:"www.cnn.com"},  {userid: 1, url:"java.sun.com/javase/6/docs/api/"}] • Join $users, $pages where $users.id == $pages.userid into {$users.name, $pages.*} • [ { "name": "Jon Doe",  "url": "code.google.com/p/jaql/",  "userid": 1 }, { "name": "Jon Doe",  "url": "java.sun.com/javase/6/",  "userid": 1 }, { "name": "Jane Doe", "url": "www.cnn.com",  "userid": 2 }]

  34. IBM InfoSphere BigInsights • IBM’s offering for managing Big-Data • Powered by Hadoop and other components • Provides a fully tested environments

  35. Recap • Introduction to Apache Hadoop • HDFS and Map-Reduce Programming Framework • Name Node, Data Node • Job Tracker, Task Tracker • Map and Reduce Methods Signatures • Word-Count Example • Flow In Map-Reduce • Java Implementation • More Map-Reduce Examples • Aggregation, Equi-Join and Inequality Join • Introduction to JAQL and IBM BigInsights

  36. Advanced Concepts In Hadoop • Map-Reduce Programming Framework • Combiner, Counter, Partitioner, Distributed-Cache • Hadoop I/O • Input-Formats and Output-Formats • Input and Output-Formats provided by Hadoop • Writing Custom Input and Output Formats • Passing custom objects as key-values • Chaining Map-Reduce Jobs • Hadoop Tuning and Optimization • Configuration Parameters • Hadoop Eco-System • Hive/Pig/JAQL • HBase • Avro, ZooKeeper, Mahout, Sqoop, Ganglia etc. • An Overview of Hadoop Research • Join Processing : Multi-way equi and theta joins, set-similarity joins, k-NN joins, interval and spatial joins • Graph Processing, Text Processing etc • Systems : ReStore, PerfXPlain, Stubby, RAMP, HadoopDB etc.

  37. References • Hadoop – The Definitive Guide . Oreilly Press • Pro-Hadoop : Build scalable, distributed applications in the Cloud. • Hadoop Tutorial : http://developer.yahoo.com/hadoop/tutorial/. • www.slideshare.net

More Related