INTRODUCTION TO HADOOP & MAP- REDUCE

INTRODUCTION TO HADOOP & MAP- REDUCE

What is Hadoop? An Open-Source Software , batch-offline oriented, data & I/Ointensive general purpose framework for creating distributed applications that process huge amounts of data. HUGE - Few thousand machines - Peta-bytes of data - Processing thousands of job each week What is not Hadoop? - A Relational Database - An OLTP System - A Structured data-store of any kind

Hadoop vs Relational • General Purpose vs Relational Data • User Control vs System Defined • No Schema vs Schema • Key-Value Pairs vs Tables • Offline/batch vs Online/Real-time

Hadoop Eco-System • HDFS • Hadoop Distributed File System • Map-Reduce System • A distributed framework for executing work in parallel • Hive/Pig/Jaql • SQL like languages to manipulate relational data on HDFS • HBase • Column-Store on Hadoop • Misc • Avro, Ganglia, Sqoop, ZooKeeper, Mahout

HDFS • Hadoop Distributed File System • Stores files in blocks across many nodes in a cluster • Replicates the blocks across nodes for durability • Default – 64 MB • Master/Slave Architecture • HDFS Master • NameNode • Runs on a single node as a master process • Directs client access to files in HDFS • HDFS Slave • DataNode • Runs on all nodes in the cluster • Block creation/replication/deletion • Takes orders from the namenode

HDFS NameNode 1 2 3 4 5 6 Data Nodes

HDFS NameNode 1, 4, 5 File1.txt Put File 2, 5, 6 2, 3, 4 1 2 3 4 5 6 Data Nodes

HDFS NameNode 1, 4 Read File 2, 6 2, 3 Read-Time = Transfer-Rate x Number of Machines 1 2 3 4 5 6 Data Nodes

HDFS • Fault-Tolerant • Handles Node Failures • Self-Healing • Rebalances files across cluster • Data from the remaining two nodes is automatically copied • Scalable • Just by adding new nodes

Map-Reduce • Logical Functions : Mappers and Reducers • Developers write map and reducer functions then submit a jar to the Hadoop Cluster • Hadoop handles distributing the Map and Reduce tasks across the cluster • Typically Batch-Oriented

Map-Reduce Job-Flow

Word-Count Sort/Shuffle A-I (a, 2) (hadoop, 1) (is, 2) (a, [1,1]) (Hadoop, 1) (is, [1,1]) Hadoop Uses Map-Reduce (Hadoop, 1) (Uses, 1) (Map, 1) (Reduce , 1) J-Q (There, 1) (is, 1) (a, 1) (Map, 1) (Phase, 1) (map, 2) (phase, 2) (map, [1,1]) (phase, [1,1]) There is a Map-Phase R-Z (There, 1) (is, 1) (a, 1) (Reduce, 1) (Phase, 1) (reduce, 2) (there, 2) (uses, 1) There is a Reduce phase (reduce, [1,1]) (there, [1,1]) (uses, 1)

Map-Reduce Daemons • Job-Tracker (Master) • Manages map-reduce jobs, • Partitions tasks across different nodes, • Manages task-failures, Restarts tasks on different nodes • Speculative Execution • Task-Tracker (Slave) • Creates individual map and reduce tasks • Reports task status to job-tracker

Type of Output Key Type of Output Value Word-Count Map • public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable>{ public void map(LongWritable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens.length; i++){ context.write(new Text(tokens[i]), new IntWritable(1)); } } } Type of Input Value Type of Input Key

Type of Output Key Type of Output Value Type of Input Key Type of Input Value Word Count Reduce public class DataReadReduce extends Reducer<Text, IntWritable, Text, IntWritable>{ public void reduce(Text key, Iterable<IntWritable> values, Context context){ context.write(key, new IntWritable(count(values))); } }

Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(WordCountMap.class); job.setReducerClass(WordCountReduce.class); job.setJarByClass(WordCountRunner.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setMapOutputKeyClass(Text.class); job.setMapOutputValuesClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.waitForCompletion(true); } }

Running a Job • ./bin/hadoop jar WC.jar WordCountRunner WC

Task Tracker Task Tracker Data Node Data Node Cluster View of a MR Job Flow NameNode JAR M R JobTracker Task Tracker MAP PHASE k,v M k,v k,v M k,v k,v M k,v k,v R R R k,v SHUFFLE SORT JOB FINISHED k,v REDUCE PHASE Data Node

Reducer 1 (1, 10) (2, 20) (1, 10) (1, [10, 10, 30, 20]) MAP 1 (1, 17.5) Reducer 2 MAP 2 (1, 30) (3, 40) (2, 10) (1, 20) (2, 10) (3, 40) (2, 15) (3, 40) Map-Reduce Example: Aggregation • Compute Avg of B for each distinct value of A

Reducer 1 (1, 10, 30) (1, 10, 30) (1, 10, 20) (1, 10, 20) (1, [R, 10]) (2, [R, 20]) (1, [R, 10]) (1, [R, 30]) (3, [R, 40]) (1, [(R, 10), (R, 10), (R, 30), (S, 20)] ) MAP 1 Reducer 2 (1, [S, 20]) (2, [S, 30]) (2, [S, 10]) (3, [S, 50]) (3, [S, 40]) (2, [(R, 20), (S, 30), (S, 10)] ) (3, [(R, 40), (S, 50), (S, 40)] MAP 2 (2, 20, 30) (2, 20, 10) (3, 40, 50) (3, 40, 40) Map-Reduce Example : Join • Select R.A, R.B, S.D where R.A==S.A

Reducer 1 (r1, [R, 1, 10]) (r2, [R, 1, 10]) (r3, [R, 1, 10]) (r2, [R, 2, 20]) (r3, [R, 2, 20]) ….. ….. (r3, [R, 3, 40]) (r1, ([R, 1, 10], [R, 1, 10], [R, 1, 30], [S, 1, 20]) MAP 1 Reducer 2 …… (r1, [S, 1, 20]) (r2, [S, 2, 30]) (r2, [S, 2, 10]) (r3, [S, 3, 50]) (r3, [S, 3, 40]) Reducer 3 MAP 2 (r3, ([R, 1, 10], [R, 2, 20], [R, 1, 10], [R, 1, 30], [R, 3, 40], [S, 3, 50], [S, 3, 40]) Map-Reduce Example : Inequality Join • Select R.A, R.B, S.D where R.A <= S.A • Consider 3-Node Cluster (1, 10, 20) (1, 10, 20) (1, 30, 20) (1, 10, 50) (1, 10, 40) (2, 20, 50) (2, 20, 40) (1, 10, 50) (1, 10, 40) (1, 30, 50) (1, 30, 40) (3, 40, 50) (3, 40, 40)

Designing a Map-Reduce Algorithm • Thinking in terms of Map and Reduce • What data should be the key? • What data should be the values? • Minimizing Cost • Reading Cost • Communication Cost • Processing Cost at Reducer • Load Balancing • All reducers should get similar volume of traffic • Should not happen that only few machines are busy while others are loaded

SQL-Like Languages For Map-Reduce • Hive, Pig, JAQL • A user need not write native Java Map-Reduce Code • SQL like statements can be written to process data on Hadoop • Allows users without a sound understanding of map-reduce to work on data stored on HDFS

JAQL • Simpler language for writing Map-Reduce jobs • Reduce the barrier to Hadoop use by eliminating the need to write Java programs for many users • Exploit massive parallelism using Hadoop • Provides a simple yet powerful language to manipulate semi-structured data • Uses JSON as data model • Most data has a natural JSON representation • Easily extended using Java, Python, JavaScript • Inspired from UNIX pipes • Other languages: Hive, Pig • Resources • http://code.google.com/p/jaql • http://jaql.org

JSON has arrays, records, strings, numbers, boolean, and null [] == array, {} == record or object, x: == field name JavaScript Object Notation (JSON) • $emp = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false},] • $emp = [ {name: "Jon Doe", income: 20000, mgr: false, dob: {day:1, month:1, year:1975}}, {name: "Vince Wayne", income: 32500, mgr: false, dob: {day:1, month:2, year:1978}},] • $emp = [ {name: "Jon Doe", income: 20000, mgr: false, skills: [java, C++, Hadoop]}, {name: "Vince Wayne", income: 32500, mgr: false, skills: [java, DB2, SQL]},] • $emp = [ {name: "Jon Doe", income: 20000, mgr: false, exp: [{org:ÌBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, exp: [{org:ÌBM’, from: 2000, to:2003},{org:òracle’, from:2003, to:`2010’}] ]

Accessing Data • $emp = [ {name: "Jon Doe", income: 20000, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org:ÌBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] }, {name: "Vince Wayne", income: 32500, mgr: false, skills: [“java”, “C++”, “Hadoop”], exp: [{org:ÌBM’, from: 2000, to:2003},{org:òracle’, from:2003, to:`2010’}] } ] • $emp[0] = {name: "Jon Doe", income: 20000, mgr: false, exp: [{org:ÌBM’, from: 2000, to:2005},{org:`yahoo’, from:2005, to:`2010’}] } • $emp[0].name = “Jon Doe” • $emp[0].exp[0] = {org:ÌBM’, from: 2000, to:2005} • $emp[0].exp[0].org = ‘IBM’ • $emp[0].skills[0] = ‘Java’ • $emp[*].name = [‘Jon Doe’, ‘Vince Wayne’] • $emp[0].exp[*].org = [‘IBM’,’yahoo’] • $emp[*].exp[*].org = [[‘IBM’,’yahoo’],[‘IBM’,’oracle’]]

JAQL core functionalities • Filter • Transform • Group • Join • Sort • Expand

Filter • $input -> filter <boolean expression>; • In <boolean expression> the variable $ is bound to each item of the input • The <boolean expression> can be composed of the relations ==, !=, >, >=, <, <= • Complex expressions can be created with not, and, or which are evaluated in this order • If the <boolean expression> evaluates to true, the item from the input is included in the output

Filter Example • $employees = [ {name: "Jon Doe", income: 20000, mgr: false}, {name: "Vince Wayne", income: 32500, mgr: false}, {name: "Jane Dean", income: 72000, mgr: true}, {name: "Alex Smith", income: 25000, mgr: false}]; • $employees -> filter $.mgr or $.income > 30000; • [ { "income": 32500, "mgr": false, "name": "Vince Wayne" }, { "income": 72000, "mgr": true, "name": "Jane Dean" }]

Group By • $input -> group by <variable> = <grouping items> into <expression> • Similar to SQL group-by • $ is bound to the grouped items • To get an array of all values for an item that are aggregated into one group, use $[*]

Group By Example • $employees = [ {id:1, dept: 1, band:7, income:12000}, {id:2, dept: 1, band:8, income:13000}, {id:3, dept: 2, band:7, income:15000}, {id:4, dept: 1, band:8, income:10000}, {id:5, dept: 3, band:7, income:8000}, {id:6, dept: 2, band:8, income:5000}, {id:7, dept: 1, band:7, income:24000}] • $emplyees -> group by $.dept into {$dept, total: sum($[*].income)}; [ {dept: 1, total: 59000}, {dept:2, total:20000}, {dept:3, total:8000} ] • $emplyees -> group by $.dept_group = $dept into {$dept_group, total: sum($[*].income)}; • $employees -> group by $dept_group = {$.dept, $.band} into {$dept_group.*, total:sum($[*].income)} • $employees -> group by $dept_group = {$.dept, $.band} into {$dept_group, total:sum($[*].income)}

Join • Join <variable-list> where <join-condition(s)> into <expression> • <variable list> contains two or more variables that should share at least one attribute • <join condition(s)> : only equality predicates are allowed • <expression> is applied to all items from the input that match the join condition. To copy all fields of an input, use $input.* • Add the keyword ‘preserve’ to make it full join

Join Example • $users = [ {name: "Jon Doe", password: "asdf1234", id: 1}, {name: "Jane Doe", password: "qwertyui", id: 2}, {name: "Max Mustermann", password: "q1w2e3r4", id: 3}];$pages = [ {userid: 1, url:"code.google.com/p/jaql/"}, {userid: 2, url:"www.cnn.com"}, {userid: 1, url:"java.sun.com/javase/6/docs/api/"}] • Join $users, $pages where $users.id == $pages.userid into {$users.name, $pages.*} • [ { "name": "Jon Doe", "url": "code.google.com/p/jaql/", "userid": 1 }, { "name": "Jon Doe", "url": "java.sun.com/javase/6/", "userid": 1 }, { "name": "Jane Doe", "url": "www.cnn.com", "userid": 2 }]

IBM InfoSphere BigInsights • IBM’s offering for managing Big-Data • Powered by Hadoop and other components • Provides a fully tested environments

Recap • Introduction to Apache Hadoop • HDFS and Map-Reduce Programming Framework • Name Node, Data Node • Job Tracker, Task Tracker • Map and Reduce Methods Signatures • Word-Count Example • Flow In Map-Reduce • Java Implementation • More Map-Reduce Examples • Aggregation, Equi-Join and Inequality Join • Introduction to JAQL and IBM BigInsights

Advanced Concepts In Hadoop • Map-Reduce Programming Framework • Combiner, Counter, Partitioner, Distributed-Cache • Hadoop I/O • Input-Formats and Output-Formats • Input and Output-Formats provided by Hadoop • Writing Custom Input and Output Formats • Passing custom objects as key-values • Chaining Map-Reduce Jobs • Hadoop Tuning and Optimization • Configuration Parameters • Hadoop Eco-System • Hive/Pig/JAQL • HBase • Avro, ZooKeeper, Mahout, Sqoop, Ganglia etc. • An Overview of Hadoop Research • Join Processing : Multi-way equi and theta joins, set-similarity joins, k-NN joins, interval and spatial joins • Graph Processing, Text Processing etc • Systems : ReStore, PerfXPlain, Stubby, RAMP, HadoopDB etc.

References • Hadoop – The Definitive Guide . Oreilly Press • Pro-Hadoop : Build scalable, distributed applications in the Cloud. • Hadoop Tutorial : http://developer.yahoo.com/hadoop/tutorial/. • www.slideshare.net

INTRODUCTION TO HADOOP & MAP- REDUCE