INTRODUCTION TO HADOOP & MAP- REDUCE

INTRODUCTION TO HADOOP & MAP- REDUCE

Outline • Map-Reduce Features • Combiner / Partitioner / Counter • Passing Configuration Parameters • Distributed-Cache • Hadoop I/O • Passing Custom Objects as Key-Values • Input and Output Formats • Introduction • Input/Output Formats provided by Hadoop • Writing Custom Input/Output-Formats • Miscellaneous • Chaining Map-Reduce Jobs • Compression • Hadoop Tuning and Optimization

Combiner • A local reduce • Processes the output of each map function • Same signature as of a reduce • Often reduces the number of intermediate key-value pairs

Word-Count Sort/Shuffle A-I (Hadoop, 1) (Map, 1) (Map, 1) (Reduce , 1) (Hadoop, 1) (Map, 1) (Map, 1) (Hadoop, 3) (Hadoop, [1,1,1]) Hadoop Map Map Reduce Hadoop Map Map J-Q (Map, 1) (Map, 1) (Reduce, 1) (Map, 1) (Hadoop, 1) (Map, 7) (Key, 2) (Map, [1,1,1,1,1,1,1]) (Key, [1,1]) Map Map Reduce Map Hadoop R-Z (Key, 1) (Key, 1) (Value, 1) (Value, 1) (Reduce, 2) (Value, 2) Key Key Value Value (Reduce, [1,1]) (Value, [1,1])

Word-Count COMBINER A-I (Hadoop, 1) (Map, 1) (Map, 1) (Reduce , 1) (Hadoop, 1) (Map, 1) (Map, 1) (Hadoop, [1,1]) (Map, [1,1,1,1]) (Reduce , [1]) (Hadoop, 2) (Map, 4) (Reduce , 1) (Hadoop, 3) Hadoop Map Map Reduce Hadoop Map Map (Hadoop, [2,1]) J-Q (Map, 7) (Key, 2) (Map, 1) (Map, 1) (Reduce, 1) (Map, 1) (Hadoop, 1) (Map, [1,1,1]) (Reduce, 1) (Hadoop, 1) (Map, 3) (Reduce, 1) (Hadoop, 1) (Map, [4, 3]) (Key, [2]) Map Map Reduce Map Hadoop R-Z Key Key Value Value (Key, 1) (Key, 1) (Value, 1) (Value, 1) (Key, [1,1]) (Value, [1,1]) (Key, 2) (Value, 2) (Reduce, 2) (Value, 2) (Reduce, [1,1]) (Value, [2])

Type of Output Key Type of Output Value Type of Input Key Type of Input Value COMBINER public class WordCountCombiner extends Reducer<Text, IntWritable, Text, IntWritable>{ public void reduce(Text key, Iterable<IntWritable> values, Context context){ context.write(key, new IntWritable(count(values))); } }

Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(WordCountMap.class); job.setCombinerClass(WordCountCombiner.class); job.setReducerClass(WordCountReduce.class); job.setJarByClass(WordCountRunner.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setMapOutputKeyClass(Text.class); job.setMapOutputValuesClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.waitForCompletion(true); } }

Counters

Counters • Built-in Counters • Report Metrics for various aspects of a Job • Task Counters • Gather information about tasks over the course of a job • Results are aggregated across all tasks • MAP_INPUT_RECORDS, REDUCE_INPUT_GROUPS • FileSystem Counters • BYTES_READ, BYTES_WRITTEN • Bytes Read/Written by each File-System (HDFS, KFS, Local, S3 etc) • FileInputFormat Counters • BYTES_READ (Bytes Read through FileInputFormat) • FileOutputFormat Counters • BYTES_WRITTEN (Bytes Written through FileOutputFormat) • Job Counters • Maintained by Job-Tracker • TOTAL_LAUNCHED_MAPS, TOTAL_LAUNCHED_REDUCES

User-Define Counters • public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable>{ enum WCCounters {NOUNS, PRONOUNS, ADJECTIVES}; public void map(LongWritable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens.length; i++){ if(isNoun(tokens[i])) context.getCounter(WCCounter.NOUNS).increment(1); else if(isProNoun(tokens[i])) context.getCounter(WCCounter.PRONOUNS).increment(1); else if(isAdjective(tokens[i])) context.getCounter(WCCount.ADJECTIVES).increment(1); context.write(new Text(tokens[i]), new IntWritable(1)); } } }

Retrieving the values of a Counter Counter counters = job.getCounters(); Counter counter = counters.findCounter(WCCounters.NOUNS); int value = counter.getValue();

Output 13/10/08 15:36:15 INFO mapred.JobClient WordCountMap.NOUNS=2342 13/10/08 15:36:15 INFO mapred.JobClient WordCountMap.PRONOUNS=2124 13/10/08 15:36:15 INFO mapred.JobClient WordCountMap.ADJECTIVES=1897

Partitioner • Map keys to reducers/partitions • Determines which reducer receives a certain key • Identical keys produced by different map functions must map to same partition/reducer • If n reducers are used, then n partitions must be filled • Number of reducers are set by the call “setNumReduceTasks” • Hadoop uses HashPartitioner as default partitioner

Defining a Custom Partitioner • Implement a class which extends the Partitioner class • Partitioning impacts load-balancing aspect of a map-reduce program • Word-Count: Many words starting with vowels • Words starting with a different character sent to different reducer • For words starting with vowels, second character may be taken into account

Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); job.setMapperClass(WordCountMap.class); job.setCombinerClass(WordCountCombiner.class); job.setReducerClass(WordCountReduce.class); job.setJarByClass(WordCountRunner.class); job.setPartitionerClass(WordCountPartitioner.class); FileInputFormat.addInputPath(job, inputFilesPath); FileOutputFormat.addOutputPath(job, outputPath); job.setMapOutputKeyClass(Text.class); job.setMapOutputValuesClass(IntWritable.class); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setNumReduceTasks(1); job.waitForCompletion(true); } }

Passing Configuration Parameters • Map-Reduce jobs may require certain input parameters • One may want to avoid counting words starting with certain prefixes • Prefixes can be set in the configuration

Word-Count Runner Class public class WordCountRunner{ public static void main(String[] args){ Job job = new Job(); Configuration conf = job.getConfiguration(); conf.set(“PrefixesToAvoid”, “abs bts bnm swe”); …… …… job.waitForCompletion(true); } }

Word-Count Map • public class WordCountMap extends Mapper<LongWritable, Text, Text, IntWritable>{ private String[] prefixesToAvoid; public void setup(Context context) throws InterruptedException{ Configuration conf = context.getConfiguration(); String prefixes = conf.get(“PrefixesToAvoid”); this.prefixesToAvoid = prefixes.split(“ “); } public void map(LongWritable key, Text line, Context context){ String[] tokens = Tokenize(line); for(int i=0; i<tokens.length; i++){ context.write(new Text(tokens[i]), new IntWritable(1)); } } }

Distributed Cache • A file may need to be broadcasted to each map-node • For example, a dictionary in a spell-check • Such file-names can be added in a distributed-cache. • Hadoop copies files added to the cache to all map-nodes. • Step 1 : Put file to HDFS • hdfs dfs –put /tmp/file1 /cachefile1 • Step 2 : Add CacheFile in Job Configuration • Configuration conf = job.getConfiguration(); • DistributedCache.addCacheFile(new URI(“/cachefile1”), conf); • Step 3 : Access cache file locally at each map • Path[] cacheFiles = context.getLocalCacheFiles(); • FileInputStream finputStream = new FileInputStream(cacheFiles[0].toString());

Hadoop I/O : Reading an HDFS File // Get FileSystem Object Instance FileSystem fs = FileSystem.get(conf); // Get File Stream Path infile = new Path(filePath); BufferedReader br = new BufferedReader(new InputStreamReader(fs.open(infile))); // Read file line by line StringBuilder fileContent = new StringBuilder(); String line = br.readLine(); while(line!=null){ fileContent.append(line).append(“\n”); line = br.readLine(); }

Hadoop I/O : Writing to an HDFS file // Get FileSystem Object Instance FileSystem fs = FileSystem.get(conf); // Get FileStream Path path = new Path(filePath); FSDataOutputStream outputStream = hdfs.create(path); // Write to file byte[] bytes = content.getBytes(); outputStream.write(bytes, 0, bytes.length); outputStream.close();

Hadoop I/O : Getting the File being Processed • A map-reduce job may need to process multiple files • The functionality of a map may depend upon which file is being processed FileSplit fileSplit = (FileSplit) context.getInputSplit(); String filename = fileSplit.getPath().getName();

Custom Objects as Key-Values • Passing key and values from map functions to reducers • IntWritable, DoubleWritable, LongWritable, Text, ArrayWritable • Passing key and values of custom classes may be desirable • Objects that can be passed around must implement certain interfaces • Writable for passing as values • WritableComparable for passing as keys

Example Use-Case • Consider Weather data • Temperature and Pressure values at different lattitude-longitude-elevation-timestamp quadruple • Data is hence 4-dimensional • Temperature and Pressure data in separate files • File Format : lattitude, longitude, elevation, timestamp, temperature-value • Ex: 10 20 10 1 99F 10 21 10 2 98F • Similarly for Pressure Ex 10 20 10 1 101kPa • We want to read the two data files and combine the data • Ex: 10 20 10 1 99F 101kPa • Let class STPoint represent the coordinates • class STPoint{ double lattitude, longitude, elevation; long timestamp; }

DoubleWritable DoubleWritable Text Text Map to Reduce Flow 10 20 1 10 101kPa 10 21 1 10 109kPa 10 20 1 10 99F 10 21 1 10 98F MAP (10 20 1 10, 99F) MAP (10 21 1 10, 101kPa) REDUCE (10, 20, 1, 10, 99F, 101kPa)

DoubleWritable DoubleWritable STPoint STPoint Map to Reduce Flow 10 20 1 10 101kPa 10 21 1 10 109kPa 10 20 1 10 99F 10 21 1 10 98F MAP (STPoint(10 20 1 10), 99F) MAP (STPoint(10 21 1 10), 101kPa) REDUCE (STPoint(10, 20, 1, 10), 99F, 101kPa)

Type of Map Output Key Type of Output Value Map • public class MyMap extends Mapper<LongWritable, Text, Text, DoubleWritable>{ public void map(LongWritable key, Text line, Context context){ String tokens[] = line.split(“ “); double lattitude = new Double(tokens[0]).doubleValue(); double longitude = new Double(tokens[1]).doubleValue(); double elevation = new Double(tokens[2]).doubleValue(); long timestamp = new Long(tokens[3]).longValue(); double attrVal = new Double(tokens[4]).doubleValue(); String keyString = lattitude+” “+longitude+” “+elevation+” “+timestamp; context.write(new Text(keyString), attrVal); } }

Type of Map Output Key Type of Output Value New Map public class MyMap extends Mapper<LongWritable, Text, STPoint, DoubleWritable>{ public void map(LongWritable key, Text line, Context context){ String tokens[] = line.split(“ “); double lattitude = new Double(tokens[0]).doubleValue(); double longitude = new Double(tokens[1]).doubleValue(); double elevation = new Double(tokens[2]).doubleValue(); long timestamp = new Long(tokens[3]).longValue(); double attrVal = new Double(tokens[4]).doubleValue(); STPoint stpoint = new STPoint (lattitude, longitude, elevation, timestamp); context.write(stpoint, attrVal); } } More Intuitive, Human Readable, Reduces Processing at Reduce Side

Type of Output Key Type of Output Value Type of Input Key Type of Input Value New Reduce public class DataReadReduce extends Reducer<STPoint, DoubleWritable, Text, DoubleWritable>{ public void reduce(STPoint key, Iterable<DoubleWritable> values, Context context){ } }

Passing Custom Objects as Key-Values • Key-Value Pairs are written to local disk by map functions • User must tell how to write a custom object • Key-Value Pairs are read by reducers from local disk • User must tell how to read a custom object • Keys are sorted and compared • User must spectify how to compare two keys

WritableComparable Interface • Three Methods • public void readFields(DataInput in) {} • public void write(DataOutput out) {} • public int compareTo(Object other) {} • Objects that are passed as keys must implement WritableComparable interface. • Objects that are passed as values must implement Writable Interface • Writable interface does not have compareTo method • Only keys are compared and not values and hence compareTo method not required for objects being passed only as keys.

Implementing WritableComparable for STPoint • public void readFields(DataInput in) { this.lattitude = in.readDouble(); this.longitude = in.readDouble(); this.elevation = in.readDouble(); long timeStamp = in.readLong(); } • public void write(DataOutput output){ out.writeDouble(this.lattitude); out.writeDouble(this.longitude); out.writeDouble(this.elevation); out.writeLong(this.timestamp); } • public int compareTo(STPoint other){ return this.toString().compareTo(other.toString()); }

InputFormat and OutputFormat • InputFormat • Defines how to read data from file and feed it to the map functions • OutputFormat • Defines how to write data on to a file • Hadoop provides various Input and Output Formats • A user can also implement custom input and output formats • Defining custom input and output formats is a very useful feature of map-reduce

Input-Format • Defines how to read data from file and feed it to the map functions • How to define Splits? • getSplits() • How to define Record? • getRecordReader() • Hadoop provides various Input and Output Formats • A user can also implement custom input and output formats • Defining custom input and output formats is a very useful feature of map-reduce

Split Split 1 MAP-1 A B C R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 R6 8 9 1 R7 2 3 1 R8 9 9 2 R9 1 7 4 R10 1 2 2 R11 2 3 4 R12 4 5 6 R13 6 7 8 R14 9 8 3 R15 3 2 1 R16 2 3 4 R17 1 2 5 R18 9 3 5 R19 5 8 1 R20 3 3 3 64 MB MAP-2 Split 2 64 MB MAP-3 Split 3 64 MB 64 MB MAP-4 Split 4

Split A B C R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 R6 8 9 1 R7 2 3 1 R8 9 9 2 R9 1 7 4 R10 1 2 2 R11 2 3 4 R12 4 5 6 R13 6 7 8 R14 9 8 3 R15 3 2 1 R16 2 3 4 R17 1 2 5 R18 9 3 5 R19 5 8 1 R20 3 3 3 64 MB Split 1 MAP-1 64 MB 64 MB MAP-2 Split 2 64 MB

Split Split 1 MAP-1 A B C R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 R5 1 3 6 R6 8 9 1 R7 2 3 1 R8 9 9 2 R9 1 7 4 R10 1 2 2 R11 2 3 4 R12 4 5 6 R13 6 7 8 R14 9 8 3 R15 3 2 1 R16 2 3 4 R17 1 2 5 R18 9 3 5 R19 5 8 1 R20 3 3 3 64 MB MAP-2 Split 2 64 MB 64 MB MAP-3 Split 3 64 MB

Record-Reader R1 1 2 3 R2 2 3 5 All records fed to Map task one by one R3 2 4 6 R4 6 4 2 R5 1 3 6

Record-Reader R1 1 2 3 R2 2 3 5 R3 2 4 6 R4 6 4 2 There are three records now R5 1 3 6

Record-Reader R1 1 2 3 R5 1 3 6 All the tuples with identical values in column 1 are bunched in the same record R2 2 3 5 R3 2 4 6 R4 6 4 2

TextInputFormat • Default Input Format • Key is Byte Offset and Value is Line Content • Suitable for reading raw text files 10 20 1 10 99F 10 21 1 10 98F offset line as a string TEXT INPUT FORMAT MAP (0, “10 20 1 10 99F”) (10, “10 21 1 10 98F”)

KeyValueInputFormat • Input data in form of key \tab value • Anything before \tab is key • Anything after \tab is value • Input if not in correct format will throw up an error 10 20 1 10 \t 99F 10 21 1 10 \t 98F Key as Content before tab Value as content after tab KEY VALUE INPUT FORMAT MAP (“10 20 1 10”, “99F”) (“10 21 1 10”, “98F”)

SequenceInputFormat • Hadoop specific high performance binary input format • Key is user-defined • Value is user-defined Binary File User-defined key User-defined Value SEQUENCE INPUT FORMAT MAP (“10 20 1 10”, “99F”) (“10 21 1 10”, “98F”)

OutputFormats • TextOutputFormat • Default Output Format • Writes data in Key \tab Value format • This output to read subsequently by KeyValueInputFormat • SequenceOutputFormat • Writes Binary Files suitable for reading into subsequent MR jobs • Keys and Values are User defined

Text Input and Output Format (“10 20 1 10”, “99F”) TEXT INPUT FORMAT (0, “10 20 1 10 \tab 99F”) (“10 21 1 10”, “98F”) (10, “10 21 1 10 \tab 98F”) TEXT OUTPUT FORMAT 10 20 1 10 \tab 99F 10 21 1 10 \tab 98F KEY VALUE INPUT FORMAT (“10 20 1 10”, “99F”) (“10 21 1 10”, “98F”)

Custom Input Formats • Allows a user control over how to read data and subsequently feed it to the map functions • Advisable to implement custom input formats for specific use-cases • Simplifies the process of implementing map-reduce algorithms

CustomInputFormat - Key is of type STPoint 10 20 1 10 99F 10 21 1 10 98F MY INPUT FORMAT MAP (STPoint(10 20 1 10), 99F) (STPoint(10 21 1 10), 98F)

Type of Map Output Key Type of Output Value Map public class MyMap extends Mapper<LongWritable, Text, STPoint, DoubleWritable>{ public void map(LongWritable key, Text line, Context context){ String tokens[] = line.split(“ “); double lattitude = new Double(tokens[0]).doubleValue(); double longitude = new Double(tokens[1]).doubleValue(); double elevation = new Double(tokens[2]).doubleValue(); long timestamp = new Long(tokens[3]).longValue(); double attrVal = new Double(tokens[4]).doubleValue(); STPoint stpoint = new STPoint (lattitude, longitude, elevation, timestamp); context.write(stpoint, attrVal); } }

Map Output Key Map Input Key Map Input Value Map Output Value New Map With Custom Input Format class MyMap extends Mapper<STPoint, DoubleWritable, STPoint, DoubleWritable>{ public void map(STPoint point, DoubleWritable attrValue, Context context){ context.write(stpoint, attrVal); } } More Intuitive, Human Readable

INTRODUCTION TO HADOOP & MAP- REDUCE