Some tips for effective map reducing

Some tips for effective map reducing. CHRISTOPHER SEVERS eBay. THE AGENDA. Quick survey of the current landscape for Hadoop tools A light comparison of the best functional tools. General advice Some code samples . The Alternatives . I promise this part will be quick.

Some tips for effective map reducing

  1. Some tips for effective map reducing CHRISTOPHER SEVERS eBay


  THE AGENDA • Quick survey of the current landscape for Hadoop tools • A light comparison of the best functional tools. • General advice • Some code samples

  4. The Alternatives I promise this part will be quick

  VANILLA MAPREDUCE package org.myorg; import java.io.IOException; import java.util.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.conf.*; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.mapreduce.lib.input.FileInputFormat; import org.apache.hadoop.mapreduce.lib.input.TextInputFormat; import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat; import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat; public class WordCount{ public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); StringTokenizertokenizer= new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> { public void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException { int sum = 0; for (IntWritable val : values) { sum += val.get(); } context.write(key, new IntWritable(sum)); } } public static void main(String[] args) throws Exception { Configuration conf= new Configuration(); Job job = new Job(conf, "wordcount"); job.setOutputKeyClass(Text.class); job.setOutputValueClass(IntWritable.class); job.setMapperClass(Map.class); job.setReducerClass(Reduce.class); job.setInputFormatClass(TextInputFormat.class); job.setOutputFormatClass(TextOutputFormat.class); FileInputFormat.addInputPath(job, new Path(args[0])); FileOutputFormat.setOutputPath(job, new Path(args[1])); job.waitForCompletion(true); }

  PIG • Apache Pig is a really great tool for quick, ad-hoc data analysis • While we can do amazing things with it, I'm not sure we should • Anything complicated requires User Defined Functions (UDFs) • UDFs require a separate code base • Now you have to maintain two separate languages for no good reason

  APACHE HIVE • On previous slide: s/Pig/Hive/g

  8. General Advice Do this, not that

  DO • Use a higher level abstraction like distributed lists • Use objects instead of tuples • Use a good serialization format • Always check for data quality • Use flatMap for uncertain computations • Develop reusable reductions (monoids!) • Prefer map side operations when possible • Always check for data skew

  DON'T • Never use nulls • Don't use too many levels of nesting • Don't use shared state • Don't use iteration (too much) • Try not to start with a complicated approach

  11. ScALDING and SCOOBI This is what we use at eBay

  SOME SCALA CODE val myLines = getStuff val myWords = myLines.flatMap(w => w.split("\\s+")) val myWordsGrouped = myLines.groupBy(identity) val countedWords = myWordsGrouped. mapValues(x=>x.size) write(countedWords)

  SOME SCALDING CODE val myLines = TextLine(path) val myWords= myLines.flatMap(w => w.split(" ")) .groupBy(identity) .size myWords.write(TypedTSV(output))

  WHAT HAPPENED ON THE PREVIOUS SLIDE? • flatMap() • Similar to map, but a one-to-many rather than one-to-one mapping • Use when the desired result has some probability of occurring • Can handle errors with the Option (Maybe) monad. A None type will be discarded

  MORE EXPLANATION • groupBy() • Takes a function that generates a key from the given value • Logically the result can be thought of as an associative array: key -> List of values • In Scalding this doesn't necessarily force a Hadoop reduce phase, it depends on what comes after

  THE BEST PART • size • This part is pure magic • size is actually sugar for .map( t => 1L).sum • sum has an implicit argument, mon: Monoid[T]

  MONOIDS: WHY YOU SHOULD CARE ABOUT MATH • From Wikipedia: • amonoidis an algebraic structure with a single associativebinary operation and an identity element. • Almost everything you want to do is a monoid • Standard addition of numeric types is the most common • List/map/set/string concatenation • Top k elements • Bloom filter, count-min sketch, hyperloglog • stochastic gradient descent • histograms

  MORE MONOID STUFF • If you are aggregating, you are probably using a monoid • Scalding has Algebird and monoid support baked in • Scoobi can use Algebird (or any other monoid library) with almost no work • combine { case (l,r) => monoid.plus(l,r) } • Algebird handles tuples with ease • Very easy to define monoids for your own types

  ADVANTAGES • Type checking • Find errors at compile time, not at job submission time (or even worse, 5 hours after job submission time) • Single language • Scala is a full programming language • Productivity • Since the code you write looks like collections code you can use the Scala REPL to prototype • Clarity • Write code as a series of operations and let the job planner smash it all together

  20. Conclusion We’re almost done!

  THINGS TO TAKE AWAY • Mapreduce is a functional problem, we should use functional tools • You can increase productivity, safety, and maintainability all at once with no down side • Thinking of data flows in a functional way opens up many new possibilities • The community is awesome

  THANKS! • Questions/comments?

