1 / 17

Introduction to Google MapReduce

Introduction to Google MapReduce. Based on materials from Internet. What is MapReduce?. A programming model (& its associated implementation) For processing large data set Exploits large set of commodity computers Executes process in distributed manner Offers high degree of transparencies.

Download Presentation

Introduction to Google MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Google MapReduce Based on materials from Internet

  2. What is MapReduce? • A programming model (& its associated implementation) • For processing large data set • Exploits large set of commodity computers • Executes process in distributed manner • Offers high degree of transparencies

  3. count count Split data count Split data count merged count count merge Split data count count Split data count Distributed Word Count Very big data

  4. Partitioning Function Map Reduce • Map: • Accepts input key/value pair • Emits intermediate key/value pair • Reduce : • Accepts intermediate key/value* pair • Emits output key/value pair R E D U C E M A P Very big data Result

  5. Partitioning Function

  6. Partitioning Function (2) • Default : hash(key) mod R • Guarantee: • Relatively well-balanced partitions • Ordering guarantee within partition • Distributed Sort • Map: emit(key,value) • Reduce (with R=1): emit(key,value) • Distributed Word Count • Map: for all w in value do emit(w,1) • Reduce: emit(key,sum(value*))

  7. MapReduce Class MapReduce{ Class Mapper …{ Mapcode; } Class Reduer …{ Reduce code; } Main(){ JobConf Conf=new JobConf(“MR.Class”); Other code; } }

  8. MapReduce Transparencies Plus Google Distributed File System : • Parallelization • Fault-tolerance • Locality optimization • Load balancing

  9. MapReduce outside Google • Hadoop (Java) • Emulates MapReduce and GFS • The architecture of Hadoop MapReduce and DFS is master/slave

  10. Example Word Count: Map public static class MapClass extends MapReduceBase implements Mapper { private final static IntWritable one= new IntWritable(1); private Text word = new Text(); public void map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) throws IOException { String line = ((Text)value).toString(); StringTokenizer itr = new StringTokenizer(line); while (itr.hasMoreTokens()) { word.set(itr.nextToken()); output.collect(word, one); } } }

  11. Example Word Count: Reduce public static class Reduce extends MapReduceBase implements Reducer { public void reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) throws IOException { int sum = 0; while (values.hasNext()) { sum += ((IntWritable) values.next()).get(); } output.collect(key, new IntWritable(sum)); } }

  12. Example Word Count: Main public static void main(String[] args) throws IOException { //checking goes here JobConf conf = new JobConf(); conf.setOutputKeyClass(Text.class); conf.setOutputValueClass(IntWritable.class); conf.setMapperClass(MapClass.class); conf.setCombinerClass(Reduce.class); conf.setReducerClass(Reduce.class); conf.setInputPath(new Path(args[0])); conf.setOutputPath(new Path(args[1])); JobClient.runJob(conf); }

  13. One time setup • set hadoop-site.xml and slaves • Initiate namenode • Run Hadoop MapReduce and DFS • Upload your data to DFS • Run your process… • Download your data from DFS

  14. Summary • A simple programming model for processing large dataset on large set of computer cluster • Fun to use, focus on problem, and let the library deal with the messy detail

  15. References • Original paper (http://labs.google.com/papers/mapreduce.html) • On wikipedia (http://en.wikipedia.org/wiki/MapReduce) • Hadoop – MapReduce in Java (http://lucene.apache.org/hadoop/) • Starfish - MapReduce in Ruby (http://rufy.com/starfish/)

More Related