1 / 12

Beyond map/reduce functions partitioner, combiner and parameter configuration

Gang Luo Sept. 9, 2010. Beyond map/reduce functions partitioner, combiner and parameter configuration. Partitioner. Determine which reducer/partition one record should go to Given the key, value and number of partitions, return an integer Partition: (K2, V2, #Partitions)  integer.

yates
Download Presentation

Beyond map/reduce functions partitioner, combiner and parameter configuration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gang Luo Sept. 9, 2010 Beyond map/reduce functionspartitioner, combiner and parameter configuration

  2. Partitioner • Determine which reducer/partition one record should go to • Given the key, value and number of partitions, return an integer • Partition: (K2, V2, #Partitions)  integer

  3. Partitioner Interface: public interface Partitioner<K2, V2> extends JobConfigurable { int getPartition(K2 key, V2 value, int numPartitions); } Implementation: public class myPartitioner<K2, V2> implements Partitioner<K2, V2> { int getPartition(K2 key, V2 value, int numPartitions) { your logic! } }

  4. Partitioner Example: public class myPartitioner<Text, Text> implements Partitioner<Text, Text> { int getPartition(Text key, Text value, int numPartitions) { int hashCode = key.hashCode(); int partitionIndex = hashCode mod numPartitions; return partitionIndex; } }

  5. Combiner • Reduce the amount of intermediate data before sending them to reducers. • Pre-aggregation • The interface is exactly the same as reducer.

  6. Combiner Example: public class myCombiner extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> value OutputCollector<Text, Text> output, Reporter reporter) { your logic } } Should be the same type as map output key/value Should be the same type as reducer input key/value

  7. Parameter • Cluster-level parameters (e.g. HDFS block size) • Job-specific parameters (e.g. number of reducers, map output buffer size) • Configurable. Important for job performance • User-define parameters • Used to pass information from driver to mapper/reducer. • Help to make your mapper/reducer more generic

  8. Parameter JobConf conf = new JobConf(Driver.class); conf.setNumReduceTasks(10);// set the number of reducers by a // build-in function conf.set(“io.sort.mb”, “200”);// set the size of map output buffer by the // name of that parameter conf.setString(“deliminator”, “\t”);//set a user-defined parameter. conf.getNumReduceTasks(10);// get the value of a parameter by // build-in function String buffSize = conf.get(“io.sort.mb”, “200”);//get the value of a parameter // by its name String deliminator = conf.getString(“deliminator”, “\t”);// get the value of a // user-defined parameter

  9. Parameter • There are some built-in parameters managed by Hadoop. We are not supposed to change them, but can read them • String inputFile = jobConf.get("map.input.file"); • Get the path to the current input • Used in joining datasets *for new api, you should use: FileSplit split = (FileSplit)context.getInputSplit(); String inputFile = split.getPath().toString();

  10. More about Hadoop • Identity Mapper/Reducer • Output == input. No modification • Why do we need map/reduce function without any logic in them? • Sorting! • More generally, when you only want to use the basic functionality provided by Hadoop (e.g. sorting/grouping)

  11. More about Hadoop • How to determine the number of splits? • If a file is large enough and splitable, it will be splited into multiple pieces (split size = block size) • If a file is non-splitable, only one split. • If a file is small (smaller than a block), one split for file, unless...

  12. More about Hadoop • CombineFileInputFormat • Merge multiple small files into one split, which will be processed by one mapper • Save mapper slots. Reduce the overhead • Other options to handle small files? • hadoop fs -getmerge src dest

More Related