Optimizing Hadoop Functions: Partitioning, Combining, and Parameter Configuration

Gang Luo Sept. 9, 2010 Beyond map/reduce functionspartitioner, combiner and parameter configuration

Partitioner • Determine which reducer/partition one record should go to • Given the key, value and number of partitions, return an integer • Partition: (K2, V2, #Partitions)  integer

Partitioner Interface: public interface Partitioner<K2, V2> extends JobConfigurable { int getPartition(K2 key, V2 value, int numPartitions); } Implementation: public class myPartitioner<K2, V2> implements Partitioner<K2, V2> { int getPartition(K2 key, V2 value, int numPartitions) { your logic! } }

Partitioner Example: public class myPartitioner<Text, Text> implements Partitioner<Text, Text> { int getPartition(Text key, Text value, int numPartitions) { int hashCode = key.hashCode(); int partitionIndex = hashCode mod numPartitions; return partitionIndex; } }

Combiner • Reduce the amount of intermediate data before sending them to reducers. • Pre-aggregation • The interface is exactly the same as reducer.

Combiner Example: public class myCombiner extends MapReduceBase implements Reducer<Text, Text, Text, Text> { public void reduce(Text key, Iterator<Text> value OutputCollector<Text, Text> output, Reporter reporter) { your logic } } Should be the same type as map output key/value Should be the same type as reducer input key/value

Parameter • Cluster-level parameters (e.g. HDFS block size) • Job-specific parameters (e.g. number of reducers, map output buffer size) • Configurable. Important for job performance • User-define parameters • Used to pass information from driver to mapper/reducer. • Help to make your mapper/reducer more generic

Parameter JobConf conf = new JobConf(Driver.class); conf.setNumReduceTasks(10);// set the number of reducers by a // build-in function conf.set(“io.sort.mb”, “200”);// set the size of map output buffer by the // name of that parameter conf.setString(“deliminator”, “\t”);//set a user-defined parameter. conf.getNumReduceTasks(10);// get the value of a parameter by // build-in function String buffSize = conf.get(“io.sort.mb”, “200”);//get the value of a parameter // by its name String deliminator = conf.getString(“deliminator”, “\t”);// get the value of a // user-defined parameter

Parameter • There are some built-in parameters managed by Hadoop. We are not supposed to change them, but can read them • String inputFile = jobConf.get("map.input.file"); • Get the path to the current input • Used in joining datasets *for new api, you should use: FileSplit split = (FileSplit)context.getInputSplit(); String inputFile = split.getPath().toString();

More about Hadoop • Identity Mapper/Reducer • Output == input. No modification • Why do we need map/reduce function without any logic in them? • Sorting! • More generally, when you only want to use the basic functionality provided by Hadoop (e.g. sorting/grouping)

More about Hadoop • How to determine the number of splits? • If a file is large enough and splitable, it will be splited into multiple pieces (split size = block size) • If a file is non-splitable, only one split. • If a file is small (smaller than a block), one split for file, unless...

More about Hadoop • CombineFileInputFormat • Merge multiple small files into one split, which will be processed by one mapper • Save mapper slots. Reduce the overhead • Other options to handle small files? • hadoop fs -getmerge src dest

Optimizing Hadoop Functions: Partitioning, Combining, and Parameter Configuration