MapReduce Programming

MapReduce Programming Yue-Shan Chang

UserProgram (1) fork (1) fork (1) fork Master (2) assign map (2) assign reduce worker split 0 (6) write output file 0 worker split 1 (5) remote read (3) read split 2 (4) local write worker split 3 output file 1 split 4 worker worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files

MapReduce Program Structure Class MapReduce{ Class Mapper …{ Map程式碼 } Class Reduer …{ Reduce程式碼 } Main(){ 主程式設定區 JobConf Conf=new JobConf(“MR.Class”); 其他設定參數程式碼 }}

MapReduce Job

Handled parts

Configuration of a Job • JobConf object • JobConf is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution. • JobConf typically specifies the Mapper, combiner (if any), Partitioner, Reducer, InputFormat and OutputFormat implementations to be used • Indicates the set of input files (setInputPaths(JobConf, Path...) /addInputPath(JobConf, Path)) and (setInputPaths(JobConf, String) /addInputPaths(JobConf, String)) and where the output files should be written (setOutputPath(Path)).

Configuration of a Job

Input Splitting • An input split will normally be a contiguous group of records from a single input file • If the number of requested map tasks is larger than number of files • the individual files are larger than the suggested fragment size, there may be multiple input splits constructed of each input file. • The user has considerable control over the number of input splits.

Specifying Input Formats • The Hadoop framework provides a large variety of input formats. • KeyValueTextInputFormat: Key/value pairs, one per line. • TextInputFormant: The key is the line number, and the value is the line. • NLineInputFormat: Similar to KeyValueTextInputFormat, but the splits are based on N lines of input rather than Y bytes of input. • MultiFileInputFormat: An abstract class that lets the user implement an input format that aggregates multiple files into one split. • SequenceFIleInputFormat: The input file is a Hadoop sequence file, containing serialized key/value pairs.

Specifying Input Formats

Setting the Output Parameters • The framework requires that the output parameters be configured, even if the job will not produce any output. • The framework will collect the output from the specified tasks and place them into the configured output directory.

Setting the Output Parameters

A Simple Map Function: IdentityMapper

A Simple Reduce Function: IdentityReducer

Configuring the Reduce Phase • the user must supply the framework with five pieces of information • The number of reduce tasks; if zero, no reduce phase is run • The class supplying the reduce method • The input key and value types for the reduce task; by default, the same as the reduce output • The output key and value types for the reduce task • The output file type for the reduce task output

How Many Maps? • The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files. • The right level of parallelism for maps seems to be around 10-100 maps per-node, • it is best if the maps take at least a minute to execute • setNumMapTasks(int)

Reducer • Reducer reduces a set of intermediate values which share a key to a smaller set of values. • Reducer has 3 primary phases: shuffle, sort and reduce. • Shuffle • Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP. • Sort • The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage • The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

How Many Reduces? • The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum). • With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. • With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

How Many Reduces? • Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. • Reducer NONE • It is legal to set the number of reduce-tasks to zero if no reduction is desired. • In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). • The framework does not sort the map-outputs before writing them out to the FileSystem

Reporter • Reporter is a facility for Map/Reduce applications to report progress, set application-level status messages and update Counters. • Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive.

JobTracker • JobTracker is the central location for submitting and tracking MR jobs in a network environment. • JobClient is the primary interface by which user-job interacts with the JobTracker • provides facilities to submit jobs, track their progress, access component-tasks' reports and logs, get the Map/Reduce cluster's status information and so on.

Job Submission and Monitoring • The job submission process involves: • Checking the input and output specifications of the job. • Computing the InputSplit values for the job. • Setting up the requisite accounting information for the DistributedCache of the job, if necessary. • Copying the job's jar and configuration to the Map/Reduce system directory on the FileSystem. • Submitting the job to the JobTracker and optionally monitoring it's status.

MapReduce Details forMultimachine Clusters

Introduction • Why? • datasets that can’t fit on a single machine, • have time constraints that are impossible to satisfy with a small number of machines, • need to rapidly scale the computing power applied to a problem due to varying input set sizes.

Requirements for Successful MapReduce Jobs • Mapper • ingest the input andprocess the input record, sending forward the records that can be passed to the reduce task orto the final output directly • Reducer • Acceptthe key and value groups that passed through the mapper, and generate the final output • job must be configured with the location and type of the input data, the mapper classto use, the number of reduce tasks required, and the reducer class and I/O types.

Requirements for Successful MapReduce Jobs • The TaskTracker service will actually run your map and reduce tasks, and the JobTracker service will distribute the tasks and their input split to the various trackers. • The cluster must be configured with the nodes that will run the TaskTrackers, and withthe number of TaskTrackers to run per node.

Requirements for Successful MapReduce Jobs • Three levels of configuration to address to configure MapReduce on your cluster • configure the machines, • the Hadoop MapReduce framework, • the jobs themselves

Launching MapReduce Jobs • launch the preceding example from the command line > bin/hadoop [-libjars jar1.jar,jar2.jar,jar3.jar] jar myjar.jar MyClass

MapReduce-Specific Configuration for Each Machine in a Cluster • install any standard JARs that your application uses • It is probable that your applications will have a runtime environment that is deployed from a configuration management application, which you will also need to deploy to each machine. • The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks. • The conf/slaves file should have the set of machines to serve as TaskTracker nodes

DistributedCache • distributes application-specific, large, read-only files efficiently • a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications. • The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node

Adding Resources to the Task Classpath • Methods • JobConf.setJar(String jar): Sets the user JAR for the MapReduce job. • JobConf.setJarByClass(Class cls): Determines the JAR that contains the class cls and calls JobConf.setJar(jar) with that JAR. • DistributedCache.addArchiveToClassPath(Path archive, Configuration conf): Adds an archive path to the current set of classpath entries.

Configuring the Hadoop Core Cluster Information • Setting the Default File System URI • You can also use the JobConf object to set the default file system: • conf.set( "fs.default.name", "hdfs://NamenodeHostname:PORT");

Configuring the Hadoop Core Cluster Information • Setting the JobTracker Location • use the JobConf object to set the JobTracker information: • conf.set( "mapred.job.tracker", "JobtrackerHostname:PORT");

MapReduce Programming