1 / 37

MapReduce Programming

MapReduce Programming. Yue-Shan Chang. User Program. (1) fork. (1) fork. (1) fork. Master. (2) assign map. (2) assign reduce. worker. split 0. (6) write. output file 0. worker. split 1. (5) remote read. (3) read. split 2. (4) local write. worker. split 3. output file 1.

kera
Download Presentation

MapReduce Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce Programming Yue-Shan Chang

  2. UserProgram (1) fork (1) fork (1) fork Master (2) assign map (2) assign reduce worker split 0 (6) write output file 0 worker split 1 (5) remote read (3) read split 2 (4) local write worker split 3 output file 1 split 4 worker worker Input files Map phase Intermediate files (on local disk) Reduce phase Output files

  3. MapReduce Program Structure Class MapReduce{ Class Mapper …{ Map程式碼 } Class Reduer …{ Reduce程式碼 } Main(){ 主程式設定區 JobConf Conf=new JobConf(“MR.Class”); 其他設定參數程式碼 }}

  4. MapReduce Job

  5. Handled parts

  6. Configuration of a Job • JobConf object • JobConf is the primary interface for a user to describe a map-reduce job to the Hadoop framework for execution. • JobConf typically specifies the Mapper, combiner (if any), Partitioner, Reducer, InputFormat and OutputFormat implementations to be used • Indicates the set of input files (setInputPaths(JobConf, Path...) /addInputPath(JobConf, Path)) and (setInputPaths(JobConf, String) /addInputPaths(JobConf, String)) and where the output files should be written (setOutputPath(Path)).

  7. Configuration of a Job

  8. Input Splitting • An input split will normally be a contiguous group of records from a single input file • If the number of requested map tasks is larger than number of files • the individual files are larger than the suggested fragment size, there may be multiple input splits constructed of each input file. • The user has considerable control over the number of input splits.

  9. Specifying Input Formats • The Hadoop framework provides a large variety of input formats. • KeyValueTextInputFormat: Key/value pairs, one per line. • TextInputFormant: The key is the line number, and the value is the line. • NLineInputFormat: Similar to KeyValueTextInputFormat, but the splits are based on N lines of input rather than Y bytes of input. • MultiFileInputFormat: An abstract class that lets the user implement an input format that aggregates multiple files into one split. • SequenceFIleInputFormat: The input file is a Hadoop sequence file, containing serialized key/value pairs.

  10. Specifying Input Formats

  11. Setting the Output Parameters • The framework requires that the output parameters be configured, even if the job will not produce any output. • The framework will collect the output from the specified tasks and place them into the configured output directory.

  12. Setting the Output Parameters

  13. A Simple Map Function: IdentityMapper

  14. A Simple Reduce Function: IdentityReducer

  15. A Simple Reduce Function: IdentityReducer

  16. Configuring the Reduce Phase • the user must supply the framework with five pieces of information • The number of reduce tasks; if zero, no reduce phase is run • The class supplying the reduce method • The input key and value types for the reduce task; by default, the same as the reduce output • The output key and value types for the reduce task • The output file type for the reduce task output

  17. How Many Maps? • The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files. • The right level of parallelism for maps seems to be around 10-100 maps per-node, • it is best if the maps take at least a minute to execute • setNumMapTasks(int)

  18. Reducer • Reducer reduces a set of intermediate values which share a key to a smaller set of values. • Reducer has 3 primary phases: shuffle, sort and reduce. • Shuffle • Input to the Reducer is the sorted output of the mappers. In this phase the framework fetches the relevant partition of the output of all the mappers, via HTTP. • Sort • The framework groups Reducer inputs by keys (since different mappers may have output the same key) in this stage • The shuffle and sort phases occur simultaneously; while map-outputs are being fetched they are merged.

  19. How Many Reduces? • The right number of reduces seems to be 0.95 or 1.75 multiplied by (<no. of nodes> * mapred.tasktracker.reduce.tasks.maximum). • With 0.95 all of the reduces can launch immediately and start transferring map outputs as the maps finish. • With 1.75 the faster nodes will finish their first round of reduces and launch a second wave of reduces doing a much better job of load balancing.

  20. How Many Reduces? • Increasing the number of reduces increases the framework overhead, but increases load balancing and lowers the cost of failures. • Reducer NONE • It is legal to set the number of reduce-tasks to zero if no reduction is desired. • In this case the outputs of the map-tasks go directly to the FileSystem, into the output path set by setOutputPath(Path). • The framework does not sort the map-outputs before writing them out to the FileSystem

  21. Reporter • Reporter is a facility for Map/Reduce applications to report progress, set application-level status messages and update Counters. • Mapper and Reducer implementations can use the Reporter to report progress or just indicate that they are alive.

  22. JobTracker • JobTracker is the central location for submitting and tracking MR jobs in a network environment. • JobClient is the primary interface by which user-job interacts with the JobTracker • provides facilities to submit jobs, track their progress, access component-tasks' reports and logs, get the Map/Reduce cluster's status information and so on.

  23. Job Submission and Monitoring • The job submission process involves: • Checking the input and output specifications of the job. • Computing the InputSplit values for the job. • Setting up the requisite accounting information for the DistributedCache of the job, if necessary. • Copying the job's jar and configuration to the Map/Reduce system directory on the FileSystem. • Submitting the job to the JobTracker and optionally monitoring it's status.

  24. MapReduce Details forMultimachine Clusters

  25. Introduction • Why? • datasets that can’t fit on a single machine, • have time constraints that are impossible to satisfy with a small number of machines, • need to rapidly scale the computing power applied to a problem due to varying input set sizes.

  26. Requirements for Successful MapReduce Jobs • Mapper • ingest the input andprocess the input record, sending forward the records that can be passed to the reduce task orto the final output directly • Reducer • Acceptthe key and value groups that passed through the mapper, and generate the final output • job must be configured with the location and type of the input data, the mapper classto use, the number of reduce tasks required, and the reducer class and I/O types.

  27. Requirements for Successful MapReduce Jobs • The TaskTracker service will actually run your map and reduce tasks, and the JobTracker service will distribute the tasks and their input split to the various trackers. • The cluster must be configured with the nodes that will run the TaskTrackers, and withthe number of TaskTrackers to run per node.

  28. Requirements for Successful MapReduce Jobs • Three levels of configuration to address to configure MapReduce on your cluster • configure the machines, • the Hadoop MapReduce framework, • the jobs themselves

  29. Launching MapReduce Jobs • launch the preceding example from the command line > bin/hadoop [-libjars jar1.jar,jar2.jar,jar3.jar] jar myjar.jar MyClass

  30. MapReduce-Specific Configuration for Each Machine in a Cluster • install any standard JARs that your application uses • It is probable that your applications will have a runtime environment that is deployed from a configuration management application, which you will also need to deploy to each machine. • The machines will need to have enough RAM for the Hadoop Core services plus the RAM required to run your tasks. • The conf/slaves file should have the set of machines to serve as TaskTracker nodes

  31. DistributedCache • distributes application-specific, large, read-only files efficiently • a facility provided by the Map/Reduce framework to cache files (text, archives, jars and so on) needed by applications. • The framework will copy the necessary files to the slave node before any tasks for the job are executed on that node

  32. Adding Resources to the Task Classpath • Methods • JobConf.setJar(String jar): Sets the user JAR for the MapReduce job. • JobConf.setJarByClass(Class cls): Determines the JAR that contains the class cls and calls JobConf.setJar(jar) with that JAR. • DistributedCache.addArchiveToClassPath(Path archive, Configuration conf): Adds an archive path to the current set of classpath entries.

  33. Configuring the Hadoop Core Cluster Information • Setting the Default File System URI • You can also use the JobConf object to set the default file system: • conf.set( "fs.default.name", "hdfs://NamenodeHostname:PORT");

  34. Configuring the Hadoop Core Cluster Information • Setting the JobTracker Location • use the JobConf object to set the JobTracker information: • conf.set( "mapred.job.tracker", "JobtrackerHostname:PORT");

More Related