1 / 29

MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004

MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004. Shimin Chen DISC Reading Group. Motivation: Large Scale Data Processing. Process lots of data to produce other derived data Input: crawled documents, web request logs etc.

carlow
Download Presentation

MapReduce: Simplified Data Processing on Large Clusters J. Dean and S. Ghemawat (Google) OSDI 2004

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce: Simplified Data Processing on Large ClustersJ. Dean and S. Ghemawat (Google) OSDI 2004 Shimin ChenDISC Reading Group

  2. Motivation: Large Scale Data Processing • Process lots of data to produce other derived data • Input: crawled documents, web request logs etc. • Output: inverted indices, web page graph structure,top queries in a day etc. • Want to use hundreds or thousands of CPUs • but want to only focus on the functionality • MapReduce hides messy details in a library: • Parallelization • Data distribution • Fault-tolerance • Load balancing

  3. Outline • Programming Model • Implementation • Refinements • Evaluation • Conclusion

  4. Programming model Input & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Processes input key/value pair to generate intermediate pairs reduce (out_key, list(intermediate_value)) -> list(out_value) Given all intermediate values for a particular key, produces a set of merged output values (usually just one) Inspired by similar primitives in LISP and other functional languages

  5. Example: Count word occurrences map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result));

  6. Looking at Actual Code (Appendix A) #include "mapreduce/mapreduce.h“ // User's map function class WordCounter : public Mapper { public: virtual void Map(const MapInput& input) { const string& text = input.value(); const int n = text.size(); for (int i = 0; i < n; ) { // Skip past leading whitespace while ((i < n) && isspace(text[i])) i++; // Find word end int start = i; while ((i < n) && !isspace(text[i])) i++; if (start < i) Emit(text.substr(start,i-start),"1"); } } }; REGISTER_MAPPER(WordCounter);

  7. // User's reduce function • class Adder : public Reducer { • virtual void Reduce(ReduceInput* input) { • // Iterate over all entries with the • // same key and add the values • int64 value = 0; • while (!input->done()) { • value += StringToInt(input->value()); • input->NextValue(); • } • // Emit sum for input->key() • Emit(IntToString(value)); • } • }; • REGISTER_REDUCER(Adder);

  8. int main(int argc, char** argv) { • ParseCommandLineFlags(argc, argv); • MapReduceSpecification spec; • // Store list of input files into "spec" • for (int i = 1; i < argc; i++) { • MapReduceInput* input = spec.add_input(); • input->set_format("text"); • input->set_filepattern(argv[i]); • input->set_mapper_class("WordCounter"); • } • // Specify the output files: • // /gfs/test/freq-00000-of-00100,/gfs/test/freq-00001-of-00100 • MapReduceOutput* out = spec.output(); • out->set_filebase("/gfs/test/freq"); • out->set_num_tasks(100); • out->set_format("text"); • out->set_reducer_class("Adder"); • // Optional: do partial sums within map tasks • out->set_combiner_class("Adder"); • // Tuning parameters • spec.set_machines(2000); • spec.set_map_megabytes(100); • spec.set_reduce_megabytes(100); • // Now run it • MapReduceResult result; • if (!MapReduce(spec, &result)) abort(); • // Done: 'result' structure contains info about counters, time • // taken, number of machines used, etc. • return 0; • }

  9. More Examples • Inverted index: worddocuments • Map: parse each document, emits <word, document ID> • Reduce: emits <word, list of document IDs> • Distributed grep • Map: emits a line if input document match a given pattern • Reduce: identity function • Distributed sort • Map: extracts key from each record, emits a <key, record> • Reduce: emits all pairs unchanged • Relies on partitioning function and ordering guarantees (later in the talk)

  10. MapReduce Jobs Run in August 2004 (Table 1)

  11. Outline • Programming Model • Implementation • Refinements • Evaluation • Conclusion

  12. Implementation Overview Execution environment: Google cluster • 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory • 100 mbps or 1 gbps Ethernet, but limited (average) bisection bandwidth • Storage is on local IDE disks • GFS: distributed file system manages data • Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines

  13. Parallelization • Map • Divide the input into M equal-sized splits • Each split is 16-64 MB large • Reduce • Partitioning intermediate key space into R pieces • hash(intermediate_key) mod R • Typical setting: • 2,000 machines • M = 200,000 • R = 5,000

  14. Execution Overview (0) mapreduce(spec, &result) M input splits of 16-64MB each • Read all intermediate data • Sort it by intermediate keys R regions Partitioning functionhash(intermediate_key) mod R

  15. Timeline

  16. More Details • Master: • Map task: state (idle/in-progress/completed), R file locations, worker machine • Reduce task: state (idle/in-progress/completed), worker machine • O(M+R) scheduling decisions, O(MR) space • Locality preserving scheduling • Schedule a map task close to the input location • Prefer fine-grain tasks • Dynamic load balancing • Speeds up recovery

  17. Fault Tolerance via Re-Execution On worker failure: • Detect failure via periodic heartbeats • Re-execute completed and in-progress map tasks • Fine-grain: the completed tasks can be re-executed on multiple machines quickly • Re-execute in progress reduce tasks • Task completion committed through master Master failure: • Could handle, but don't yet (master failure unlikely)

  18. Outline • Programming Model • Implementation • Refinements • Evaluation • Conclusion

  19. Backup Tasks • Problem: “straggler” • A machine takes an unusually long time to complete one of the last few map or reduce tasks • E.g. bad disk with frequent correctable errors; other jobs running; machine configuration problems etc. • Near end of phase, master schedules backup executions of the remaining in-progress tasks • Whichever one finishes first "wins"

  20. Combiner Function • Purpose: reduce data sent over network • Combiner function: performs partial merging of intermediate data at the map worker • Typically, combiner function == reducer function • Requires commutative and associative • E.g. word count

  21. Skipping Bad Records • Map/Reduce functions sometimes fail for particular inputs • Best solution is to debug & fix, but not always possible • On seg fault: • Send UDP packet to master from signal handler • Include sequence number of record being processed • If master sees two failures for same record: • Next worker is told to skip the record • Effect: Can work around bugs in third-party libraries

  22. Other Refinements • Extensible input and output types • Local execution for debugging • Status web page • User-defined counters • Counter values returned to user code • Displayed on status web page

  23. Outline • Programming Model • Implementation • Refinements • Evaluation • Conclusion

  24. Setup Tests run on cluster of 1800 machines: • 4 GB of memory • Dual-processor 2 GHz Xeons with Hyperthreading • Dual 160 GB IDE disks • Gigabit Ethernet per machine • Bisection bandwidth approximately 100 Gbps

  25. Benchmarks Two benchmarks: Grep Scan 1010 100-byte records to extract records matching a rare pattern (92K matching records) • M=15,000 (input split size about 64MB) • R=1 Sort Sort 1010 100-byte records (modeled after TeraSort benchmark) • M=15,000 (input split size about 64MB) • R=4,000

  26. Grep Locality optimization helps: • 1800 machines read 1 TB of data at peak of ~31 GB/s • Without this, rack switches would limit to 10 GB/s Startup overhead is significant for short jobs • Total time about 150 seconds; 1 minute startup time

  27. Sort 44% longer 5% longer

  28. Conclusion • MapReduce has proven to be a useful abstraction • Greatly simplifies large-scale computations at Google • Fun to use: focus on problem, let library deal w/ messy details

More Related