mapreduce l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
MapReduce PowerPoint Presentation
Download Presentation
MapReduce

Loading in 2 Seconds...

play fullscreen
1 / 23

MapReduce - PowerPoint PPT Presentation


  • 173 Views
  • Uploaded on

MapReduce. Simplified Data Processing on Large Clusters Google, Inc. Presented by Prasad Raghavendra. Introduction. Model for processing large data sets. Contains Map and Reduce functions.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'MapReduce' - gus


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
mapreduce

MapReduce

Simplified Data Processing on Large Clusters

Google, Inc.

Presented by

Prasad Raghavendra

introduction
Introduction
  • Model for processing large data sets.
  • Contains Map and Reduce functions.
  • Runs on a large cluster of machines.
  • A lot of MapReduce programs are executed on Google’s cluster everyday.
motivation
Motivation
  • Very large data sets need to be processed.

- The whole Web, billions of Pages

  • Lots of machines

- Use them efficiently.

processing of large data sets
Processing of Large Data Sets

For example:

- Counting access frequency to URLs:

Input: list(RequestURL)

Output: list(RequestURL, total_number)

- Distributed Grep

- Distributed Sort

slide5

Programming modelInput & Output: each a set of key/value pairs Programmer specifies two functions: map (in_key, in_value) -> list(out_key, intermediate_value) Name comes from map function in LISPEx. (map 'list #’+ '(1 2 3) '(1 2 3)) => (2 4 6)-Processes input key/value pair -Produces set of intermediate pairsmap(document, content) {for each word in contentemit(word, “1”)}

slide6

reduce (out_key, list(intermediate_value)) -> list(out_value) Name comes from reduce function in LISPEx. (reduce #’+ '(1 2 3 4 5)) => 15- Combines all intermediate values for a particular key - Produces a set of merged output values (usually just one)reduce(word, values) {result = 0;for each value in valuesresult += valueemitString(w, result)}

slide7
ExampleThe problem of counting the number of occurrences of each word in a large collection ofdocuments.
  • Page 1: the weather is good
  • Page 2: today is good
  • Page 3: good weather is good
map output
Map output
  • Worker 1:

(the 1), (weather 1), (is 1), (good 1).

  • Worker 2:

(today 1), (is 1), (good 1).

  • Worker 3:

(good 1), (weather 1), (is 1), (good 1).

reduce input
Reduce Input
  • Worker 1:(the 1)
  • Worker 2: (is 1), (is 1), (is 1)
  • Worker 3:(weather 1), (weather 1)
  • Worker 4:(today 1)
  • Worker 5:(good 1),(good 1), (good 1),

(good 1)

reduce output
Reduce Output
  • Worker 1: (the 1)
  • Worker 2: (is 3)
  • Worker 3: (weather 2)
  • Worker 4: (today 1)
  • Worker 5: (good 4)
flow of mapreduce operation
Flow of MapReduce Operation
  • The MapReduce library in the user program splits the input files into M pieces(16,64 MB).
  • One of the copies of the program is special . The master. The rest are workers .
  • A worker who is assigned a map task parses key/value pairs out of the input data.
  • Periodically, the buffered pairs are written to local disk.
  • When a reduce worker is notified by the master about these locations, it uses remote procedure calls to read the buffered data.
  • The output of the Reduce function is appended to a final output file.
  • When all map tasks and reduce tasks have been completed, the master wakes up the user program.
problem stragglers
Problem: Stragglers
  • Often some machines are late in their replies

- slow disk, overloaded, etc

  • Approach:

- when only few tasks left to execute, start backup

tasks

- a task completes when either primary or backup

completes task

  • Performance:

- without backup, sort (->) takes 44% longer

partition function
Partition Function
  • Defines which worker processes which keys

- default: hash(key2) mod R

Other partition functions useful:

- sort: prefix of k bytes of line

- idea: based on known/sampled distribution of key2

to evenly distribute processed keys

combiner function
Combiner Function
  • Problem:

intermediate results can be quite verbose

e.g., (“the”, 1) could occur many times in previous example

  • Approach:

perform a local reduction before writing intermediate

results

typically, combiner same function as reduce func

This will reduce the run-time because less writing to

disk and across the network

performance
Performance
  • Scan 10^10 100-byte records to extract records matching a rare pattern (92K matching records) : 150 seconds.
  • Sort 10^10 100-byte records (modeled after TeraSort benchmark) : normal 839 seconds.
fault tolerance
Fault Tolerance
  • Crash of worker

all - even finished - tasks are redone

  • Crash of leader

crash of leader process -> restart process with

checkpoint

crash of leader machine-> unlikely - restart

computation

redo computation

conclusion
Conclusion
  • MapReduce has proven to be a useful abstraction
  • Easy to use
  • Very large variety of problems are easily expressible as MapReduce computations
  • Greatly simplifies large-scale computations at Google