1 / 9

Map Reduce

Map Reduce. Dustin Beaupre Thuy Nguyen Relation to our course: Chapter 8 : Physical Data Model - 8.3.2 Hash Tables & Files - 8.6.3 : Parallel Processing Sources : 1. wikipedia entry (en.wikipedia.org/wiki/MapReduce)

paco
Download Presentation

Map Reduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce Dustin Beaupre Thuy Nguyen Relation to our course: Chapter 8 : Physical Data Model - 8.3.2 Hash Tables & Files - 8.6.3 : Parallel Processing Sources: 1. wikipedia entry (en.wikipedia.org/wiki/MapReduce) 2. Apache MapReduce Tutorial (hadoop.apache.org/docs/r1.2.1/mapred_tutorial.html)

  2. Why is MapReduce useful? • Computes large amounts of data using parallel processing. • Divides the workload across a large number of machines. • If an update in the data is required, you have to re-map. • Useful for data mining. • Has fault tolerance, meaning that if one machine stops working, it will reassign the task to another

  3. What does map do? Distributes the workload to multiple machines. map() performs filtering and sorting. What does reduce do? Combines the output from the mapping into a single output reduce() performs summary operation.

  4. Logical View • (key, value) pair • Map(): take one pair of data in one domain and return a list of pairs in a different domain • Map(k1, v1) -> list (k2, v2) • Reduce(): apply in parallel in each group to produce a collection of value in the same domain • Reduce(k2, list(v2)) -> list (v3) • Transform a list of (key, value) pair into a single list of values

  5. Execution Trace for Wordcount Mapper A Reducer Mapper B

  6. SQL SELECT eyeColor, COUNT(*) FROM worldPopulation GROUP BY eyeColor • Suppose everyone was in this database all ~7,222,157,690 people • Sequential response time is too large! • Map Reduce may help!

  7. Execution Trace for EyeColorCount Mapper A Reducer Mapper B

  8. MapReduce steps • Prepare the Map() input • Run the user-provided Map() code • “Shuttle” the Map output to the Reduce processors • Run the user-provided Reduce() code • Produce the final output

  9. Overall, the goal of MapReduce is to provide correct output of large data sets in the smallest amount of time. Any Questions?

More Related