1 / 28

Lecture 2 – MapReduce: Theory and Implementation

Lecture 2 – MapReduce: Theory and Implementation. CSE 490h – Introduction to Distributed Computing, Winter 2008. Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License. Last Class. How do I process lots of data?

najila
Download Presentation

Lecture 2 – MapReduce: Theory and Implementation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 2 – MapReduce: Theory and Implementation CSE 490h – Introduction to Distributed Computing, Winter 2008 Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5 License.

  2. Last Class • How do I process lots of data? • Distribute the work • Can I distribute the work? • Maybe… if it’s not dependent on other tasks • Example: Fibonnaci.

  3. Last Class • What problems can occur? • Large tasks • Unpredictable bugs • Machine failure • How do solve / avoid these? • Break up into small chunks? • Restart tasks? • Use known working solutions

  4. MapReduce • Concept from functional programming • Implemented by Google • Applied to large number of problems

  5. Functional Programming Review Java:int fooA(String[] list) { return bar1(list) + bar2(list); } int fooB(String[] list) { return bar2(list) + bar1(list); } Do they give the same result?

  6. Functional Programming Review Functional Programming:fun fooA(l: int list) = bar1(l) + bar2(l) fun fooB(l: int list) = bar2(l) + bar1(l) Do they give the same result?

  7. Functional Programming Review • Operations do not modify data structures: They always create new ones • Original data still exists in unmodified form

  8. Functional Updates Do Not Modify Structures fun foo(x, lst) = let lst' = reverse lst in reverse ( x :: lst' ) foo: a’ -> a’ list -> a’ list The foo() function above reverses a list, adds a new element to the front, and returns all of that, reversed, which appends an item. But it never modifies lst!

  9. Functions Can Be Used As Arguments fun DoDouble(f, x) = f (f x) It does not matter what f does to its argument; DoDouble() will do it twice. What is the type of this function? x: a’ f: a’ -> a’ DoDouble: (a’ -> a’) -> a’ -> a’

  10. map (Functional Programming) Creates a new list by applying f to each element of the input list; returns output in order. map f lst: (’a->’b) -> (’a list) -> (’b list)

  11. map Implementation fun map f [] = [] | map f (x::xs) = (f x) :: (map f xs) • This implementation moves left-to-right across the list, mapping elements one at a time • … But does it need to?

  12. Implicit Parallelism In map • In a purely functional setting, elements of a list being computed by map cannot see the effects of the computations on other elements • If order of application of f to elements in list is commutative, we can reorder or parallelize execution • This is the “secret” that MapReduce exploits

  13. Fold Moves across a list, applying f to each element plus an accumulator. f returns the next accumulator value, which is combined with the next element of the list fold f x0 lst: ('a*'b->'b)->'b->('a list)->'b

  14. fold left vs. fold right • Order of list elements can be significant • Fold left moves left-to-right across the list • Fold right moves from right-to-left SML Implementation: fun foldl f a [] = a | foldl f a (x::xs) = foldl f (f(x, a)) xs fun foldr f a [] = a | foldr f a (x::xs) = f(x, (foldr f a xs))

  15. Example fun foo(l: int list) = sum(l) + mul(l) + length(l) How can we implement this?

  16. Example (Solved) fun foo(l: int list) = sum(l) + mul(l) + length(l) fun sum(lst) = foldl (fn (x,a)=>x+a) 0 lst fun mul(lst) = foldl (fn (x,a)=>x*a) 1 lst fun length(lst) = foldl (fn (x,a)=>1+a) 0 lst

  17. Google MapReduce • Input Handling • Map function • Partition Function • Compare Function • Reduce Function • Output Writer

  18. Input Handling • Divides up data into bite-size chunks • Starts up tasks • Assigns tasks to idle workers

  19. Map • Input: Key, Value pair • Output: Key, Value pairs • Example: Annual Rainfall Per City

  20. Map (Example) • Example: Annual Rainfall Per City map(String key, String value): // key: date // value: weather info foreach (City c in value) EmitIntermediate(c, c.temperature)

  21. Partition Function • Allocates map output to particular reduces • Input: key, number of reduces • Output: Index of desired reduce • Typical: hash(key) % numberOfReduces

  22. Comparison • Sorts input for each reduce • Example: Annual rainfall per city • Sorts rainfall data for each city • Seattle: {0, 0, 0, 1, 4, 7, 10, …}

  23. Reduce • Input: Key, Sorted list of values • Output: Single value • Example: Annual rainfall per city

  24. Reduce • Input: Key, Sorted list of values • Output: Single value • Example: Annual rainfall per city

  25. Reduce (Example) • Example: Annual rainfall per city • reduce(String key, Iterator values): // key: city // values: temperature sum = 0, count = 0 for each (v in values) sum += v count = count + 1 Emit(sum / count)

  26. Output • Writes the output to storage (GFS, etc)

  27. MapReduce for Google Local • Intersections • Rendering Tiles • Finding nearest gas stations

More Related