1 / 39

Introduction to MapReduce

Introduction to MapReduce. Amit K Singh. Do you recognize this ??. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965). “The density of transistors on a chip doubles every 18 months, for the same cost” (1965). The Free Lunch Is Almost Over !!.

gracie
Download Presentation

Introduction to MapReduce

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to MapReduce Amit K Singh

  2. Do you recognize this ?? “The density of transistors on a chip doubles every 18 months, for the same cost” (1965)

  3. “The density of transistors on a chip doubles every 18 months, for the same cost” (1965)

  4. The Free Lunch Is Almost Over !!

  5. Web graphic Super ComputerJanet E. Ward, 2000 Cluster of Desktops The Future is Multi-core !!

  6. Replace specialized powerful Super-Computers with large clusters of commodity hardware • But Distributed programming is inherently complex. The Future is Multi-core !!

  7. Platform for reliable, scalable parallel computing • Abstracts issues of distributed and parallel environment from programmer. • Runs over Google File Systems Google’s MapReduce Paradigm

  8. Highly scalable distributed file system for large data-intensive applications. • Provides redundant storage of massive amounts of data on cheap and unreliable computers • Provides a platform over which other systems like MapReduce, BigTable operate. Detour: Google File Systems (GFS)

  9. GFS Architecture

  10. ”Consider the problem of counting the number of occurrences of each word in a large collection of documents” • How would you do it in parallel ? MapReduce: Insight

  11. One possible solution

  12. Inspired from map and reduce operations commonly used in functional programming languages like Lisp. • Users implement interface of two primary methods: • 1. Map: (key1, val1) → (key2, val2) • 2. Reduce: (key2, [val2]) → [val3] MapReduce Programming Model

  13. Map, a pure function, written by the user, takes an input key/value pair and produces a set of intermediate key/value pairs. • e.g. (doc—id, doc-content) • Draw an analogy to SQL, map can be visualized as group-by clause of an aggregate query. Map operation

  14. On completion of map phase, all the intermediate values for a given output key are combined together into a list and given to a reducer. • Can be visualized as aggregate function (e.g., average) that is computed over all the rows with the same group-by attribute. Reduce operation

  15. map(String input_key, String input_value): // input_key: document name // input_value: document contents for each word w in input_value: EmitIntermediate(w, "1"); reduce(String output_key, Iterator intermediate_values): // output_key: a word // output_values: a list of counts int result = 0; for each v in intermediate_values: result += ParseInt(v); Emit(AsString(result)); Pseudo-code

  16. MapReduce: Execution overview

  17. MapReduce: Execution overview

  18. MapReduce: Example

  19. MapReduce in Parallel: Example

  20. MapReduce: Runtime Environment

  21. Handled via re-execution of tasks. • Task completion committed through master • What happens if Mapper fails ? • Re-execute completed + in-progress map tasks • What happens if Reducer fails ? • Re-execute in progress reduce tasks • What happens if Master fails ? • Potential trouble !! MapReduce: Fault Tolerance

  22. Leverage GFS to schedule a map task on a machine that contains a replica of the corresponding input data. • Thousands of machines read input at local disk speed • Without this, rack switches limit read rate MapReduce: Refinements Locality Optimization

  23. Slow workers are source of bottleneck, may delay completion time. • Near end of phase, spawn backup tasks, one to finish first wins. • Effectively utilizes computing power, reducing job completion time by a factor. MapReduce: Refinements Redundant Execution

  24. Map/Reduce functions sometimes fail for particular inputs. • Fixing the Bug might not be possible : Third Party Libraries. • On Error • Worker sends signal to Master • If multiple error on same record, skip record MapReduce: Refinements Skipping Bad Records

  25. Combiner Function at Mapper • Sorting Guarantees within each reduce partition. • Local execution for debugging/testing • User-defined counters MapReduce: Refinements Miscellaneous

  26. Walk through of One more Application MapReduce:

  27. PageRank models the behavior of a “random surfer”. • C(t) is the out-degree of t, and (1-d) is a damping factor (random jump) • The “random surfer” keeps clicking on successive links at random not taking content into consideration. • Distributes its pages rank equally among all pages it links to. • The dampening factor takes the surfer “getting bored” and typing arbitrary URL. MapReduce : PageRank

  28. Computing PageRank

  29. Effects at each iteration is local. i+1th iteration depends only on ith iteration • At iteration i, PageRank for individual nodes can be computed independently PageRank : Key Insights

  30. Use Sparse matrix representation (M) • Map each row of M to a list of PageRank “credit” to assign to out link neighbours. • These prestige scores are reduced to a single PageRank value for a page by aggregating over them. PageRank using MapReduce

  31. Map: distribute PageRank “credit” to link targets Reduce: gather up PageRank “credit” from multiple sources to compute new PageRank value PageRank using MapReduce Iterate until convergence Source of Image: Lin 2008

  32. Map task takes (URL, page-content) pairs and maps them to (URL, (PRinit, list-of-urls)) • PRinit is the “seed” PageRank for URL • list-of-urls contains all pages pointed to by URL • Reduce task is just the identity function Phase 1: Process HTML

  33. Reduce task gets (URL, url_list) and many (URL, val) values • Sum vals and fix up with d to get new PR • Emit (URL, (new_rank, url_list)) • Check for convergence using non parallel component Phase 2: PageRank Distribution

  34. Distributed Grep. • Count of URL Access Frequency. • Clustering (K-means) • Graph Algorithms. • Indexing Systems MapReduce Programs In Google Source Tree MapReduce: Some More Apps

  35. PIG (Yahoo) • Hadoop (Apache) • DryadLinq (Microsoft) MapReduce: Extensions and similar apps

  36. Large Scale Systems Architecture using MapReduce

  37. Although restrictive, provides good fit for many problems encountered in the practice of processing large data sets. • Functional Programming Paradigm can be applied to large scale computation. • Easy to use, hides messy details of parallelization, fault-tolerance, data distribution and load balancing from the programmers. • And finally, if it works for Google, it should be handy !! Take Home Messages

  38. Thank You

More Related