1 / 16

Introduction to Search Engines Technology CS 236375 Technion , Winter 2013

Introduction to Search Engines Technology CS 236375 Technion , Winter 2013. Map-Reduce. Amit Gross Some slides are courtesy of: Edward Bortnikov & Ronny Lempel, Yahoo! Labs, Haifa. Problem Example. Given a huge set of documents, count how many times each token appears in the corpus.

amandla
Download Presentation

Introduction to Search Engines Technology CS 236375 Technion , Winter 2013

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Search Engines TechnologyCS 236375 Technion, Winter 2013 Map-Reduce Amit Gross Some slides are courtesy of:Edward Bortnikov & Ronny Lempel, Yahoo! Labs, Haifa

  2. Problem Example • Given a huge set of documents, count how many times each token appears in the corpus. • The data is of petabyte scale. • Some sort of distributed computing is required – but we don’t want to reinvent the wheel for each new task.

  3. Solution Paradigm • Describe the problem as a set of Map-Reduce tasks, from the functional programming paradigm. • Map: data -> (key,value)* • Document -> (token, ‘1’)* • Reduce: (key,List<value>) -> (key,value’) • (token,List<1>) -> (token,#repeats)

  4. Word-count - example Input: • D1 = The good the bad and the ugly • D2 = As good as it gets and more • D3 = Is it ugly and bad? It is, and more! Map: Text->(term,’1’): (The,1); (Good,1); (the,1); (bad,1); (and,1); (the,1); (ugly,1); (as,1); (good,1); (as,1); (it,1); (gets,1); (and,1); (more,1); (is,1); (it,1) (ugly,1); (and,1); (bad,1); (it,1); (is,1); (and,1); (more,1)

  5. Word-count - example (the,[1,1,1]); (good, [1,1]); (bad, [1,1]); (ugly,[1,1]); (and, [1,1,1,1]); (as, [1,1]); (it,[1,1,1]); (gets, [1]); (more, [1,1]); (is,[1,1]) Reduce (term,list<1>)->(term,#occurances) (the,3); (good,2); (bad,2); (ugly,2); (and,4); (as,2); (it,3); (gets,1); (more,2); (is,2)

  6. Word-count – pseudo-code: Map(Document): terms[] <- parse(Document) for each t in terms: emit(t,’1’) Reduce(term,list<id>): emit(term,sum(list))

  7. Other examples: grep(Text,regex): • Map(Text,regex)->(line#,1) • Reduce(line,[1])->line# Inverted-Index: • Map(docId,Text) -> (term, docId) • Reduce(term,list<docId>)-> (term,sorted(list<docId>)) Reverse Web-Link-Graph • Map(Webpages)->(target,source) [for each link] • Reduce(target,list<source>)-> (target,list<source>)

  8. Data-flow Input (text) (key,value) Mapper Sort pairs by keyCreate a list per keyShuffle keys by hashvalue output (key,list<value>) Reducer Framework User Supplied

  9. Example: MR job on 2 Machines Shuffle M1 R1 M2 Output (on DFS) Input splits (on DFS) R2 M3 M4 Synchronousexecution: every R starts computing after all M’s have completed

  10. Storage • Job input and output are stored on DFS • Replicated, reliable storage • Intermediate files reside on local disks • Non-reliable • Data is transferred between Mapper to Reducers via network, on files – time consuming.

  11. Combiners • Often, the reducer does is simple aggregation • Sum, average, min/max, … • Commutative and associative functions • We can do some aggregation at the mapper side • … and eliminate a lot of network traffic! • Where can we use it in an example we have already seen? • Word Count – combiner identical to reducer

  12. Data-flow with combiner (key,value’) (key,value) Input (text) Mapper Combiner Sort pairs by keyCreate a list per keyShuffle keys by hashvalue Done on the same machine! output (key,list<value>) Reducer Framework User Supplied

  13. Fault tolerance • If the probability for a machine to fail is p, the probability for some machine in a cluster of n machines to fail is • For p=0.0005, n=2000 we get • We cannot assume no faults in large scale clusters. • The framework takes care for fault tolerance. • There is a master that controls the data flow. • If a mapper/reducer fails – just resend the task to a different machine. • If the master fails – some other machine becomes the new master.

  14. Straggler Tasks Slowest task (straggler) affects the job latency M1 R1 M2 Input (on DFS) Output (on DFS) R2 M3 M4

  15. Speculative Execution • Schedule a backup task if the original task takes too long to complete • Same input(s), different output(s) • Failed tasks and stragglers get the same treatment • Let the fastest win • After one task completes, kill all the clones • Challenge: how can we tell a task is late?

  16. Summary • A simple paradigm for batch processing • Data- and computation-intensive jobs • Simplicity is key for scalability • No silver bullet • E.g., MPI is better for iterative computation-intensive workloads (e.g., scientific simulations)

More Related