1 / 55

Map-Reduce: Win -or- Epic Win

CSC313: Advanced Programming Topics. Map-Reduce: Win -or- Epic Win. Brief History of Google. BackRub : 1996 4 disk drives 24 GB total storage. Brief History of Google. BackRub : 1996 4 disk drives 24 GB total storage. =. Brief History of Google. Google: 1998 44 disk drives

melba
Download Presentation

Map-Reduce: Win -or- Epic Win

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSC313: Advanced Programming Topics Map-Reduce:Win-or-Epic Win

  2. Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage

  3. Brief History of Google BackRub: 1996 4 disk drives 24 GB total storage =

  4. Brief History of Google Google: 1998 44 disk drives 366 GB total storage

  5. Brief History of Google Google: 1998 44 disk drives 366 GB total storage =

  6. Traditional Design Principles • If big enough, supercomputer processes work • Use desktop CPUs, just a lot more of them • But it also provides huge bandwidth to memory • Equivalent to many machines bandwidth at once • But supercomputers are VERY, VERY expensive • Maintenance also expensive once machine bought • But do get something: high-quality == low downtime • Safe, expensive solution to very large problems

  7. Why Trade Money for Safety?

  8. Why Trade Money for Safety?

  9. How Was Search Performed? http://www.yahoo.com/search?p=pager DNS

  10. How Was Search Performed? http://www.yahoo.com/search?p=pager DNS

  11. How Was Search Performed? http://www.yahoo.com/search?p=pager http://209.191.122.70 DNS

  12. How Was Search Performed? http://www.yahoo.com/search?p=pager http://209.191.122.70/search?p=pager DNS

  13. How Was Search Performed? http://www.yahoo.com/search?p=pager http://209.191.122.70/search?p=pager DNS

  14. Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance

  15. Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance • But problem is desktop machines unreliable • Budget for 2 replacements, since machines cheap • Just expect failure; software provides quality

  16. Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance • But problem is desktop machines unreliable • Budget for 2 replacements, since machines cheap • Just expect failure; software provides quality

  17. Google’s Big Insight • Performing search is “embarrassingly parallel” • No need for supercomputer and all that expense • Can instead do this using lots & lots of desktops • Identical effective bandwidth & performance • But problem is desktop machines unreliable • Budget for 2 replacements, since machines cheap • Just expect failure; software provides quality

  18. A brief history of Google Google: 2012 ?0,000 total servers ??? PB total storage

  19. How Is Search Performed Now? http://209.85.148.100/search?q=android

  20. How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

  21. How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

  22. How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

  23. How Is Search Performed Now? http://209.85.148.100/search?q=android Spell Checker Ad Server Document Servers (TB) Index Servers (TB)

  24. Google’s Processing Model • Buy cheap machines & prepare for worst • Machines going to fail, but still cheaper approach • Important steps keep whole system reliable • Replicate data so that information losses limited • Move data freely so can always rebalance loads • These decisions lead to many other benefits • Scalability helped by focus on balancing • Search speed improved; performance much better • Utilize resources fully, since search demand varies

  25. Heterogeneous processing • By buying cheapest computers, variances are high • Programs must handle homo- & hetero- systems • Centralized workqueuehelps with different speeds

  26. Heterogeneous processing • By buying cheapest computers, variances are high • Programs must handle homo- & hetero- systems • Centralized workqueuehelps with different speeds • This process also leads to a few small downsides • Space • Power consumption • Cooling costs

  27. Complexity at Google

  28. Complexity at Google Avoid this nightmare using abstractions

  29. Google Abstractions • Google File System • Handles replication to provide scalability & durability • BigTable • Manages largerelational data sets • Chubby • Gonna skip past that joke; distributed locking service • MapReduce • Ifjob fits, easy parallelism possible without much work

  30. Google Abstractions • Google File System • Handles replication to provide scalability & durability • BigTable • Manageslarge relational data sets • Chubby • Gonna skip past that joke; distributed locking service • MapReduce • Ifjob fits, easy parallelism possible without much work

  31. Remember Google’s Problem

  32. MapReduce Overview • Programming model makes details simple • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail

  33. MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail

  34. MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail • Idea came from 2 Lisp (functional) primitives • Map • Reduce

  35. MapReduce Overview • Programming model provides good Façade • Automatic parallelization & load balancing • Network and disk I/O optimization • Robust performance even if machines fail • Idea came from 2 Lisp (functional) primitives • Map:process each entry in list using some function • Reduce: recombines data using given function

  36. Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in

  37. Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Outline always same; Just map & reduce functions change

  38. Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Algorithm always same; Just map & reduce functions change

  39. Typical MapReduce problem • Read lots and lots of data (e.g., TBs) • Map • Extract important data from each entry in input • Combine Maps and sort entries by key • Reduce • Process each key’s entries to get result for that key • Output final result & watch money roll in Template method always same; Just the hook methods change

  40. Pictorial View of MapReduce

  41. Ex: Count Word Frequencies • Processes files separately Map Key=URL Value=text on page

  42. Ex: Count Word Frequencies • Processes files separately & count word freq. in each Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=count

  43. Ex: Count Word Frequencies • In shuffle step, Maps combined & entries sorted by key • Reduce Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’=“or” Value’=“1” Reduce Key’=“not” Value’=“1” Key’=“to” Value’=“1” Key’=“to” Value’=“1”

  44. Ex: Count Word Frequencies • In shuffle step, Maps combined & entries sorted by key • Reduce combines key’s results to compute final output Key’=“be” Value’=“1” Key’=“be” Value’=“1” Key’’=“be” Value’’=“2” Key’=“or” Value’=“1” Key’’=“or” Value’’=“1” Reduce Key’=“not” Value’=“1” Key’’=“not” Value’’=“1” Key’’=“to” Value’’=“2” Key’=“to” Value’=“1” Key’=“to” Value’=“1”

  45. Word Frequency Pseudo-code Map(String input_key, String input_values) {String[] words = input_values.split(“ ”);foreachw in words{EmitIntermediate(w, "1");} } Reduce(String key, Iterator intermediate_values){intresult = 0;foreachv in intermediate_values{result += ParseInt(v);}Emit(result); }

  46. Ex: Build Search Index • Processes files separately & record words found on each Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=URL

  47. Ex: Build Search Index • Processes files separately & record words found on each • To get search Map, combine key’s results in Reduce Key’=word Value’=count Map Key’=word Value’=count Key=URL Value=text on page Key’=word Value’=count Key’=word Value’=URL Key’=word Value’=count Reduce Key’=word Value’=count Key=word Value=URLs with word Key’=word Value’=count Key’=word Value’=URL

  48. Search Index Pseudo-code Map(String input_key, String input_values) {String[] words = input_values.split(“ ”);foreachw in words{EmitIntermediate(w, input_key);} } Reduce(String key, Iterator intermediate_values){List result = new ArrayList();foreachv in intermediate_values{result.addLast(v);}Emit(result); }

  49. Ex: Page Rank Computation • Google’s algorithm ranking pages’ relevance

  50. Ex: Page Rank Computation Key’=word Value’=count Map Key’=word Value’=count Key=<URL, rank> Value=links on page Key’=word Value’=count Key’=link on page Value’=<URL, rank/N> + Key=<URL, rank> Value=links on page Key=<URL, rank> Value=links on page + Key’=word Value’=count Reduce Key’=word Value’=count Key=<URL, rank> Value=links on page Key’=word Value’=count Key’=link to URL Value’=<src, rank/N>

More Related