1 / 24

MapReduce and Hadoop

MapReduce and Hadoop. Frankie Pike. Why care?. 2010: 1.2 zettabytes 1.2 trillion gigabytes DVDs past the moon 2-way = 6 newspapers everyday ~58% growth per year. Why care?. Why care?. Google’s capacity = 1 exabyte 24 hours of Youtube > Internet in 2000

farhani
Download Presentation

MapReduce and Hadoop

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. MapReduce and Hadoop Frankie Pike

  2. Why care? • 2010: 1.2 zettabytes • 1.2 trillion gigabytes • DVDs past the moon • 2-way = 6 newspapers everyday • ~58% growth per year

  3. Why care?

  4. Why care? • Google’s capacity = 1 exabyte • 24 hours of Youtube > Internet in 2000 • 4 years of video / day on Youtube • 100 trillion words online

  5. Common Architecture http://www.adopenstatic.com/images/resources/blog/Kerberos6.jpg

  6. Common Architecture • Single point of failure • Space-constraints • Multi-tenancy difficulties • Re-writing of programs or changes to network config

  7. MapReduce

  8. The Promise • High reliability • any node can go down • High scalability • easy to add nodes • Multi-tenancy • Cost Reduction • “Cloud-friendly” • Java, C++, C#, Python, R • Transparent Parallelization

  9. The Kryptonite • Data set needs to be “big enough” • Consistency mid-processing

  10. Two Steps in MapReduce • Map • Reduce

  11. Mapping • Input K/V pairs -> Intermediate K/V Pairs • Input and Intermediate can be different • (Server Key, Blog Data) -> (Blog Key, Post Count) • Sorted and Partitioned for reduction • Number of maps depends on task and cluster • 10TB data with blocksize 128MB = 82,000 maps • 10-100 maps per node ideal

  12. Reducing • Intermediate K/V -> Intermediate K/V (smaller) • Matching keys consolidated • (A, 15); (B, 6); (A, 3) -> (A, 18); (B, 6) • Number of Reductions >= 0 • Hopefully smaller dataset at each iteration • Reduce as much as needed

  13. An Example { "type": "post", "name": "Raven's Map/Reduce functionality", "blog_id": 1342, "post_id": 29293921, "tags": ["raven", "nosql"], "post_content": "<p>...</p>", "comments": [ { "source_ip": '124.2.21.2', "author": "martin", "text": "..."} ] } Want count of comments for blog http://ayende.com/blog/4435/map-reduce-a-visual-explanation

  14. Step 1: Map to final format http://ayende.com/blog/4435/map-reduce-a-visual-explanation

  15. Step 2: Reduce (Partition) http://ayende.com/blog/4435/map-reduce-a-visual-explanation

  16. Step 3: Reduce (more) http://ayende.com/blog/4435/map-reduce-a-visual-explanation

  17. Step 4: Reduce (most) http://ayende.com/blog/4435/map-reduce-a-visual-explanation

  18. Single Node http://bc.tech.coop/blog/070520.html

  19. Dual Node http://map-reduce.wikispaces.asu.edu/

  20. N-Nodes http://www.inventoland.net/img/blog/mapReduce.png

  21. Dealing with Failure • Workers • Occasional check-in pings by masters • Masters • Data structures get periodic auto-saves and consistency checks. Can restart from periodic saves • Bandwidth • Tasks attempt to pair with local storage

  22. Has it worked? • Patented • Regenerated index

  23. Apache Hadoop • “open source software for reliable, scalable, distributed computing” • Hadoop Distributed File System (HDFS) • HadoopMapReduce • Cassandra (multi-master database) • HBase (scalable, distributed, structured database) • Mahout (data mining and machine learning libs) • ZooKeeper (coordination service)

  24. Sources • Avankipu & Sdsalvi, Cloud Computing - An Overview. http://map-reduce.wikispaces.asu.edu • AyendeRahien, Map/Reduce – A Visual Explanation. http://ayende.com/blog/4435/map-reduce-a-visual-explanation • http://hadoop.apache.org/ • http://en.wikipedia.org/wiki/MapReduce/

More Related