70 likes | 203 Views
This document introduces MapReduce, a powerful model used for analyzing large datasets by creating key-value pairs and condensing them through a reducing function. Through an example of word frequency analysis, we illustrate the process of data handling, emphasizing how MapReduce efficiently manages vast data volumes through parallel computing, fault tolerance, and load balancing. The provided code breakdown showcases the main components - Map and Reduce functions - and demonstrates their application in counting word occurrences, making it accessible for data analysts and programmers alike.
E N D
MapReduce: Simplified Data Processing on Large Clusters Appendix A: Word Frequency Alex Newton Billy Coss
Contents • Abstract • Introduction • MapReduce • Word Frequency Analysis Sample Code
Abstract • MapReduce is a model used to analyze large amounts of data • Map creates key:value pairs, irrespective of duplicates • Reduce takes the key-value pairs created by the Map function and condenses them down to remove duplicate results
Introduction • Data analysts at Google frequently work on extremely large sets of raw data • Parallel computing is required to process datasets in a useful length of time • MapReduce was created as a form of abstraction for the details of parallelization, fault tolerance, data distribution, and load balancing
MapReduce Image taken from OSDI ‘04 Presentation by Jeff Dean and Sanjay Ghemawat.
Word Frequency Analysis Example Code • Code is divided into three functions • main • WordCounter • Adder • WordCounter is used for the Map function • Skips any leading whitespace and then parses words out of text • The word itself is the key, the value is 1 • Adder is used for the Reduce function • Iterates through keys, and adds the values of the same key together • Since the value is 1, this has the effect of incrementing a counter for the number of times a word is used
Sources J. Dean & S. Ghemawat (2004), MapReduce: Simplified Data Processing on Large Clusters. OSDI ‘04: 6th Symposium on Operating Systems Design and Implementation. pp. 137, 149. http://research.google.com/archive/mapreduce.html