1 / 19

# Google MapReduce - PowerPoint PPT Presentation

Google MapReduce. Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development http://labs.google.com/papers/mapreduce.html. Outline. Motivation MapReduce Concept Map? Reduce?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Simplified Data Processing on

Large Clusters

Jeff Dean, Sanjay Ghemawat

Presented by

Conroy Whitney

4th year CS – Web Development

Outline
• Motivation
• MapReduce Concept
• Map? Reduce?
• Example of MapReduce problem
• MapReduce Cluster Environment
• Lifecycle of MapReduce operation
• Optimizations to MapReduce process
• Conclusion
• MapReduce in Googlicious Action
Motivation: Large Scale Data Processing
• Many tasks composed of processing lots of data to produce lots of other data
• Want to use hundreds or thousands of CPUs ... but this needs to be easy!
• MapReduce provides
• User-defined functions
• Automatic parallelization and distribution
• Fault-tolerance
• I/O scheduling
• Status and monitoring
Programming Concept
• Map
• Perform a function on individual values in a data set to create a newlist of values
• Example: square x = x * x map square [1,2,3,4,5] returns [1,4,9,16,25]
• Reduce
• Combine values in a data set to create a new value
• Example: sum = (each elem in arr, total +=) reduce [1,2,3,4,5] returns 15 (the sum of the elements)‏
• Find all pages that link to a certain page
• Map Function
• Outputs <target, source> pairs for each link to a target URL found in a source page
• For each page we know what pages it links to
• Reduce Function
• Concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>
• For a given web page, we know what pages link to it
• Distributed grep
• Distributed sort
• Term-Vector per Host
• Web Access Log Statistics
• Document Clustering
• Machine Learning
• Statistical Machine Translation
Performance Boasts
• Distributed grep
• 1010 100-byte files (~1TB of data)‏
• 3-character substring found in ~100k files
• ~1800 workers
• 150 seconds start to finish, including ~60 seconds startup overhead
• Distributed sort
• Same files/workers as above
• 50 lines of MapReduce code
• Best reported result of 1057 seconds for TeraSort benchmark
Typical Cluster
• 100s/1000s of Dual-Core, 2-4GB Memory
• Limited internal bandwidth
• Temporary storage on local IDE disks
• Distributed file system for permanent/shared storage
• Job scheduling system
• Master-Scheduler assigns tasks to Worker machines
Execution Initialization
• Split input file into 64MB sections (GFS)‏
• Read in parallel by multiple machines
• Fork off program onto multiple machines
• One machine is Master
• Master assigns idle machines to either Map or Reduce tasks
• Master Coordinates data communication between map and reduce machines
Map-Machine
• Reads contents of assigned portion of input-file
• Parses and prepares data for input to map function (e.g. read <a /> from HTML)‏
• Passes data into map function and saves result in memory (e.g. <target, source>)‏
• Periodically writes completed work to local disk
• Notifies Master of this partially completed work (intermediate data)‏
Reduce-Machine
• Retrieves intermediate data from Map-Machine via remote-read
• Sorts intermediate data by key (e.g. by target page)‏
• Iterates over intermediate data
• For each unique key, sends corresponding set through reduce function
• Appends result of reduce function to final output file (GFS)‏
Worker Failure
• Master pings workers periodically
• Any machine who does not respond is considered “dead”
• Both Map- and Reduce-Machines
• Any task in progress gets needs to be re-executed and becomes eligible for scheduling
• Map-Machines
• Completed tasks are also reset because results are stored on local disk
• Reduce-Machines notified to get data from new machine assigned to assume task
• Bugs in user code (from unexpected data) cause deterministic crashes
• Optimally, fix and re-run
• Not possible with third-party code
• When worker dies, sends “last gasp” UDP packet to Master describing record
• If more than one worker dies over a specific record, Master issues yet another re-execute command
• Tells new worker to skip problem record
• Some “Stragglers” not performing optimally
• Other processes demanding resources
• Slow down I/O speeds from 30MB/s to 1MB/s
• CPU cache disabled ?!
• Near end of phase, schedule redundant execution of in-process tasks
• First to complete “wins”
Locality
• Network Bandwidth scarce
• Around 64MB file sizes
• Redundant storage (usually 3+ machines)‏
• Assign Map-Machines to work on portions of input-files which they already have on local disk
• Read input file at local disk speeds
• Without this, read speed limited by network switch
Conclusion
• Complete rewrite of the production indexing system
• 20+ TB of data
• indexing takes 5-10 MapReduce operations
• indexing code is simpler, smaller, easier to understand
• Fault Tolerance, Distribution, Parallelization hidden within MapReduce library
• Avoids extra passes over the data
• Easy to change indexing system
• Improve performance of indexing process by adding new machines to cluster