google mapreduce l.
Skip this Video
Loading SlideShow in 5 Seconds..
Google MapReduce PowerPoint Presentation
Download Presentation
Google MapReduce

Loading in 2 Seconds...

play fullscreen
1 / 19

Google MapReduce - PowerPoint PPT Presentation

  • Uploaded on

Google MapReduce. Simplified Data Processing on Large Clusters Jeff Dean, Sanjay Ghemawat Google, Inc. Presented by Conroy Whitney 4 th year CS – Web Development Outline. Motivation MapReduce Concept Map? Reduce?

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Google MapReduce' - tim

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
google mapreduce
Google MapReduce

Simplified Data Processing on

Large Clusters

Jeff Dean, Sanjay Ghemawat

Google, Inc.

Presented by

Conroy Whitney

4th year CS – Web Development

  • Motivation
  • MapReduce Concept
    • Map? Reduce?
  • Example of MapReduce problem
    • Reverse Web-Link Graph
  • MapReduce Cluster Environment
  • Lifecycle of MapReduce operation
  • Optimizations to MapReduce process
  • Conclusion
    • MapReduce in Googlicious Action
motivation large scale data processing
Motivation: Large Scale Data Processing
  • Many tasks composed of processing lots of data to produce lots of other data
  • Want to use hundreds or thousands of CPUs ... but this needs to be easy!
  • MapReduce provides
    • User-defined functions
    • Automatic parallelization and distribution
    • Fault-tolerance
    • I/O scheduling
    • Status and monitoring
programming concept
Programming Concept
  • Map
    • Perform a function on individual values in a data set to create a newlist of values
    • Example: square x = x * x map square [1,2,3,4,5] returns [1,4,9,16,25]
  • Reduce
    • Combine values in a data set to create a new value
    • Example: sum = (each elem in arr, total +=) reduce [1,2,3,4,5] returns 15 (the sum of the elements)‏
example reverse web link graph
Example: Reverse Web-Link Graph
  • Find all pages that link to a certain page
  • Map Function
    • Outputs <target, source> pairs for each link to a target URL found in a source page
    • For each page we know what pages it links to
  • Reduce Function
    • Concatenates the list of all source URLs associated with a given target URL and emits the pair: <target, list(source)>
    • For a given web page, we know what pages link to it
additional examples
Additional Examples
  • Distributed grep
  • Distributed sort
  • Term-Vector per Host
  • Web Access Log Statistics
  • Document Clustering
  • Machine Learning
  • Statistical Machine Translation
performance boasts
Performance Boasts
  • Distributed grep
    • 1010 100-byte files (~1TB of data)‏
    • 3-character substring found in ~100k files
    • ~1800 workers
    • 150 seconds start to finish, including ~60 seconds startup overhead
  • Distributed sort
    • Same files/workers as above
    • 50 lines of MapReduce code
    • 891 seconds, including overhead
    • Best reported result of 1057 seconds for TeraSort benchmark
typical cluster
Typical Cluster
  • 100s/1000s of Dual-Core, 2-4GB Memory
  • Limited internal bandwidth
  • Temporary storage on local IDE disks
  • Google File System (GFS)‏
    • Distributed file system for permanent/shared storage
  • Job scheduling system
    • Jobs made up of tasks
    • Master-Scheduler assigns tasks to Worker machines
execution initialization
Execution Initialization
  • Split input file into 64MB sections (GFS)‏
    • Read in parallel by multiple machines
  • Fork off program onto multiple machines
  • One machine is Master
  • Master assigns idle machines to either Map or Reduce tasks
  • Master Coordinates data communication between map and reduce machines
map machine
  • Reads contents of assigned portion of input-file
  • Parses and prepares data for input to map function (e.g. read <a /> from HTML)‏
  • Passes data into map function and saves result in memory (e.g. <target, source>)‏
  • Periodically writes completed work to local disk
  • Notifies Master of this partially completed work (intermediate data)‏
reduce machine
  • Receives notification from Master of partially completed work
  • Retrieves intermediate data from Map-Machine via remote-read
  • Sorts intermediate data by key (e.g. by target page)‏
  • Iterates over intermediate data
    • For each unique key, sends corresponding set through reduce function
  • Appends result of reduce function to final output file (GFS)‏
worker failure
Worker Failure
  • Master pings workers periodically
  • Any machine who does not respond is considered “dead”
  • Both Map- and Reduce-Machines
    • Any task in progress gets needs to be re-executed and becomes eligible for scheduling
  • Map-Machines
    • Completed tasks are also reset because results are stored on local disk
    • Reduce-Machines notified to get data from new machine assigned to assume task
skipping bad records
Skipping Bad Records
  • Bugs in user code (from unexpected data) cause deterministic crashes
    • Optimally, fix and re-run
    • Not possible with third-party code
  • When worker dies, sends “last gasp” UDP packet to Master describing record
  • If more than one worker dies over a specific record, Master issues yet another re-execute command
  • Tells new worker to skip problem record
backup tasks
Backup Tasks
  • Some “Stragglers” not performing optimally
    • Other processes demanding resources
    • Bad Disks (correctable errors)
      • Slow down I/O speeds from 30MB/s to 1MB/s
    • CPU cache disabled ?!
  • Near end of phase, schedule redundant execution of in-process tasks
  • First to complete “wins”
  • Network Bandwidth scarce
  • Google File System (GFS)‏
    • Around 64MB file sizes
    • Redundant storage (usually 3+ machines)‏
  • Assign Map-Machines to work on portions of input-files which they already have on local disk
  • Read input file at local disk speeds
  • Without this, read speed limited by network switch
  • Complete rewrite of the production indexing system
    • 20+ TB of data
    • indexing takes 5-10 MapReduce operations
    • indexing code is simpler, smaller, easier to understand
    • Fault Tolerance, Distribution, Parallelization hidden within MapReduce library
    • Avoids extra passes over the data
    • Easy to change indexing system
    • Improve performance of indexing process by adding new machines to cluster