Leroy garcia
This presentation is the property of its rightful owner.
Sponsored Links
1 / 46

Map Reduce PowerPoint PPT Presentation


  • 113 Views
  • Uploaded on
  • Presentation posted in: General

Leroy Garcia. Map Reduce. What is Map Reduce?. A patented programming model developed by Google Derived from LISP and other forms of functional programming Used for processing large data and generating large data sets Exploits large set of commodity computers

Download Presentation

Map Reduce

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Leroy garcia

Leroy Garcia

Map Reduce


What is map reduce

What is Map Reduce?

  • A patented programming model developed by Google

    • Derived from LISP and other forms of functional programming

  • Used for processing large data and generating large data sets

  • Exploits large set of commodity computers

  • Executes process in distributed manner

  • Easy to use, no messy code


Implementation at google

Implementation at Google

  • Machines w/ Multiple Processors

  • Commodity Networking Hardware

  • Cluster of Hundreds or Thousands of Machines

  • IDE Disks used for storage

  • Input Data managed by GFS

  • Users submit jobs to a scheduling system


Introduction

Introduction

  • How does Map Reduce work?


Overview

Overview

  • Programming Model

  • Implementation

  • Refinement

  • Performance

  • Related Topics

  • Conclusion


Programming model

Programming Model


Programming model1

Programming Model

  • Map

    • Input: key/value pair

      • Key: ex. Document Name

      • Value: ex. Document Contents

    • Output:

      • Set of Intermediate key/values


Programming model2

Programming Model

  • Reduce

    • Input: Intermediate key, values

      • Key: ex. A Word

      • Values: Values

    • Output

      • List of Values or a Single Value


Map reduce

Partitioning

Function

MAP

R

E

D

U

C

E

Big Data

Reduce

Result


Execution

Execution

Input:

M

M

M

M

M

M

M

Intermediate:

k1:v k1:v k2:v

k1:v

k3:v k2:v

k4:v k5:v

k4:v

k1:v k3:v

Group by Key

Grouped:

k1:v,v,v,v

k2:v

k3:v,v

k4:v,v,v

k5:v

R

R

R

R

R


Parallel execution

Parallel Execution

Map Task 1

Map Task 2

Map Task 3

M

M

M

M

M

M

M

k1:v k1:v k2:v

k1:v

k3:v k2:v

k4:v k5:v

k4:v

k1:v k3:v

Partition Function

Partition Function

Partition Function

Sort and Group

Sort and Group

k1:v,v,v,v

k3:v,v

k4:v,v,v

k2:v

k5:v

R

R

R

R

R

Reduce 1

Reduce 1


The map step

map

map

k

k

k

v

v

v

k

k

k

v

v

v

The Map Step

Input

key-value pairs

Intermediate

key-value pairs

k

v


Reduce step

Intermediate

key-value pairs

Key-value groups

reduce

reduce

k

k

v

v

k

v

v

v

k

k

k

v

v

v

k

v

v

group

k

v

k

v

k

v

Reduce Step

Output

key-value pairs


Word count

Word Count

Reduce

{Boy,34}

{Boy,12}

MAP

{Boy,23}

v

{Boy,16}

{Boy,34}

{Girl,3}

{Girl,18}

{Girl,8}

{Girl,18}

{Boy,16}

{Girl,5}

{Boy,12}

{Girl,8}

{Girl,5}

{Boy,23}

{Girl,5}

{Girl,12}

{Boy,85}

{Girl,43}


Examples

Examples

  • Distributed Grep

  • Count of URL Access Frequency

  • Reverse Web-Link Graph

  • Term-Vector per Host

  • Inverted Index

  • Distributed Sort


Practical examples

Practical Examples

  • Large PDF Generation

  • Artificial Intelligence

  • Statistical Data

  • Geographical Data


Large scale pdf generation

Large-Scale PDF Generation

  • The New York Times needs to generate PDF files for 11,000,000 articles (every article from 1851-1980) in the form of images scanned from the original paper

  • Each article is composed of numerous TIFF images which are scaled and glued together

  • Code for generating a PDF is relatively straightforward


Map reduce

Artificial Intelligence

  • Compute statistics

    • Central Limit Theorem

  • N voting nodes cast votes (map)

  • Tally votes and take action (reduce)


Map reduce

Statistical Analysis

  • Statistical analysis of current stock against historical data

  • Each node (map) computes similarity and ROI.

  • Tally Votes (reduce) to generate expected ROI and standard deviation

Photos from: stockcharts.com


Geographical data

Geographical Data

  • Large data sets including road, intersection, and feature data

  • Problems that Google Maps has used MapReduce to solve

    • Locating roads connected to a given intersection

    • Rendering of map tiles

    • Finding nearest feature to a given address or location


Map reduce

Geographical Data

  • Input: Graph describing node network with all gas stations marked

  • Map: Search five mile radius of each gas station and mark distance to each node

  • Sort: Sort by key

  • Reduce: For each node, emit path and gas station with the shortest distance

  • Output: Graph marked and nearest gas station to each node


Implementation

Implementation


Map reduce walkthrough

Map/Reduce Walkthrough

  • Map: (Functional Programming)uses a function on each element of the array

  • Mapper: The node that performs a function on one element of the set.

  • Reduce: (Functional programming) iterate a function across an array

  • Reducer: The node that reduces across all the like-keyed elements.


Execution overview

Execution Overview

  • Split input files

  • Starts up copies of the program in cluster.

    • Copy of program is sent to the Master

    • Master assigns either map or reduce responsibilities

  • Map Worker reads the splits

    • Parses key/value pairs out of the input data

    • Passes each pair to the user-defined Map function.

  • Buffer pairs are written to local disc partitioned into regions by partitioning function

    • The locations of these buffered pairs on the local disk are passed back to the master.

    • Master is responsible for forwarding these locations to the reduce workers.

  • Location of the buffer pairs are given to Reduce Workerby the master

    • Sorts Intermediate keys

  • The reduce worker iterates over the sorted intermediate data for each unique intermediate key.

    • passes the key and the corresponding set of intermediate values to the user's Reduce function.

    • The output of the Reduce function is appended to a final output file for this reduce partition.

  • When all map tasks and reduce tasks have been completed, the master wakes up the user program


Distributed execution overview

fork

fork

fork

Master

assign

map

assign

reduce

Input Data

Worker

Output

File 0

write

Worker

local

write

Split 0

read

Worker

Split 1

Output

File 1

Split 2

Worker

Worker

remote

read,

sort

Distributed Execution Overview

User

Program


Fault tolerance

Fault Tolerance

  • Worker Failure

  • Master Failure

  • Dealing with Stragglers

  • Locality

  • Task Granularity

  • Skipping Bad Record


Worker failure

Worker Failure

Worker A

Map Task 1

Complete

Ping

Ping

Worker AZ

Reduce Task 1

Failed

Reduce Task 1

Idle

Master

Worker BX

Reduce Task 1

In Progress

Worker B

Map Task 2

Complete Failed

Map Task 2

Idle

Worker C

Map Task 2

In Progress


Master failure

Master Failure

Master Fail

  • Checkpoints

Checkpoint 125

Checkpoint 124

Checkpoint 123

Checkpoint 125

NEW MASTER

MASTER


Dealing with stragglers

Dealing with Stragglers

  • Straggler- a machine in a cluster than is running significantly slower than the rest

Straggler

Map Task

Finish Task Line

Good Machine

Map Task Copy


Locality

Locality

  • Input Data is stored locally

  • GFS divides files in 64 MB blocks

  • Stores 3 copies of the blocks on different machines

  • Finds Replica of input data and scheduled map tasks.

  • Map tasks scheduled so GFS input block replica are on same machine or same rack


Task granularity

Task Granularity

  • Minimizes time for fault recovery

  • Can pipeline shuffling with map execution

  • Better dynamic load balancing

  • Often use 200,000 map/5000 reduce tasks w/ 2000 machines


Refinements

Refinements


Partitoning function

Partitoning Function

  • The users of MapReduce specify the number of reduce tasks/output files that they desire.

  • Data gets partitioned across these tasks using a partitioning function on the intermediate key.

  • Special partitioning function.

    • eg.hash(Hostname(urlkey)) mod R.

  • Ordering Guarantee

    • Intermediate keys are process in increasing key order.

    • Generates sorted output per partition.


Combiner function optional

Combiner Function(Optional)

  • Used by the Map Task when there is a significant repetition in the intermediate keys produced by each Map Task

Map Worker

Map Function

Combiner Function

(Girls, 1)

(Girls, 1)

Text Document

(Girls, 6)

(Girls, 2)

(Girls, 2)


Input and output types

Input and Output Types

  • Input:

  • Supports reading data of various formats

    • Support for new input type using a simple implementation of a reader interface.

    • Ex.Database

    • Ex. Datastructure Mapped in Memory

  • Output:

  • User codes supports to handle new type


Skipping bad records

Skipping Bad Records

  • Map/Reduce functions sometimes fail for particular inputs

  • Best solution is to debug & fix, but not always possible

  • On seg fault:

    • Send UDP packet to master from signal handler

    • Include sequence number of record being processed

  • If master sees two failures for same record:

    • Next worker is told to skip the record


Status pages

Status Pages


Status pages1

Status Pages


Status pages2

Status Pages


Performance

Performance

Tests run on cluster of 1800 machines:

4 GB of memory

Dual-processor 2 GHz Xeons with Hyperthreading

Dual 160 GB IDE disks

Gigabit Ethernet per machine

Bisection bandwidth approximately 100 Gbps

Two Benchmarks


Mr grep

MR_Grep

Inputs Scanned

  • Locality optimization helps:

  • 1800 machines read 1 TB of data at peak of ~31 GB/s

  • Without this, rack switches would limit to 10 GB/s


Mr sort

MR_Sort

Normal

No Backup Tasks

200 Processes Killed


Related topics

Related Topics


Map reduce

Other Notable Implementations of MapReduce

  • Hadoop

    • Open-source implementation of MapReduce

  • HDFS

    • Primary storage system used by Hadoop applications. HDFS creates multiple replicas of data blocks and distributes them on compute nodes throughout a cluster to enable reliable, extremely rapid computations.

  • Amazon Elastic Compute Cloud (EC2)

    • Virtualized computing environment designed for use with other Amazon services (especially S3)

  • Amazon Simple Storage Service (S3)

    • Scalable, inexpensive internet storage which can store and retrieve any amount of data at any time from anywhere on the web

    • Asynchronous, decentralized system which aims to reduce scaling bottlenecks and single points of failure


Conclusion

Conclusion

  • MapReduce has proven to be a useful abstraction

  • Greatly simplifies large-scale computations at Google

  • Easily Handles machine failure.

  • Allows users to focus on problem, without having to deal with complicated code behind the scene.


Questions

Questions?????


  • Login