Understanding MapReduce: Big Data Processing Framework

MapReduce, the Big Data Workhorse Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013

Distributed Computing • Use several computers to process large amounts of data • Often significant distribution overhead • If math helps: • How do you deal with dependencies between data elements? • ie counting word occurrences: what if a word gets sent to two computers?

History of MapReduce • Developed at Google 1999-2000, published by Google 2004 • Used to make/maintain Google WWW index • Open source implementation by the Apache Software Foundation: Hadoop • “Spinoffs” egHBase (used by Facebook) • Amazon’s Elastic MapReduce (EMR) service • Uses the Hadoop implementation of MapReduce • Various wrapper libraries, egMRjob

MapReduce, Conceptually • Split data for distributed processing • But some data may depend on other data to be processed correctly • MapReducemaps which data need to be processed together • Then reduces (processes) the data

The MapReduce Framework • Input is split into different chunks Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9

The MapReduce Framework • Each chunk is sent to one of several computers running the same map() function Input 1 Mapper 1 Input 2 Input 3 Input 4 Mapper 2 Input 5 Input 6 Input 7 Mapper 3 Input 8 Input 9

The MapReduce Framework • Each map() function outputs several (key, value) pairs (k1, v1) Input 1 (k3, v2) Mapper 1 (k3, v3) Input 2 (k3, v6) Input 3 Input 4 (k2, v4) Mapper 2 (k1, v5) Input 5 (k3, v9) Input 6 (k2, v8) Input 7 (k2, v10) Mapper 3 Input 8 (k1, v7) Input 9 (k1, v12) (k2, v11)

The MapReduce Framework • The map() outputs are collected and sorted by key (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)

The MapReduce Framework • Several computers running the same reduce()function receive the (key, value) pairs (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)

The MapReduce Framework • All the records for a given key will be sent to the same reducer; this is why we sort (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)

The MapReduce Framework • Each reducer outputs a final value (maybe with a key) (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 Output 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 Output 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 Output 3 (k3, v3) (k1, v12) (k3, v2) (k2, v11)

The MapReduce Framework • The reducer outputs are aggregated and become the final output (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 Output 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 Output 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 Output 3 (k3, v3) (k1, v12) (k3, v2) (k2, v11)

Example – Word Count • Problem: given a large body of text, count how many times each word occurs • How can we parallelize? • Mapper key = • Mapper value = • Reducer key = • Reducer value = word # occurrences in this mapper’s input word sum of # occurrences over all mappers

Example – Word Count function map(input): counts = new dictionary() for word in input: counts[word]++ for word in counts: yield (word, count[word])

Example – Word Count function reduce (key, values): sum = 0 for val in values: sum += val yield (key, sum)

Now Let’s Do It • I need 3 volunteer slave nodes • I’ll be the master node

Considerations • Hadoop takes care of distribution, but only as efficiently as you allow • Input must be split evenly • Values should be spread evenly over keys • If not, reduce() step will not be very well distributed – imagine all values get mapped to the same key, then the reduce() step is not parallelized at all! • Several keys should be used • If you have few keys, then few computers can be used as reducers • By the same token, more/smaller input chunks are good • You need to know the data you’re processing!

Practical Hadoop Concerns • I/O is often the bottleneck, so use compression! • Some compression formats are not splittable • Entire input files (large!) will be sent to single mappers, destroying hopes of distribution • Consider using a combiner (“pre-reducer”) • EMR considerations: • Input from S3 is fast • Nodes are virtual machines

Hadoop Streaming • Hadoop in its original form uses Java • Hadoop Streaming allows programmers to avoid direct interaction with Java by instead using Unix STDIN/STDOUT • Requires serialization of keys and values • Potential problems – “<key>\t<value>”, but what if serialized key or value contains a “\t”? • Beware of stray “print” statements • Safer to print to STDERR

Hadoop Streaming JAVA HADOOP Serialized Input Serialized Output STDOUT STDIN

Thank You! • Thanks for your attention • Please provide feedback, comments, questions, etc: vyassa.baratham@stonybrook.edu • Interested in physics? Want to learn about Monte Carlo Simulation?

Understanding MapReduce: Big Data Processing Framework

Understanding MapReduce: Big Data Processing Framework

Presentation Transcript

Measuring a (MapReduce) Data Center

MapReduce and Data Management

The Big Deal about Big Data

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Big Data Processing with MapReduce and Spark

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce

MapReduce , Sensor Data, and the Cloud

Data Processing with MapReduce

The Big Deal About Big Data

MapReduce and Data Management

Simplifying MapReduce Data Processing

Big Data Big Data

Hadoop MapReduce Vs Spark: Which big data framework to choose

Introduction to Big Data HADOOP HDFS MapReduce - Department of Computer Engineering

HadoopViz: A MapReduce Framework for Extensible Visualization of Big Spatial Data

Data Engineering How MapReduce Works

MapReduce and Data Management