230 likes | 324 Views
"Learn about MapReduce, a powerful tool for distributed computing, its history, framework, and an example of word counting. Explore how data processing and reduction work across multiple computers efficiently. Join us to dive into the world of Big Data processing!"
E N D
MapReduce, the Big Data Workhorse Vyassa Baratham, Stony Brook University April 20, 2013, 1:05-2:05pm cSplash 2013
Distributed Computing • Use several computers to process large amounts of data • Often significant distribution overhead • If math helps: • How do you deal with dependencies between data elements? • ie counting word occurrences: what if a word gets sent to two computers?
History of MapReduce • Developed at Google 1999-2000, published by Google 2004 • Used to make/maintain Google WWW index • Open source implementation by the Apache Software Foundation: Hadoop • “Spinoffs” egHBase (used by Facebook) • Amazon’s Elastic MapReduce (EMR) service • Uses the Hadoop implementation of MapReduce • Various wrapper libraries, egMRjob
MapReduce, Conceptually • Split data for distributed processing • But some data may depend on other data to be processed correctly • MapReducemaps which data need to be processed together • Then reduces (processes) the data
The MapReduce Framework • Input is split into different chunks Input 1 Input 2 Input 3 Input 4 Input 5 Input 6 Input 7 Input 8 Input 9
The MapReduce Framework • Each chunk is sent to one of several computers running the same map() function Input 1 Mapper 1 Input 2 Input 3 Input 4 Mapper 2 Input 5 Input 6 Input 7 Mapper 3 Input 8 Input 9
The MapReduce Framework • Each map() function outputs several (key, value) pairs (k1, v1) Input 1 (k3, v2) Mapper 1 (k3, v3) Input 2 (k3, v6) Input 3 Input 4 (k2, v4) Mapper 2 (k1, v5) Input 5 (k3, v9) Input 6 (k2, v8) Input 7 (k2, v10) Mapper 3 Input 8 (k1, v7) Input 9 (k1, v12) (k2, v11)
The MapReduce Framework • The map() outputs are collected and sorted by key (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)
The MapReduce Framework • Several computers running the same reduce()function receive the (key, value) pairs (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)
The MapReduce Framework • All the records for a given key will be sent to the same reducer; this is why we sort (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 (k3, v3) (k1, v12) (k3, v2) (k2, v11)
The MapReduce Framework • Each reducer outputs a final value (maybe with a key) (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 Output 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 Output 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 Output 3 (k3, v3) (k1, v12) (k3, v2) (k2, v11)
The MapReduce Framework • The reducer outputs are aggregated and become the final output (k1, v1) (k1, v1) (k1, v5) Input 1 (k3, v2) Reducer 1 Output 1 (k1, v7) Mapper 1 (k3, v3) Input 2 (k1, v12) (k3, v6) Input 3 (k2, v4) Input 4 (k2, v4) (k2, v8) Master Node: collect & sort Mapper 2 (k1, v5) Input 5 Reducer 2 Output 2 (k2, v10) (k3, v9) Input 6 (k2, v11) (k2, v8) Input 7 (k3, v9) (k2, v10) Mapper 3 Input 8 (k1, v7) (k3, v6) Reducer 3 Input 9 Output 3 (k3, v3) (k1, v12) (k3, v2) (k2, v11)
Example – Word Count • Problem: given a large body of text, count how many times each word occurs • How can we parallelize? • Mapper key = • Mapper value = • Reducer key = • Reducer value = word # occurrences in this mapper’s input word sum of # occurrences over all mappers
Example – Word Count function map(input): counts = new dictionary() for word in input: counts[word]++ for word in counts: yield (word, count[word])
Example – Word Count function reduce (key, values): sum = 0 for val in values: sum += val yield (key, sum)
Now Let’s Do It • I need 3 volunteer slave nodes • I’ll be the master node
Considerations • Hadoop takes care of distribution, but only as efficiently as you allow • Input must be split evenly • Values should be spread evenly over keys • If not, reduce() step will not be very well distributed – imagine all values get mapped to the same key, then the reduce() step is not parallelized at all! • Several keys should be used • If you have few keys, then few computers can be used as reducers • By the same token, more/smaller input chunks are good • You need to know the data you’re processing!
Practical Hadoop Concerns • I/O is often the bottleneck, so use compression! • Some compression formats are not splittable • Entire input files (large!) will be sent to single mappers, destroying hopes of distribution • Consider using a combiner (“pre-reducer”) • EMR considerations: • Input from S3 is fast • Nodes are virtual machines
Hadoop Streaming • Hadoop in its original form uses Java • Hadoop Streaming allows programmers to avoid direct interaction with Java by instead using Unix STDIN/STDOUT • Requires serialization of keys and values • Potential problems – “<key>\t<value>”, but what if serialized key or value contains a “\t”? • Beware of stray “print” statements • Safer to print to STDERR
Hadoop Streaming JAVA HADOOP Serialized Input Serialized Output STDOUT STDIN
Thank You! • Thanks for your attention • Please provide feedback, comments, questions, etc: vyassa.baratham@stonybrook.edu • Interested in physics? Want to learn about Monte Carlo Simulation?