Problem-solving on large-scale clusters: theory and applications

Problem-solving on large-scale clusters: theory and applications Lecture 3: Bringing it all together

Today’s Outline • Course directions, projects, and feedback • Quiz 2 • Context / Where we are • Why do we care about fold() and map()? • Why do we care about parallelization and data dependencies? • MapReduce architecture from 10,000 feet

Context and Review • Data dependencies determine whether a problem can be formulated in MapReduce • The properties of fold() and map() determine how to formulate a problem in MapReduce How do you parallelize fold()? map()?

MapReduce Introduction • MapReduce is both a programming model and a clustered computing system • A specific way of formulating a problem, which yields good parallelizability • A system which takes a MapReduce-formulated problem and executes it on a large cluster • Hides implementation details, such as hardware failures, grouping and sorting, scheduling … • Previous lectures have focused on MapReduce-the-problem-formulation • Today will mostly focus on MapReduce-the-system

MR Problem Formulation: Formal Definition MapReduce: mapreduce fm fr l = map (reducePerKey fr) (group (map fm l)) reducePerKey fr (k,v_list) = (k, (foldl (fr k) [] v_list)) • Assume map here is actually concatMap. • Argument l is a list of documents • The result of first map is a list of key-value pairs • The function fr takes 3 arguments key, context, current. With currying, this allows for locking the value of “key” for each list during the fold. MapReduce maps a fold over the sorted result of a map!

MR System Overview (1 of 2) Map: • Preprocesses a set of files to generate intermediate key-value pairs • As parallelized as you want Group: • Partitions intermediate key-value pairs by unique key, generating a list of all associated values Reduce: • For each key, iterates over value list • Performs computation that requires context between iterations • Parallelizable amongst different keys, but not within one key

MR System Overview (2 of 2) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Example: MapReduce DocInfo (1 of 2) MapReduce: mapreduce fm fr l = map (reducePerKey fr) (group (map fm l)) reducePerKey fr (k,v_list) = (k, (foldl (fr k) [] v_list) Pseudocode for fm fm contents = concat [ [(“spaces”, (count_spaces contents))], (map (emit “raw”) (split contents)), (map (emit “scrub”) (scrub (split contents)))] emit label value = (label, (value, 1))

Example: MapReduce DocInfo (2 of 2) MapReduce: mapreduce fm fr l = map (reducePerKey fr) (group (map fm l)) reducePerKey fr (k,v_list) = (k, (foldl (fr k) [] v_list) Pseudocode for fr fr ‘spaces’ count (total:xs) = (total+count:xs) fr ‘raw’ (word,count) (result) = (update_result (word,count) result) fr ‘scrub’ (word,count) (result) = (update_result (word,count) result)

Group Exercise Formulate the following as map reduces: • Find the set of unique words in a document • Input: a bunch of words • Output: all the unique words (no repeats) • Calculate per-employee taxes • Input: a list of (employee, salary, month) tuples • Output: a list of (employee, taxes due) pairs • Randomly reorder sentences • Input: a bunch of documents • Output: all sentences in random order (may include duplicates) • Compute the minesweeper grid/map • Input: coordinates for the location of mines • Output: coordinate/value pairs for all non-zero cells Can you think generalized techniques for decomposing problems?

MapReduce Parallelization: Execution Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

MapReduce Parallelization: Pipelining • Finely granular tasks: many more map tasks than machines • Better dynamic load balancing • Minimizes time for fault recovery • Can pipeline the shuffling/grouping while maps are still running • Example: 2000 machines -> 200,000 map + 5000 reduce tasks Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Example: MR DocInfo, revisited Do MapReduce DocInfo in 2 passes (instead of 1), performing all the work in the “group” step Map1: • Tokenize document • For each token output: • (“raw:<word>”,1) • (“scrubbed:<scrubbed_word>”, 1) Reduce1: • For each key, ignore value list and output (key,1) Map2: • Tokenize document • For each token “type:value”, output (type,1) Reduce 2: • For each key, output (key, (sum values))

Example: MR DocInfo, revisited • Of the 2 DocInfo MapReduce implementations, which is better? • Define “better”. What resources are you considering? Dev time? CPU? Network? Disk? Complexity? Reusability? Mapper Reducer Mapper Key: • Connections are network links • GFS is a cluster of storage machines Reducer Mapper GFS

HaDoop-as-MapReduce mapreduce fm fr l = map (reducePerKey fr) (group (map fm l)) reducePerKey fr (k,v_list) = (k, (foldl (fr k) [] v_list) Hadoop: • The fm and fr are function objects (classes) • Class for fm implements the Mapper interface Map(WritableComparable key, Writable value, OutputCollector output, Reporter reporter) • Class for fr implements the Reducer interface reduce(WritableComparable key, Iterator values, OutputCollector output, Reporter reporter) Hadoop takes the generated class files and manages running them

Bonus Materials: MR Runtime • The following slides illustrate an example run of MapReduce on a Google cluster • A sample job from the indexing pipeline, processes ~900 GB of crawled pages

MR Runtime (1 of 9) Shamelessly stolen from Jeff Dean’s OSDI ‘04 presentation http://labs.google.com/papers/mapreduce-osdi04-slides/index.html

Problem-solving on large-scale clusters: theory and applications