Data Parallelism & MapReduce: CS492 Lecture Overview

Lecture #4Introduction to Data Parallelism and MapReduce CS492 Special Topics in Computer Science: Distributed Algorithms and Systems

Today’s Topics to Cover • Short quiz on programming in Ocaml

How to parallelize (I) Runlength encoding Fibonacchi function Calculation of π Word count Inverted index

How to parallelize (II) SIMD MIMD via shared memory MIMD via message passing Distributed computing

MapReduce • Functional programming “Map / Reduce” way of thinking about problem solving • Google’s runtime library supporting MR paradigm at a very large scale

Fall 2008 CS492 MapReduce Execution Overview

How popular is MapReduce? • In September 2007, Google used 11,081 “machine-years” (roughly, CPU-years) on MapReduce jobs alone • Assume all machines were busy 100% and ran only MR 11,081 x 365 / 30 = 134,818 • If a rack holds 176 CPUS (88 1U dual-processor) 134,818 / 176 = 766

Reading material “MapReduce: Simplified data processing on large clusters” by J. Dean and S. Ghemawat Communications of the ACM, Jan. 2008/Vol. 51, No. 1 “MapReduce: Simplified data processing on large clusters” by J. Dean and S. Ghemawat USENIX OSDI 2004

Data Parallelism & MapReduce: CS492 Lecture Overview

Data Parallelism & MapReduce: CS492 Lecture Overview

Presentation Transcript

LECTURE

Lecture 25 Lecture 26

Lecture

Lecture

Lecture VIII Lecture IX

Lecture

Lecture 10 Lecture 10 Lecture 11 Lecture 11 Lecture 11 Lecture 11

Lecture S1: Sample Lecture

Lecture