Main Memory Map Reduce (M3R)

Main Memory Map Reduce (M3R) VLDB, August 2012 (to appear) Avi Shinnar, David Cunningham, Ben Herta, Vijay Saraswat In collaboration with: Yan Li*, Dave Grove, Mikio Takeuchi**, Salikh Zakirov**, Juemin Zhang * IBM China Research Lab ** IBM Tokyo Research Lab Otherwise IBM TJ Watson Research Lab, New York 1

M3R/Hadoop • Hadoop • Popular Java API for Map/Reduce programming • Out of core, resilient, scalable (1000 nodes) • Based on HDFS (a resilient distributed filesystem) • M3R/Hadoop • Reimplementation of Hadoop API using managed X10 • Existing Hadoop applications just work • Reuse HDFS (and some other parts of Hadoop) • In-memory: problem size must fit in cluster RAM • Not resilient: cluster scales until MTBF barrier • But considerably faster (closer to HPC speeds)

System ML performance results Iterative sparse matrix algorithms, implemented in DML (Executed with System ML) Run on 20 x86 nodes (each 8 core 16GB ram)

Results for our sparse MatVecMult code running on Hadoop and M3R/Hadoop Algorithm is specially tailored for M3R Approx 50x speedup Sparse Matrix * Dense Vector performance

Architecture Hadoop Map Reduce Engine JVM only Java Hadoop App multiple jobs multiple jobs HDFS M3R/Hadoop adaptor HDFS data HDFS M3R Engine JVM/Native X10 M3R jobs Java M3R jobs X10 Java

Speeding up Iterative Hadoop Map Reduce Jobs • Reducing Disk I/O • Reducing network communication • Reducing serialization/deserialization • E.g. about 25 seconds for 1x1M sparse matrix

Presentation Note In the pictures that follow, BLUE lines represent slow communication paths. BLACK lines represent fast in-memory aliasing The M3R goal is to turn BLUE lines to BLACK to get performance.

Basic Flow for an (Iterative) Hadoop Job With an emphasis on Disk I/O File System (HDFS) Input (InputFormat/ RecordReader/ InputSplit) Map (Mapper) Reduce (Reducer) Output (OutputFormat/ RecordWriter OutputCommitter) Shuffle File System File System

Basic Flow for an (Iterative) M3R/Hadoop Job With an emphasis on Disk I/O File System (HDFS) Input (InputFormat/ RecordReader/ InputSplit) Map (Mapper) Reduce (Reducer) Output (OutputFormat/ RecordWriter OutputCommitter) Shuffle

Basic Flow for an (Iterative) M3R/Hadoop Job With an emphasis on Disk I/O File System (HDFS) Cache Input (InputFormat/ RecordReader/ InputSplit) Map (Mapper) Reduce (Reducer) Output (OutputFormat/ RecordWriter OutputCommitter) Shuffle

Basic Flow for an (Iterative) Hadoop Job With an emphasis on Network I/O File System (HDFS) Input (InputFormat/ RecordReader/ InputSplit) Map (Mapper) Reduce (Reducer) Output (OutputFormat/ RecordWriter OutputCommitter) Shuffle File System File System

Basic Flow for an (Iterative) M3R/Hadoop Job With an emphasis on Network I/O File System (HDFS) Cache Input (InputFormat/ RecordReader/ InputSplit) Map (Mapper) Reduce (Reducer) Output (OutputFormat/ RecordWriter OutputCommitter) Shuffle

Map/Shuffle/Reduce Map (Mapper) Reduce (Reducer) Shuffle

Mappers/Shuffle/Reducers Mapper1 Reducer1 Shuffle Mapper2 Reducer2 Mapper3 Reducer3 Mapper4 Reducer4 Mapper5 Reducer5 Mapper6 Reducer6

Co-locating Mappers and Reducers Mapper1 Reducer1 Shuffle Mapper2 Reducer2 Mapper3 Reducer3 Mapper4 Reducer4 Mapper5 Reducer5 Mapper6 Reducer6

Co-locating Mappers and Co-locating Reducers Mapper1 Reducer1 Shuffle Mapper2 Reducer2 Mapper3 Reducer3 Mapper4 Reducer4 Mapper5 Reducer5 Mapper6 Reducer6

Hadoop Broadcast Mapper1 Reducer1 Shuffle Mapper2 Reducer2 Mapper3 Reducer3 Mapper4 Reducer4 Mapper5 Reducer5 Mapper6 Reducer6

M3R Broadcast via De-Duplication Mapper1 Reducer1 Shuffle Mapper2 Reducer2 Mapper3 Reducer3 Mapper4 Reducer4 Mapper5 Reducer5 Mapper6 Reducer6

Algorithm (Row block partitioned G, V) Replicate V In parallel, each place broadcasts its segment of V to all others In parallel, at each place, multiply each row of G with V. Yields a new distributed V Key to performance: Read the appropriate part of G once never communicate G Communicate only to replicate V Iterated Matrix Vector multiplication Also subdivide horizontally for out-of-core V G V V V = *

Iterated Matrix Vector multiplication in Hadoop Input (G) Map/Pass (G) Shuffle Reducer (*) Output V# Input (V) Map/Bcast (V) File System (HDFS) Shuffle Input (V#) Map/Pass (V#) Reducer (+) Output V’

Iterated Matrix Vector multiplication Algorithm (Row block partitioned G, V) Replicate V In parallel, each place broadcasts its segment of V to all others In parallel, at each place, multiply each row of G with V. Yields a new distributed V Key to performance: Read the appropriate part of G once never communicate G Communicate only to replicate V Also subdivide horizontally for out-of-core V V V V G = * 25

Iterated Matrix Vector multiplication in M3R Input (G) Map/Pass (G) Shuffle Reducer (*) Input (V) Map/Bcast (V) File System (HDFS) Cache Shuffle Map/Pass (V#) Reducer (+) Output V’

Iterated Matrix Vector multiplication in M3R Do not communicate G Input (G) Map/Pass (G) Shuffle Reducer (*) Input (V) Map/Bcast (V) File System (HDFS) Cache Do no communication Shuffle Map/Pass (V#) Reducer (+) Output V’

Partition Stability in M3R • The reducer associated with a given partition number will always be run at the same place • Same place => Same memory • Can reuse existing datastructures • User can control local vs remote communications

Shuffle An M3R/Hadoop Job that Exploits Locality File System (HDFS) Cache Input (InputFormat/ RecordReader/ InputSplit) Map (Mapper) Reduce (Reducer) Output (OutputFormat/ RecordWriter OutputCommitter) Partitioner

Iterated Matrix Vector multiplication in M3R Do not communicate G Input (G) Map/Pass (G) Shuffle Reducer (*) Input (V) Map/Bcast (V) File System (HDFS) Cache Do no communication Shuffle Map/Pass (V#) Reducer (+) Output V’ 30

Conclusions Sacrifice resilience and out-of-core execution Gain performance Used X10 to build a fast map/reduce engine Used X10/Java interop to wrap with Hadoop API Used X10 features to implement distributed cache Avoid serialization, disk, network I/O costs 10x faster for unmodified Hadoop apps (System ML) 50x faster for Hadoop app designed for M3R

Backup Slides

M3R is operational. Multi-JVM, multi-threaded, main memory map reduce implementation written in X10 Runs Java Hadoop 0.20.0 mapred jobs which may use HDFS Also supports API that permits mappers/reducers to operate on objects pre-positioned in global memory M3R can run DML programs unchanged Some changes needed to DML compiler to teach it to use caching file system. Speedups range from 1.6x to 13x (Measurements on 8-core, 20 node cluster) (More tests are under way.) Java Hadoop jobs can be written to take advantage of M3R features Cloning: Make keys/values immutable so they don’t need to be cloned Caching: Engine caches key-value pairs associated with files (avoid (de)-serialization, I/O costs) Partition Stability: Key value pairs with same partition number go to same JVM, across jobs Matrix Vector multiply implementation (in Java Hadoop 0.20.2) shows cycle time improvements from 9x to 47x. Exploits cloning, caching, partition stability. Better performance possible with code written to M3R APIs Conclusions

The entire mapper output for a given job must fit in available memory. May be possible to relax this in the future The mapper and reducer code must be safe for multi-threaded execution. In particular, the use of static variables is suspect. Code that runs correctly using Hadoop's multithreaded maprunner is probably fine. M3R will support single-threaded places (may already) Code should not assume JVM will be restarted in-between tasks. Clean up after you. For now, only supports mapred 0.20.2. No failure resilience M3R/Hadoop limitations

X10: An Evolution of Java for the Scale-Out Era Java X10 – Performance and Productivity at Scale Java Java Managed Place Place 34 X10 X10 X10 Java Java APGAS … APGAS Place 1 Place 0 APGAS APGAS Native Place … X10 X10 X10 APGAS APGAS GPU, FPGA • How do you deal with peta-bytes of data? • How do you take advantage of GPUs and FPGAs?

X10 and the APGAS model Asychrony • async S Atomicity • when (c) S Global data-structures • points, regions, distributions, arrays Locality • at (P) S Order • finish S • clocks • Five basic constructs • async S – run S as a separate activity • at (P) S – switch to place P to run S. • finish S – execute S, wait for termination • when (c) S – execute S when c, atomically • clocked async, clocked finish support barriers classHelloWholeWorld { public static defmain(s:Array[String](1)):void { finish for(pinPlace.places()) async at(p) Console.OUT.println("(At " + p + ") " + s(0)); } } • Cilk-style work-stealing scheduler • Runs on modern interconnects • Collectives exploit hardware support • RDMA transfer support • Runs on Blue Gene, x86, Power.. • Runs natively and in JVM Java-like productivity, MPI-like performance

M3R Goals + Support Hadoop mapred 0.20.2 API with minimal user-visible changes. Same job should be able to run on Hadoop and M3R. + Support HDFS access Perform well on scale-up SMP Ensure that DML programs can run unchanged on Hadoop or M3R. Jaql, Nimble, … Fast multi-node, multi-threaded Map Reduce for clusters with high MTBF optimized for iterative jobs. 37

Reading G into the Correct Partitions: Preloading File System (HDFS) Cache Input G Map Pass G through Reduce Pass G through Output Partitioned_G Shuffle Partitioner 38

Preload performance Cost reflects one time preload cost + 3 iterations cost. Preload costs decrease with number of iterations 39

Reading G into the Correct Partitions: PlacedSplit class MatrixInputSplit implements InputSplit, PlacedSplit { int getPartition(…) { } … } M3R/Hadoop prioritizes PlacedSplit requests over HDFS localiity 40

Hadoop Serialization and Mutation File System (HDFS) Input (InputFormat/ RecordReader/ InputSplit) Map (Mapper) Reduce (Reducer) Output (OutputFormat/ RecordWriter OutputCommitter) Shuffle File System File System 41

Encouraging Mutation: WordCount public static class Map extends Mapper<?,?,?,?> { private final static IntWritable one = new IntWritable(1); private Text word = new Text(); public void map(LongWritable key, Text value, Context context) { String line = value.toString(); StringTokenizer tokenizer = new StringTokenizer(line); while (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken()); context.write(word, one); } } } 42

Aliasing and Mutation Shuffle File System (HDFS) Cache Input (InputFormat/ RecordReader/ InputSplit) Map (Mapper) Reduce (Reducer) Output (OutputFormat/ RecordWriter OutputCommitter) Partitioner 43

ImmutableOutputs Shuffle File System (HDFS) Cache Input (InputFormat/ RecordReader/ InputSplit) Mapper implementsImmutableOutput Reducer implementsImmutableOutput Output (OutputFormat/ RecordWriter OutputCommitter) Partitioner 44

M3R foundation At its heart, a core X10 M3R engine allows jobs to “pin” data in distributed memory using standard X10 features. Over this, we built an interop layer that consumes Hadoop jobs and runs them against the core engine. The cache and the partition stability guarantee mediate between these worlds  Allow a Hadoop programmer to pin data in memory Writing directly to the X10 M3R interface offers the opportunity for increased performance. X10 code for MatVecMult performs 10x better (Managed backend, sockets) Native code (on sockets) 30x better 45

Mappers/Shuffle/Reducers Mapper1 Reducer1 Shuffle Mapper2 Reducer2 Mapper3 Reducer3 Mapper4 Reducer4 Mapper5 Reducer5 Mapper6 Reducer6 46

Connecting Mappers and Reducers Mapper1 Reducer1 Shuffle Mapper2 Reducer2 Mapper3 Reducer3 Mapper4 Reducer4 Mapper5 Reducer5 Mapper6 Reducer6 47

Partitioner: Connecting Mappers and Reducers Mapper1 Reducer1 Shuffle Mapper2 Reducer2 Mapper3 Reducer3 Partitioner Mapper4 Reducer4 Mapper5 Reducer5 Mapper6 Reducer6 int partitionNumber = getPartition(key, value); 48

Main Memory Map Reduce (M3R)

Main Memory Map Reduce (M3R)

Presentation Transcript

Main Memory

Map Reduce

Main Memory

Main Memory

Main Memory

Map Reduce

Map Reduce

Map/Reduce

Main Memory

Map Reduce

Main Memory

Map-Reduce

Main Memory

Main Memory

Main Memory

Map Reduce

Main Memory

Map Reduce