Tristan Aubrey-Jones Bernd Fischer

Synthesizing MPI Implementations from Functional Data-Parallel Programs Tristan Aubrey-Jones Bernd Fischer

People need to program distributed memory architectures: Graphics/GPGPU CUDA/OpenCL Memory levels: thread-local, block-shared, device-global. GPUs Server farms/compute clusters Our focus:Programming shared-nothing clusters • Big Data/HPC; MapReduce/HPF; Ethernet/Infiniband Future many-core architectures?

Our Aim To automatically synthesize distributed memory implementations (C++/MPI). • We want this not just for arrays (i.e. other collection types & disk backed collections). Single threaded Parallel shared-memory Distributed memory Many distributions possible

Our Approach • We define Flocc, a data-parallel DSL, where parallelism is expressed via combinator functions. • We develop a code generator that searches for well performing implementations, and generates C++/MPI implementations. • We use extendedtypes to carry data distribution information, and type inference to infer data distribution information.

A data-parallel DSL (Functional language on compute clusters) Flocc

Flocc syntax Polyadic lambda calculus, standard call-by-value semantics.

Matrix multiplication in Flocc Data-parallelism via combinator functions (abstracts away from individual element access) let R1 = eqJoin(\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in … SELECT * FROM A JOIN B ON A.j = B.i; eqJoin :: ((k₁,v₁) → x,(k₂,v₂) → x, Map k₁ v₁, Map k₂ v₂) → Map (k₂,k₂) (v₂,v₂)

Matrix multiplication in Flocc let R1 = eqJoin(\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in let R2 = map (id, \(_,(av,bv)) -> mul(av,bv), R1) in … SELECT A.i as i, B.j as j, A.v* B.vas v FROM A JOIN B ON A.j = B.i; eqJoin :: ((k₁,v₁) → x,(k₂,v₂) → x, Map k₁ v₁, Map k₂ v₂) → Map (k₂,k₂) (v₂,v₂) map :: (k₁ → k₂, (k₁,v₁) →v₂,Map k₁ v₁) → Map k₂ v₂

Matrix multiplication in Flocc let R1 = eqJoin(\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in let R2 = map (id, \(_,(av,bv)) -> mul(av,bv), R1) in groupReduce (\(((ai,aj),(bi,bj)),_) -> (ai,bj), snd, add, R2) SELECT A.i as i, B.j as j, sum (A.v * B.v) as v FROM A JOIN B ON A.j = B.iGROUP BY A.i, B.j; eqJoin :: ((k₁,v₁) → x,(k₂,v₂) → x, Map k₁ v₁, Map k₂ v₂) → Map (k₂,k₂) (v₂,v₂) map :: (k₁ → k₂, (k₁,v₁) →v₂,Map k₁ v₁) → Map k₂ v₂ groupReduce :: ((k₁,v₁) →k₂,(k₁,v₁) →v₂, (v₂,v₂) →v₂, Map k₁ v₁) → Map k₂ v₂

Distributed data layout (DDL) types Distribution Types

Deriving distribution plans Hidden from user • Different distributed implementations of each combinator. • Enumerates different choices of combinator implementations. • Each combinator implementation has a DDL type. • Use type inference to check if a set of choices is sound, insert redistributions to make sound, and infer data distributions. • Have backend template for each combinator implementation. map folded intogroupReduce eqJoin1 / eqJoin2 / eqJoin3 let R = eqJoin(\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in groupReduce(\(((ai,aj),(bi,bj)),_) -> (ai,bj), \(_,(av,bv)) -> mul (av,bv), add, R) groupReduce1 / groupReduce 2

A Distributed Map type • The DMap type extends the basic Map k vtype, to symbolically describe how the Map should be distributed on the cluster. • Key and value types t1 and t2. • Partition function f: a function taking key-value pairs to node coordinates i.e. f : (t1,t2)  N • Partition dimension identifier d1: specifies which nodes the coordinates map onto. • Mirror dimension identifier d2: specifies which nodes to mirror the partitions over. • Also includes local layout mode and key, omitted for brevity.Also works for Arrays (DArr) and Lists (DList).

Some example distributions A :: DMap (Int,Int) Float (hash(dim0)fstfst) (dim0) [] 2D node topology dim1 dim0 hash(dim0)fstfst = \((x,_),_) -> x % 3

Some example distributions A :: DMap (Int,Int) Float (hash(dim0)fstfst) (dim0) [dim1] 2D node topology dim1 dim0 hash(dim0)fstfst = \((x,_),_) -> x % 3

Some example distributions A :: DMap (Int,Int) Float (hash(dim0)snd) (dim0) [dim1] 2D node topology dim1 dim0 hash(dim0)snd = \(_,x) -> ceil(x) % 3

Some example distributions A :: DMap …(hash(dim0,dim1)fst) (dim0,dim1) [] 2D node topology dim1 dim0 hash(dim0, dim1)fst = \((x,y),_) -> (x%3,y%4)

Distribution types for group reduces • Input collection is partitioned using any projection function f • must exchange values between nodes before reduction groupReduce1 : f, k, k’, v, v’, d, m ((k,v) k’, (k,v) v’, (v’,v’) v’, DMap k v (hash(d)f) d m) DMap k’ v’ (hash(d) fst) d m • Result is always co-located by the result Map’s keys. In Re-partition Local greduce Out Node1 Node2 Node3

Distribution types for group reduces • Input is partitioned using the groupReduce’s key projection function f • result keys are already co-located groupReduce2 : k, k’, v, v’, d, m (f,_,_,_) : ((k,v) k’, (k,v) v’, (v’,v’) v’, DMap k v (hash(d)f) d m) DMap k’ v’ (hash(d) fst) d m • Result is always co-located by the result Map’s keys. In Local group reduce Out No inter-node communication is necessary (but constrains input distribution) Node1 Node2 Node3

Dependent type schemes Two different binders/schemes: • Π binds a concrete term in AST • lifts the argument’s AST into the types when instantiated • must match concrete term when unifying (rigid) • used to specify that a collection is distributed using a specific projection function • (non-standard dep type syntax) • ∀ binds a type variable • create fresh variable when instantiated for term • can be replaced by any type when unifying (flexible) • used to specify that a collection is distributed by any partition function/arbitrarily • similar to standard type schemes

Extending Damas-Milner inference • We use a constraint based implementation of Algorithm W and extend it to: • Lift parameter functions into types at function apps (see DApp). • Compare embedded functions (extends Unify). • Undecidable in general. • Currently use an under-approximation which simplifies functions and tests for syntactic equality. • Future work solves equations involving projection functions.

Distribution types for joins • Inputs are partitioned using the join’s key emitter functions • no inter-node communication, equal keys already co-located eqJoin1: k’, k1, k2, v1, v2, d, m (f1,f2,_,_) : ((k1,v1) k’, (k2,v2)k’, DMapk1 v1 (hash(d)f1) dm, DMap k2 v2 (hash(d) f2) dm) DMap(k1,k2) (v1,v2) (hash(d) (f1 lft∩ f2 rht)) dm Partitioned by f1 and f2 (i.e. aligned) In Local equi-joins Out • No inter-node communication • Constrains inputs and outputs • No mirroring of inputs • (Like sort-merge join) Node1 Node2 Node3

Deriving a distributed memory matrix-multiply Example

Deriving a distributed matrix-multiply • The program again: map folded intogroupReduce eqJoin1 / eqJoin2 / eqJoin3 groupReduce1 / groupReduce 2 let R = eqJoin(\((ai,aj),_) -> aj, \((bi,bj),_) -> bi, A, B) in groupReduce(\(((ai,aj),(bi,bj)),_) -> (ai,bj), \(_,(av,bv)) -> mul (av,bv), add, R)

Distributed matrix multiplication - #1 • groupReduce2 – in part by key f, out part by fst (out keys) • eqJoin3 – left part by lftFun(f), right rhtFun(f), out by any f A: Partitioned by row along d1, mirrored along d2 B: Partitioned by column along d2, mirrored along d1 A : DMap(Int,Int) Float hash(d1)\((ai,aj),_)->aid1 [d2,m] B : DMap(Int,Int) Float hash(d2)\((bi,bj),_)->bj d2, [d1,m] R : DMap((Int,Int),(Int,Int)) (Float,Float) (hash(d1,d2) \(((ai,aj),(bi,bj)),_) -> (ai,bj)) (d1,d2) m C : DMap(Int,Int) Float (hash(d1,d2) fst) (d1,d2) m C: Partitioned by (row, column) along (d1, d2) Must partition and mirror A and B at beginning of computation.

Distributed matrix multiplication - #1 • groupReduce2 – in part by key f, out part by fst (out keys) • eqJoin3 – left part by lftFun(f), right rhtFun(f), out by any f A: Partitioned by row along d1, mirrored along d2 B: Partitioned by column along d2, mirrored along d1 groupReduce2 eqJoin3 = = d1 (3-by-3 = 9nodes) d2 C: Partitioned by (row, column) along (d1, d2) A common solution. Must partition and mirror A and B at beginning of computation.

Distributed matrix multiplication - #2 • eqJoin1 – left part by f1, right by f2, aligned along d • groupReduce1 – in part by any f, out part by fst (out keys) A: Partitioned by col along d B: Partitioned by row along d(aligned with A) A : DMap(Int,Int) Float (hash(d) \((ai,aj),_) -> aj) d m B : DMap(Int,Int)Float (hash(d) \((bi,bj),_) -> bi) d m R : DMap((Int,Int),(Int,Int)) (Float,Float) (hash(d) (\((ai,aj),_) -> ajlft∩ ) d m C : DMap(Int,Int) Float (hash(d) fst) d m C: Partitioned by (row, column) along d Must exchange R during groupReduce1.

Distributed matrix multiplication - #2 A by col, B by row, so join is in-place but group must exchange values to group by key. eqJoin1 groupReduce1 regrouped by col from B Node 1 (3 nodes) + + = = Node 2 d + + = = Node 3 + + = = Efficient for sparse matrices, where A and B are large, but R is small.

Generated implementations Implementation & results

Prototype Implementation • Uses a genetic search to explore candidate implementations. • For each candidate weautomatically insertredistributions to make ittype check. • We evaluate each candidateby code generating, compiling, and running it onsome test data. • Generates C++ using MPI. • (~20k lines of Haskell)

Example programs Performance comparable with hand-coded versions: • PLINQ comparisons run on quad-core (Intel Xeon W3520/2.67GHz) x64 desktop with 12GB memory. • C++/MPI comparisons run on Iridis3&4: a 3rd gen cluster with ~1000 Westmere compute nodes, each with two 6-core CPUs and 22GB of RAM, over an InfiniBand network. Speedups compared to sequential, averaged over 1,2,3,4,8,9,16,32 nodes.

So what? Conclusions

The Pros and Cons… Benefits Limitations Currently data parallel algorithms must be expressed via combinators Data distributions are limited to those that can be expressed in the types • Multiple distributed collections: Maps, Arrays, Lists... • Can support in-memory and disk backed collections (e.g. for Big Data). • Generates distributed algorithms automatically. • Flexible communication • Concise input language • Abstracts away from complex details

Future work • Improve prototype implementation. • More distributed collection types and combinators. • Better function equation solving during unification. • Support more distributed memory architectures (GPUs). • Retrofit into an existing functional language. • Similar type inference for imperative languages? • Suggestions?

questions?

Tristan Aubrey-Jones Bernd Fischer

Tristan Aubrey-Jones Bernd Fischer

Presentation Transcript

Shahram Esmaeilsabzali Bernd Fischer

Aubrey Drake Graham

Aubrey “Drake” Graham

Fischer Family

Aubrey Bennett

Tristan P.

Created By: Aubrey Thompson

Jeremy Morse, Mikhail Ramalho , Lucas Cordeiro , Denis Nicole, Bernd Fischer

Stanley Fischer

Aubrey Drake Graham

Bernd Fischer 1 , Omar Inverso 2 , Gennaro Parlato 2 1 Stellenbosch University, South Africa

Aubrey Drake Graham

LOVE AUBREY- TRIVIA

TRISTAN BENNETT

Aubrey Drake Graham

oaisd/ctc/ Aubrey Holzinger

Tristan and Isolde

Aubrey Cornfoot Limited