Synthesizing MPI Implementations from Functional Data-Parallel Programs
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

Tristan Aubrey-Jones Bernd Fischer PowerPoint PPT Presentation


  • 103 Views
  • Uploaded on
  • Presentation posted in: General

Synthesizing MPI Implementations from Functional Data-Parallel Programs. Tristan Aubrey-Jones Bernd Fischer. People need to program distributed memory architectures:. Graphics/GPGPU CUDA/ OpenCL M emory levels: thread-local, block-shared, device-global. GPUs.

Download Presentation

Tristan Aubrey-Jones Bernd Fischer

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Tristan aubrey jones bernd fischer

Synthesizing MPI Implementations from Functional Data-Parallel Programs

Tristan Aubrey-Jones

Bernd Fischer


People need to program distributed memory architectures

People need to program distributed memory architectures:

Graphics/GPGPU CUDA/OpenCL

Memory levels: thread-local, block-shared, device-global.

GPUs

Server farms/compute clusters

Our focus:Programming shared-nothing clusters

  • Big Data/HPC; MapReduce/HPF; Ethernet/Infiniband

Future many-core architectures?


Our aim

Our Aim

To automatically synthesize distributed memory implementations (C++/MPI).

  • We want this not just for arrays (i.e. other collection types & disk backed collections).

Single threaded

Parallel shared-memory

Distributed memory

Many distributions possible


Our approach

Our Approach

  • We define Flocc, a data-parallel DSL, where parallelism is expressed via combinator functions.

  • We develop a code generator that searches for well performing implementations, and generates C++/MPI implementations.

  • We use extendedtypes to carry data distribution information, and type inference to infer data distribution information.


Flocc

A data-parallel DSL

(Functional language on compute clusters)

Flocc


Flocc syntax

Flocc syntax

Polyadic lambda calculus, standard call-by-value semantics.


Matrix multiplication in flocc

Matrix multiplication in Flocc

Data-parallelism via combinator functions (abstracts away from individual element access)

let R1 = eqJoin(\((ai,aj),_) -> aj,

\((bi,bj),_) -> bi,

A, B) in …

SELECT * FROM A JOIN B ON A.j = B.i;

eqJoin :: ((k₁,v₁) → x,(k₂,v₂) → x, Map k₁ v₁, Map k₂ v₂) → Map (k₂,k₂) (v₂,v₂)


Matrix multiplication in flocc1

Matrix multiplication in Flocc

let R1 = eqJoin(\((ai,aj),_) -> aj,

\((bi,bj),_) -> bi,

A, B) in

let R2 = map (id, \(_,(av,bv)) -> mul(av,bv),

R1) in …

SELECT A.i as i, B.j as j, A.v* B.vas v

FROM A JOIN B ON A.j = B.i;

eqJoin :: ((k₁,v₁) → x,(k₂,v₂) → x, Map k₁ v₁, Map k₂ v₂) → Map (k₂,k₂) (v₂,v₂)

map :: (k₁ → k₂, (k₁,v₁) →v₂,Map k₁ v₁) → Map k₂ v₂


Matrix multiplication in flocc2

Matrix multiplication in Flocc

let R1 = eqJoin(\((ai,aj),_) -> aj,

\((bi,bj),_) -> bi,

A, B) in

let R2 = map (id, \(_,(av,bv)) -> mul(av,bv),

R1) in

groupReduce (\(((ai,aj),(bi,bj)),_) -> (ai,bj),

snd, add, R2)

SELECT A.i as i, B.j as j, sum (A.v * B.v) as v FROM A JOIN B ON A.j = B.iGROUP BY A.i, B.j;

eqJoin :: ((k₁,v₁) → x,(k₂,v₂) → x, Map k₁ v₁, Map k₂ v₂) → Map (k₂,k₂) (v₂,v₂)

map :: (k₁ → k₂, (k₁,v₁) →v₂,Map k₁ v₁) → Map k₂ v₂

groupReduce :: ((k₁,v₁) →k₂,(k₁,v₁) →v₂, (v₂,v₂) →v₂, Map k₁ v₁) → Map k₂ v₂


Distribution types

Distributed data layout (DDL) types

Distribution Types


Deriving distribution plans

Deriving distribution plans

Hidden from user

  • Different distributed implementations of each combinator.

  • Enumerates different choices of combinator implementations.

  • Each combinator implementation has a DDL type.

  • Use type inference to check if a set of choices is sound, insert redistributions to make sound, and infer data distributions.

  • Have backend template for each combinator implementation.

map folded intogroupReduce

eqJoin1 / eqJoin2 / eqJoin3

let R = eqJoin(\((ai,aj),_) -> aj,

\((bi,bj),_) -> bi,

A, B) in

groupReduce(\(((ai,aj),(bi,bj)),_) -> (ai,bj),

\(_,(av,bv)) -> mul (av,bv), add, R)

groupReduce1 / groupReduce 2


A distributed map type

A Distributed Map type

  • The DMap type extends the basic Map k vtype, to symbolically describe how the Map should be distributed on the cluster.

  • Key and value types t1 and t2.

  • Partition function f: a function taking key-value pairs to node coordinates i.e. f : (t1,t2)  N

  • Partition dimension identifier d1: specifies which nodes the coordinates map onto.

  • Mirror dimension identifier d2: specifies which nodes to mirror the partitions over.

  • Also includes local layout mode and key, omitted for brevity.Also works for Arrays (DArr) and Lists (DList).


Some example distributions

Some example distributions

A :: DMap (Int,Int) Float (hash(dim0)fstfst) (dim0) []

2D node topology

dim1

dim0

hash(dim0)fstfst = \((x,_),_) -> x % 3


Some example distributions1

Some example distributions

A :: DMap (Int,Int) Float (hash(dim0)fstfst) (dim0) [dim1]

2D node topology

dim1

dim0

hash(dim0)fstfst = \((x,_),_) -> x % 3


Some example distributions2

Some example distributions

A :: DMap (Int,Int) Float (hash(dim0)snd) (dim0) [dim1]

2D node topology

dim1

dim0

hash(dim0)snd = \(_,x) -> ceil(x) % 3


Some example distributions3

Some example distributions

A :: DMap …(hash(dim0,dim1)fst) (dim0,dim1) []

2D node topology

dim1

dim0

hash(dim0, dim1)fst = \((x,y),_) -> (x%3,y%4)


Distribution types for group reduces

Distribution types for group reduces

  • Input collection is partitioned using any projection function f

    • must exchange values between nodes before reduction

groupReduce1 : f, k, k’, v, v’, d, m

((k,v) k’, (k,v) v’, (v’,v’) v’,

DMap k v (hash(d)f) d m)

DMap k’ v’ (hash(d) fst) d m

  • Result is always co-located by the result Map’s keys.

In

Re-partition

Local greduce

Out

Node1

Node2

Node3


Distribution types for group reduces1

Distribution types for group reduces

  • Input is partitioned using the groupReduce’s key projection function f

    • result keys are already co-located

groupReduce2 : k, k’, v, v’, d, m (f,_,_,_) :

((k,v) k’, (k,v) v’, (v’,v’) v’,

DMap k v (hash(d)f) d m)

DMap k’ v’ (hash(d) fst) d m

  • Result is always co-located by the result Map’s keys.

In

Local group reduce

Out

No inter-node communication is necessary

(but constrains input distribution)

Node1

Node2

Node3


Dependent type schemes

Dependent type schemes

Two different binders/schemes:

  • Π binds a concrete term in AST

  • lifts the argument’s AST into the types when instantiated

  • must match concrete term when unifying (rigid)

  • used to specify that a collection is distributed using a specific projection function

  • (non-standard dep type syntax)

  • ∀ binds a type variable

  • create fresh variable when instantiated for term

  • can be replaced by any type when unifying (flexible)

  • used to specify that a collection is distributed by any partition function/arbitrarily

  • similar to standard type schemes


Extending damas milner inference

Extending Damas-Milner inference

  • We use a constraint based implementation of Algorithm W and extend it to:

    • Lift parameter functions into types at function apps (see DApp).

    • Compare embedded functions (extends Unify).

      • Undecidable in general.

      • Currently use an under-approximation which simplifies functions and tests for syntactic equality.

      • Future work solves equations involving projection functions.


Distribution types for joins

Distribution types for joins

  • Inputs are partitioned using the join’s key emitter functions

    • no inter-node communication, equal keys already co-located

eqJoin1: k’, k1, k2, v1, v2, d, m (f1,f2,_,_) :

((k1,v1) k’, (k2,v2)k’,

DMapk1 v1 (hash(d)f1) dm,

DMap k2 v2 (hash(d) f2) dm)

DMap(k1,k2) (v1,v2) (hash(d) (f1 lft∩ f2 rht)) dm

Partitioned by f1 and f2 (i.e. aligned)

In

Local equi-joins

Out

  • No inter-node communication

  • Constrains inputs and outputs

  • No mirroring of inputs

  • (Like sort-merge join)

Node1

Node2

Node3


Example

Deriving a distributed memory matrix-multiply

Example


Deriving a distributed matrix multiply

Deriving a distributed matrix-multiply

  • The program again:

map folded intogroupReduce

eqJoin1 / eqJoin2 / eqJoin3

groupReduce1 / groupReduce 2

let R = eqJoin(\((ai,aj),_) -> aj,

\((bi,bj),_) -> bi,

A, B) in

groupReduce(\(((ai,aj),(bi,bj)),_) -> (ai,bj),

\(_,(av,bv)) -> mul (av,bv), add, R)


Distributed matrix multiplication 1

Distributed matrix multiplication - #1

  • groupReduce2 – in part by key f, out part by fst (out keys)

  • eqJoin3 – left part by lftFun(f), right rhtFun(f), out by any f

A: Partitioned by row along d1, mirrored along d2

B: Partitioned by column along d2, mirrored along d1

A : DMap(Int,Int) Float hash(d1)\((ai,aj),_)->aid1 [d2,m]

B : DMap(Int,Int) Float hash(d2)\((bi,bj),_)->bj d2, [d1,m]

R : DMap((Int,Int),(Int,Int)) (Float,Float)

(hash(d1,d2) \(((ai,aj),(bi,bj)),_) -> (ai,bj)) (d1,d2) m

C : DMap(Int,Int) Float (hash(d1,d2) fst) (d1,d2) m

C: Partitioned by (row, column) along (d1, d2)

Must partition and mirror A and B at beginning of computation.


Distributed matrix multiplication 11

Distributed matrix multiplication - #1

  • groupReduce2 – in part by key f, out part by fst (out keys)

  • eqJoin3 – left part by lftFun(f), right rhtFun(f), out by any f

A: Partitioned by row along d1, mirrored along d2

B: Partitioned by column along d2, mirrored along d1

groupReduce2

eqJoin3

=

=

d1

(3-by-3 = 9nodes)

d2

C: Partitioned by (row, column) along (d1, d2)

A common solution.

Must partition and mirror A and B at beginning of computation.


Distributed matrix multiplication 2

Distributed matrix multiplication - #2

  • eqJoin1 – left part by f1, right by f2, aligned along d

  • groupReduce1 – in part by any f, out part by fst (out keys)

A: Partitioned by col along d

B: Partitioned by row along d(aligned with A)

A : DMap(Int,Int) Float (hash(d) \((ai,aj),_) -> aj) d m

B : DMap(Int,Int)Float (hash(d) \((bi,bj),_) -> bi) d m

R : DMap((Int,Int),(Int,Int)) (Float,Float)

(hash(d) (\((ai,aj),_) -> ajlft∩ ) d m

C : DMap(Int,Int) Float (hash(d) fst) d m

C: Partitioned by (row, column) along d

Must exchange R during groupReduce1.


Distributed matrix multiplication 21

Distributed matrix multiplication - #2

A by col, B by row, so join is in-place but group must exchange values to group by key.

eqJoin1

groupReduce1

regrouped by col from B

Node 1

(3 nodes)

+

+

=

=

Node 2

d

+

+

=

=

Node 3

+

+

=

=

Efficient for sparse matrices, where A and B are large, but R is small.


Implementation results

Generated implementations

Implementation & results


Prototype implementation

Prototype Implementation

  • Uses a genetic search to explore candidate implementations.

  • For each candidate weautomatically insertredistributions to make ittype check.

  • We evaluate each candidateby code generating, compiling, and running it onsome test data.

  • Generates C++ using MPI.

  • (~20k lines of Haskell)


Example programs

Example programs

Performance comparable with hand-coded versions:

  • PLINQ comparisons run on quad-core (Intel Xeon W3520/2.67GHz) x64 desktop with 12GB memory.

  • C++/MPI comparisons run on Iridis3&4: a 3rd gen cluster with ~1000 Westmere compute nodes, each with two 6-core CPUs and 22GB of RAM, over an InfiniBand network. Speedups compared to sequential, averaged over 1,2,3,4,8,9,16,32 nodes.


Conclusions

So what?

Conclusions


Tristan aubrey jones bernd fischer

The Pros and Cons…

Benefits

Limitations

Currently data parallel algorithms must be expressed via combinators

Data distributions are limited to those that can be expressed in the types

  • Multiple distributed collections: Maps, Arrays, Lists...

  • Can support in-memory and disk backed collections (e.g. for Big Data).

  • Generates distributed algorithms automatically.

  • Flexible communication

  • Concise input language

  • Abstracts away from complex details


Future work

Future work

  • Improve prototype implementation.

  • More distributed collection types and combinators.

  • Better function equation solving during unification.

  • Support more distributed memory architectures (GPUs).

  • Retrofit into an existing functional language.

  • Similar type inference for imperative languages?

  • Suggestions?


Questions

questions?


  • Login