Synthesizing MPI Implementations from Functional Data-Parallel Programs
Download
1 / 34

Tristan Aubrey-Jones Bernd Fischer - PowerPoint PPT Presentation


  • 130 Views
  • Uploaded on

Synthesizing MPI Implementations from Functional Data-Parallel Programs. Tristan Aubrey-Jones Bernd Fischer. People need to program distributed memory architectures:. Graphics/GPGPU CUDA/ OpenCL M emory levels: thread-local, block-shared, device-global. GPUs.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Tristan Aubrey-Jones Bernd Fischer' - howie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Synthesizing MPI Implementations from Functional Data-Parallel Programs

Tristan Aubrey-Jones

Bernd Fischer


People need to program distributed memory architectures
People need to program Data-Parallel Programsdistributed memory architectures:

Graphics/GPGPU CUDA/OpenCL

Memory levels: thread-local, block-shared, device-global.

GPUs

Server farms/compute clusters

Our focus:Programming shared-nothing clusters

  • Big Data/HPC; MapReduce/HPF; Ethernet/Infiniband

Future many-core architectures?


Our aim
Our Aim Data-Parallel Programs

To automatically synthesize distributed memory implementations (C++/MPI).

  • We want this not just for arrays (i.e. other collection types & disk backed collections).

Single threaded

Parallel shared-memory

Distributed memory

Many distributions possible


Our approach
Our Approach Data-Parallel Programs

  • We define Flocc, a data-parallel DSL, where parallelism is expressed via combinator functions.

  • We develop a code generator that searches for well performing implementations, and generates C++/MPI implementations.

  • We use extendedtypes to carry data distribution information, and type inference to infer data distribution information.


Flocc

A data-parallel DSL Data-Parallel Programs

(Functional language on compute clusters)

Flocc


Flocc syntax
Flocc Data-Parallel Programs syntax

Polyadic lambda calculus, standard call-by-value semantics.


Matrix multiplication in flocc
Matrix multiplication in Data-Parallel ProgramsFlocc

Data-parallelism via combinator functions (abstracts away from individual element access)

let R1 = eqJoin(\((ai,aj),_) -> aj,

\((bi,bj),_) -> bi,

A, B) in …

SELECT * FROM A JOIN B ON A.j = B.i;

eqJoin :: ((k₁,v₁) → x,(k₂,v₂) → x, Map k₁ v₁, Map k₂ v₂) → Map (k₂,k₂) (v₂,v₂)


Matrix multiplication in flocc1
Matrix multiplication in Data-Parallel ProgramsFlocc

let R1 = eqJoin(\((ai,aj),_) -> aj,

\((bi,bj),_) -> bi,

A, B) in

let R2 = map (id, \(_,(av,bv)) -> mul(av,bv),

R1) in …

SELECT A.i as i, B.j as j, A.v* B.vas v

FROM A JOIN B ON A.j = B.i;

eqJoin :: ((k₁,v₁) → x,(k₂,v₂) → x, Map k₁ v₁, Map k₂ v₂) → Map (k₂,k₂) (v₂,v₂)

map :: (k₁ → k₂, (k₁,v₁) →v₂,Map k₁ v₁) → Map k₂ v₂


Matrix multiplication in flocc2
Matrix multiplication in Data-Parallel ProgramsFlocc

let R1 = eqJoin(\((ai,aj),_) -> aj,

\((bi,bj),_) -> bi,

A, B) in

let R2 = map (id, \(_,(av,bv)) -> mul(av,bv),

R1) in

groupReduce (\(((ai,aj),(bi,bj)),_) -> (ai,bj),

snd, add, R2)

SELECT A.i as i, B.j as j, sum (A.v * B.v) as v FROM A JOIN B ON A.j = B.iGROUP BY A.i, B.j;

eqJoin :: ((k₁,v₁) → x,(k₂,v₂) → x, Map k₁ v₁, Map k₂ v₂) → Map (k₂,k₂) (v₂,v₂)

map :: (k₁ → k₂, (k₁,v₁) →v₂,Map k₁ v₁) → Map k₂ v₂

groupReduce :: ((k₁,v₁) →k₂,(k₁,v₁) →v₂, (v₂,v₂) →v₂, Map k₁ v₁) → Map k₂ v₂


Distribution types

Distributed data layout (DDL) types Data-Parallel Programs

Distribution Types


Deriving distribution plans
Deriving distribution plans Data-Parallel Programs

Hidden from user

  • Different distributed implementations of each combinator.

  • Enumerates different choices of combinator implementations.

  • Each combinator implementation has a DDL type.

  • Use type inference to check if a set of choices is sound, insert redistributions to make sound, and infer data distributions.

  • Have backend template for each combinator implementation.

map folded intogroupReduce

eqJoin1 / eqJoin2 / eqJoin3

let R = eqJoin(\((ai,aj),_) -> aj,

\((bi,bj),_) -> bi,

A, B) in

groupReduce(\(((ai,aj),(bi,bj)),_) -> (ai,bj),

\(_,(av,bv)) -> mul (av,bv), add, R)

groupReduce1 / groupReduce 2


A distributed map type
A Data-Parallel Programs Distributed Map type

  • The DMap type extends the basic Map k vtype, to symbolically describe how the Map should be distributed on the cluster.

  • Key and value types t1 and t2.

  • Partition function f: a function taking key-value pairs to node coordinates i.e. f : (t1,t2)  N

  • Partition dimension identifier d1: specifies which nodes the coordinates map onto.

  • Mirror dimension identifier d2: specifies which nodes to mirror the partitions over.

  • Also includes local layout mode and key, omitted for brevity.Also works for Arrays (DArr) and Lists (DList).


Some example distributions
Some example distributions Data-Parallel Programs

A :: DMap (Int,Int) Float (hash(dim0)fstfst) (dim0) []

2D node topology

dim1

dim0

hash(dim0)fstfst = \((x,_),_) -> x % 3


Some example distributions1
Some example distributions Data-Parallel Programs

A :: DMap (Int,Int) Float (hash(dim0)fstfst) (dim0) [dim1]

2D node topology

dim1

dim0

hash(dim0)fstfst = \((x,_),_) -> x % 3


Some example distributions2
Some example distributions Data-Parallel Programs

A :: DMap (Int,Int) Float (hash(dim0)snd) (dim0) [dim1]

2D node topology

dim1

dim0

hash(dim0)snd = \(_,x) -> ceil(x) % 3


Some example distributions3
Some example distributions Data-Parallel Programs

A :: DMap …(hash(dim0,dim1)fst) (dim0,dim1) []

2D node topology

dim1

dim0

hash(dim0, dim1)fst = \((x,y),_) -> (x%3,y%4)


Distribution types for group reduces
Distribution types for group reduces Data-Parallel Programs

  • Input collection is partitioned using any projection function f

    • must exchange values between nodes before reduction

groupReduce1 : f, k, k’, v, v’, d, m

((k,v) k’, (k,v) v’, (v’,v’) v’,

DMap k v (hash(d)f) d m)

DMap k’ v’ (hash(d) fst) d m

  • Result is always co-located by the result Map’s keys.

In

Re-partition

Local greduce

Out

Node1

Node2

Node3


Distribution types for group reduces1
Distribution types for group reduces Data-Parallel Programs

  • Input is partitioned using the groupReduce’s key projection function f

    • result keys are already co-located

groupReduce2 : k, k’, v, v’, d, m (f,_,_,_) :

((k,v) k’, (k,v) v’, (v’,v’) v’,

DMap k v (hash(d)f) d m)

DMap k’ v’ (hash(d) fst) d m

  • Result is always co-located by the result Map’s keys.

In

Local group reduce

Out

No inter-node communication is necessary

(but constrains input distribution)

Node1

Node2

Node3


Dependent type schemes
Dependent type schemes Data-Parallel Programs

Two different binders/schemes:

  • Π binds a concrete term in AST

  • lifts the argument’s AST into the types when instantiated

  • must match concrete term when unifying (rigid)

  • used to specify that a collection is distributed using a specific projection function

  • (non-standard dep type syntax)

  • ∀ binds a type variable

  • create fresh variable when instantiated for term

  • can be replaced by any type when unifying (flexible)

  • used to specify that a collection is distributed by any partition function/arbitrarily

  • similar to standard type schemes


Extending damas milner inference
Extending Data-Parallel ProgramsDamas-Milner inference

  • We use a constraint based implementation of Algorithm W and extend it to:

    • Lift parameter functions into types at function apps (see DApp).

    • Compare embedded functions (extends Unify).

      • Undecidable in general.

      • Currently use an under-approximation which simplifies functions and tests for syntactic equality.

      • Future work solves equations involving projection functions.


Distribution types for joins
Distribution types for joins Data-Parallel Programs

  • Inputs are partitioned using the join’s key emitter functions

    • no inter-node communication, equal keys already co-located

eqJoin1: k’, k1, k2, v1, v2, d, m (f1,f2,_,_) :

((k1,v1) k’, (k2,v2)k’,

DMapk1 v1 (hash(d)f1) dm,

DMap k2 v2 (hash(d) f2) dm)

DMap(k1,k2) (v1,v2) (hash(d) (f1 lft∩ f2 rht)) dm

Partitioned by f1 and f2 (i.e. aligned)

In

Local equi-joins

Out

  • No inter-node communication

  • Constrains inputs and outputs

  • No mirroring of inputs

  • (Like sort-merge join)

Node1

Node2

Node3


Example

Deriving a distributed memory matrix-multiply Data-Parallel Programs

Example


Deriving a distributed matrix multiply
Deriving a distributed matrix-multiply Data-Parallel Programs

  • The program again:

map folded intogroupReduce

eqJoin1 / eqJoin2 / eqJoin3

groupReduce1 / groupReduce 2

let R = eqJoin(\((ai,aj),_) -> aj,

\((bi,bj),_) -> bi,

A, B) in

groupReduce(\(((ai,aj),(bi,bj)),_) -> (ai,bj),

\(_,(av,bv)) -> mul (av,bv), add, R)


Distributed matrix multiplication 1
Distributed matrix multiplication - #1 Data-Parallel Programs

  • groupReduce2 – in part by key f, out part by fst (out keys)

  • eqJoin3 – left part by lftFun(f), right rhtFun(f), out by any f

A: Partitioned by row along d1, mirrored along d2

B: Partitioned by column along d2, mirrored along d1

A : DMap(Int,Int) Float hash(d1)\((ai,aj),_)->aid1 [d2,m]

B : DMap(Int,Int) Float hash(d2)\((bi,bj),_)->bj d2, [d1,m]

R : DMap((Int,Int),(Int,Int)) (Float,Float)

(hash(d1,d2) \(((ai,aj),(bi,bj)),_) -> (ai,bj)) (d1,d2) m

C : DMap(Int,Int) Float (hash(d1,d2) fst) (d1,d2) m

C: Partitioned by (row, column) along (d1, d2)

Must partition and mirror A and B at beginning of computation.


Distributed matrix multiplication 11
Distributed matrix multiplication - #1 Data-Parallel Programs

  • groupReduce2 – in part by key f, out part by fst (out keys)

  • eqJoin3 – left part by lftFun(f), right rhtFun(f), out by any f

A: Partitioned by row along d1, mirrored along d2

B: Partitioned by column along d2, mirrored along d1

groupReduce2

eqJoin3

=

=

d1

(3-by-3 = 9nodes)

d2

C: Partitioned by (row, column) along (d1, d2)

A common solution.

Must partition and mirror A and B at beginning of computation.


Distributed matrix multiplication 2
Distributed matrix multiplication - Data-Parallel Programs#2

  • eqJoin1 – left part by f1, right by f2, aligned along d

  • groupReduce1 – in part by any f, out part by fst (out keys)

A: Partitioned by col along d

B: Partitioned by row along d(aligned with A)

A : DMap(Int,Int) Float (hash(d) \((ai,aj),_) -> aj) d m

B : DMap(Int,Int)Float (hash(d) \((bi,bj),_) -> bi) d m

R : DMap((Int,Int),(Int,Int)) (Float,Float)

(hash(d) (\((ai,aj),_) -> ajlft∩ ) d m

C : DMap(Int,Int) Float (hash(d) fst) d m

C: Partitioned by (row, column) along d

Must exchange R during groupReduce1.


Distributed matrix multiplication 21
Distributed matrix multiplication - #2 Data-Parallel Programs

A by col, B by row, so join is in-place but group must exchange values to group by key.

eqJoin1

groupReduce1

regrouped by col from B

Node 1

(3 nodes)

+

+

=

=

Node 2

d

+

+

=

=

Node 3

+

+

=

=

Efficient for sparse matrices, where A and B are large, but R is small.


Implementation results

Generated implementations Data-Parallel Programs

Implementation & results


Prototype implementation
Prototype Implementation Data-Parallel Programs

  • Uses a genetic search to explore candidate implementations.

  • For each candidate weautomatically insertredistributions to make ittype check.

  • We evaluate each candidateby code generating, compiling, and running it onsome test data.

  • Generates C++ using MPI.

  • (~20k lines of Haskell)


Example programs
Example programs Data-Parallel Programs

Performance comparable with hand-coded versions:

  • PLINQ comparisons run on quad-core (Intel Xeon W3520/2.67GHz) x64 desktop with 12GB memory.

  • C++/MPI comparisons run on Iridis3&4: a 3rd gen cluster with ~1000 Westmere compute nodes, each with two 6-core CPUs and 22GB of RAM, over an InfiniBand network. Speedups compared to sequential, averaged over 1,2,3,4,8,9,16,32 nodes.


Conclusions

So what? Data-Parallel Programs

Conclusions


The Pros and Cons… Data-Parallel Programs

Benefits

Limitations

Currently data parallel algorithms must be expressed via combinators

Data distributions are limited to those that can be expressed in the types

  • Multiple distributed collections: Maps, Arrays, Lists...

  • Can support in-memory and disk backed collections (e.g. for Big Data).

  • Generates distributed algorithms automatically.

  • Flexible communication

  • Concise input language

  • Abstracts away from complex details


Future work
Future work Data-Parallel Programs

  • Improve prototype implementation.

  • More distributed collection types and combinators.

  • Better function equation solving during unification.

  • Support more distributed memory architectures (GPUs).

  • Retrofit into an existing functional language.

  • Similar type inference for imperative languages?

  • Suggestions?


Questions
questions? Data-Parallel Programs


ad