eden parallel functional programming with haskell
Download
Skip this Video
Download Presentation
Eden Parallel Functional Programming with Haskell

Loading in 2 Seconds...

play fullscreen
1 / 106

Eden Parallel Functional Programming with Haskell - PowerPoint PPT Presentation


  • 144 Views
  • Uploaded on

Eden Parallel Functional Programming with Haskell. Rita Loogen Philipps-Universität Marburg, Germany

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Eden Parallel Functional Programming with Haskell' - jihan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
eden parallel functional programming with haskell

Eden Parallel FunctionalProgrammingwithHaskell

Rita Loogen

Philipps-Universität Marburg, Germany

Joint workwithYolanda Ortega Mallén, Ricardo Peña Alberto de la Encina, Mercedes Hildalgo Herrero, ChristóbalPareja, Fernando Rubio, Lidia Sánchez-Gil, Clara Segura, Pablo Roldan Gomez (UniversidadComplutense de Madrid)

Jost Berthold, Silvia Breitinger, Mischa Dieterle, Thomas Horstmeyer, Ulrike Klusik, Oleg Lobachev, Bernhard Pickenbrock, Steffen Priebe, Björn Struckmeier(Philipps-Universität Marburg)

CEFP Budapest 2011

slide2
Marburg /Lahn

Rita Loogen: Eden – CEFP 2011

overview
Overview
  • Lectures I & II (Thursday)
    • Motivation
    • Basic Constructs
    • Case Study: Mergesort
    • Eden TV – The Eden Trace Viewer
    • Reducingcommunicationcosts
    • Parallel mapimplementations
    • Explicit Channel Management
    • The Remote Data Concept
    • Algorithmic Skeletons
      • NestedWorkpools
      • DivideandConquer
  • Lecture III: Lab Session (Friday Morning)
  • Lecture IV: Implementation
    • LayeredStructure
    • Primitive Operations
    • The Eden Module
      • The Trans class
      • The PA monad
      • Process Handling
      • Remote Data
materials
Materials

Materials

  • Lecture Notes
  • Slides
  • Example Programs (Case studies)
  • Exercises

areprovided via the Eden web page

www.informatik.uni-marburg.de/~eden

Navigateto CEFP!

Rita Loogen: Eden – CEFP 2011

motivation
Motivation

Rita Loogen: Eden – CEFP 2011

our goal
Our Goal

Parallel programmingat a highlevelofabstraction

inherent parallelism

  • functional language (e.g. Haskell)
  • => concise programs
  • => high programming efficiency

automatic parallelisation

or annotations

our approach
Our Approach

Parallel programmingat a highlevelofabstraction

l

l

l

l

+

  • parallelismcontrol
    • explicit processes
    • implicitcommunication
    • distributedmemory
  • functional language (e.g. Haskell)
  • => concise programs
  • => high programming efficiency

Eden = Haskell + Parallelism

www.informatik.uni-marburg.de/~eden

basic constructs
Basic Constructs

Rita Loogen: Eden – CEFP 2011

slide9
Eden

parallel

programming

at a high level of

abstraction

  • = Haskell + Coordination
    • processdefinition
    • processinstantiation

process:: (Trans a, Trans b) => (a -> b) -> Process a b

gridProcess = process (\ (fromLeft,fromTop) ->

let ... in (toRight, toBottom))

processoutputs

computedby

concurrentthreads,

listssentasstreams

( # ) :: (Trans a, Trans b) => Process a b -> a ->b

(outEast, outSouth) = gridProcess# (inWest,inNorth)

derived operators and functions
Derivedoperatorsandfunctions
  • Parallel functionapplication
    • Often, processabstractionandinstantiationareused in thefollowingcombination
  • Eagerprocesscreation
    • Eagercreationof a seriesofprocesses

($#) :: (Trans a, Trans b) => (a -> b) -> a -> b

f $# x = process f # x -- ($#) = (#) . process

spawn :: (Trans a, Trans b) =>

[Processa b] -> [a] -> [b]

spawn= zipWith (#) -- ignoringdemandcontrol

spawnF :: (Trans a, Trans b) =>

[a -> b] -> [a] -> [b]

spawnF = spawn . (mapprocess)

Rita Loogen: Eden – CEFP 2011

evaluating f e
Evaluating f $# e

graphofprocessabstraction

process f

graphofargumentexpressione

#

will beevaluated

in parentprocess

bynewconcurrentthread

andsenttochildprocess

will beevaluated

bynewchildprocess

on remote PE

11

resultof f $ e

mainprocess

creates

childprocess

resultof e

Rita Loogen: Eden – CEFP 2011

defining process nets example computing hamming numbers
*2

*3

*5

sm

sm

DefiningprocessnetsExample: Computing Hammingnumbers

importControl.Parallel.Eden

hamming :: [Int]

hamming

= 1: sm ((uncurrysm) $#

(map (*2)$#hamming,

map (*3)$#hamming))

(map (*5)$# hamming)

sm :: [Int] -> [Int] -> [Int]

sm [] ys = ys

smxs [] = xs

sm (x:xs) (y:ys)

| x < y = x : smxs (y:ys)

| x == y = x : smxsys

| otherwise = y : sm (x:xs) ys

1:

hamming

questions about semantics
QuestionsaboutSemantics
  • simple denotationalsemantics
    • processabstraction -> lambdaabstraction
    • processinstantiation -> application
    • value/resultofprogram, but noinformationaboutexecution, parallelismdegree, speedups /slowdowns
  • operational
    • When will a processbecreated? When will a processinstantiationbeevaluated?
    • Towhichdegree will process in-/outputsbeevaluated? Weakhead normal form or normal form or ...?
    • When will process in-/outputsbecommunicated?
answers
Answers

Eden

onlyifandwhenitsresult

isdemanded

normal form

eager (push) communication:

valuesarecommunicated

assoonasavailable

Lazy Evaluation (Haskell)

onlyifandwhenitsresult

isdemanded

WHNF

(weakhead normal form )

onlyifdemanded:

requestandanswer

messagesnecessary

  • When will a processbecreated? When will a processinstantiationbeevaluated?
  • Towhichdegree will process in-/outputsbeevaluated? Weakhead normal form or normal form or ...?
  • When will process in-/outputsbecommunicated?
lazy evaluation vs parallelism
Lazyevaluation vs. Parallelism
  • Problem:Lazyevaluation ==> distributedsequentiality
  • Eden‘sapproach:
    • eagerprocesscreationwithspawn
    • eagercommunication:
      • normal form evaluationof all processoutputs(byindependentthreads)
      • push communication, i.e. valuesarecommunicatedassoonasavailable
    • explicit demandcontrolbysequentialstrategies(Module Control.Seq):
      • rnf, rwhnf... :: Strategy a
      • using :: a -> Strategy a -> a
      • pseq :: a -> b -> b (Module Control.Parallel)
case study merge sort
Case Study: MergeSort

Rita Loogen: Eden – CEFP 2011

case study merge sort1
Case Study: MergeSort

Unsorted

sublist 1

Sorted

sublist 1

Haskell Code:

mergeSort :: (Ord a, Show a) => [a] -> [a]

mergeSort [] = []

mergeSort [x] = [x]

mergeSortxs = sortMerge (mergeSort xs1) (mergeSort xs2)

where [xs1,xs2] = unshuffle 2xs

sorted

list

Unsorted

list

split

merge

Unsorted

sublist 2

sorted

Sublist 2

example merge sort parallel
Example: MergeSortparallel

Unsorted

sublist 1

Sorted

sublist 1

Eden Code (simplestversion):

parMergeSort :: (Ord a, Show a, Trans a) => [a] -> [a]

parMergeSort [] = []

parMergeSort [x] = [x]

parMergeSortxs = sortMerge (parMergeSort$# xs1) (parMergeSort$# xs2)

where [xs1,xs2] = unshuffle 2xs

sorted

list

Unsorted

list

split

merge

Unsorted

sublist 2

sorted

Sublist 2

example merge sort process net
Example: MergeSortProcessnet

childprocess

childprocess

Eden Code (simplestversion):

parMergeSort :: (Ord a, Show a, Trans a) => [a] -> [a]

parMergeSort [] = []

parMergeSort [x] = [x]

parMergeSortxs = sortMerge (parMergeSort$# xs1) (parMergeSort$# xs2)

where [xs1,xs2] = unshuffle 2xs

childprocess

mainprocess

childprocess

childprocess

childprocess

edentv the eden trace viewer tool
EdenTV: The Eden Trace Viewer Tool

Rita Loogen: Eden – CEFP 2011

the eden system
The Eden-System

Eden

Parallel runtimesystem

(Management ofprocesses

andcommunication)

EdenTV

parallel system

compiling running analysing eden programs
Compiling, Running, Analysing Eden Programs

Set upenvironmentfor Eden on Lab computersbycalling

edenenv

Compile Eden programswith

ghc –parmpi --make –O2 –eventlogmyprogram.hsor

ghc –parpvm --make –O2 –eventlogmyprogram.hs

Ifyouusepvm, youfirsthavetostart it.

Providepvmhostsormpihostsfile

Runcompiledprogramswith

myprogram +RTS –ls -N -RTS

Viewactivityprofile (tracefile) with

edentvmyprogram_..._-N4_-RTS.parevents

Rita Loogen: Eden – CEFP 2011

eden threads and processes
deblock thread

new thread

runnable

suspend thread

block thread

kill thread

run thread

running

blocked

kill thread

finished

kill thread

Eden Threads andProcesses
  • An Eden processcomprisesseveralthreads(one per outputchannel).
  • Thread State Transition Diagram:
edentv
EdenTV

-Diagrams:

Machines (PEs)

Processes

Threads

- Message

Overlays

Machines

Processes

- zooming

- messagestreams

- additional infos

- ...

edentv demo
EdenTV Demo

Rita Loogen: Eden – CEFP 2011

case study merge sort continued
Case Study: MergeSortcontinued

Rita Loogen: Eden – CEFP 2011

example activity profile of parallel mergesort
Example: Activityprofileof parallel mergesort
  • Program run, lengthofinputlist: 1.000
  • Observation:
  • SLOWDOWN
  • Seq. runtime: 0,0037 s
  • Par. runtime: 0,9472 s
  • Reasons:
  • 1999 processes, mostlyblocked
  • 31940 messages
  • delayedprocesscreation
  • processplacement
how can we improve our parallel mergesort here are some rules of thumb
Howcanweimproveour parallel mergesort? Herearesomerulesofthumb.
  • Adaptthe total numberofprocessestothenumberofavailableprocessorelements (PEs), in Eden: noPe :: Int
  • UseeagerprocesscreationfunctionsspawnorspawnF.
  • Bydefault, Eden placesprocessesroundrobin on theavailable PEs. Try todistributeprocessesevenlyoverthe PEs.
  • Avoidelement-wisestreamingif not necessary, e.g. byputtingthelistintosome „box“ orbychunkingitintobiggerpieces.

THINK PARALLEL!

Rita Loogen: Eden – CEFP 2011

parallel mergesort revisited
Parallel Mergesortrevisited

unsorted

sublist 1

sorted

sublist 1

mergesort

unsorted

sublist 2

sorted

sublist 2

mergesort

sorted

list

unsorted

list

mergemanylists

unshuffle (noPe-1)

unsorted

sublist noPe-1

sorted

sublistnoPe-1

mergesort

unsorted

sublistnoPe

sorted

sublistnoPe

mergesort

Rita Loogen: Eden – CEFP 2011

a simple parallelisation of map
...

x1

x2

x3

x4

...

f

f

f

f

...

y1

y2

y3

y4

A Simple Parallelisationofmap

map :: (a -> b) -> [a] -> [b]

map f xs = [ f x | x <- xs ]

parMap :: (Trans a, Trans b) =>

(a -> b) -> [a] -> [b]

parMap f = spawn (repeat(process f))

1 process

per list element

alternative parallelisation of mergesort 1st try
Alternative Parallelisationofmergesort - 1st try

Eden Code:

par_ms :: (Ord a, Show a, Trans a) => [a] -> [a]

par_msxs

= head $ sms $parMapmergeSort

(unshuffle (noPe-1) xs))

sms :: (NFData a, Ord a) => [[a]] -> [[a]]

sms [] = []

[email protected][xs] = xss

sms (xs1:xs2:xss) = sms (sortMerge xs1 xs2) (smsxss)

 Total numberofprocesses = noPe

 eagerlycreatedprocesses

 roundrobinplacementleadsto 1 process per PE

but maybe still toomanymessages

resulting activity profile processes machine view
ResultingActivity Profile (Processes/Machine View)

Previousresultsforinputsize 1000

Seq. runtime: 0,0037 s

Par. runtime: 0,9472 s

  • Input size 1.000
  • seq. runtime: 0,0037
  • par. runtime: 0,0427 s
  • 8 Pes, 8 processes, 15 threads
  • 2042 messages
  • Much better, but still
  • SLOWDOWN
  • Reason: Indeedtoomanymessages

Rita Loogen: Eden – CEFP 2011

reducing communication costs
Reducing Communication Costs

Rita Loogen: Eden – CEFP 2011

reducing number of messages by chunking streams
ReducingNumberof Messages byChunking Streams

Split a list (stream) intochunks:

chunk :: Int -> [a] -> [[a]]

chunksize [] = []

chunksizexs = ys : chunksizezs

where (ys,zs) = splitAtsizexs

Combine with parallel map-implementationofmergesort:

par_ms_c :: (Ord a, Show a, Trans a) =>

Int -> -- chunksize

[a] -> [a]

par_ms_csizexs

= head $ sms $mapconcat$

parMap ((chunksize) . mergeSort . concat)

(map (chunksize)(unshuffle (noPe-1) xs)))

Rita Loogen: Eden – CEFP 2011

resulting activity profile processes machine view1
ResultingActivity Profile (Processes/Machine View)

Previousresultsforinputsize 1000

Seq. runtime: 0,0037 s

Par. runtime I: 0,9472 s

Par. runtime II: 0,0427 s

  • Input size 1.000, chunksize 200
  • seq. runtime: 0,0037
  • par. runtime: 0,0133 s
  • 8 Pes, 8 processes, 15 threads
  • 56 messages
  • Much better, but still
  • SLOWDOWN
  • parallel runtime w/o Startup and Finish of parallel system:
  • 0,0125-0,009 = 0,0035
  •  increaseinputsize

Rita Loogen: Eden – CEFP 2011

activity profile for input size 1 000 000
Activity Profile for Input Size 1.000.000
  • Input size 1.000.000
  • Chunksize 1000
  • seq. runtime: 7,287 s
  • par. runtime: 2,795 s
  • 8 Pes, 8 processes, 15 threads
  • 2044 messages
  •  speedupof 2.6 on 8 PE

unshuffle

mapmergesort

merge

Rita Loogen: Eden – CEFP 2011

further improvement
Further improvement

Idea: Remove inputlistdistributionbylocalsublistselection:

par_ms_c :: (Ord a, Show a, Trans a) =>

Int -> [a] -> [a]

par_ms_csizexs

= head $ sms $mapconcat$

parMap ((chunksize) . mergeSort . concat)

(map (chunksize)(unshuffle (noPe-1) xs)))

par_ms:: (Ord a, Show a, Trans a) =>

Int -> [a] -> [a]

par_ms_bsizexs

= head $ sms $mapconcat$

parMap(\ i -> (chunksize(mergeSort

((unshuffle (noPe-1) xs)!!i))))

[0..noPe-2]

Rita Loogen: Eden – CEFP 2011

corresponding activity profiles
CorrespondingActivityProfiles

  • Input size 1.000.000
  • Chunksize 1000
  • seq. runtime: 7,287 s
  • par. runtime: 2,795 s
  • new par. runtime: 2.074 s
  • 8 Pes, 8 processes, 15 threads
  • 1036 messages
  •  speedupof 3.5 on 8 PEs

Rita Loogen: Eden – CEFP 2011

parallel map implementations
Parallel mapimplementations

Rita Loogen: Eden – CEFP 2011

parallel map implementations parmap vs farm
T

T

T

T

T

T

T

T

...

...

...

...

P

P

P

P

P

P

...

...

PE

PE

PE

PE

Parallel mapimplementations: parMapvsfarm

parMap

farm

farm :: (Trans a, Trans b) =>

([a] -> [[a]]) -> ([[b]] -> [b]) ->

(a -> b)-> [a] -> [b]

farmdistributecombinef xs

= combine (parMap(map f)

(distributexs))

parMap :: (Trans a, Trans b) => (a -> b) -> [a] -> [b]

parMapf xs

=spawn(repeat(process f))xs

process farms
Processfarms

1 process

per sub-tasklist

withstatic

taskdistribution

farm :: (Trans a, Trans b) =>

([a] -> [[a]]) -> -- distribute

([[b]] -> [b]) -> -- combine

(a->b) -> [a] -> [b]

farmdistributecombine f xs

= combine . (parMap (map f)) . distribute

Choose e.g.

  • distribute = unshufflenoPe / combine = shuffle
  • distribute = splitIntoNnoPe / combine = concat

1 process

per PE

withstatic

taskdistribution

Rita Loogen: Eden – CEFP 2011

example functional program for mandelbrot sets
Example: Functional Program for Mandelbrot Sets

ul

dimx

Idea: parallel computation of lines

lr

image :: Double -> Complex Double -> Complex Double -> Integer -> String

imagethresholdullrdimx

= header ++ (concat$ map xy2col lines)

where

xy2col ::[Complex Double] -> String

xy2col line = concatMap (rgb.(iterthreshold (0.0 :+ 0.0) 0)) line

(dimy, lines) = coordullrdimx

example parallel functional program for mandelbrot sets
Example: ParallelFunctional Program for Mandelbrot Sets

ul

dimx

Idea: parallel computation of lines

lr

image :: Double -> Complex Double -> Complex Double -> Integer -> String

imagethresholdullrdimx

= header ++ (concat$ map xy2col lines)

where

xy2col ::[Complex Double] -> String

xy2col line = concatMap (rgb.(iterthreshold (0.0 :+ 0.0) 0)) line

(dimy, lines) = coordullrdimx

Replacemapby

farm(unshufflenoPe) shuffle

orfarmB (splitIntoNnoPe) concat

slide44
Mandelbrot

Traces

Problem size: 2000 x 2000

Platform: Beowulf cluster

Heriot-Watt-University,

Edinburgh

(32 Intel P4-SMP nodes @ 3 GHz

512MB RAM, Fast Ethernet)

farm (unshufflenoPe) shuffle

roundrobinstatic

taskdistribution

farm (splitIntoNnoPe) concat

roundrobinstatic

taskdistribution

example ray tracing
Camera

2D Image

3D Scene

Example: Ray Tracing

rayTrace:: Size -> CamPos -> [Object] -> [Impact]

rayTracesizecameraPosscene = findImpactsallRaysscene

whereallRays = generateRayssizecameraPos

findImpacts:: [Ray] -> [Object] -> [Impact]

findImpactsraysobjs = map (firstImpactobjs) rays

reducing communication costs by chunking
Reducing Communication CostsbyChunking

Combine chunkingwith parallel map-implementation:

chunkMap :: Int -> (([a] -> [b]) -> ([[a]] -> [[b]]))

-> (a -> b) -> [a] -> [b]

chunkMapsizemapscheme f xs

= concat (mapscheme (map f) (chunksizexs))

Rita Loogen: Eden – CEFP 2011

raytracer example element wise streaming vs chunking
RaytracerExample:Element-wise Streaming vsChunking

Input size 250

Chunksize 500

Runtime: 0,235 s

8 PEs

9 processes

17 threads

48 conversations

548 messages

Input size 250

Runtime: 6,311 s

8 PEs

9 processes

17 threads

48 conversations

125048 messages

Rita Loogen: Eden – CEFP 2011

communication vs parameter passing
Communication vs Parameter Passing

Processinputs- canbecommunicated: f $# inp

- canbepassedasparameter(\ () -> f inp) $# () () isdummyprocessinput

graphofprocessabstraction

graphofinputexpression

#

will beevaluated in parentprocess

byconcurrentthread

andthensenttochildprocess

will bepacked (serialised)

andsentto remote PE

wherechildprocessiscreated

toevaluatethisexpression

Rita Loogen: Eden – CEFP 2011

farm vs offline farm
T

T

T

T

...

...

P

P

P

P

T...T

T...T

...

...

PE

PE

PE

PE

Farm vs Offline Farm

Farm

Offline Farm

offlineFarm:: (Trans a, Trans b) =>

([a] -> [[a]]) -> ([[b]] -> [b]) ->

(a -> b) -> [a] -> [b]

offlineFarmdistributecombinef xs

= combine$

spawn

(map (rfi (map f)) (distributexs) ) (repeat ())

rfi :: (a -> b) -> a -> Process () b

rfi h x = process (\ () -> h x)

farm :: (Trans a, Trans b) =>

([a] -> [[a]]) -> ([[b]] -> [b]) ->

(a -> b)-> [a] -> [b]

farmdistributecombinef xs

= combine (parMap(map f)

(distributexs))

raytracer example farm vs offline farm
RaytracerExample: Farm vs Offline Farm

Input size 250

Chunksize 500

Runtime: 0,235 s

8 PEs

9 processes

17 threads

48 conversations

548 messages

Input size 250

Chunksize 500

Runtime: 0,119 s

8 PEs

9 processes

17 threads

40 conversations

290 messages

Rita Loogen: Eden – CEFP 2011

eden what we have seen so far
Eden extendsHaskellwithparallelism

explicit processdefinitionsandimplicitcommunication

controlofprocessgranularity, distributionofwork, andcommunicationtopology

implementedbyextendingthe Glasgow Haskell Compiler (GHC)

toolEdenTVtoanalyse parallel programbehaviour

rulesofthumbforproducingefficient parallel programs

numberofprocesses ~ noPe

reducingcommunication

chunking

offline processes: parameterpassinginsteadofcommunication

parallel mapimplementations

Schemata taskdecompositiontaskdistribution

parMapregularstatic:process per task

farmregularstatic:process per processor

offlineFarmregularstatic:taskselectionin processes

workpoolirregulardynamic

Eden: Whatwehaveseen so far
overview eden lectures
Overview Eden Lectures
  • Lectures I & II (Thursday)
    • Motivation
    • Basic Constructs
    • Case Study: Mergesort
    • Eden TV – The Eden Trace Viewer
    • Reducingcommunicationcosts
    • Parallel mapimplementations
    • Explicit Channel Management
    • The Remote Data Concept
    • Algorithmic Skeletons
      • NestedWorkpools
      • DivideandConquer
  • Lecture III: Lab Session (Friday Morning)
  • Lecture IV: Implementation (FridayAfternoon)
    • LayeredStructure
    • Primitive Operations
    • The Eden Module
      • The Trans class
      • The PA monad
      • Process Handling
      • Remote Data
many to one communication merge
workpool

merge

worker

nw-1

worker

1

worker

0

Many-to-one Communication: merge

Using non-deterministicmergefunction: merge:: [[a]] -> [a]

Workpoolor

Master/Worker Scheme

masterWorker :: (Trans a, Trans b) => Int -> Int -> (a->b) -> [a] -> [b]

masterWorkernwprefetchf tasks = orderByfromWsreqs

wherefromWs = parMap (map f) toWs

toWs = distributenptasksreqs

reqs = initReqs ++ newReqs

initReqs = concat (replicateprefetch [0..nw-1])

newReqs = merge [[i | r <- rs] | (i,rs) <- zip [0..nw-1] fromWs]

example mandelbrot revisited
Example: Mandelbrot revisited!

offlineFarm

splitIntoN (noPe-1)

Input size 2000

Runtime: 13,09 s

8 PEs

8 processes

15 threads

35 conversations

1536 messages

Input size 2000

Runtime: 13,91 s

8 PEs

8 processes

22 threads

42 conversations

3044 messages

masterWorker

prefetch 50

Rita Loogen: Eden – CEFP 2011

parallel map implementations1
T

T

T

T

T

T

T

T

...

...

...

...

P

P

P

P

T...T

T...T

P

P

P

P

...

...

PE

PE

...

PE

PE

PE

PE

workpool

T

T

T

merge

w np-1

w 1

w 0

PE

PE

PE

Parallel mapimplementations

parMapfarmofflineFarm

  • statictaskdistribution:
  • dynamictaskdistribution:

increasinggranularity

explicit channel management
Explicit Channel Management

Rita Loogen: Eden – CEFP 2011

explicit channel management in eden
Explicit Channel Management in Eden

Example: Definition of a process ring

ring :: (Trans i,Transo,Trans r) =>

((i,r) -> (o,r)) -> -- ring processfct

[i] -> [o] -- input-outputfct

ring f is = os

where

(os, ringOuts)

= unzip[process f #inp |

inp<- lazyzipisringIns]

ringIns = rightRotateringOuts

rightRotatexs = last xs : initxs

rightrotate

Problem: onlyindirect ring connections via parentprocess

explicit channels in eden
Explicit Channels in Eden

parfill

nextChan

new

prevchan

p

  • Channel generation
  • Channel usage

plink ::

(Trans i,Trans o, Trans r) =>

((i,r) -> (o,r)) ->

Process (i,ChanName r)

(o,ChanName r)

plink f = processfun_link

where

fun_link (fromP, nextChan)

= new (\ prevChanprev ->

let

(toP, next)

= f (fromP, prev)

in

parfillnextChannext

(toP, prevChan)

)

new :: Trans a =>

(ChanName a -> a -> b) -> b

parfill :: Trans a =>

ChanName a -> a -> b -> b

ring definition
Ring Definition

with explicit channels

ring :: (Trans i,Transo,Trans r) =>

((i,r) -> (o,r)) -> -- ring processfct

[i] -> [o] -- input-outputfct

ring f is = os

where

(os, ringOuts)

= unzip [f # inp |

inp<- lazyzipisringIns]

ringIns = rightRotateringOuts

rightRotatexs = last xs : initxs

leftrotate

rightrotate

process

plink

Problem: onlyindirect ring connections via parentprocess

traceprofile ring implicit vs explicit channels
Traceprofile RingImplicitvs explicit channels

ring withimplicitchannels -

All communications

gothroughgenerator

process (number1).

Ring with explicit channels –

Ring processescommunicatedirectly.

the remote data concept
The Remote Data Concept

Rita Loogen: Eden – CEFP 2011

the remote data concept1
The „Remote Data“-Concept
  • Functions:
    • Release localdatawithrelease :: a -> RD a
    • Fetchreleaseddatawithfetch :: RD a -> a
  • Replace
    • (process g # (process f # inp))

with

    • process (g o fetch) # (process (release o f) # inp)

inp

f

g

inp

f

g

ring definition with remote data
Ring Definition with Remote Data

ring :: (Trans i,Transo,Trans r) =>

((i,r) -> (o,r)) -> -- ring processfct

[i] -> [o] -- input-outputfct

ring f is = os

where

(os, ringOuts)

= unzip [processf_RD#inp |

inp<- lazyzipisringIns]

f_RD (i, ringIn) = (o, releaseringOut)

where (o, ringOut) = f (i, fetchringIn)

ringIns = rightRotateringOuts

rightRotatexs = last xs : initxs

type [RD r]

implementation of remote data with dynamic channels
Implementationof Remote Data withdynamicchannels

-- remote data

type RD a = ChanName (ChanName a)

-- convertlocaldataintocorresponding remote data

release :: Trans a ⇒ a → RD a

release x = new (\ cc c → parfill c x cc)

-- convert remote dataintocorrespondinglocaldata

fetch :: Trans a ⇒ RD a → a

fetch cc = new (\ c x → parfill cc c x)

example computing shortest paths
Main Station

1

200

  • 0 200 300 ∞ ∞
  • 0 150 ∞ 400
  • 150 0 50 125
  • ∞ ∞ 50 0 100
  • ∞ 400 125 100 0

Elisabeth

church

2

300

150

400

Town Hall

3

125

50

4

Mensa

Computetheshortestway

fromA toB für arbitrarynodes

A and B!

100

5

Old University

Example: Computing ShortestPaths

Map -> Graph -> Adjacencymatrix/Distancematrix

1

2

3

4

5

warshall s algorithm in process ring
Warshall‘salgorithmin process ring
  • Ring

rowi

rowk

ring_iterate :: Int -> Int -> Int ->

[Int] -> [[Int]] -> ( [Int],[[Int]])

ring_iteratesize k i rowk (rowi:xs)

| i > size = (rowk, []) -- End ofiterations

| i == k = (rowR, rowk:restoutput) –- send ownrow

| otherwise = (rowR, rowi:restoutput) –- update row

where

(rowR, restoutput) = ring_iteratesize k (i+1) nextrowkxs

nextrowk | i == k = rowk -- no update, ifownrow

| otherwise = updaterowrowkrowi(rowk!!(i-1))

Force evaluationofnextrowkbyinserting

rnfnextrowk `pseq`

beforecallofring_iterate

traces of parallel warshall
Tracesof parallel Warshall

sequentialstartupphase

With additional demand on

nextrowk

End of

fast

version

advanced algorithmic skeletons
(Advanced) Algorithmic Skeletons

Rita Loogen: Eden – CEFP 2011

algorithmic skeletons
Algorithmic Skeletons
  • patternsof parallel computations=> in Eden: parallel higher-order functions
  • typicalpatterns:
    • parallel mapsandmaster-workersystems: parMap, farm, offline_farm, mw (workpoolSorted)
    • map-reduce
    • topologyskeletons: pipeline, ring, torus, grid, trees...
    • divideandconquer
  • in thefollowing:
    • nestedmaster-workersystems
    • divideandconquerschemes

See Eden‘s

Skeleton Library

nesting workpools
workpool

merge

worker

np-1

tasks

worker

1

worker

0

results

sub-wp

sub-wp

sub-wp

sub-wp

merge

merge

w{np-1}

w1

w0

merge

merge

w{np-1}

w1

w0

w{np-1}

w{np-1}

w1

w1

w0

w0

Nesting Workpools

sub-wp

sub-wp

sub-wp

sub-wp

sub-wp

nesting workpools1
workpool

merge

worker

np-1

tasks

worker

1

worker

0

results

sub-wp

sub-wp

sub-wp

sub-wp

merge

merge

w{np-1}

w1

w0

merge

merge

w{np-1}

w1

w0

w{np-1}

w{np-1}

w1

w1

w0

w0

Nesting Workpools

wpNested :: (Trans a, Trans b) =>

[Int] -> [Int] -> -- branchingdegrees/prefetches

-- per level

([a] -> [b]) -> -- workerfunction

[a] -> [b] -- tasks, results

wpNestednspfswf = foldrfldwf (zipnspfs)

where

fld :: (Trans a, Trans b) =>

(Int,Int) -> ([a] -> [b]) -> ([a] -> [b])

fld (n,pf) wf = workpool' n pfwf

sub-wp

sub-wp

sub-wp

sub-wp

sub-wp

wpnested [4,5] [64,8] yields

hierarchical workpool
HierarchicalWorkpool

1 master

4 submasters

20 workers

fasterresultcollection

via hierarchy

-> betteroverallruntime

Mandelbrot Trace

Problem size: 2000 x 2000

Platform: Beowulf clusterHeriot-Watt-University, Edinburgh

(32 Intel P4-SMP nodes @ 3 GHz, 512MB RAM, Fast Ethernet)

experimental results
Experimental Results
  • Mandelbrot set visualisation
  • . . . for 5000 5000 pixels, calculated line-wise (5000 tasks)
  • Platform: Beowulf cluster Heriot-Watt-University(32 Intel P4-SMP nodes @ 3 GHz, 512MB RAM, Fast Ethernet)

[31] [2,15] [4,7] [6,5] [2,2,4]

logical PEs 32 33 33 37 35

1 level nesting trace
7W

7W

7W

7W

1-Level-Nesting Trace

Garbage

Collection

Prefetch

branching

[4,7]

=> 33 log.PEs

prefetch 40

3 level nesting trace
4W

4W

4W

4W

4W

4W

3-Level-Nesting Trace

branching

[3,2,4]

=> 34 log.PEs

prefetch 60

divide and conquer
1

2

1

2

1

3

5

1

4

3

6

2

7

5

8

1

4

3

6

2

7

5

8

Divide-and-conquer

dc :: (a->Bool) -> (a->b) -> (a->[a]) -> ([b]->b) -> a->b

dc trivial solvesplitcombinetask

= if trivial taskthensolvetask

elsecombine (maprec_dc (splittask))

whererec_dc = dc trivial solvesplitcombine

regularbinaryscheme

withdefaultplacing:

1

2

1

2

3

1

3

3

1

4

3

4

2

4

5

3

1

4

3

4

2

4

5

explicit placement via ticket list
1

2

1

2

1

3

5

1

4

3

6

2

7

5

8

1

4

3

6

2

7

5

8

Explicit Placement via Ticket List

2,

3, 4, 5, 6, 7, 8

1

unshuffle

5, 7

4,

6, 8

3,

2

1

6

8

5

2

7

4

1

3

4

1

5

3

7

2

6

8

4

1

5

3

7

2

6

8

regular dc skeleton with ticket placement
Regular DC-Skeleton with Ticket Placement

dcNTickets :: (Trans a, Trans b) =>

Int -> [Int] -> ... -- branch degree / tickets / ...

dcNTickets k [] trivial solve split combine

= dc trivial solve split combine

dcNTickets k tickets trivial solve split combine x

= if trivial x then solve x

else childRes `pseq` rnf myRes `pseq` -- demand control

combine (myRes:childRes ++ localRess )

where childRes = spawnAtchildTicketschildProcsprocIns

childProcs = map (process . rec_dcN) theirTs

rec_dcNts = dcNTickets kts trivial solve split combine

-- ticket distribution

(childTickets, restTickets) = splitAt (k-1) tickets

(myTs: theirTs) = unshuffle k restTickets

-- input splitting

(myIn:theirIn) = split x

(procIns, localIns) = splitAt (length childTickets) theirIn

-- local computations

myRes = dcNTicketsk myTstrivial solve split combine myIn

localRess = map (dc trivial solve split combine) localIns

regular dc skeleton with ticket placement1
Regular DC-Skeleton with Ticket Placement

dcNTickets:: (Trans a, Trans b) =>

Int -> [Int] -> ... -- branchdegree / tickets / ...

dcNTickets k [] trivial solvesplitcombine

= dc trivial solvesplitcombine

dcNTickets k tickets trivial solvesplitcombine x

= if trivial x thensolve x

elsechildRes`pseq` rnfmyRes`pseq` -- demandcontrol

combine (myRes:childRes ++ localRess )

wherechildRes = spawnAtchildTicketschildProcsprocIns

childProcs = map (process . rec_dcN) theirTs

rec_dcNts = dcNTickets kts trivial solvesplitcombine

-- ticket distribution

(childTickets, restTickets) = splitAt (k-1) tickets

(myTs: theirTs) = unshuffle k restTickets

-- inputsplitting

(myIn:theirIn) = split x

(procIns,localIns) = splitAt (lengthchildTickets) theirIn

-- localcomputations

myRes= dcNTicketsk myTstrivial solve split combine myIn

localRess = map (dc trivial solvesplitcombine) localIns

  • arbitrary, but fixedbranchingdegree
  • flexible, workswith
    • toofewtickets
    • double tickets
  • parallel unfoldingcontrolledby ticket list
case study karatsuba
Case Study: Karatsuba
  • multiplicationof large integers
  • fixedbranchingdegree 3
  • complexity O(nlog2 3), combinecomplexity O(n)
  • Platform: LAN (Fast Ethernet),7 dual-corelinuxworkstations, 2 GB RAM
  • inputsize: 2 integerswith 32768 digitseach
divide and conquer schemes
Divide-and-ConquerSchemes
  • Distributed expansion
  • Flat expansion
divide and conquer using master worker
Divide-and-ConquerUsing Master-Worker

divConFlat :: (Trans a,Trans b, Show b, Show a, NFData b) =>

((a->b) -> [a] -> [b])

-> Int -> (a->Bool) -> (a->b) -> (a->[a]) -> ([b]->b) -> a -> b

divConFlatparallelMapSkeldepth trivial solvesplitcombine x

= combineTopMaster (\_ -> combine) levelsresults

where

(tasks,levels) = generateTasksdepth trivial split x

results = parallelMapSkeldcSeqtasks

dcSeq = dc trivial solvesplitcombine

case study parallel fft
Case Study: Parallel FFT
  • frequency distribution in a signal, decimation in time
  • 4-radix FFT, input size: 410 complex numbers
  • Platform: Beowulf Cluster Edinburgh

Problem:

Communicating

huge data amounts

intermediate conclusions
Intermediate Conclusions
  • Eden enableshigh-level parallel programming
  • Usepredefinedor design ownskeletons
  • Eden‘sskeletonlibraryprovides a large collectionofsophisticatedskeletons:
    • parallel maps: parMap, farm, offlineFarm …
    • master-worker: flat, hierarchical, distributed …
    • divide-and-conquer: ticket placement, via master-worker …
    • topologicalskeletons: ring, torus, all-to-all, parallel transpose …

Rita Loogen: Eden – CEFP 2011

eden lab session
Eden Lab Session
  • Download the exercise sheet from http://www.mathematik.uni-marburg.de/~eden/?content=cefp
  • Chooseoneofthethreeassignmentsanddownloadthecorrespondingsequentialprogram: sumEuler.hs (easy) - juliaSets.hs (medium) - gentleman.hs (advanced)
  • Download the sample mpihostsfileandmodifyittorandomlychosen lab computersnylxywithxychosenfrom 01 upto 64
  • Call edenenvtosetuptheenvironmentfor Eden
  • Compile Eden programswithghc –parmpi --make –O2 –eventlogmyprogram.hs
  • Runcompiledprogramswithmyprogram +RTS –ls -Nx-RTS with x=noPe
  • Viewactivityprofile (tracefile) withedentvmyprogram_..._-N…_-RTS.parevents

Rita Loogen: Eden – CEFP 2011

eden s implementation
Eden‘sImplementation

Rita Loogen: Eden – CEFP 2011

glasgow haskell compiler eden extensions
HaskellEden

core syntax

simplify

Lex/Yacc parser

core to STG

simplify

STG

Front end

code generation

(parallel Eden

runtime system)

abs. syntax

prefix form

C

abstract C

abs. syntax

abs. syntax

renamer

type checker

reader

flattening

C compiler

desugarer

Message

Passing

Library

MPI or PVM

Glasgow Haskell Compiler& Eden Extensions
eden s parallel runtime system prts
Eden‘s parallel runtimesystem (PRTS)

Modificationof GUM, the PRTS ofGpH (Glasgow Parallel Haskell):

  • Recycled
    • Thread management: heapobjects, threadscheduler
    • Memory management: localgarbagecollection
    • Communication: graphpackingandunpackingroutines
  • Newlydeveloped
    • Processmanagement:runtimetables, generationandtermination
    • Channel management:channelrepresentation, connection, etc.
  • Simplifications
    • no „virtualsharedmemory“ (global addressspace) necessary
    • noglobalisationofunevaluateddata
    • no global garbagecollectionofdata
dream distributed eden abstract machine
BH

.

.

.

.

.

.

.

.

.

DREAM: DistRibuted Eden Abstract Machine
  • abstractviewofEden‘s parallel runtimesystem
  • abstractviewofprocess:

HEAP

BH

out-

ports

in-

ports

BH

Thread representedby TSO (threadstateobject) in theheap

black hole closure, on accessthreadsaresuspendeduntilthisclosureisoverwritten

BH

Rita Loogen: Eden – CEFP 2011

garbage collection and termination
no global addressspace

localheap

inports/outports

noneedfor global garbagecollection

localgarbagecollection

outportsas additional roots

inportscanberecognisedasgarbage

.

.

.

.

.

.

.

.

.

Garbage Collection and Termination

close

BH

BH

out-

ports

BH

BH

in-

ports

.

.

.

BH

BH

implementation of eden
Implementationof Eden

Eden

  • Parallel programming on a highlevelofabstraction
    • explicit processdefinitions
    • implicitcommunication
  • Automaticprocessandchannelmanagement
    • Distributed graphreduction
    • Management ofprocessesandtheirinterconnectingchannels
    • Message passing

?

Eden

Runtime System (RTS)

implementation of eden1
Eden ModuleImplementationof Eden

Eden

  • Parallel programming on a highlevelofabstraction
    • explicit processdefinitions
    • implicitcommunication
  • Automaticprocessandchannelmanagement
    • Distributed graphreduction
    • Management ofprocessesandtheirinterconnectingchannels
    • Message passing

Eden

Runtime System (RTS)

layer structure
Eden programs

Skeleton Library

Eden Module

Primitive Operations

Parallel GHC Runtime System

Layer Structure
parprim the interface to the parallel rts
Parprim – The Interface to the Parallel RTS

Primitive operationsprovide the basic functionality :

  • channel administration
      • primitive channels (= inports) data ChanName' a = Chan Int# Int# Int#
      • create communication channel(s) createC :: IO ( ChanName' a, a )
      • connect communication channel connectToPort :: ChanName' a -> IO ()
  • communication
      • send data sendData :: Mode -> a -> IO ()
      • modi data Mode = Connect | Stream | Data | Instantiate Int
  • thread creationfork :: IO () -> IO ()
  • generalnoPE, selfPE :: Int

PE ProcessInport

the eden module
The Eden Module
  • Type class Trans
  • transmissible data types
  • overloaded communication functions for lists (-> streams): write :: a -> IO () and tuples (-> concurrency): createComm :: IO (ChanName a, a)

process:: (Trans a, Trans b) => (a -> b) -> Process a b

( # ) :: (Trans a, Trans b) => Processa b -> a -> b

spawn :: (Trans a, Trans b) => [Process a b] -> [a]->[b]

explicit definitions

of process, ( # ) and Process

as well as spawn

  • explicit channels
  • newtype ChanName a
  • = Comm (a -> IO ())
type class trans
Type class Trans

class NFData a => Transa where

write :: a -> IO ( )

write x = rnf x `pseq` sendDataData x

createComm :: (ChanName a, a) createComm = do (cx, x) <- createC

return (Comm (sendVia cx), x)

sendVia :: ChanName’ a -> a -> IO ()

sendViach d = do connectToPortch

write d

tuple transmission by concurrent threads
Tupletransmissionbyconcurrentthreads

instance (Trans a, Trans b) => Trans (a,b) where

createComm = do (cx,x) <- createC

(cy,y) <- createC

return (Comm (write2 (cx,cy)),

(x,y))

write2 :: (Trans a, Trans b) =>

(ChanName' a, ChanName' b) -> (a,b) -> IO ()

write2 (c1,c2) (x1,x2) = do fork (sendVia c1 x1)

sendVia c2 x2

Rita Loogen: Eden – CEFP 2011

stream transmission of lists
Stream Transmission of Lists

instance Trans a => Trans [a] where

write [email protected][] = sendData Data l

write (x:xs) = do (rnf x `pseq` sendData Stream x)

write xs

Rita Loogen: Eden – CEFP 2011

the pa monad
The PA Monad

Improvingcontrolover parallel activities:

newtypePA a = PA { fromPA :: IO a }

instanceMonad PA where

return b = PA $ return b

(PA ioX) >>= f = PA $ do x <- ioX

fromPA $ f x

runPA :: PA a -> a

runPA = unsafeperformIO . fromPA

Rita Loogen: Eden – CEFP 2011

remote process creation
Remote Process Creation

data (Trans a, Trans b) => Processa b

= Proc (ChanName b -> ChanName‘(ChanName a) -> IO ( ) )

process :: (a -> b) -> Process a b

process f = Procf_remotewheref_remote(CommsendResult) inCC =do (sendInput, invals) = createComm

connectToPortinCCsendDataDatasendInputsendResult (f invals)

channel for

returning input channel handle(s)

output channel(s)

Process Output

process instantiation
ProcessInstantiation

( # ) :: (Trans a, Trans b) => Process a b -> a -> b

pabs#inps

= unsafePerformIO $ instantiateAt0pabsinps

instantiateAt:: (Trans a, Trans b) =>

Int -> Process a b -> a -> IO b

instantiateAtpe(Procf_remote) inps

= do (sendresult, result) <- createComm

(inCC, CommsendInput) <- createC

sendData (Instantiatepe)

(f_remotesendresultinCC)

fork (sendInputinps)

returnresult

output channel(s)

channel for

returning input channel handle(s)

slide103
PE 1

PE 2

System

Parent Process

Child Process

createinputchannels (inports)

forchildprocess‘ results

createinputchannels

send inputchannels

toparentprocess

createinputchannels

forreceivingchildprocess‘

inputchannels

createoutportsandforkthreads

forevaluatingoutputcomponents

send instantiatemessage

withcreatedinputchannels

H

E

A

P

evaluateoutputcomponent n

evaluateoutputcomponent 2

evaluateoutputcomponent 1

createoutportsandforkthreads

forsendinginputstochildprocess

send resultstoparent

andterminatethread

send resultstoparent

andterminatethread

send resultstoparent

andterminatethread

H

E

A

P

evaluateinputcomponentsto NF

continueevaluation

evaluateinputcomponentsto NF

evaluateinputcomponent n to NF

terminateprocessand

closeinports

send resultstochild

andterminatethread

send resultstochild

andterminatethread

send resultstochild

andterminatethread

conclusions of lecture 3
Eden programs

Skeleton Library

Eden Module

Primitive Operations

Parallel GHC Runtime System

ConclusionsofLecture 3

Layeredimplementationof Eden

  • More flexibility
  • Complexityhidden
  • BetterMaintainability
  • Lean interfaceto GHC RTS
conclusions
Conclusions

www.informatik.uni-marburg.de/~eden

  • Eden = Haskell + Coordination
  • Explicit ProcessDefinitions
  • ImplicitCommunication (Message Transfer)
  • Explicit Channel Management -> arbitraryprocesstopologies
  • NondeterministicMerge -> masterworkersystemswithdynamicloadbalancing
  • Remote Data -> pass datadirectlyfromproducertoconsumerprocesses
  • ProgrammingMethodology: UseAlgorithmic Skeletons
  • EdenTVtoanalyse parallel programbehaviour
  • Available on severalplatforms

Rita Loogen: Eden – CEFP 2011

more on eden
More on Eden

PhD Workshop tomorrow 16:40-17:00

Bernhard Pickenbrock: Development of a multi-core implementation of Eden

Rita Loogen: Eden – CEFP 2011

ad