HParC language

HParC language

Background • Shared memory level • Multiple separated shared memory spaces • Message passing level-1 • Fast level of k separate message passing segments • Message passing level-2 • Slow level of message passing

Proposed solution • The target architecture is a grid composed of MCs • Scoping rules that clearly separates MP constructs and SM constructs. • Extensive operation set to support serialization and deserialization of complex messages that could be sent in one operation

Scoping rules problems • C:\Documents and Settings\Dima\My Documents\Downloads\attachments\cl.pdf

Do we allow MP_PF to access shared variables defined in an outer Parallel Construct? If so, how do we support coherent views in the caches of different MC machines? Can a message queue variable defined in an outer MP_ PF be used by an inner MP_PF where a SM_PF separates between the two MP_PFs Nesting rules problem • Can a shared variable defined in an outer SM_ PF be used by an inner SM_PF where a MP_PF separates between the two SM_PFs

HParC parallel constructs • mparfor/mparblock • Threads can not access shared memory (i.e., can not reference nonlocal variables). They can only send and receive messages. • Threads are evenly distributed among all machines in the system • parfor/parblock • Threads can access shared memory and use message passing. • Threads are distributed among the machines of only one MC out of the set of clusters constituting the target architecture.

Code example on HParC #define N number_of_MCs mparfor(int i=0;i<N;i++) using Q { int A[N], sum=0; if(i < N-1) { parfor(int j=0;j<N;j++) A[j]=f(i,j); Q[N-1]= A; } else { int z = rand()% N; parfor(int j=0;j<z;j++) { message m; int s=0,k; for(int t=0;t<N/z;t++) { m = Q[N-1]; for(k=0;k<N;k++) s+=m[k]; } faa(&sum,s); } }

The parallel directives of openMP enable similar parallel constructs as those available in HParC. atomically executed basic arithmetic expressions and synchronization primitives. Various types of shared variables helping to adjust shared memory usage MPI is a language independent communications protocol. Supports point-to-point and collective communication. Is a framework that provides an extensive API and not a language enhancing. OPEN-MP ENHANCED WITH MPI • Tailoring those two programming styles in a single program is not an easy task. • MPI constructs are aimed to be used only in thread-wide scope. • The dynamic joining of new threads to MPI realm is not straightforward

Comparison with HParC • The code is far more complex and less readable • The MPI usage demands a lot of supporting directives • Communication procedures demands low-level address information binding the application to hardware archtecture. • Lines 11,12,14 are replaced in HParC by simple delaration “using Ql” • The parfor of HParC implies natural synchronization. Thus no need in lines 8 and 24. • The asymmetric declaration of communication groups performed by different threads (lines 25-26 ad 28-29), while in HParC the message passing queue is part of the parallel construct and is accessed in symmetrical manner.

PGAS Languages • Partitioned Global Address Space languages assumes distinct areas of memory local to each machine are logically assembled to global memory address space • Remote DMA operations are used to simulate shared memory among distinct MC computers • Have no scoping rules that impose locality on shared variables.

Comparision with X10 • X10 is a Java based language . • X10 can be used to program a cluster of MCs allowing full separate shared memory scope at each MC. • Uses a global separate shared memory area Instead of nested MP levels • Has a foreach construct that is similar to the SM_PF allowing full nesting including declaration of local variables that are shared among the threads spawned by inner foreach constructs.

The X10 code vs. HParC ateach ((i) in Dist.makeUnique()) { //i iterates over all places if (here != Place.FIRST_PLACE){ val A = Rail.make[Int](N, (i:Int) => 0); //creates local array at place 2..n: finish foreach (j) in [0..N-1] A[j]=f(i,j); //create a thread to fill in each A: at (Place.FIRST_PLACE) atomic queue.add[A]; //go to Place1 and place A into queue: } else { //at Place.FIRST_PLACE: shared sum = 0; finish foreach (k) in [0..K-1] { //create k pool threads to sum up the A's: while (true){ var A; when( (nProcessed == N-1) || (!queue.isEmpty()) { //if all processed or queue non-empty: if (nProcessed == N-1) break; A = queue.remove() as ValRail[Int]; //copy the A array: nProcessed++; } var s = 0; for ((i) in 0..N-1) s+= A[i]; atomic sum += s; } } } }

Comparison with HParC • We have to spawn an additional thread in order to update Global Shared “Queue”. • The update command implicitly copies the potentially big array “A” and sends it through the network.

HParC language

HParC language

Presentation Transcript

Language

Language

Language

Language

Language

Language

Language

Language

Language!

Language

Language

Language

Language

LANGUAGE

Language

LANGUAGE

LANGUAGE

LANGUAGE

Language

Language

Language

Language