Porting NANOS on SDSM

Porting NANOS on SDSM GOAL Porting a shared memory environment to distributed memory. What is missing to current SDSM ? Christian Perez

Who am i ? • December 1999 : PhD at LIP, ENS Lyon, France • Data parallel languages, distributed memory, load balancing, preemptive thread migration • Winter 1999/2000 : TMR at UPC • OpenMP, Nanos, SDSM • October 2000 : INRIA researcher • Distributed programs, code coupling

Contents • Motivation • Related works • Nanos execution model (NthLib) • Nanos on top of 2 SDSM (JIAJIA & DSM-PM2) • Missing SDSM functionalities • Conclusion

Motivation • OpenMP : emerging standard • simplicity (no data distribution) • Cluster of machines (mono or multiprocessors) • excellent ratio performance / price • OpenMP on top of a cluster !

OpenMP / Cluster : HOW ? • OpenMP paradigm : shared memory • Cluster paradigm : message passing • Use of software DSM system ! • Hardware DSM system : SCI (write: 2 s) • specific hardware • not yet stable

Related work • Several OpenMP/DSM implementations • OpenMP NOW!, Omni • But, • Modification of OpenMP semantics • One level of parallelism • Do not exploit high performance networks

OpenMP on classical DSM • Compiler extracts shared data from stack • Expensive local variable creation • shared memory allocation • Modification of OpenMP standard : • default should be private instead of being shared variables • New synchronization primitives : • condition variables & semaphores

OpenMP on classical DSM • One level of parallelism (SPMD) !$omp parallel do do i = 1,4 x(i) = x(i) + x(i+1) end do call schedule(lb, up, …) do i = lb, ub x(i) = x(i) + x(i+1) end do call dsm_barrier() barrier

Omni compilation approach Taken from pdplab.trc.rwcp.or.jp/pdperf/Omni/wgcc2k/

Our goals • Support OpenMP standard • High performance • Allow exploitation of • multithreading (SMP) • high performance networks

Nanos OpenMP compiler • Convert an OpenMP program to a task graph • Communications via shared memory !$omp parallel do do i = 1,4 x(i) = x(i) + x(i+1) end do i=1,2 i=3,4

NthLib runtime support • Nanos compiler generates intermediate codes • Communications still via shared memory call nthf_depadd(…) do nth_p = 1, proc nth= nthf_create_1s(…,f,…) done call nth_block() subroutine f(…) x(i) = x(i) + x(i+1)

NthLib details • Assumes to run on top of kernel threads • Provides user-level threads (QT) • Stack management (allocate) • Stack initialization (argument) • Explicit context switch

Nthlib queues • Global/Local • Thread descriptor • Rich functionalities • Work descriptor • High performance

Nthlib : Memory management Nano-thread descriptor Successors Stack Guard zone Mutal exclusion mmap allocation SLOT_SIZE stack alignment

Porting Nthlib to SDSM Data consistency Shared memory management Nanos threads JIAJIA implementation DSM-PM2 implementation Summary of DSM requirements

Data consistency • Mutual exclusion for defined data structures  Acquire/Release • User level shared memory data  Barrier

Data consistency • Mutual exclusion for defined data structures  Acquire/Release • User level shared memory data  Barrier barrier barrier barrier

Shared memory management • Asynchronous shared memory allocation • Alignment parameter (> PAGE_SIZE) • Global variables/commondeclaration  Not yet supported

Nano-threads • Run-to-block execution model • Shared stacks (father/sons relationship) • Implicit thread migration (scheduler)

JIAJIA • Developed at China by W. Hu, W. Shi & Z. Tang • Public domain DSM • User level DSM • DSM : lock/unlock, barrier, cond. variables • MP : send/receive, broadcast, reduce • Solaris, AIX, Irix, Linux, NT (not distributed)

JIAJIA : Memory Allocation • No control of memory alignment (x2) • Synchronous memory allocation primitive  Development of an RPC version • Based on send/receive primitive • Add of a user level message handler  Problems • Global lock • Interference with JIAJIA blocking function

JIAJIA : Discussion • Global barrier for data synchronization  Not multiple levels of parallelism • No thread aware  No efficient use of SMP nodes

DSM/PM2 • Developed at LIP by G. Antoniu (PhD student) • Public domain • User level, module of PM2 • Generic and multi-protocol DSM • DSM : lock/unlock • MP : LRPC • Linux, Solaris, Irix (32 bits)

PM2 organization MAD1 TCP PVM MPI SCI VIA SBP MARCEL MONO SMP ACTIVATON PM2 DSM TBX NTBX MAD2 TCP MPI SCI VIA BIP http://www.pm2.org

DSM/PM2 : Memory Allocation • Only static memory allocation  Build dynamic memory allocation primitive • Centralized memory allocation • LRPC to Node 0  Integration of alignment parameter Summer 2000 : dynamic memory allocation ready !

DSM/PM2 : marcel descriptor Page boundary marcel_t (sp&MASK)+SLOT_SIZE NthLib requirement : a kernel thread  many nano-threads

DSM/PM2 : marcel descriptor Page boundary marcel_t (sp&MASK)+SLOT_SIZE marcel_t* Page boundary *((sp&MASK)+SLOT_SIZE)

DSM/PM2 : Discussion • Using page level sequential consistency + no need of barrier (Multiple levels of parallelism) – False sharing  Dedicated stack layout marcel_t* Page boundary Pad Page boundary

DSM/PM2 : Discussion (cont) • No alternate stack for signal handler  Prefetch page before context switch : O(n)  Pad to next page before opening parallelism Page boundary Shared data Pad Page boundary

DSM/PM2 improvement • Availability of an asynchronous DSM malloc • Lazy data consistency protocol in evaluation • eager consistency, multiple writer • scope consistency • Support for stack in shared memory (LINUX)

DSM/PM2 shared stack support marcel_t SEGV stack (sp&MASK)+SLOT_SIZE

DSM/PM2 shared stack support marcel_t SEGV stack SEGV stack (sp&MASK)+SLOT_SIZE

DSM/PM2 shared stack support marcel_t SEGV stack (sp&MASK)+SLOT_SIZE

DSM requirement • Support of static global shared variables • Efficient code • remove one indirection level • Enable use of classical compiler • Support for common  « Sharedization » of already allocated memory dsm_to_shared(void* p, size_t size);

DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock

DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock barrier barrier

DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock barriers barrier

DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock start(1) start(2) stop(1) stop(2) update(1,2)

Summary of DSM requirements • Support of static global shared variables  « Sharedization » of already allocated memory • Acquire/release primitive • Partial barrier  group management • Asynchronous shared memory allocation • Alignment parameter to memory allocation • Threads (SMP nodes) • Optimized stack management

Conclusion • Successfully port Nanos to 2 DSM  JIAJIA & DSM-PM2 • DSM requirement to obtain performance  Support MIMD model  Automatic thread migration • Performance ?

Porting NANOS on SDSM