Porting NANOS on SDSM - PowerPoint PPT Presentation

viola
porting nanos on sdsm n.
Skip this Video
Loading SlideShow in 5 Seconds..
Porting NANOS on SDSM PowerPoint Presentation
Download Presentation
Porting NANOS on SDSM

play fullscreen
1 / 44
Download Presentation
Presentation Description
154 Views
Download Presentation

Porting NANOS on SDSM

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Porting NANOS on SDSM GOAL Porting a shared memory environment to distributed memory. What is missing to current SDSM ? Christian Perez

  2. Who am i ? • December 1999 : PhD at LIP, ENS Lyon, France • Data parallel languages, distributed memory, load balancing, preemptive thread migration • Winter 1999/2000 : TMR at UPC • OpenMP, Nanos, SDSM • October 2000 : INRIA researcher • Distributed programs, code coupling

  3. Contents • Motivation • Related works • Nanos execution model (NthLib) • Nanos on top of 2 SDSM (JIAJIA & DSM-PM2) • Missing SDSM functionalities • Conclusion

  4. Motivation • OpenMP : emerging standard • simplicity (no data distribution) • Cluster of machines (mono or multiprocessors) • excellent ratio performance / price • OpenMP on top of a cluster !

  5. OpenMP / Cluster : HOW ? • OpenMP paradigm : shared memory • Cluster paradigm : message passing • Use of software DSM system ! • Hardware DSM system : SCI (write: 2 s) • specific hardware • not yet stable

  6. Related work • Several OpenMP/DSM implementations • OpenMP NOW!, Omni • But, • Modification of OpenMP semantics • One level of parallelism • Do not exploit high performance networks

  7. OpenMP on classical DSM • Compiler extracts shared data from stack • Expensive local variable creation • shared memory allocation • Modification of OpenMP standard : • default should be private instead of being shared variables • New synchronization primitives : • condition variables & semaphores

  8. OpenMP on classical DSM • One level of parallelism (SPMD) !$omp parallel do do i = 1,4 x(i) = x(i) + x(i+1) end do call schedule(lb, up, …) do i = lb, ub x(i) = x(i) + x(i+1) end do call dsm_barrier() barrier

  9. Omni compilation approach Taken from pdplab.trc.rwcp.or.jp/pdperf/Omni/wgcc2k/

  10. Our goals • Support OpenMP standard • High performance • Allow exploitation of • multithreading (SMP) • high performance networks

  11. Nanos OpenMP compiler • Convert an OpenMP program to a task graph • Communications via shared memory !$omp parallel do do i = 1,4 x(i) = x(i) + x(i+1) end do i=1,2 i=3,4

  12. NthLib runtime support • Nanos compiler generates intermediate codes • Communications still via shared memory call nthf_depadd(…) do nth_p = 1, proc nth= nthf_create_1s(…,f,…) done call nth_block() subroutine f(…) x(i) = x(i) + x(i+1)

  13. NthLib details • Assumes to run on top of kernel threads • Provides user-level threads (QT) • Stack management (allocate) • Stack initialization (argument) • Explicit context switch

  14. Nthlib queues • Global/Local • Thread descriptor • Rich functionalities • Work descriptor • High performance

  15. Nthlib : Memory management Nano-thread descriptor Successors Stack Guard zone Mutal exclusion mmap allocation SLOT_SIZE stack alignment

  16. Porting Nthlib to SDSM Data consistency Shared memory management Nanos threads JIAJIA implementation DSM-PM2 implementation Summary of DSM requirements

  17. Data consistency • Mutual exclusion for defined data structures  Acquire/Release • User level shared memory data  Barrier

  18. Data consistency • Mutual exclusion for defined data structures  Acquire/Release • User level shared memory data  Barrier barrier barrier barrier

  19. Shared memory management • Asynchronous shared memory allocation • Alignment parameter (> PAGE_SIZE) • Global variables/commondeclaration  Not yet supported

  20. Nano-threads • Run-to-block execution model • Shared stacks (father/sons relationship) • Implicit thread migration (scheduler)

  21. JIAJIA • Developed at China by W. Hu, W. Shi & Z. Tang • Public domain DSM • User level DSM • DSM : lock/unlock, barrier, cond. variables • MP : send/receive, broadcast, reduce • Solaris, AIX, Irix, Linux, NT (not distributed)

  22. JIAJIA : Memory Allocation • No control of memory alignment (x2) • Synchronous memory allocation primitive  Development of an RPC version • Based on send/receive primitive • Add of a user level message handler  Problems • Global lock • Interference with JIAJIA blocking function

  23. JIAJIA : Discussion • Global barrier for data synchronization  Not multiple levels of parallelism • No thread aware  No efficient use of SMP nodes

  24. DSM/PM2 • Developed at LIP by G. Antoniu (PhD student) • Public domain • User level, module of PM2 • Generic and multi-protocol DSM • DSM : lock/unlock • MP : LRPC • Linux, Solaris, Irix (32 bits)

  25. PM2 organization MAD1 TCP PVM MPI SCI VIA SBP MARCEL MONO SMP ACTIVATON PM2 DSM TBX NTBX MAD2 TCP MPI SCI VIA BIP http://www.pm2.org

  26. DSM/PM2 : Memory Allocation • Only static memory allocation  Build dynamic memory allocation primitive • Centralized memory allocation • LRPC to Node 0  Integration of alignment parameter Summer 2000 : dynamic memory allocation ready !

  27. DSM/PM2 : marcel descriptor Page boundary marcel_t (sp&MASK)+SLOT_SIZE NthLib requirement : a kernel thread  many nano-threads

  28. DSM/PM2 : marcel descriptor Page boundary marcel_t (sp&MASK)+SLOT_SIZE marcel_t* Page boundary *((sp&MASK)+SLOT_SIZE)

  29. DSM/PM2 : Discussion • Using page level sequential consistency + no need of barrier (Multiple levels of parallelism) – False sharing  Dedicated stack layout marcel_t* Page boundary Pad Page boundary

  30. DSM/PM2 : Discussion (cont) • No alternate stack for signal handler  Prefetch page before context switch : O(n)  Pad to next page before opening parallelism Page boundary Shared data Pad Page boundary

  31. DSM/PM2 improvement • Availability of an asynchronous DSM malloc • Lazy data consistency protocol in evaluation • eager consistency, multiple writer • scope consistency • Support for stack in shared memory (LINUX)

  32. DSM/PM2 shared stack support marcel_t SEGV stack (sp&MASK)+SLOT_SIZE

  33. DSM/PM2 shared stack support marcel_t SEGV stack (sp&MASK)+SLOT_SIZE

  34. DSM/PM2 shared stack support marcel_t SEGV stack SEGV stack (sp&MASK)+SLOT_SIZE

  35. DSM/PM2 shared stack support marcel_t SEGV stack SEGV stack (sp&MASK)+SLOT_SIZE

  36. DSM/PM2 shared stack support marcel_t SEGV stack SEGV stack (sp&MASK)+SLOT_SIZE

  37. DSM/PM2 shared stack support marcel_t SEGV stack (sp&MASK)+SLOT_SIZE

  38. DSM requirement • Support of static global shared variables • Efficient code • remove one indirection level • Enable use of classical compiler • Support for common  « Sharedization » of already allocated memory dsm_to_shared(void* p, size_t size);

  39. DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock

  40. DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock barrier barrier

  41. DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock barriers barrier

  42. DSM requirement • Support for multiple level of parallelism • Partial barrier • group management • Dependencies support • like acquire/release but without lock start(1) start(2) stop(1) stop(2) update(1,2)

  43. Summary of DSM requirements • Support of static global shared variables  « Sharedization » of already allocated memory • Acquire/release primitive • Partial barrier  group management • Asynchronous shared memory allocation • Alignment parameter to memory allocation • Threads (SMP nodes) • Optimized stack management

  44. Conclusion • Successfully port Nanos to 2 DSM  JIAJIA & DSM-PM2 • DSM requirement to obtain performance  Support MIMD model  Automatic thread migration • Performance ?