1 / 25

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel. ECE1747 – Parallel Programming Vicky Tsang. Background. Published in the Journal of Parallel and Distributed Computing, vol. 60 (12), pp. 1512-1530, December 2000 Work to further improve TreadMarks

africa
Download Presentation

OpenMP for Networks of SMPs Y. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. OpenMP for Networks of SMPsY. Charlie Hu, Honghui Lu, Alan L. Cox, Willy Zwaenepoel ECE1747 – Parallel Programming Vicky Tsang

  2. Background • Published in the Journal of Parallel and Distributed Computing, vol. 60 (12), pp. 1512-1530, December 2000 • Work to further improve TreadMarks • Presents an alternative solution to MPI

  3. Roadmap • Motivation • Solution • OpenMP API • TreadMarks • OpenMP Translator • Performance Measurement • Results • Conclusion

  4. Motivation • To enable the programmer to reply on a single, standard, shared-memory API for parallelization within and between multiprocessors. • To provide another standard other than MPI?

  5. Solution • Presents the first system that implements OpenMP on a network of shared-memory multiprocessors • Implemented via a translator converting OpenMP directives to calls in modified TreadMarks • Modified TreadMarks uses POSIX threads for parallelism within an SMP node

  6. Solution • Original version of TreadMarks: • A Unix process was executed on each processor of the multiprocessor node and communication between processes was achieved through message passing • Fails to take advantage of hardware shared memory

  7. Solution • Modified version of TreadMarks • POSIX threads used to implement parallelism • OpenMP threads within a multiprocessor share a single address space • Positive: • Reduces the number of changes to TreadMarks to support multithreading on a multiprocessor • OS maintains the coherence of page mappings automatically • Negative: • More difficult to provide uniform sharing of memory between threads on the same node and threads on different nodes

  8. OpenMP API • Three kinds of directives: • Parallelism/work sharing • Data environment • Synchronization • Based on a fork-join model • Sequential code sections executed by master thread • Parallel code sections are executed by all threads, including the master thread

  9. OpenMP API • Parallel directive – all threads perform the same computation • Work sharing directive – computation is divided among the threads • Data environment directive – control the sharing of program variables • Synchronization directive – control the synchronization between threads

  10. TreadMarks • User-level SDSM system • Provides a global shared address space on top of physically distributed memories • Key functions performed are memory coherence and synchronization

  11. TreadMarks – Memory Coherence • Minimize the amount of communication performed to maintain memory consistency by: • a lazy implementation of release consistency • reducing the impact of false sharing by allowing multiple concurrent writers to modify a page • Propagation of consistency information is postponed until the time of an acquire

  12. TreadMarks - Synchronization • Barrier implemented as acquire and release messages • Governed by a centralized manager

  13. TreadMarks – Modifications for OpenMP • Inclusion of two primitives: • Tmk_fork • Tmk_join • All threads created at the start of a program’s execution to minimize overhead. • Slave threads are blocked during sequential execution until the next Tmk_fork is issued by the master thread.

  14. TreadMarks – Modifications for Networks of Multiprocessors • POSIX thread enabled sharing of data between processors. Addition of some data structures, such as message buffers, in thread-private memory for data that is to remain private within a thread. • A per-page mutex was added to allow greater concurrency in the page fault handler. • Synchronization functions in TreadMarks were modified to use POSIX thread-based synchronization between processors within a node and existing TreadMarks synchronization functions between nodes. • A second mapping was added for the memory that is shared between nodes so shared-memory pages can be updated while the first mapping remains invalid until the update is complete. This reduces the number of page protection operations performed by TreadMarks.

  15. OpenMP Translator • Synchronization directives translate directly to TreadMarks synchronization operations. • The complier translates the code sections marks with parallel directives to fork-join code. • Data environment directives implemented to work with both TreadMarks and POSIX threads, hiding the interface issues from the programmer.

  16. Performance Measurement • Platform • IBM SP2 consisting of four SMP nodes • Per node: • Four IBM PowerPC 604 processors • 1 GB memory • Running AIX 4.2

  17. Performance Measurement • Applications • SPLASH-2 Barnes-Hut • NAS 3D-FFT • SPLASH-2 CLU • SPLASH-2 Water • Red-Black SOR • TSP • Modified Gramm-Schmidt (MGS)

  18. Results

  19. Results

  20. Results

  21. Results

  22. Conclusion • Enables the programmer to rely on a single, standard, shared-memory API for parallelization within and between multiprocessors. • Using shared hardware memory reduced data and messages transmitted. • The speedups of multithreaded TreadMarks codes on four four-way SMP SP2 nodes are within 7-30% of the MPI versions.

  23. Critique • Solution allows easier implementation of program parallelization across multiprocessors if speedup is not crucial • OpenMP is easier on the programmer but speedup still not as good as MPI

  24. Critique • Issues: • AIX has inefficient implementation of page protection • Paper claims that every other brand of Unix, including Linux, uses data structures that handle mprotect operations more efficiently • Why wasn’t the solution implemented on another platform? • Paper failed to present a big motivation for using this solution over MPI.

  25. Thank You

More Related