1 / 48

TreadMarks: Shared Memory Computing on Networks of Workstations

TreadMarks: Shared Memory Computing on Networks of Workstations. C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher, H. Lu, R. Rajamony, W. Yu, W. Zwaenepoel Rice University. INTRODUCTION.

Download Presentation

TreadMarks: Shared Memory Computing on Networks of Workstations

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. TreadMarks: Shared Memory Computing on Networks of Workstations C. Amza, A. L. Cox, S. Dwarkadas, P.J. Keleher, H. Lu, R. Rajamony,W. Yu, W. ZwaenepoelRice University

  2. INTRODUCTION • Distributed shared memory is a software abstraction allowing a set of workstations connected by a LAN to share a single paged virtual address space • Key issue in building a software DSM is minimizing the amount of data communication among the workstation memories

  3. Why bother with DSM? • Key idea is to build fast parallel computers that are • Cheaper than shared memory multiprocessor architectures • As convenient to use

  4. CACHE CACHE CACHE CACHE Conventional parallel architecture CPU CPU CPU CPU Shared memory

  5. Today’s architecture • Clusters of workstations are much more cost effective • No need to develop complex bus and cache structures • Can use off-the-shelf networking hardware • Gigabit Ethernet • Myrinet (1.5 Gb/s) • Can quickly integrate newest microprocessors

  6. Limitations of cluster approach • Communication within a cluster of workstation is through message passing • Much harder to program than concurrent access to a shared memory • Many big programs were written for shared memory architectures • Converting them to a message passing architecture is a nightmare

  7. Distributed shared memory main memories DSM = one shared global address space

  8. Distributed shared memory • DSM makes a cluster of workstations look like a shared memory parallel computer • Easier to write new programs • Easier to port existing programs • Key problem is that DSM only provides the illusion of having a shared memory architecture • Data must still move back and forth among the workstations

  9. Munin • Developed at Rice University • Based on software objects (variables) • Used the processor virtual memory to detect access to the shared objects • Included several techniques for reducing consistency-related communication • Only ran on top of the V kernel

  10. Munin main strengths • Excellent performance • Portability of programs • Allowedprograms written for a multiprocessor architecture to run on a cluster of workstations with a minimum number of changes(dusty decks)

  11. Munin main weakness • Very poor portability of Munin itself • Depended of some features of the V kernel • Not maintained since the late 80's

  12. TreadMarks • Provides DSM as an array of bytes • Like Munin, • Uses release consistency • Offers a multiple writer protocol to fight false sharing • Runs at user-level on a number of UNIX platforms • Offers a very simple user interface

  13. First example: Jacobi iteration • Illustrates the use of barriers • A barrier is a synchronization primitive that forces processes accessing it to wait until all processes have reached it • Forces processes to wait until all of them have completed a specific step

  14. Proc 0 … Proc n-1 Jacobi iteration: overall organization • Operates on a two-dimensional array • Each processor works on a specific band of rows • Boundary rows are shared

  15. Jacobi iteration: overall organization • During each iteration step, each array element is set to the average of its four neighbors • Averages are stored in a scratch matrix and copied later into the shared matrix

  16. Jacobi iteration: the barriers • Mark the end of each computation phase • Prevents processes from continuing the computation before all other processes have completed the previous phase and the new values are "installed" • Include an implicit release() followed by an implicit acquire() • To be explained later

  17. Jacobi iteration: declarations #define M #define N float *grid // shared array float scratch[M][N] // private array

  18. Jacobi iteration: startup main() { Tmk_startup(); if (Tmk_proc_id == 0 ) { grid = Tmk_malloc(M*N*sizeof(float)); initialize grid; } // if Tmk_barrier(0); length = M/Tmk_nprocs; begin = length*Tmk_proc_id; end = length*(Tmk_proc_id + 1);

  19. Jacobi iteration: main loop for (number of iterations) { for (i = begin; i < end; i++) for (j = 0; j < N; j++) scratch[i][j] = (grid[i-][j] + … + grid[i][j+1])/4; Tmk_barrier(1); for (i = begin; i < end; i++) for (j = 0; j < N; j++) grid[i][j] = scratch[i][j]; Tmk_barrier(2); } // main loop } // main

  20. Second example: TSP • Traveling salesman problem • Finding the shortest path through a number of cities • Program keeps a queue of partial tours • Most promising at the end

  21. TSP: declarations queue_type *Queue int *Shortest_length int queue_lock_id, min_lock_id;

  22. TSP: startup main ( Tmkstartup() queue_lock_id = 0; min_lock_id = 1; if (Tmk_proc_id == 0) { Queue = Tmk_malloc(sizeof(queuetype)); Shortest_length = Tmk_malloc(sizeof(int)); initialize Heap and Shortest_length; ] // if Tmk_barrier (0);

  23. TSP: while loop while (true) do { Tmk_lock_acquire(queue_lock_id); if (queue is empty) { Tmk_lock_release(queue_lock_id); Tmk_exit(); } // while loop Keep adding to queue until a long promising tour appears at the head Path = Delete the tour from the head Tmk_lock_release(queue_lock_id); } // while

  24. TSP: end of main length = recursively try all cities not on Path, find the shortest tour length Tmk_lock_acquire(min_lock_id); if (length < Shortest_length) Shortest_length = length Tmk_lock_release(min_lock_id } // main

  25. Critical sections • All accesses to shared variables are surrounded by a pair Tmk_lock_acquire(lock_id); … Tmk_lock_relese(lock_id);

  26. Implementation Issues • Consistency issues • False sharing

  27. Consistency model (I) • Shared data are replicated at times • To speed up read accesses • All workstations must share a consistent view of all data • Strict consistency is not possible

  28. Consistency model (II) • Various authors have proposed weaker consistency models • Cheaper to implement • Harder to use in a correct fashion • TreadMarks usessoftware release consistency • Only requires the memory to be consistent at specific synchronization points

  29. SW release consistency (I) • Well-written parallel programs use locks to achieve mutual exclusion when they access shared variables • P(&mutex) and V(&mutex) • lock(&csect) and unlock(&csect) • acquire( ) and release( ) • Unprotected accesses can produce unpredictable results

  30. SW release consistency (II) • SW release consistency will only guarantee correctness of operations performed within a request/release pair • No need to export the new values of shared variables until the release • Must guarantee that workstation has received the most recent values of all shared variables when it completes a request

  31. shared int x; acquire( ); x = 1;release ( ); // export x=1 shared int x; acquire( );// wait for new value of x x++;release ( ); // export x=2 SW release consistency (III)

  32. SW release consistency (IV) • Must still decide how to release updated values • TreadMarks uses lazy release: • Delays propagation until an acquire is issued • Its predecessor Munin used eager release: • New values of shared variables were propagated at release time

  33. SW release consistency (V) Eager release Lazy release

  34. False sharing accesses y accesses x x y page containing x and y will move back and forthbetween main memories of workstations

  35. Multiple write protocol (I) • Designed to fight false sharing • Uses a copy-on-write mechanism • Whenever a process is granted access to write-shared data, the page containing these data is marked copy-on-write • First attempt to modify the contents of the page will result in the creation of a copy of the page modified (the twin).

  36. Creating a twin

  37. Multiple write protocol (II) • At release time, TreadMarks • Performs a word by word comparison of the page and its twin • Stores the diff in the space used by the twin page • Informs all processors having a copy of the shared data of the update • These processors will request the diff the first time they access the page

  38. Creating a diff

  39. Example Before First write access x = 1 y = 2 x = 1 y = 2 twin After Compare with twin x = 3 y = 2 New value of x is 3

  40. Multiple write protocol (III) • TreadMarks could but does not check for conflicting updates to write-shared pages

  41. The TreadMarks system • Entirely at user-level • Links to programs written in C, C++ and Fortran • Uses UDP/IP for communication (or AAL3/4 if machines are connected by an ATM LAN) • Uses SIGIO signal to speed up processing of incoming requests • Uses mprotect( ) system call to control access to shard pages

  42. Performance evaluation (I) • Long discussion of two large TreadMarks applications

  43. Performance evaluation (II) • A previous paper compared performance of TreadMarks with that of Munin • Munin performance typically was within 5 to 33% of the performance of hand-coded message passing versions of the same programs • TreadMarks was almost always better than Munin with one exception: • A 3-D FFT program

  44. Performance Evaluation (III) • 3-D FFT program was an iterative program that read some shared data outside any critical section • Doing otherwise would have been to costly • Munin used eager release, which ensured that the values read were not far from their true value • Not true for TreadMarks!

  45. Other DSM Implementations (I) • Sequentially-Consistent Software DSM (IVY): • Sends messages to other copies at each write • Much slower • Software release consistency with eager release (Munin)

  46. Other DSM Implementations (II) • Entry consistency (Midway): • Requires each variable to be associated to a synchronization object (typically a lock) • Acquire/release operations on a given synchronization object only involve the variables associated with that object • Requires less data traffic • Does not handle well dusty decks

  47. Other DSM Implementations (III) • Structured DSM Systems (Linda): • Offer to the programmer a shared tuple space accessed using specific synchronized methods • Require a very different programming style

  48. CONCLUSIONS • Can build an efficient DSM entirely in user space • Modern UNIX systems offer all the required primitives • Software release consistency model works very well • Lazy release is almost always better than eager release

More Related