1 / 21

Bridging the Gap Between Distributed Shared Memory and Message Passing

Bridging the Gap Between Distributed Shared Memory and Message Passing. Holger Karl Department of Computer Science Courant Institute of Mathematical Sciences New York University Partially supported by DARPA/Rome laboratory and Deutsche Forschungsgemeinschaft. Overview. Motivation

tracen
Download Presentation

Bridging the Gap Between Distributed Shared Memory and Message Passing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bridging the Gap Between Distributed Shared Memory and Message Passing Holger Karl Department of Computer Science Courant Institute of Mathematical Sciences New York University Partially supported by DARPA/Rome laboratory and Deutsche Forschungsgemeinschaft

  2. Overview • Motivation • Related approaches • Charlotte • Annotating Charlotte • Some experiments • Summary

  3. Motivation • Scenario • Distributed computing using the World Wide Web • Use Java to overcome heterogeneity, security concerns and program installation problems • Idea: volunteer computing • Applets can participate in distributed applications • Challenges • Only make use of standard Web browsers/Java virtual machines • Consider the Web’s and Java’s idiosyncrasies • Inherent faultiness, machines come and go • Host-of-origin imposes a star-like communication topology • Provide a simple programming interface, e.g., DSM • No hardware support (e.g., page protection) available • Do not forget efficiency

  4. Related Approaches

  5. Charlotte - Programs public class Matrix extends Droutine { // define routine public void drun (int n, int Id){ // do computation } … public void run () { … parBegin(); addRoutine (this, Size); parEnd(); … } } • Alternating sequential and parallel steps • Sequential steps executed by a manager application • Parallel steps executed by worker applets • Routines are defined in parallel steps • Routines are methods of objects, derived from a Charlotte class Droutine

  6. Charlotte - Memory Semantics • Memory partitioned into private and shared parts • Shared memory with CRCW-common semantics • Implemented at object level • Charlotte provides classes for every primitive type: int Dint, float  Dfloat, etc • Manager has master copy of memory • Shared objects have get() and set() methods • get() brings invalid data from manager if necessary, amortized by bringing “pages” of shared data from the manager • set() marks data as modified to be flushed to manager at end of parallel step • Updates from different routines incorporated atomically at end of parallel step • Allows eager scheduling for fault tolerance • Routines can be executed multiple times without violating exactly once semantics

  7. Charlotte - Matrix Multiplication public class Matrix extends Droutine { // define routine public void drun (int n, int i){ for (int j=0; j<Size; j++){ sum = 0; for (int k=0; k<Size; k++) sum += A[i][k].get() * B[k][j].get(); C[i][j].set(sum); } … }

  8. Charlotte - Pros and Cons • Pros • Simple programming model • Well-defined DSM semantics • Adaptive parallelism • Fault tolerant with respect to worker crashes • Cons • Efficiency problems • Shared data is accessed via method invocations • Loading data from manager incurs latency • Choosing good page size is difficult • Sending modified data back to manager requires inspection of status flags • Reason: runtime system does not know which data is read or modified by routines • Make this information explicit in Charlotte programs!

  9. Annotations public class Matrix extends Droutine { public void drun (int n, int i) {…} public Locations dloc_read (int n, int i) { Locations loc = new Locations(); loc.add (B); loc.add (A[i]); return loc; } … } • Specify read and write locations for routines • Use methods of class Droutine • dloc_read() • dloc_write() • First step: Correctness-preserving annotations • Only send read data at beginning of routine

  10. Relying on Annotations • Second step: Rely on annotations • Data transfer between manager and worker happens only according to annotations • No if-statement in get() necessary • No status update in set() necessary • No latencies for requesting data updates • But, wrong annotations lead to wrong results • Use Uint (unchecked) instead of Dint, etc. • Identical interface • Remaining overhead: method invocation for data access • Charlotte’s distributed types (Dint) and unchecked distributed types (Uint) can be freely mixed • Simple to switch back and forth

  11. Sharing Primitive Types • Third step: Use primitive types (int) instead of objects (Uint) • Possible since get() and set() are trivial for Uint • Possible since annotations describe data movement completely • Avoids method invocation overhead • But: different interface • No get()/set()  syntactic changes • Call-by-reference / call-by-value • Particularly suited to arrays of distributed types • Freely mix Dint, Uint and shared int

  12. Additional Optimizations • Manager keeps track of workers’ valid data sets • Use routines’ read sets to select routine to give to worker • Choose routine with minimal amount of missing data • Example: 3 routines, 2 workers • Data can be cached at workers between parallel steps • Requires data to be declared unmodified at the beginning of parallel step Worker 1 valid: 1-100 read_set: 160-210 20-80 90-150 Worker 2 valid: 150-200

  13. Discussion • Gradual incorporation of semantic knowledge into Charlotte programs • Large flexibility to mix different levels of annotation semantics in one program • Easy to switch back and forth among these levels • Close to message passing behavior • Charlotte’s nice properties are preserved (fault tolerance, adaptive parallelism) • Is it efficient?

  14. Annotating DSM Code • Munin • Annotating data with expected access patterns, system chooses appropriate consistency protocols • Aurora • Object-based system in C++, different consistency models are dynamically selected at runtime • CRL, Cid • Library of C-functions, explicit mapping of shared memory in local memory • Jade • Annotations used to extract parallelism from sequential code • All lack flexibility to mix different guarantee levels

  15. Setup: Multiplication of 200x200 integer matrices PentiumPro 200 at NYU with FastEthernet (100MBit/sec) Pentium 90 at Humboldt University Berlin Kaffe Virtual Machine Version 0.92 (Java JIT compiler) under Linux Sequential runtimes: 8.1 sec. (P90), 2.3 sec. (PPro200) Ping between NYU and HUB: typically 130 msec. Experiments - Setup

  16. Experiments • Multiply two 200x200 integer matrices • Runtimes for • Standard Charlotte (Dint) • Standard Charlotte plus correctness-preserving annotations (Dint+A) • Unchecked classes (Uint) • Shared primitive types (int) • A message passing implementation (mes.pas.) • Shared primitive types competitive with message passing • Reasonable absolute speedups (compared to sequential runtime)

  17. Experiments (contd.) • Ratios between various optimizations • Annotations/Presending give a factor of about three • Getting rid of objects is another factor of two • Altogether: up to nine times faster than standard Charlotte • And better scalability

  18. Experiments (contd.) • Manager at NYU, worker at HU Berlin • Similar behavior, if slightly smaller improvement

  19. Experiments (contd.) • Colocation and Caching • Example: multiply matrix A by two different matrices B1 and B2 • Small impact in a LAN environment • Up to 25% improvement for high-latency connections • Colocation results in smaller standard deviation of runtimes

  20. Future Work • Investigating more elaborate examples • Data with both regular and irregular access patterns • Compiler-generated annotations • New programming interface • no nested parallelism • parBegin() / parEnd() makes some annotations awkward • Overlapping communication/computation • Multiple worker threads in one virtual machine • Applying annotations to Calypso (a page-based DSM system) • Down the road: use annotations for QoS

  21. Summary • Annotations to describe read and write sets of parallel routines • Correctness-preserving • Correctness-sensitive • Or directly sharing primitive types • Possible to mix these different levels in a single program and to switch back and forth between them • Big flexibility for programmer • Maintains Charlotte’s advantages like adaptive parallelism and fault tolerance • Performance competitive with message passing programs

More Related