1 / 38

Distributed Shared Memory (part 1)

Distributed Shared Memory (part 1). Distributed Shared Memory (DSM). shared memory. network. mem0. mem1. mem2. memN. proc0. proc1. proc2. procN. Shared memory programming. Standard – pthread synchronizations Barriers Locks Semaphores. Sequential SOR.

nalani
Download Presentation

Distributed Shared Memory (part 1)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Shared Memory (part 1)

  2. Distributed Shared Memory (DSM) shared memory network mem0 mem1 mem2 memN ... proc0 proc1 proc2 procN

  3. Shared memory programming • Standard – pthread • synchronizations • Barriers • Locks • Semaphores

  4. Sequential SOR for some number of timesteps/iterations { for (i=0; i<n; i++ ) for( j=1, j<n, j++ ) temp[i][j] = 0.25 * ( grid[i-1][j] + grid[i+1][j] grid[i][j-1] + grid[i][j+1] ); for( i=0; i<n; i++ ) for( j=1; j<n; j++ ) grid[i][j] = temp[i][j]; }

  5. Parallel SOR with Barriers (1 of 2) void* sor (void* arg) { int slice = (int)arg; int from = (slice * (n-1))/p + 1; int to = ((slice+1) * (n-1))/p + 1; for some number of iterations { … } }

  6. Parallel SOR with Barriers (2 of 2) for (i=from; i<to; i++) for (j=1; j<n; j++) temp[i][j] = 0.25 * (grid[i-1][j] + grid[i+1][j] + grid[i][j-1] + grid[i][j+1]); barrier(); for (i=from; i<to; i++) for (j=1; j<n; j++) grid[i][j]=temp[i][j]; barrier();

  7. Differences between SMP and Software DSM • Delay: tradeoffs, such as block size • Software => traps: cost of read/write misses • Goals of caches: multiprocessor = performance, dist. system = transparency • bus vs. long networks: reliance on serialization and broadcast.

  8. Consequent differences in protocols and applications • Bigger block size • Cost amortization, higher hit ratio for larger blocks? • Reduced overhead • But therefore... • Migration vs. Replication • False sharing increases • DSM protocol more complex: Must handle lost, corrupted, and out-of-order packets • Above, coupled with cost of traps, => SDSM consistency cost much higher!

  9. Results of high consistency costs • Manage sharing more carefully • Align data to page boundaries

  10. Consistency Models • Sequential Consistency • All processors observe the same order • Must correspond to some serial order • Only ordering constraint is that reads/writes of P1 appear in the same order, but no restrictions on relative ordering between processors.

  11. Common consistency protocols • Write update • Multicast update to all replicas • Write invalidate • Invalidate cached copies in p2, p3 • Cache miss if p2/p3 access X • Valid data from other cache

  12. Conventional Implementation • As proposed by Li & Hudak, TOCS ‘86. • Use virtual memory to implement sharing. • Shared memory divided up by virtual memory pages. • Use single-writer, multiple-reader write-invalidate coherence protocol. • Keep pages in one of three states: • invalid, read-only, read-write

  13. Example shared memory proc0 proc1 proc2 procN

  14. Example: Read Access Hit read proc0 proc1 proc2 procN

  15. Example: Write Access Hit write proc0 proc1 proc2 procN

  16. Example: Read Access Miss read proc0 proc1 proc2 procN

  17. Example: Read Fault read fault proc0 proc1 proc2 procN

  18. Example: Replication on Read read proc0 proc1 proc2 procN

  19. Example: Write Access Miss write proc0 proc1 proc2 procN

  20. Example: Write Fault write fault proc0 proc1 proc2 procN

  21. Example: Write Invalidation write proc0 proc1 proc2 procN

  22. Example: Write Access to Read-Only write proc0 proc1 proc2 procN

  23. Example: Write Fault write fault proc0 proc1 proc2 procN

  24. Example: Write Invalidation write proc0 proc1 proc2 procN

  25. How to Remember Locations? • Broadcast on miss (as in SMP). • Static home. • Dynamic home or owner.

  26. Ownership and Owner Location • Owner is the last writer. • Owner maintains copyset. • Every processor maintains probable owner (not always the real owner).

  27. Ownership Location • Every read or write miss is sent to (local) probable owner. • If owner, handle appropriately, else forward to probable owner.

  28. Ownership Modification • If write miss, new writer becomes owner, and all forwarders set probable owner to requester. • If read miss, set probable owner to responding processor.

  29. Example • Initially, owner(page0) = p0, and probable owner(page0) = p0 everywhere. • Write miss by p1, sends message to its probable owner (p0), handled there, new owner = p1, probable owner(0) on p0 = 1. • Read miss by p2, sends message to probable owner (p0), forwarded to probable owner (p1), handled there, probable owner(0) on p2 becomes p1.

  30. Implement synchronizations • Use messages to implement synchronizations

  31. Barriers • Designate one processor as barrier manager. • When a process waits at a barrier, it sends an arrival message to the barrier manager and waits. • When barrier manager has received all messages, it sends a departure message to all processes.

  32. Locks • Designate one process as the lock manager for a particular lock. • When a process acquires a lock, it sends an acquire message to the manager and waits. • Manager forwards message to last acquirer. • If lock free, send lock grant message. • If lock held, hold on to request until free, and then send lock grant message.

  33. Problem: False Sharing • Concurrent access to different data within the same consistency unit. • With page as consistency unit, lots of opportunity for false sharing. • Two flavors: • read-write • write-write

  34. Read-Write False Sharing x y

  35. Read-Write False Sharing (Cont.) w(x) w(x) w(x) r(x) r(y) r(y) synch

  36. Read-Write False Sharing (Cont.) w(x) w(x) w(x) r(x) r(y) r(y) synch

  37. Write-Write False Sharing w(x) w(x) w(x) r(x) w(y) w(y) synch

  38. Summary • Software shared memory on distributed memory hardware. • Uses virtual memory. • Home migration to improve locality • important because of high latencies. • Sequential consistency suffers from false sharing

More Related