1 / 71

ECE 1747: Parallel Programming

ECE 1747: Parallel Programming. Basics of Parallel Architectures: Shared-Memory Machines. Two Parallel Architectures. Shared memory machines. Distributed memory machines. Shared Memory: Logical View. Shared memory space. proc1. proc2. proc3. procN. Shared Memory Machines.

amable
Download Presentation

ECE 1747: Parallel Programming

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 1747: Parallel Programming Basics of Parallel Architectures: Shared-Memory Machines

  2. Two Parallel Architectures • Shared memory machines. • Distributed memory machines.

  3. Shared Memory: Logical View Sharedmemoryspace proc1 proc2 proc3 procN

  4. Shared Memory Machines • Small number of processors: shared memory with coherent caches (SMP). • Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

  5. SMPs • 2- or 4-processors PCs are now commodity. • Good price/performance ratio. • Memory sometimes bottleneck (see later). • Typical price (8-node): ~ $20-40k.

  6. Physical Implementation Sharedmemory bus cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN

  7. Shared Memory Machines • Small number of processors: shared memory with coherent caches (SMP). • Larger number of processors: distributed shared memory with coherent caches (CC-NUMA).

  8. CC-NUMA: Physical Implementation mem1 mem2 mem3 memN inter- connect cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN

  9. Caches in Multiprocessors • Suffer from the coherence problem: • same line appears in two or more caches • one processor writes word in line • other processors now can read stale data • Leads to need for a coherence protocol • avoids coherence problems • Many exist, will just look at simple one.

  10. What is coherence? • What does it mean to be shared? • Intuitively, read last value written. • Notion is not well-defined in a system without a global clock.

  11. The Notion of “last written” in a Multi-processor System r(x) P0 w(x) P1 P2 w(x) P3 r(x)

  12. The Notion of “last written” in a Single-machine System w(x) w(x) r(x) r(x)

  13. Coherence: a Clean Definition • Is achieved by referring back to the single machine case. • Called sequential consistency.

  14. Sequential Consistency (SC) • Memory is sequentially consistent if and only if it behaves “as if” the processors were executing in a time-shared fashion on a single machine.

  15. Returning to our Example r(x) P0 w(x) P1 P2 w(x) P3 r(x)

  16. Another Way of Defining SC • All memory references of a single process execute in program order. • All writes are globally ordered.

  17. SC: Example 1 Initial values of x,y are 0. w(x,1) w(y,1) r(x) r(y) What are possible final values?

  18. SC: Example 2 w(x,1) w(y,1) r(y) r(x)

  19. SC: Example 3 w(x,1) w(y,1) r(y) r(x)

  20. SC: Example 4 r(x) w(x,1) w(x,2) r(x)

  21. Implementation • Many ways of implementing SC. • In fact, sometimes stronger conditions. • Will look at a simple one: MSI protocol.

  22. Physical Implementation Sharedmemory bus cache1 cache2 cache3 cacheN proc1 proc2 proc3 procN

  23. Fundamental Assumption • The bus is a reliable, ordered broadcast bus. • Every message sent by a processor is received by all other processors in the same order. • Also called a snooping bus • Processors (or caches) snoop on the bus.

  24. States of a Cache Line • Invalid • Shared • read-only, one of many cached copies • Modified • read-write, sole valid copy

  25. Processor Transactions • processor read(x) • processor write(x)

  26. Bus Transactions • bus read(x) • asks for copy with no intent to modify • bus read-exclusive(x) • asks for copy with intent to modify

  27. State Diagram: Step 0 I S M

  28. State Diagram: Step 1 PrRd/BuRd I S M

  29. State Diagram: Step 2 PrRd/- PrRd/BuRd I S M

  30. State Diagram: Step 3 PrWr/BuRdX PrRd/- PrRd/BuRd I S M

  31. State Diagram: Step 4 PrWr/BuRdX PrRd/- PrRd/BuRd PrWr/BuRdX I S M

  32. State Diagram: Step 5 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M

  33. State Diagram: Step 6 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRd/Flush

  34. State Diagram: Step 7 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRd/Flush BuRd/-

  35. State Diagram: Step 8 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRdX/- BuRd/Flush BuRd/-

  36. State Diagram: Step 9 PrWr/BuRdX PrRd/- PrWr/- PrRd/BuRd PrWr/BuRdX I S M BuRdX/- BuRd/Flush BuRd/- BuRdX/Flush

  37. In Reality • Most machines use a slightly more complicated protocol (4 states instead of 3). • See architecture books (MESI protocol).

  38. Problem: False Sharing • Occurs when two or more processors access different data in same cache line, and at least one of them writes. • Leads to ping-pong effect.

  39. False Sharing: Example (1 of 3) #pragma omp parallel for schedule(cyclic) for( i=0; i<n; i++ ) a[i] = b[i]; • Let’s assume: • p = 2 • element of a takes 4 words • cache line has 32 words

  40. False Sharing: Example (2 of 3) cache line a[0] a[1] a[2] a[3] a[4] a[5] a[6] a[7] Written by processor 0 Written by processor 1

  41. False Sharing: Example (3 of 3) a[2] a[4] a[0] P0 ... inv data a[3] a[5] P1 a[1]

  42. Summary • Sequential consistency. • Bus-based coherence protocols. • False sharing.

  43. Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors J.M. Mellor-Crummey, M.L. Scott (MCS Locks)

  44. Introduction • Busy-waiting techniques – heavily used in synchronization on shared memory MPs • Two general categories: locks and barriers • Locks ensure mutual exclusion • Barriers provide phase separation in an application

  45. Problem • Busy-waiting synchronization constructs tend to: • Have significant impact on network traffic due to cache invalidations • Contention leads to poor scalability • Main cause: spinning on remote variables

  46. The Proposed Solution • Minimize access to remote variables • Instead, spin on local variables • Claim: • It can be done all in software (no need for fancy and costly hardware support) • Spinning on local variables will minimize contention, allow for good scalability, and good performance

  47. Spin Lock 1: Test-and-Set Lock • Repeatedly test-and-set a boolean flag indicating whether the lock is held • Problem: contention for the flag (read-modify-write instructions are expensive) • Causes lots of network traffic, especially on cache-coherent architectures (because of cache invalidations) • Variation: test-and-test-and-set – less traffic

  48. Test-and-test with Backoff Lock • Pause between successive test-and-set (“backoff”) • T&S with backoff idea: while test&set (L) fails { pause (delay); delay = delay * 2; }

  49. Spin Lock 2: The Ticket Lock • 2 counters (nr_requests, and nr_releases) • Lock acquire: fetch-and-increment on the nr_requests counter, waits until its “ticket” is equal to the value of the nr_releases counter • Lock release: increment of the nr_releases counter

  50. Spin Lock 2: The Ticket Lock • Advantage over T&S: polls with read operations only • Still generates lots of traffic and contention • Can further improve by using backoff

More Related