1 / 34

Multiprocessing

Multiprocessing. & Cache Coherency. What is multiprocessing (REVIEW). Computer System – supports several simultaneous processes All OSes support multiprocessing More complex - must share system resources ILP running out of steam Today ’ s CPUs are Chip MultiProcessors  CMP.

Download Presentation

Multiprocessing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Multiprocessing & Cache Coherency

  2. What is multiprocessing (REVIEW) • Computer System – supports severalsimultaneous processes • All OSes support multiprocessing • More complex - must share system resources • ILP running out of steam • Today’s CPUs are Chip MultiProcessors CMP

  3. Multiple Processes – One CPU (review) stack stack stack ... task priority task priority task priority CPU registers CPU registers CPU registers Memory Processor } context CPU registers

  4. Time-slice Context switches Context switches Context-Switch to share CPU (review) • Time-slicing • Time-slice: period of time task runs before context-switch • hardware interrupts  system timer • kernel Scheduling • Preemption • Currently task halted and switched out by higher-priority task • Typical in Embedded, Real time

  5. Process State (review) • A process can be in one of many states Waiting for Event event occurred task deleted Delayed wait for event task delete delay expired delay task for n ticks Running Ready task delete Dormant context switch task create interrupted Interrupted task deleted

  6. P P n 1 P P n 1 $ $ $ $ Mem Mem Inter connection network Inter connection network Mem Mem Extensions of Memory System Scale Centralized Memory Dance Hall, UMA Distributed Memory (NUMA)

  7. Processor Processor CPU-Memory bus bridge I/O bus Memory I/O controller I/O controller I/O controller Graphics output Networks Symmetric Multiprocessors • symmetric • All memory is equally far • away from all processors • Any processor can do any I/O • (set up a DMA transfer)

  8. P P n 1 $ $ Bus I/O devices Mem Bus-Based Symmetric Shared Memory • on chip  Building blocks for larger systems; already on desktop • Attractive for servers and parallel programs • Fine-grain resource sharing • Uniform access via loads/stores • Automatic data movement and coherent replication in caches • Cheap and powerful extension • Normal uniprocessor mechanisms to access data

  9. SMP :: example Connecting IBM Power chips • 8-way SMP • Each CMP has 2 cores

  10. Parallel Programming Models • Programming model : languages – libraries create abstract view of machine • Control • How is parallelismcreated • Operation ordering • Synchronization control • Data • privatevs.shared • CommunicatedHow shared data accessed • Synchronization • What operations can be used • What are atomic (indivisible) operations?

  11. Shared memory s s = ... y = ..s ... i: 5 i: 8 i: 2 Private memory P1 Pn P0 Programming Model 1: Shared Memory • Program: collection of threads with private variables, • AND shared variables, e.g., static variables, shared common blocks, • Threads communicateimplicitlyby writing / reading shared variables. • Thread coordination by synchronizing shared variables

  12. Synchronization Techniques • Mutexes – mutual exclusion locks (binary semaphore) • threads are mostly independent and must access common data • lock *l = alloc_and_init(); /* shared */ • lock(l); • access data • unlock(l); • Barrier – global (/coordinated) synchronization • simple use of barriers -- all threads hit the same one • work_on_my_subgrid(); • barrier; • read_neighboring_values(); • barrier; • Need atomic operations bigger than loads/stores • atomic swap, test-and-test-and-set • Transactional memory • Hardware equivalent of optimistic concurrency • Solves many parallel programming problems

  13. Private memory receive Pn,s s: 11 s: 12 s: 14 y = ..s ... i: 2 i: 3 i: 1 send P1,s P1 Pn P0 Network Programming Model 2: Message Passing • Program : a collection of processes. • Usually fixed at program startup • local address space -- NO shared data. • Logically shared data partitioned. • Processes communicate by explicit send/receive pairs • Coordination implicit in every communication event. • MPI (Message Passing Interface) most commonly used SW

  14. MPI – de facto standard • MPI has become de facto standard for parallel computing using message passing • Example: (FYI) for(i=1;i<numprocs;i++) { sprintf(buff, "Hello %d! ", i); MPI_Send(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD); } for(i=1;i<numprocs;i++) { MPI_Recv(buff, BUFSIZE, MPI_CHAR, i, TAG, MPI_COMM_WORLD, &stat); printf("%d: %s\n", myid, buff); } • Pros and Cons of standards • MPI a standard for development in the HPC community  portability • The MPI standard buit on mid-80s technology,

  15. Shared Memory VS or Message Passing • Advantages of Shared Memory: • Implicit communication (loads/stores) • Low overhead when cached • Disadvantages of Shared Memory: • Complex to scale well • Requires synchronization operations • Advantages of Message Passing • Explicit Communication (sending/receiving of messages) • Easier to control data placement (no automatic caching) • Disadvantages of Message Passing • High Message passing overhead • Complex to program • Due to CMPs, cache-coherent shared memory systems will be dominant form of multiprocessor

  16. Caches and Cache Coherence • Caches play key role • Reduce average data access time • Reduce bandwidth demands placed on shared interconnect • private processor caches create a problem • Copies of a variable can be present in multiple caches • A write by one processor may not become visible to others • stale value in their caches • Solutions • Cache snoop architecture & protocols

  17. u = ? u = ? u = 7 5 4 3 1 2 u u u :5 :5 :5 Example Cache Coherence Problem notes: Processors see different values for u after event 3 With write back caches, value written back to memory depends on which cache flushes or when writes back value Processes accessing main memory see stale value P P P 2 1 3 $ $ $ I/O devices Memory

  18. Cached portions of page Physical Memory Memory Bus Proc. Cache DMA transfers DMA DISK Problems with Parallel I/O Memory Disk: Physical memory may be stale if Cache copy is dirty Disk Memory: Cache may hold stale data and not see memory writes Use non-cacheable paging to solve

  19. State Address Data Snoopy Cache-Coherence Protocols • Cache Controller “snoops” all transactions on the shared bus • relevant transaction if for a block it contains • take action to ensure coherence • invalidate, update, or supply value • depends on state of the block and the protocol

  20. P P n 1 $ $ Bus I/O devices Mem State Tag Data State Tag Data Write-through Invalidate Protocol • Basic Bus-Based Protocol • Each processor has cache, state • All transactions over bus snooped • Writes invalidate all other caches • can have multiple simultaneous readers of block, but write invalidates them • Two states per block in each cache • state bits associated with blocks that are in the cache • other blocks can be seen as being in invalid (not-present) state in that cache

  21. u = ? u = ? u u = 7 = 7 5 4 3 1 2 u u u :5 :5 :5 Example: Write-thru Invalidate P P P 2 1 3 $ $ $ I/O devices Memory

  22. P P n 1 $ $ Bus I/O devices Mem • Example: 200 MHz dual issue, CPI = 1, 15% stores of 8 bytes •  30 M stores per second per processor •  240 MB/s per processor! State Tag Data State Tag Data Write-through vs. Write-back • Write-through protocol is simple • every write is observable • Every write goes on the bus  Only one write can take place at a time in any processor • Uses a lot of bandwidth!

  23. Invalidate vs. Update • Basic question of program behavior: • Is a block written by one processor later read by others before it is overwritten? • Invalidate. • yes: readers will take a miss • no: multiple writes without additional traffic • Update. • yes: avoids misses on later references • no: multiple useless updates

  24. Coherent Memory System • Reading a location should return latest value written by any process • Easy in uniprocessors ; Except for I/O, - infrequent - software solutions work – eg: non cacheable operations, .. • coherence problem more pervasive performance critical in multiprocessors

  25. Coherence Meansas if no cache exists 1. operations issued by any process occur in order issued by process, and 2. value returned by read is last value written to that location in serial order 3. Two necessary features: Write propagation: value written must become visible to others Write serialization: writes to location seen in same order by all – if I see w1 after w2, you should not see w2 before w1

  26. Two Hardware Cache Coherence Solutions – “snoopy” schemes » rely on broadcast to observe all coherence traffic » well suited for buses and small-scale systems – directory schemes » uses centralized information to avoid broadcast » scales well to large numbers of processors

  27. Snoopy Cache Protocols • all coherence-related activity is broadcast to all processors on a bus (MESI protocol) • each processor monitors (“snoops”) bus actions • Processor reacts when activity relevant to current cache contents • » if another processor wishes to write to a line, you may need to “invalidate” (I.e. discard) the copy in your own cache » if another processor wishes to read a line for which you have a dirty copy, you may need to supply

  28. M: Modified Exclusive E: Exclusive, unmodified S: Shared I: Invalid Each cache line has a tag Address tag state bits MESI Invalidate Cache Protocol • 4 States (per cache block/line) • Invalid I • Shared S: Two or more caches have copy • Dirty or Modified M: one only • Exclusive E :Only this cache has copy, not modified • Implemented in most commercial processors, Core Duo, Core 2, IBM Power, ..

  29. MESI Protocol • Modified / E xclusive / S hared / I nvalid • Upon loading, a line is marked E, subsequent read OK, write marks M • If another's load is seen, mark S • Write to an S, send I to all, mark M • If another reads an M line, write it back, mark it S • Read/write to an Imisses

  30. Snoop with Level-2 Caches Possible CPU CPU CPU CPU L1 $ L1 $ L1 $ L1 $ L2 $ L2 $ L2 $ L2 $ Snooper Snooper Snooper Snooper • Processors have two-level caches • Inclusion property: entries in IL1 & DL1 are in L2 • invalidation in L2  invalidation in L1 • Snooping on L2 does not affect CPU-L1 bandwidth

  31. Cache Coherent System summary: • Provide set of states, state transition diagram, and actions • Manage coherence protocol • (0) Determine when to invoke coherence protocol • (a) Find info about state of block in other caches to determine action - whether need to communicate with other cached copies • (b) Locate the other copies • (c) Communicate with those copies (invalidate/update) • (0) is done the same way on all systems • state of the line is maintained in the cache • protocol is invoked if an “access fault” occurs on the line • Different approaches distinguished by (a) to (c)

  32. Bus-based Coherence summary • All of (a), (b), (c) done through broadcast on bus • faulting processor sends out a “search” • others respond to the search probe and take necessary action • Could do it in scalable network too • broadcast to all processors, and let them respond • Conceptually simple, but broadcast doesn’t scale with p • on bus, bus bandwidth doesn’t scale • on scalable network, every fault leads to at least p network transactions • Scalable coherence: • can have same cache states and state transition diagram • different mechanisms to manage protocol

  33. More Scalable coherency Approach : Directories • Every memory block / line has associated directory entry • Tracks copies of cached blocks and their states • on a miss, find directory entry; communicate only with nodes that have copies • in scalable networks, communication with directory and copies is through network transactions • alternatives for organizing directory information • k processors. • each cache-block in memory: k presence-bits, 1 dirty-bit • With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit

  34. Directory Operation • Read from memory by processor i: • If dirty-bit OFF then { read from main memory; turn p[i] ON; } • if dirty-bit ON then { recall line from dirty proc ; update memory; turn dirty-bit OFF; turn p[i] ON; supply recalled data to i;} • Write to memory by processor i: • If dirty-bit OFF then {send invalidations to all caches that have the block; turn dirty-bit ON; supply data to i; turn p[i] ON; ... } • k processors. • With each cache-block in memory: k presence-bits, 1 dirty-bit • With each cache-block in cache: 1 valid bit, and 1 dirty (owner) bit

More Related