1 / 56

Software Distributed Shared Memory (SDSM): MultiView SDSM, false sharing. Solution: MultiView.

Software Distributed Shared Memory (SDSM): MultiView SDSM, false sharing. Solution: MultiView. Granularity adaptation. Integrated services. Ayal Itzkovitz, Assaf Schuster. Local memory. core. core. core. core. A multi-core system (simplified).

azra
Download Presentation

Software Distributed Shared Memory (SDSM): MultiView SDSM, false sharing. Solution: MultiView.

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Software Distributed Shared Memory (SDSM): • MultiView • SDSM, false sharing. • Solution: MultiView. • Granularity adaptation. • Integrated services. • Ayal Itzkovitz, Assaf Schuster DSM Innovations - MultiView

  2. Local memory core core core core A multi-core system (simplified) • A parallel program may spawn processes (threads) in order to utilize all computing units • Processes communicate through shared memory, physically located on the local machine DSM Innovations - MultiView

  3. Virtual Shared Memory Local memory Local memory core core A distributed system • Emulation of the same programming paradigm • Ultimately: no changes to source/binary code Local memory core Network DSM Innovations - MultiView

  4. The First SDSM System • The first software SDSM system, Ivy [Li & Hudak, Yale, ‘86] • Strict memory semantics (Lamport’s sequential consistency) • Page-based: memory pages as units of sharing • The major performance limitation: Page size  False sharing • Page size – 4K (and more) • Average object size – 28 bytes  About 150 objects on a page DSM Innovations - MultiView

  5. Object Distribution DSM Innovations - MultiView

  6. Object Distribution – Memory View Network DSM Innovations - MultiView

  7. False Sharing “…the conventional wisdom remainsthat the overhead of false sharing[…] in page-based consistency protocolsis the primary factor limiting the performance of software SDSM” [Amza, Cox, Ramajamni, and Zwaenepoel, PPoPP ‘97] “[The] conventional wisdom holds that fine-grain performance and false sharingdoom page-based approaches” [Buck and Keleher, IPPS ‘98]

  8. Solution: The MultiView Approach • “MultiView and Millipage – Fine-grain Sharing in Page-based SDSMs” [Itzkovitz and Schuster, OSDI ‘99] • Implement small-size pages through special memory configuration Other Goals: • W/O compromising the strict memory consistency [ICS’04, EuroPar’04] • Utilizing low-latency networks (Myrinet, VIA/ServerNet-II, Infiniband) [Hot-Interconnects’03, IPDPS’04] • Transparency [EuroPar’03] • Adaptive sharing granularity [ICPP’00, IPDPS’01 best paper] • Maximize locality through migration and load sharing [DISC’01] • Additional “service layers” (garbage collection, data-race detection) [JPDC’01,JPDC02] DSM Innovations - MultiView

  9. x y z w v u The Traditional Memory Layout struct a { …};struct b; int x, y, z; main() { w = malloc(sizeof(struct a)); v = malloc(sizeof(struct a)); u = malloc(sizeof(struct b)); …} Traditional DSM Innovations - MultiView

  10. x x y y z z w w v v u u The MultiView Technique MultiView Traditional DSM Innovations - MultiView

  11. Protection is now set independently x x y y z z RW NA R w w v v u u The MultiView Technique Variables reside in the same page but are not shared MultiView Traditional DSM Innovations - MultiView

  12. Memory Object x x View 1 y y z z View 2 View 3 w w v v u u The MultiView Technique MultiView Traditional DSM Innovations - MultiView

  13. MemoryObject x View 1 y z View 2 View 3 w v u MultiView The MultiView Technique View 1 Memory Object View 2 View 3 Memory Layout DSM Innovations - MultiView

  14. The MultiView Technique R R View 1 NA RW View 1 Memory Object Memory Object NA R View 2 View 2 R R R R RW NA View 3 View 3 RW NA Host A Host B DSM Innovations - MultiView

  15. The MultiView Technique R R View 1 NA RW View 1 NA R View 2 View 2 R R R R RW NA View 3 View 3 RW NA Host A Host B DSM Innovations - MultiView

  16. Enabling Technology Memory mapped I/O created for inter-process communication SharedMemoryObject DSM Innovations - MultiView

  17. SharedMemoryObject Implementation: Millipage Can be used by a single process to provide desired functionality • Windows-NT (Solaris, BSD, Linux) • CreateFileMapping(), MapViewOfFileEx() • for allocating views DSM Innovations - MultiView

  18. mat = malloc(lines*sizeof(int*));for(i=0;i<N;i++) mat[i] = malloc(cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; … mat = malloc(lines*cols*sizeof(int));…mat[i][j] = mat[i-1][j]+mat[i][j-1]; … Transparency • 1999: • Minipages are allocated at malloc time (via malloc-like API) • Allocation routines should be slightly modified • SOR and LU have not been modified at all • WATER- changed ~20 lines out of 783 lines • IS- changed 5 lines out of 93 lines • TSP- changed ~15 lines out of ~400 lines • 2003: complete transparency • Through binary instrumentation/interception of OS calls DSM Innovations - MultiView

  19. SOR SPLASH-II Benchmark DSM Innovations - MultiView

  20. Performance with Fixed Granularity(NBodyW on 8 nodes) DSM Innovations - MultiView

  21. False Sharing vs. Prefetching (WATER) DSM Innovations - MultiView

  22. Adapting Granularity Shared data elements Application run time Adaptation is dynamic, automatic, transparent DSM Innovations - MultiView

  23. Water-nsq speedup (one thread per node) Water-nsq speedup (two threads per node) 12 24 22 10 20 18 8 16 14 speedup speedup 6 12 10 4 8 6 2 4 2 0 0 1 2 4 6 8 10 12 1 2 4 6 8 10 12 nodes nodes SC/MV - fine granularity HLRC Mixed consistency SC/MV - best static granularity SC/MV - dynamic granularity Performance (VIA/ServerNet-II, 2004) DSM Innovations - MultiView

  24. Integrating Data Race Detection • Detection in application variable granularity DSM Innovations - MultiView

  25. Integrating Distributed Garbage Collection(Remote Reference Counting) • Collection in native application granularity. DSM Innovations - MultiView

  26. Questions? DSM Innovations - MultiView

  27. Types of Parallel Systems Communication Efficiency • In-core multi-threading • Multi-core/SMP multi-threading • Tightly-coupled cluster, customized interconnect (SGI’s Altix) • Tightly-coupled cluster, of-the-shelf interconnect (InfiniBand) • WAN, Internet, Grid, peer-to-peer Traditionally: 1+2 are programmable using shared memory, 3+4 are programmable using message passing, in 5 peer processes communicate with central control only. HDSM: systems in 3 move towards presenting a shared memory interface to a physically distributed system. What about 4,5? Software Distributed Shared Memory = SDSM Scalability DSM Innovations - MultiView

  28. A = malloc(MATSIZE);B = malloc(MATSIZE);C = malloc(MATSIZE); parfor(n) mult(A, B, C); mult(id): for (line=Nxid .. Nx(id+1)) for(col=0..N) C[line,col] = multline(A[line],B[col]); W Matrix Multiplication two threads Read/only matrices Write matrix R R DSM Innovations - MultiView

  29. RW RO RO RW RO RO RO RO RO RO RO RO RW RW RO RO Matrix Multiplication A A Sent once x x Sent once B B = = C C Network DSM Innovations - MultiView

  30. RO RO RW RW RO RO RO RO RO RO RO RO RW RO RO RW Matrix Multiplication R R W A A x x B B = = C C Network DSM Innovations - MultiView

  31. RO RO RO RO RO RW RO RW RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO RO NA NA RW RW Matrix Multiplication - False Sharing Sent once A A x x B Sent once B = = C C Network DSM Innovations - MultiView

  32. RO RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO NA NA RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations - MultiView

  33. RO RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RW RW RO RO NA NA RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations - MultiView

  34. R R W RO RO RO RO RW RW RO RO RO RO RO RO RO RO RO RO RO RO RO RO RW RO RO RW RW RW RW RW Matrix Multiplication - False Sharing A A x x B B = = C C Network DSM Innovations - MultiView

  35. Apply diff Apply diff RW RW First Approach: Weak Semantics • Example - Release Consistency: • Allow multiple writers to page (assume exclusive update for any portion of the page) • Each page has a twin copy • At synchronization time, all pages perform “diff” with their twins, and send diffs to managers • Managers hold master copies twin twin DSM Innovations - MultiView

  36. First Approach: Weak Semantics • Allow memory to reside in an incosistent state for time intervals • Enforce consistency only at synchronization points • Reaching a consistent view of the memory requires computation • Reduces (but not always eliminate) false sharing • Reduces number of protocol messages • Weak memory semantics • Involves both memory and processing time overhead • Still: coarse-grain sharing (why diff at locations not touched? ) DSM Innovations - MultiView

  37. Software DSM Evolution - Weak Semantics Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Brazos, ‘97Scope Cons.Rice DSM Innovations - MultiView

  38. Software DSM Evolution - Multithreading Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Multithreading CVM, Millipede, ‘96multi-protocol Maryland Technion Brazos, ‘97Scope Cons.Rice Quarks, ‘98protocol latency hiding Utah DSM Innovations - MultiView

  39. Second Approach:Code Instrumentation • Example - Binary Rewriting: • wrap each load and store with instructions that check whether the data is available locally push ptr[line]call __check_rload r1, ptr[line]push ptr[v]call __check_rload r2, ptr[v] add r1, 3hpush ptr[line]call __check_wstore r1, ptr[line]push ptr[line]call __donesub r2, r1push ptr[v]call __check_wstore r2, ptr[v]push ptr[v]call __done line += 3; v = v - line; push ptr[line]call __check_wload r1, ptr[line]push ptr[v]call __check_wload r2, ptr[v] add r1, 3hstore r1, ptr[line]push ptr[line]call __donesub r2, r1store r2, ptr[v]push ptr[v]call __done Compile CodeInstr. load r1, ptr[line]load r2, ptr[v] add r1, 3hstore r1, ptr[line]sub r2, r1store r2, ptr[v] Opt. DSM Innovations - MultiView

  40. Second Approach:Code Instrumentation • Provides fine-grain access control, thus avoids false sharing • Bypasses the page protection mechanism • Usually, fixed granularity for all application data (Still, false sharing ) • Needs a special compiler or binary-level rewriting tools • Cost: • High overheads (even on single machine) • Inflated code • Not portable (among architectures) DSM Innovations - MultiView

  41. Software DSM Evolution Li & Hudak - IVY, ‘86Yale Page-grain:Relaxed consistency Munin, ‘92Release Cons.Rice Fine-grain:Code Instrumentation Midway, ‘93Entry Cons.CMU Treadmarks, ‘94Lazy Release Cons.Rice Blizzard, ‘94binary instrumentationWisconsin Multithreading CVM, Millipede, ‘96multi-protocol Maryland Technion Shasta, ‘97transparent,works forcommercial appsDigital WRL Brazos, ‘97Scope Cons.Rice Quarks, ‘98protocol latency hiding Utah DSM Innovations - MultiView

  42. MultiView - Overheads • Application:traverse an array of integers, all packed up in minipages • The number of minipages is derived from the value of max views in page • Limitations of the experiments: • 1.63GB contiguous address space available • Up to 1664 views •  Need 64 bits!!! DSM Innovations - MultiView

  43. Num views MultiView - Overheads • As expected, committed (physical) memory is constant • Only a negligible overhead (< 4%): Due to TLB misses DSM Innovations - MultiView

  44. 2MB 4MB 8MB 1MB MultiView - Taking it to the extreme • Beyond critical points overhead becomes substantial • Number of minipages at critical points is 128K • Slowdown due to L2 cache exhausted by PTEs DSM Innovations - MultiView

  45. 2MB 4MB 8MB 1MB MultiView - Taking it to the extreme • Beyond critical points overhead becomes substantial • Number of minipages at critical points is 128K • Slowdown due to L2 cache exhausted by PTEs SDSM DSM Innovations - MultiView

  46. The Transparent DSM: System Initialization • For most DSM systems, initialization is an almost trivial task • The transparent DSM system cannot use such a simple solution • In order to initialize a DSM system transparently we have to inject the initialization code into the loaded application DSM Innovations - MultiView

  47. crtStartup: main: … call c_init … call main … … application code … Standard Initialization Startup code from in the C standard library. This code is identical for all C applications. crtStartup is the entry point of the executable. Initialize the C runtime library Start the application This instruction lies at a fixed offset from crtStartup. We denote this offset as main_call_offset Standard C application DSM Innovations - MultiView

  48. DllMain: … crtStartup = get_entry_point(); mainPtr = *(crtStartup + main_call_offset); *(crtStartup + main_call_offset) = hookedMain; … crtStartup: main: … call c_init … call main … … application code … mainPtr dd NULL hookedMain: dsm_init(…); dsm_create_thread(…,mainPtr,…); … Transparent DSM System Initialization The OS passes control to DllMain() after the DLL has been loaded The main thread is resumed Initialize the C runtime library Initialize the DSM system (the OS API is intercepted, globals are moved to the DSM) The application main thread is created using the DSM system thread creation API hookedMain main Injected DLL DSM Innovations - MultiView

  49. SDSMs on Emerging Fast Networks • Fast networking is an emerging technology • MultiView provides only one aspect: reducing message sizes • The next magnitude of improvement shifts from the network layer to the system architectures and protocols that use those networks • Challenges: • Efficiently employ and integrate fast networks • Provide a “thin” protocol layer: reduce protocol complexity, eliminate buffer copying, use home-based management, etc. DSM Innovations - MultiView

  50. x y z RW NA R RW Adding the Privileged View • Constant Read/Write permissions • Separate application threads from SDSM injected threads • Atomic updates • DSM threads can access (and update) memory while application threads are prohibited • Direct send/receive • Memory-to-memory • No buffer copying Memory Object Application Views The Privileged View DSM Innovations - MultiView

More Related