1 / 32

Concurrent Data Structures in Architectures with Limited Shared Memory Support

Distributed Computing and Systems Chalmers University of Technology Gothenburg, Sweden. Concurrent Data Structures in Architectures with Limited Shared Memory Support. Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas. Concurrent Data Structures.

trent
Download Presentation

Concurrent Data Structures in Architectures with Limited Shared Memory Support

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Distributed Computing and SystemsChalmers University of TechnologyGothenburg, Sweden Concurrent Data Structures in Architectures with Limited Shared Memory Support Ivan Walulya Yiannis Nikolakopoulos Marina Papatriantafilou Philippas Tsigas

  2. Concurrent Data Structures • Parallel/Concurrent programming: • Share data among threads/processes, sharing a uniform address space (shared memory) • Inter-process/thread communication and synchronization • Both a tool and a goal Yiannis Nikolakopoulos ioaniko@chalmers.se

  3. Concurrent Data Structures:Implementations • Coarse grained locking • Easy but slow... • Fine grained locking • Fast/scalable but: error-prone, deadlocks • Non-blocking • Atomic hardware primitives (e.g. TAS, CAS) • Good progress guarantees (lock/wait-freedom) • Scalable Yiannis Nikolakopoulos ioaniko@chalmers.se

  4. What’s happening in hardware? • Multi-cores  many-cores • “Cache coherency wall” [Kumar et al 2011] • Shared address space will not scale • Universal atomic primitives (CAS, LL/SC) harder to implement • Shared memory  message passing Shared Local Cache Cache IACore Yiannis Nikolakopoulos ioaniko@chalmers.se

  5. Networks on chip (NoC) • Short distance between cores • Message passing model support • Shared memory support • Eliminatedcache coherency • Limited support for synchronization primitives Shared Local Cache Cache IACore Can we have Data Structures: Fast Scalable Good progress guarantees Yiannis Nikolakopoulos ioaniko@chalmers.se

  6. Outline • Concurrent Data Structures • Many-core architectures • Intel’s SCC • Concurrent FIFO Queues • Evaluation • Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se

  7. Single-chip Cloud Computer (SCC) • Experimental processor by Intel • 48 independent x86 cores arranged on 24 tiles • NoC connects all tiles • TestAndSet registerper core Yiannis Nikolakopoulos ioaniko@chalmers.se

  8. SCC: Architecture Overview Message Passing Buffer (MPB) 16Kb Memory Controllers: to private & shared main memory Yiannis Nikolakopoulos ioaniko@chalmers.se

  9. Programming Challenges in SCC • Message Passing but… • MPB small for large data transfers • Data Replication is difficult • No universal atomic primitives (CAS); no wait-free implementations [Herlihy91] Yiannis Nikolakopoulos ioaniko@chalmers.se

  10. Outline • Concurrent Data Structures • Many-core architectures • Intel’s SCC • Concurrent FIFO Queues • Evaluation • Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se

  11. Concurrent FIFO Queues • Main idea: • Data are stored in shared off-chip memory • Message passing for communication/coordination • 2 design methodologies: • Lock-based synchronization (2-lock Queue) • Message passing-based synchronization (MP-Queue, MP-Acks) Yiannis Nikolakopoulos ioaniko@chalmers.se

  12. 2-lock Queue • Array based, in shared off-chip memory (SHM) • Head/Tail pointers in MPBs • 1 lock for each pointer [Michael&Scott96] • TAS based locks on 2 cores Yiannis Nikolakopoulos ioaniko@chalmers.se

  13. 2-lock Queue:“Traditional” Enqueue Algorithm • Acquire lock • Read & UpdateTail pointer (MPB) • Add data (SHM) • Release lock Yiannis Nikolakopoulos ioaniko@chalmers.se

  14. 2-lock Queue:Optimized Enqueue Algorithm • Acquire lock • Read & UpdateTail pointer (MPB) • Release lock • Add data to node SHM • Set memory flag to dirty Why?No Cache Coherency! Yiannis Nikolakopoulos ioaniko@chalmers.se

  15. 2-lock Queue:Dequeue Algorithm • Acquire lock • Read & UpdateHead pointer • Releaselock • Check flag • Read node data What about progress? Yiannis Nikolakopoulos ioaniko@chalmers.se

  16. 2-lock Queue:Implementation Locks? On which tile(s)? Head/TailPointers (MPB) Data nodes Yiannis Nikolakopoulos ioaniko@chalmers.se

  17. Message Passing-based Queue • Data nodes in SHM • Access coordinated by a Server node who keeps Head/Tail pointers • Enqueuers/Dequeuers request access through dedicated slots in MPB • Successfully enqueued data are flagged with dirty bit Yiannis Nikolakopoulos ioaniko@chalmers.se

  18. MP-Queue DEQ ENQ TAIL HEAD ADDDATA SPIN What if this fails and is never flagged? “Pairwise blocking”only 1 dequeue blocks Yiannis Nikolakopoulos ioaniko@chalmers.se

  19. Adding Acknowledgements • No more flags! Enqueue sends ACK when done • Server maintains in SHM a private queue of pointers • On ACK: Server adds data location to its private queue • On Dequeue:Server returns only ACKed locations Yiannis Nikolakopoulos ioaniko@chalmers.se

  20. MP-Acks DEQ ENQ TAIL HEAD ACK No blocking between enqueues/dequeues Yiannis Nikolakopoulos ioaniko@chalmers.se

  21. Outline • Concurrent Data Structures • Many-core architectures • Intel’s SCC • Concurrent FIFO Queues • Evaluation • Conclusion Yiannis Nikolakopoulos ioaniko@chalmers.se

  22. Evaluation Benchmark: • Each core performs Enq/Deq at random • High/Low contention • Perfomance? Scalability? • Is it the same for all cores? Yiannis Nikolakopoulos ioaniko@chalmers.se

  23. Measures • Throughput:Data structure operations completed per time unit. [Cederman et al 2013] Average operations per core Operations by core i Yiannis Nikolakopoulosioaniko@chalmers.se

  24. Throughput – High Contention Yiannis Nikolakopoulos ioaniko@chalmers.se

  25. Fairness – High Contention Yiannis Nikolakopoulos ioaniko@chalmers.se

  26. Throughput VS Lock Location Yiannis Nikolakopoulos ioaniko@chalmers.se

  27. Throughput VS Lock Location Yiannis Nikolakopoulos ioaniko@chalmers.se

  28. Conclusion • Lock based queue • High throughput • Less fair • Sensitive to lock locations, NoC performance • MP based queues • Lower throughput • Fairer • Better liveness properties • Promising scalability Yiannis Nikolakopoulos ioaniko@chalmers.se

  29. Thank you! ivanw@chalmers.se ioaniko@chalmers.se Yiannis Nikolakopoulos ioaniko@chalmers.se

  30. Backup slides Yiannis Nikolakopoulos ioaniko@chalmers.se

  31. Experimental Setup • 533MHz cores, 800MHz mesh, 800MHz DDR3 • Randomized Enq/Deq operations • High/Low contention • One thread per core • 600ms per execution • Averaged over 12 runs Yiannis Nikolakopoulos ioaniko@chalmers.se

  32. Concurrent FIFO Queues • Typical 2-lock queue [Michael&Scott96] Yiannis Nikolakopoulos ioaniko@chalmers.se

More Related