1 / 27

Scalable Synchronization Algorithms in Multi-core Processors

Jeremy Denham April 7, 2008. Scalable Synchronization Algorithms in Multi-core Processors. Outline. Motivation Background / Previous work Experimentation Results Questions. Motivation. Modern processor design trends are primarily concerned with the multi-core design paradigm.

cprimm
Download Presentation

Scalable Synchronization Algorithms in Multi-core Processors

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Jeremy Denham April 7, 2008 Scalable Synchronization Algorithms in Multi-core Processors

  2. Outline • Motivation • Background / Previous work • Experimentation • Results • Questions

  3. Motivation • Modern processor design trends are primarily concerned with the multi-core design paradigm. • Still figuring out what to do with them • Different way of thinking about “shared-memory multiprocessors” • Distributed apps? • Synchronization will be important.

  4. Background / Previous work • Algorithms for Scalable Synchronization on Shared-Memory Multiprocessors, Mellor-Crummey & Scott 1991. • Scalable, busy-wait synchronization algorithms • No memory or interconnect contention • O(1) remote references per mechanism utilization • Spin locks and barriers

  5. Spin Locks • “Spin” on lock by busy-waiting until available. • Typically involves “fetch-and-Φ” operations • Must be atomic!

  6. Simple spin locks • “Test-and-set” • Needs processor support to make it atomic • “fetch-and-store” • xchgworks in x86 • Loop until lock is possessed • Expensive! • Frequently accessed, too • Networking issues

  7. Ticket lock • Can reduce fetch-and-Φ ops to one per lock acquisition • FIFO service guarantee • Two counters • Requests • Releases • fetch_and_incrementrequest counter • Wait until release counter reflects turn • Still problematic…

  8. Queue-based approach • T.E. Anderson • Incoming processes put themselves in the queue • Lock holder hands off the lock to next in queue • Faster than ticket, but more space

  9. MCS List-based Queuing Lock • FIFO Guarantee • Local spinning! • Small constant amount of space • Cache coherence a non-issue

  10. Details • Each processor allocates a record • next link • boolean flag • Adds to queue • Spins locally • Owner passes lock to next user in queue as necessary

  11. Barriers • Mechanism for “phase separation” • Block processes from proceeding until all others have reached a checkpoint • Designed for repetitive use

  12. Centralized Barriers • “Local” and “global” sense • As processor arrives • Reverse local sense • Signal its arrival • If last, reverse global sense • Else spin • Lots of spinning…

  13. Dissemination Barriers • Barrier information is “disseminated” algorithmically • At each synchronization stage k, processor i signals processor (i + 2k) mod P, where P is the number of processors • Similarly, processor i continues when it is signaled by processor (i - 2k) mod P • log(P) operations on critical path, P log(P) remote operations

  14. Tournament Barriers • Tree-based approach • Outcome statically determined • “Roles” for each round • “loser” notifies “winner,” then drops out • “winner” waits to be notified, participates in next round • “champion” sets global flag when over • log(P) rounds • Heavy interconnect traffic…

  15. MCS Approach • Also tree-based • Local spinning • O(P) space for P processors • (2P – 2) network transactions • O(log P) network transactions on critical path

  16. The idea • Use two P-node trees • “child-not-ready” flag for each child present in parent • When all children have signaled arrival, parent signals its parent • When root detects all children have arrived, signals to the group that it can proceed to next barrier.

  17. MCS Results • Experiments done on BBN Butterfly 1 and Sequent Symmetry Model B machines • BBN • Supports up to 256 processor nodes • 8 MHz MC68000 • Sequent • Supports up to 30 processor nodes • 16 MHz Intel 80386 • Most concerned with Sequent

  18. Sequent Spin Locks

  19. Sequent Spin Locks cont’d

  20. Sequent Barriers

  21. My Experiments • Want to extend to multi-core machines • Scalability of limited usefulness (not that many cores) • Shared resources • Core load

  22. Equipment • Intel Centrino Duo T5200 Processor • Two cores • 1.60 GHz per core • 2MB L2 Cache • Windows Vista • 2GB DDR2 Memory

  23. Experimental Procedure • Evaluate basic and MCS approaches • Simple and complex evaluations • Core pinning • Load ramping

  24. Challenges • Code porting • Lots of Linux-specific code • Win32 Thread API • Esoteric… • How to pin a thread to a core? • Timing • Win32 μsec-granularity measurement • Surprisingly archaic C code

  25. Progress • Spin lock base code ported • Barriers nearly done • Simple experiments for spin locks done • More complex on the way

  26. Results • Simple spin lock tests • Simple lock outperforms MCS on: • Empty Critical Section • Simple FP Critical Section • Single core • Dual core • More procedural overhead for MCS on small scale • Next steps: • More threads! • More critical section complexity

  27. Questions?

More Related