1 / 15

ECE 569 High Performance Processors and Systems

ECE 569 High Performance Processors and Systems. Administrative HW3 — presentations Topics? Presentation dates? Options: Tues 3/11 Tues 3/18 Thurs 3/20. Rank your choice: 1 st , 2 nd , 3 rd. Speculative Lock Elision: Enabling Highly Concurrent Multithreaded Execution. By:

latham
Download Presentation

ECE 569 High Performance Processors and Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ECE 569 High Performance Processors and Systems • Administrative • HW3 — presentations • Topics? • Presentation dates? Options: • Tues 3/11 • Tues 3/18 • Thurs 3/20 Rank your choice: 1st, 2nd, 3rd ECE 569 -- 04 Mar 2014

  2. Speculative Lock Elision:Enabling Highly Concurrent Multithreaded Execution • By: • Ravi Rajwar • James Goodman • U. of Wisconsin, Madison • Where: • 34th International Symposium on Microarchitecture (MICRO) • Dec 2001, Austin TX

  3. High performance requires parallel execution • Typically threads running across cores T1 T2 T3 T4 Core Core Core Core Memory ECE 569 -- 04 Mar 2014

  4. Locks are often needed to ensure correct execution • e.g. when threads are accessing a shared resource Locks can significantly impact performance T1 T2 T3 T4 Lock Unlock Core Core Core Core Memory ECE 569 -- 04 Mar 2014

  5. Key insight: • Acquiring lock is often unnecessary for correct execution • Why is lock unnecessary? First… • Threads don't enter critical section at the same time T1 T2 T3 T4 Lock Unlock ECE 569 -- 04 Mar 2014

  6. Why is lock unnecessary? Second… • Threads access different parts of shared resource • Example: hash table T2 T3 Locks are used to prevent race condition, but only occurs on hash to same index hash(x) hash(y) ECE 569 -- 04 Mar 2014

  7. Idea: • Let hardware dynamically identify lock • HW speculatively executes critical section without lock • If HW detects memory conflict, discards state & re-executes with lock • If HW reaches unlock, commits state & skips unlock Uses existing cache coherence protocols and speculation support ― new HW not required! ECE 569 -- 04 Mar 2014

  8. How? • Let hardware dynamically identify lock • HW speculatively executes critical section without lock • If HW detects memory conflict, discards state & re-executes with lock • If HW reaches unlock, skips unlock & commits state • Identify: the lock acquire (store) in Load-Locked/Store-Conditional • Execute: like branch prediction, perform speculative execution • Detect: use cache coherence to detect (1) data read is modified by another, or (2) data written is read/written by another • Identify: store to same location as Store-Conditional, at which point commit state, exit speculation ECE 569 -- 04 Mar 2014

  9. Silent store-pair elision: • By skipping stores, lock remains unlocked so no one waits! ECE 569 -- 04 Mar 2014

  10. Results? CMP: chip multiprocessor SMP: shared-memory multiprocessor DSM: distributed shared memory Taller is better! ECE 569 -- 04 Mar 2014

  11. Results? Normalized execution time (we want < 1.0, lower better) Portion spent accessing & waiting on locks None were slower Many were faster A few significantly faster ECE 569 -- 04 Mar 2014

  12. Benchmarks: Barnes: high lock and data contention Cholesky: shared work queues with high contention Mp3D: frequent locking but with little contention in critical section Radiosity: shared work queues with high contention Water-nsq: little contention Ocean-cont: conditional locking ECE 569 -- 04 Mar 2014

  13. Summary: • It works! • Has been added to most recent Intel, high-end chipsets • Performance gains: • Less waiting for locks ==> more parallelism • Fewer locks ==> less waiting & cache disruption ==> reduced latency • Fewer accesses to locks ==> less memory traffic ECE 569 -- 04 Mar 2014

  14. Drawbacks? Costs? • ? • ? ECE 569 -- 04 Mar 2014

  15. Presentation • Always have: • Motivation • Idea • Results • Pros and cons • Any insights beyond what paper says • Text on slides + Visuals • Demos if possible

More Related