1 / 39

A Systematic Methodology to Develop Resilient Cache Coherence Protocols

A Systematic Methodology to Develop Resilient Cache Coherence Protocols. Konstantinos Aisopos (Princeton, MIT ) Li- Shiuan Peh (MIT ). Motivation. CMP era is here … Enabled by aggressive transistor scaling shrinking transistor dimensions  unreliable silicon

maddy
Download Presentation

A Systematic Methodology to Develop Resilient Cache Coherence Protocols

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Systematic Methodology to Develop Resilient Cache Coherence Protocols Konstantinos Aisopos (Princeton, MIT) Li-ShiuanPeh (MIT)

  2. Motivation • CMP era is here… • Enabled by aggressive transistor scaling • shrinking transistor dimensions  unreliable silicon • (10K-100K FITs, frequency of errors : months) • … C C C C P P$ S$ NIC R R R R R R R R R R R • [1,2] R R R R R R • [1] R. Bauman (TI), IEEE Design Test of Computers, vol. 22 (3), 2005 [2] J. Graham (MoSys), EE Times, 2002

  3. Motivation • data • request • CMP era is here… • Enabled by aggressive transistor scaling • shrinking transistor dimensions  unreliable silicon • (10K-100K FITs, frequency of errors : months) • Goal: resilient cache coherence protocol R S • … R R C C C C R P • loss of a single coherence • message : deadlock R R P$ S$ NIC

  4. Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions

  5. Walkthrough Example: transaction resilient transaction S{ } BM M{ } R • request (M) • request (M) dir S S S • unblock R S1 S2 SM I I • ack M • ack • 1. initiator sends request to the directory • 2. directory forwards request to the sharers • 3. sharers invalidate their copy and acknowledge • 4. request completes and initiator sends unblock to the dir • 5. dir updates sharing vector and may now process succeeding requests R S1 S2

  6. Walkthrough Example: transaction resilient transaction • request (M) • request (M) dir S • request (M) R S1 S2 SM • 1. initiator sends request to the directory • 2. request is lost • 3. initiator resends request after a timeout • 4. directory forwards request to the sharers • (…transaction continues identically as before)

  7. Walkthrough Example: transaction resilient transaction S{ } S{R,S1,S2} BM • request (M) • request (M) dir S • request (M) R S1 S2 SM • ack • ack • 1. initiator resends its request R S1 S2

  8. Walkthrough Example: transaction resilient transaction • request S{ } • (M) S{R,S1,S2} BM • request (M) • request (M) dir S • request (M) S R S1 S2 BM SM ? • ack BM BS • ack • request • unblock • unblock • (M) • tolerate a duplicate request: • (1) transit to same state • (2) generate the same messages • request • request • request M • 1. initiator resends its request S • (S) • (M) • (M) R S1 S2

  9. Walkthrough Example: transaction resilient transaction S{ } S{R,S1,S2} BM • request (M) • request (M) dir S • request (M) R S1 S2 SM • ack • ack • 1. initiator resends its request • 2. directory forwards the request to sharers (again) R S1 S2

  10. Walkthrough Example: transaction resilient transaction • request • request • (M) • (M) • request (M) S • ack S1 S2 I • ack • ack • tolerate a duplicate request: • (1) transit to same state • (2) generate the same messages • ack

  11. Walkthrough Example: transaction resilient transaction • request (M) • request (M) dir S • request (M) R S1 S2 SM • ack • ack M • ack • ack • 1. initiator resends its request • 2. directory forwards the request to sharers (again) • 3. sharers acknowledge (again) • (…transaction completes identically as before)

  12. Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions

  13. Defining the Resilience Properties • … • … • request • response • response R R • … • message loss => transaction suspended • the requestor regenerates its request after timeout • - same state transition • - same outgoing messages • - same state transition • - same outgoing messages

  14. Defining the Resilience Properties • … • … • request • response R R • … stable • request X • msgA • msgB transient A • … • … • … transient • last Y • message • msgB • msgA stable • Property 3 • Property 1 • Property 2 • msgA • msgA • initiator remains transient throughout the transaction • retain information to regenerate msgs • replicate msgs roll-back to same earlier state

  15. Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions

  16. Enforcing Property 1 • Property 1 • the initiator remains transient throughout a transaction to be able to resend lost messages stable • request transient • … transient • last • message stable

  17. Enforcing Property 1 • Property 1 stable • request • the initiator remains transient throughout a transaction to be able to resend lost messages transient • … transient • last • message stable • Enforcement: • counter-example: stable • request • - detect every outgoing message that transits the initiator to stable state • … transient • response dir • unblock • - replace the stable with a transient state, and wait for done transient stable • done • initiator cannot resend unblock

  18. Enforcing Property 2 • Property 2 A • replicate messages roll-back to the earlier state the original message transitioned to • … • msgA • msgA

  19. Enforcing Property 2 • Property 2 A • replicate messages roll-back to the earlier state the original message transitioned to • … S S • … • … • … • … T2 T2 T1 T1 • disassociate branches after merging point • … • … • … • … • msgA TM1 TM2 • msgA • msgA TM TM • msgA • msgA • msgA • msgA • msgA • T1 or T2? • … • … • msgA • …

  20. Enforcing Property 3 • Property 3 • msgA • msgB • retain info to regenerate every outgoing message, in case a replicate request is received • … • msgB • msgA Sharer • ( ) • ( ) dir dir M • request (M) R • unique data I • request (M) • unique data

  21. Enforcing Property 3 • Property 3 • msgA • msgB • retain info to regenerate every outgoing message, in case a replicate request is received • … • msgB • msgA Sharer • ( ) dir M • … • request (M) R • unique data • retains TI I • invalidate permission • unique data TM • invalidate ack • …

  22. Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions

  23. Evaluation: Overhead  broadcast-based protocol (AMD Hammer, MOESI)  directory-based protocol (static directory node, MESI) stable stable transient transient 9 to 17 states(4 to 5 bits) No state was introduced into the critical path of serving a request 12to22states(4 to 5 bits) 12 to 22 cache states (4 to 5 bits)

  24. Evaluation: Overhead Miss Status Holding Register (MSHR) • 0 to 213 • 13bits • 6bits • 1bit • 64bits • 11 bytes • 4-32 • entries • total storage overhead: • < 0.5 KB / core • (worst-case: 2KB / core) • (*) • (*) • assuming a 64-node CMP with in-order cores

  25. Evaluation: Performance Simulator: Wisconsin Multifacet GEMS

  26. Evaluation: Performance metric: runtime overhead vs. non-resilient baseline 11% directory protocol  lower is better 7.4% 3.5% 1.8% 1.4% 1.1% fftfmmlu radix water water blacks cannealfluidanswaptions x264 AVERAGE nsqspcholesimate SPLASH PARSEC

  27. Evaluation: Performance metric: runtime overhead vs. non-resilient baseline broadcast protocol 56% 51% 20.4% 5.1% 2.4% 0.5% fftfmmlu radix water water blacks cannealfluidanswaptions x264 AVERAGE nsqspcholesimate SPLASH PARSEC

  28. Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions

  29. Conclusions We have presented a generic methodology: • coherence protocol -> resilient coherence protocol …by enforcing 3 properties • minimal hardware overhead (<2KB / node) • small performance overhead • directory-based protocol: 1.4% (1 fault / msec) • broadcast-based protocol: 2.4% (1 fault / msec)

  30. Thank You! Questions?

  31. BACKUP SLIDES

  32. Why performance overhead? • transactions last longer => a request may have to wait for outstanding conflicting requests to complete • data remain in caches for longer (3-way hs) => cache replacement duration • more messages are injected in the NoC => network traffic => average NoC latency

  33. Transaction Duration +18% B: baseline protocol, no faults R: resilient protocol, 1fault/10μsec L1: transaction served by sharer's L1 L2: transaction served by directory (L2) +12%

  34. Transaction Duration B: baseline protocol, no faults R: resilient protocol, 1fault/10μsec L1: transaction served by sharer's L1 L2: transaction served by directory (L2) 11% large working sets, shared data => high number of requests (high traffic) (!) retransmissions saturate network) 24%

  35. Network Traffic mostcongested link average over all links

  36. Enforcing the Resilience Properties • P2 • A single message type transits to a unique state in every FSM branch • Case 2: identical messages in same branch R • request (M) • … X • msgA SM + acks =0 • ack T1 T count =1 SM + acks =1 • … Y • ack • msgA SM + acks =2 Tcount =2 T2 • … • … M

  37. Enforcing the Resilience Properties • P2 • A single message type transits to a unique state in every FSM branch • Case 2: identical messages in same branch • … • … • msgA • msgA X X T count =1 T [XYZ=100] • … • … • msgA • msgA Y Y Tcount =2 T [XYZ=110] • … • …

  38. Enforcing the Resilience Properties • P2 • A single message type transits to a unique state in every FSM branch • Case 2: identical messages in same branch • … • … • msgA • msgA X X T count =1 T [XYZ=100] • … • … • msgA • msgA Y X • (duplicate) Tcount =2 T [XYZ=100] • … • …

  39. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63

More Related