390 likes | 489 Views
This study presents a systematic methodology for developing resilient cache coherence protocols in the era of CMP technology. It addresses the challenges posed by unreliable silicon due to aggressive transistor scaling, offering a solution to mitigate errors. The outlined process includes defining and enforcing resilience properties, with evaluations on overhead and performance to draw conclusions. Walkthrough examples illustrate scenarios of transaction resilience, emphasizing key steps and strategies employed for protocol robustness.
E N D
A Systematic Methodology to Develop Resilient Cache Coherence Protocols Konstantinos Aisopos (Princeton, MIT) Li-ShiuanPeh (MIT)
Motivation • CMP era is here… • Enabled by aggressive transistor scaling • shrinking transistor dimensions unreliable silicon • (10K-100K FITs, frequency of errors : months) • … C C C C P P$ S$ NIC R R R R R R R R R R R • [1,2] R R R R R R • [1] R. Bauman (TI), IEEE Design Test of Computers, vol. 22 (3), 2005 [2] J. Graham (MoSys), EE Times, 2002
Motivation • data • request • CMP era is here… • Enabled by aggressive transistor scaling • shrinking transistor dimensions unreliable silicon • (10K-100K FITs, frequency of errors : months) • Goal: resilient cache coherence protocol R S • … R R C C C C R P • loss of a single coherence • message : deadlock R R P$ S$ NIC
Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions
Walkthrough Example: transaction resilient transaction S{ } BM M{ } R • request (M) • request (M) dir S S S • unblock R S1 S2 SM I I • ack M • ack • 1. initiator sends request to the directory • 2. directory forwards request to the sharers • 3. sharers invalidate their copy and acknowledge • 4. request completes and initiator sends unblock to the dir • 5. dir updates sharing vector and may now process succeeding requests R S1 S2
Walkthrough Example: transaction resilient transaction • request (M) • request (M) dir S • request (M) R S1 S2 SM • 1. initiator sends request to the directory • 2. request is lost • 3. initiator resends request after a timeout • 4. directory forwards request to the sharers • (…transaction continues identically as before)
Walkthrough Example: transaction resilient transaction S{ } S{R,S1,S2} BM • request (M) • request (M) dir S • request (M) R S1 S2 SM • ack • ack • 1. initiator resends its request R S1 S2
Walkthrough Example: transaction resilient transaction • request S{ } • (M) S{R,S1,S2} BM • request (M) • request (M) dir S • request (M) S R S1 S2 BM SM ? • ack BM BS • ack • request • unblock • unblock • (M) • tolerate a duplicate request: • (1) transit to same state • (2) generate the same messages • request • request • request M • 1. initiator resends its request S • (S) • (M) • (M) R S1 S2
Walkthrough Example: transaction resilient transaction S{ } S{R,S1,S2} BM • request (M) • request (M) dir S • request (M) R S1 S2 SM • ack • ack • 1. initiator resends its request • 2. directory forwards the request to sharers (again) R S1 S2
Walkthrough Example: transaction resilient transaction • request • request • (M) • (M) • request (M) S • ack S1 S2 I • ack • ack • tolerate a duplicate request: • (1) transit to same state • (2) generate the same messages • ack
Walkthrough Example: transaction resilient transaction • request (M) • request (M) dir S • request (M) R S1 S2 SM • ack • ack M • ack • ack • 1. initiator resends its request • 2. directory forwards the request to sharers (again) • 3. sharers acknowledge (again) • (…transaction completes identically as before)
Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions
Defining the Resilience Properties • … • … • request • response • response R R • … • message loss => transaction suspended • the requestor regenerates its request after timeout • - same state transition • - same outgoing messages • - same state transition • - same outgoing messages
Defining the Resilience Properties • … • … • request • response R R • … stable • request X • msgA • msgB transient A • … • … • … transient • last Y • message • msgB • msgA stable • Property 3 • Property 1 • Property 2 • msgA • msgA • initiator remains transient throughout the transaction • retain information to regenerate msgs • replicate msgs roll-back to same earlier state
Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions
Enforcing Property 1 • Property 1 • the initiator remains transient throughout a transaction to be able to resend lost messages stable • request transient • … transient • last • message stable
Enforcing Property 1 • Property 1 stable • request • the initiator remains transient throughout a transaction to be able to resend lost messages transient • … transient • last • message stable • Enforcement: • counter-example: stable • request • - detect every outgoing message that transits the initiator to stable state • … transient • response dir • unblock • - replace the stable with a transient state, and wait for done transient stable • done • initiator cannot resend unblock
Enforcing Property 2 • Property 2 A • replicate messages roll-back to the earlier state the original message transitioned to • … • msgA • msgA
Enforcing Property 2 • Property 2 A • replicate messages roll-back to the earlier state the original message transitioned to • … S S • … • … • … • … T2 T2 T1 T1 • disassociate branches after merging point • … • … • … • … • msgA TM1 TM2 • msgA • msgA TM TM • msgA • msgA • msgA • msgA • msgA • T1 or T2? • … • … • msgA • …
Enforcing Property 3 • Property 3 • msgA • msgB • retain info to regenerate every outgoing message, in case a replicate request is received • … • msgB • msgA Sharer • ( ) • ( ) dir dir M • request (M) R • unique data I • request (M) • unique data
Enforcing Property 3 • Property 3 • msgA • msgB • retain info to regenerate every outgoing message, in case a replicate request is received • … • msgB • msgA Sharer • ( ) dir M • … • request (M) R • unique data • retains TI I • invalidate permission • unique data TM • invalidate ack • …
Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions
Evaluation: Overhead broadcast-based protocol (AMD Hammer, MOESI) directory-based protocol (static directory node, MESI) stable stable transient transient 9 to 17 states(4 to 5 bits) No state was introduced into the critical path of serving a request 12to22states(4 to 5 bits) 12 to 22 cache states (4 to 5 bits)
Evaluation: Overhead Miss Status Holding Register (MSHR) • 0 to 213 • 13bits • 6bits • 1bit • 64bits • 11 bytes • 4-32 • entries • total storage overhead: • < 0.5 KB / core • (worst-case: 2KB / core) • (*) • (*) • assuming a 64-node CMP with in-order cores
Evaluation: Performance Simulator: Wisconsin Multifacet GEMS
Evaluation: Performance metric: runtime overhead vs. non-resilient baseline 11% directory protocol lower is better 7.4% 3.5% 1.8% 1.4% 1.1% fftfmmlu radix water water blacks cannealfluidanswaptions x264 AVERAGE nsqspcholesimate SPLASH PARSEC
Evaluation: Performance metric: runtime overhead vs. non-resilient baseline broadcast protocol 56% 51% 20.4% 5.1% 2.4% 0.5% fftfmmlu radix water water blacks cannealfluidanswaptions x264 AVERAGE nsqspcholesimate SPLASH PARSEC
Outline • Motivation • Methodology • Walkthrough: a resilient transaction • Defining resilience properties • Enforcing resilience properties • Evaluation • Overhead • Performance • Conclusions
Conclusions We have presented a generic methodology: • coherence protocol -> resilient coherence protocol …by enforcing 3 properties • minimal hardware overhead (<2KB / node) • small performance overhead • directory-based protocol: 1.4% (1 fault / msec) • broadcast-based protocol: 2.4% (1 fault / msec)
Thank You! Questions?
BACKUP SLIDES
Why performance overhead? • transactions last longer => a request may have to wait for outstanding conflicting requests to complete • data remain in caches for longer (3-way hs) => cache replacement duration • more messages are injected in the NoC => network traffic => average NoC latency
Transaction Duration +18% B: baseline protocol, no faults R: resilient protocol, 1fault/10μsec L1: transaction served by sharer's L1 L2: transaction served by directory (L2) +12%
Transaction Duration B: baseline protocol, no faults R: resilient protocol, 1fault/10μsec L1: transaction served by sharer's L1 L2: transaction served by directory (L2) 11% large working sets, shared data => high number of requests (high traffic) (!) retransmissions saturate network) 24%
Network Traffic mostcongested link average over all links
Enforcing the Resilience Properties • P2 • A single message type transits to a unique state in every FSM branch • Case 2: identical messages in same branch R • request (M) • … X • msgA SM + acks =0 • ack T1 T count =1 SM + acks =1 • … Y • ack • msgA SM + acks =2 Tcount =2 T2 • … • … M
Enforcing the Resilience Properties • P2 • A single message type transits to a unique state in every FSM branch • Case 2: identical messages in same branch • … • … • msgA • msgA X X T count =1 T [XYZ=100] • … • … • msgA • msgA Y Y Tcount =2 T [XYZ=110] • … • …
Enforcing the Resilience Properties • P2 • A single message type transits to a unique state in every FSM branch • Case 2: identical messages in same branch • … • … • msgA • msgA X X T count =1 T [XYZ=100] • … • … • msgA • msgA Y X • (duplicate) Tcount =2 T [XYZ=100] • … • …
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63