Memory Sharing Predictor: The key to speculative Coherent DSM

Memory SharingPredictor: The key to speculative Coherent DSM An-Chow Lai Babak Falsafi Purdue University

Organization • Introduction • Directory based cache coherence • Pattern Based Message Predictors • Memory Sharing Predictors • Vector Memory Sharing predictors • Speculative Coherent operations • Performance Analysis • Results • Summary & conclusions

Introduction • Distributed Shared Memory Multiprocessors: • Provide a logical shared address space over physically distributed memory • Programming easier compared to SMPs. • Non-Uniform Memory Access(Bottleneck): Remote access far slower compared to local access. DSM

Efforts to eliminate this difference: • Custom designed motherboards– cannot get benefit of excellent cost-performance of off-shelf motherboards • Reduce remote access frequency • Reduce coherence protocol overhead—will need complex adaptive coherence protocols. • Existing predictors—directed to specific sharing patterns known a priori. • Pattern based predictors: • Dynamically adapt to an application’s sharing pattern at runtime • Does not modify the base coherence protocol • Memory Sharing Predictors & Vector Memory Sharing Predictors : • Topic of this paper • Improvement on general pattern based predictors proposed by Mukherjee & Hill

Directory based cache coherence Processor & Caches Processor & Caches Processor & Caches Processor & Caches Memory I/O Memory I/O Memory I/O Memory I/O Directory Directory Directory Directory Interconnection network

Directory based cache coherence • Directory Based cache Coherence Protocols • Each node maintains sharing information of all memory blocks • Based on a Finite state machine in which states : directory state & actions: messages • This paper uses half migratory protocol • Speculative Coherent DSM must accurately predict remote access and timely perform actions. A remote read request Directory protocol transitions

Pattern Based Message Predictors • Predicts the sender and type of next incoming message for a particular block. • Structure : Similar to a two level branch predictor • History table: captures most recent sequence of incoming messages for every memory block • Pattern table records all observed sequences of coherence messages for every memory block –(An Entry : Sequence of messages : prediction message) A two level Message predictor

Message History Register (MHR) <sender, type> <sender, type> … Message History Table (MHT) Pattern Based Message Predictors(contd.) • Depth of History Table Register = number of past messages, it keeps track of. • Deeper history depth=> more accurate prediction, no race conditions. • Deeper history depth => Large Pattern history table=> high cost.

Memory Sharing Predictors • Shortcomings of General Message Predictor: • Invalidation messages may arrive in any order, thus may interfere with prediction of more necessary request messages - It increases the number of pattern table entries (almost doubles) • It increases the number of bits needed to encode themessages (three requests & two acks). • Observations: • To eliminate the coherence overhead on remote access, only necessary to predict memory request messages (read ,write, upgrade). • Coherence acknowledgement message prediction extra overhead as they are always expected to arrive in response to a coherence action

Memory Sharing Predictors • MSP addresses these issues: • predicting only the memory request messages • Since the acknowledgements are eliminated, all the effects of possible reordering of acknowledgements are eliminated. • Only 2 bits required to encode messages compared to 3 for general predictor

VMSP: A Vector MSP • Observations: • Full map protocol allows multiple processors to simultaneously cache read only copy of a memory block. • A predictor must identify the sharers and not maintain the order in which they are read. • Optimizations to MSP to get VMSP: • Rather than record and predict read requests as individual pattern table entries, encode a sequence of read requests as a bit vector just like the directory maintains the list of sharers.

Vector Memory Sharing Predictor(contd.) • Benefits: • reduces the number of pattern table entries • eliminates the effect of re-ordering of reads on size • Effect on history depth : number of sharers • Good when the number of readers are large(>(2+n)/2+log(n)).

Triggering Request Speculation • Important considerations: • Predict what remote memory requests arrive • Predict when remote accesses arrive • Execute necessary coherence actions A speculative coherent DSM node and coherence hardware

Triggering Request Speculation • A) What remote memory request arrives : somewhat simple from pattern history table (which stores what memory accesses take place) • B) When : somewhat tough here • early speculation may take away block from its readers • Late speculation may incur additional delay and may limit DSM’s ability to hide coherence overhead • was not a problem in COSMOS as all the coherence messages were being predicted but not sent. They were sent only after the previous message arrived. Since there are no coherence acknowledgement messages in the history table so timing is a problem now.

Triggering Request Speculation • Two ways to overcome: • Speculative Write Invalidation: • Based on common memory access patterns– most producer consumer scenario: Producer writes to a memory block and then no longer accesses until it has been read by consumers. Common in parallel commercial data base servers. • MSP predicts that a processor is done writing when the processor writes to some other memory location • Maintain a early write-invalidate table – stores last address written by a processor. • If address in EWI table changes, trigger speculative write invalidate and subsequent reads.

Comparison with general Message predictor P1 reader P3 Directory P2 Writer P1 reader P3 Directory P2 writer Time Time Write A Read invalidate Write B Invalidate Writeback Send block Writeback Prefetching starts Send block Read hit

Question? • What happens if while speculatively read data has been sent by P3 to P1, P1 has already made the request for data?

Question? • What happens if while speculatively read data has been sent by P3 to P1, P1 has already made the request for data? --The DSM node on receiving that speculated message drops this message to avoid modifying the protocol.

Question? • What happens if P1 makes read request before P2 does the second write?

Question? • What happens if P1 makes read request before P2 does the second write? • First Read Protocol 2) First Read: • If SWI fails, then on the first read request made, all subsequent reads are triggered.

Speculative Coherence Operations Final Action: • execute a coherence action speculatively • verify the accuracy of the predictor • Requirements: • Co-exist with the base coherence protocol without any protocol modifications • MSP simply advices the protocol to execute coherence operations. Any misspeculation results in additional coherence operations but no interference with protocol functionality • eg. A premature write invalidation results in additional read /write request by producer. • MSP will advice the protocol to send read-only block copies to requesters.

Verification of accuracy • Reference bit in remote cache of every block placed speculatively • On actual reference, remote cache clears the bit, verifying that the access occurred. • On invalidation of this block, reference bit is sent alongwith the invalidation message • The MSP at home node examines this bit and removes mispredicted messages.

Performance Analysis • Performance depends on • Speculation accuracy • Reduction in latency on successful speculation • Misspeculation penalty • Speculation opportunity– A computationally intensive application will benefit little from speculation. • Assumptions: • When speculative memory request is successfully executed, entire remote latency is hidden • Misspeculation only slows the remote access, does not increase the request frequency

Performance • Performance Model: • c : Application’s communication ratio • f : fraction of speculatively executed instructions over all received requests • p : request prediction accuracy • laccess : local access latency • raccess: remote access latency • rtl : raccess /laccess • n: misspeculation penalty factor • N: number of remote requests on the critical path

Performance • Communication speedup is given by: • (Comm time w/o speculation)/(comm time w/ speculation) Nraccess = -------------------------------------------------- (1-f)Nraccess + fN(placcess + (1-p)nraccess) 1 = -------------------------------------------------- (1-f) +f (p/rtl + n(1-p)) • Total speedup is given by : • (total execution time w/o speculation)/(total execution time w/ speculation) 1 = ------------------------------------------------- (1-c) + c/(comm_speedup)

Speedup vs various parameters Potential Speedup in a speculative coherent DSM

Speedups • Prediction accuracy plays prominent role in speedup • A low prediction accuracy of 10-50% results in slowdown due to high speculation overhead while a high prediction accuracy (90%) increases speedup even for moderate communication ratios. • At high prediction rates, slowdown due to increasing misspeculation penalty is not significant • f: fraction of speculated instructions, is a measure of number of request messages it takes to learn and predict. For rapidly changing patterns, even at high prediction accuracy, performance improvement will not be significant. • Speculative coherent Protocol impacts clusters most because of high rtl ratio.

Simulation & results • Wisconsin wind tunnel II to simulate CC-Numa with 16 nodes interconnected through hardware DSM boards to a low latency switched network. • Full map write invalidate protocol with 32 byte coherence blocks. • Benchmarks: appbt, barnes, em3d, moldyn, ocean, tomcatv, unstructures.

Results Base predictor accuracy comparison(history depth 1)

Results • Em3d, Moldyn exhibit producer/consumer sharing with small read sharing => low impact of read ordering => high performance with MSP. • Unstructured exhibits wide read-sharing in producer/consumer phase, hence MSP can get a prediction accuracy of less that 65% while VMSP can get almost 85%.

Results Prediction accuracy with varying history depths

Results Messages predicted(correctly predicted) for a history depth of 1

Results Predictor storage overhead

Results • All predictors use 4 bits to encode processor id • Cosmos uses 3 bits to encode message type => 7 bits for history table entry and 14 bit per pte => (7+14) bits per block • MSP and VMSP use 2 bits to encode a message type • MSP 12 bits per pte =>(6+12) bits per block • VMSP uses 18 bits per history table, but (18+6) bits per pte => (18+24) bits per block (in VMSP a read vector is always followed by a write/upgrade and vice versa). A pte will contain at most one entry. • MSP and VMSP require less storage compared to cosmos.

Summary and Conclusion • Proposed the Memory Sharing Predictor tom predict and execute coherence protocols speculatively. • MSP eliminates acknowledgement messages in pattern tables and increases prediction accuracy from 81% to 86%. • VMSP further improves accuracy upto 93% using compact vector representations and eliminating perturbations due to read request reorderings. • VMSP also reduces implementation storage. • High accuracy predictors are key to high performance SC DSM.

Discussions

Memory Sharing Predictor: The key to speculative Coherent DSM