Cache Coherence Mechanisms (Research project) CSCI-5593

Cache Coherence Mechanisms(Research project)CSCI-5593 Prepared By Sultan Almakdi, AbdulwahabAlazeb, Mohammed Alshehri

Outline:

Introduction • Modern systems depend on using shared memory multiprocessors to increase the speed of execution time. • Each processor has its own private cache. • Cache is very important since it is used to improve and speedup the processing time. That is because of read or writes that can be completed in just a few cycles by the CPU. • There might be multiple copies of same data in different caches. • The Big Question: How to keep all of those different copies of data consistent ?

Importance of Cache Regarding Performance • Cache minimizes the average latency • Main memory access costs from 100 to 1000 cycles • Cache reduces the latency down to a small number of cycles • Cache minimizes the average bandwidthrequired to access main memory • Reduce access to shared bus or interconnect. • Cache allows for automatic migration of data • Data is moved closer to processor • Cache automatically replicates the data • Replication is done based upon need • Processors can share data efficiently But private caches can create a problem!!

The Cache Coherence Problem • Imaginewhat would happen when different processors read and write to same memory location? • Private processor caches cause a problem: • Multiple copies of data in different processor’s caches might cause having different values for the same data. • When any processor modifies its copy of data, it might NOT become visible to others. • This would result in the other processors having invalid value of the data in their caches

Example

Cache Coherence What Does Cache Coherence Mean? • Cache coherence is the process of ensuring that all shared data are consistent. • In order to get a correct execution, coherencemust be enforced between the caches.

A memory is coherent ifthe following conditions fulfilled : Write propagation: When any processor writes, time elapses, the written value must become visible to others. Write serialization: If two processors want to write to the same location in the same time, they will be seen in the same order by all processors.

Cache Coherence • There are 4 primary design issues for the coherence mechanisms must be considered to get the best performance: • Coherence detection strategy. • Coherence enforcement strategy. • How the precision of block-sharing information. • Cache block size.

Four primary design issues • Coherence detection strategy: • Incoherentmemoryaccessescan bedetected. • This could occur at run-time orcompile-time. • Precision of block-sharing information: • How precise the coherence detection strategy will be. • There is a trade-off between performance and implementation cost.

Four primary design issues(cont..) • Cache block size: • How the cache block size effects the performance of the memory system. • CoherenceEnforcement strategy: • Toupdateorinvalidate data tomake sure that ”invalid data" will not be readbyanyprocessor.

Cache Coherence Solutions • Software Solutions: Software solutions rely on Operating system and Compiler. • Hardware Solutions: We will Focus on hardware solutions because they are more common and related to our course.

Hardware Solutions • Two basic methods for Cache –Memory coherence: • Write back The memory is updated only when the block in the cache is being replaced. • Write through The memory is updated every time the cache is updated.

Hardware Solutions • Two basic methods for cache –cache coherence: • Write-invalidate: • The processor invalidates all copies of data blocks and hence is granted complete access of the block • Write-update: • When a processor writes on the data blocks, all copies of the block are also updated.

Cache Coherence Protocols • The methods explained in the last slide are most commonly used by the following mechanisms: • Snoop-based protocol • Directory-basedprotocol

Snoopy-based protocol • Snoopy-based coherence protocol is a very popular in multi-core system since it is simple and has low overhead. • Busallows each processor to monitor all of the transactions to the shared memory. • A controller “snooper”: is used in each cache to response to other processors requests and bus. • Snooping protocol is fast when we have enough bandwidth, and provides low average miss latency.

Cont.. • All coherence transactions are broadcasted, so all are seen by all other processors. • In case cache snooper sees a write on the bus, it will invalidate the line out of its cache if it is present. • In case a cache snooper sees a read request on the bus, it checks to see if it has the most recent copy of data, and if so, responds to the bus request.

Snoop-based protocol(cont..) • Two major methods are used by Snoop-based protocol: • Write-invalidate. • Write-update.

Snoop-based protocol(cont..) • On the case of Write update Protocol: • Write to the blocks that are sharing the data • Then broadcast on bus and processors. • Snoop and update all of the blocks copies. • The memory is always kept freshly updated • This method is not preferred since it needs a broadcast for each step of write, which need more bandwidth and lead to more traffic.

Snoop-based protocol(cont..) • On the case of Write Invalidate Protocol: • It has one writer and many readers • In order to write to shared data: • An invalidate is sent to all caches which snoop and invalidate any copies. • When Read Miss occurs: • Write-back: snoop in caches to find most recent copy. • It is used in Most modern multicore systems since it has less bus traffic.

Some Snoop Cache Types“based on the block states” • Basic Protocol • Modified, Shared and Invalid • Berkeley Protocol • Owned Exclusive, Owned Shared, Shared and Invalid • Illinois Protocol • Private Dirty, Private Clean, Shared and Invalid • MESI Protocol • Modified, exclusive, Shared and Invalid

Snoopy-based protocol • Each block of main memory is in one state: • Cleanin all caches and up-to-date in memory (shared) • Dirtyin exactly one cache (exclusive) • Un-cachedNot in any cache • Each cache block can be in one of following states : • Modified: Only the valid copy in any cache and its value is different from the main memory copy. • Shared: A valid copy, but other caches may also have it. • Invalid: block has no valid data. • Exclusive: This copy has not been modified yet but it is the only valid copy in any cache.

Processor 1 Processor 2 Processor 3 Processor 4 caches caches caches caches snooper snooper snooper snooper Bus Main Memory

Example 1: If processor 1wants to read block A: • Read Hit: If block A is in its own cache, there will be read hit. • Read miss: • If block A is not in its own cache. Therefore, it will send broadcast to see: • If any other cache has valid copy of this block if so, it will get it from there. • If not, it will get it from the main memory as the following:

Read miss Processor 1 Processor 2 Processor 3 Processor 4 Read A caches caches caches caches S Controller Bus If processor1 wants to read block A from main memory Main Memory Clean A A

Read miss Processor 1 Processor 2 Processor 3 Processor 4 Read A caches caches caches caches S A S A Controller Bus Example 2: If processor 3 wants to read block A and it is on cache of P1 Main Memory Clean A

Write miss Processor 1 Processor 2 Processor 3 Processor 4 Write A A S I caches A caches caches caches M A I S Controller Bus Example 3: If processor 4 wants to write on block A and it is on cache of P1 and P3 Dirty Main Memory A

Read miss Processor 1 Processor 2 Processor 3 Processor 4 Read A caches caches S caches A caches M I S A A A I A Write- Back To main memory Example 4: If processor 3 has an invalid copy of block A and wants to read block A and there is modified copy of it on cache of P4 Dirty Clean Main Memory A

Requests from the processor:

Requests from the bus:

Important Observations • If any processor now wants to write in its block, it has to upgrade its block state from shared to exclusive copy. • By write-back method, the main memory will be updated once the processor which has a modified copy wants to change its state to shared state.

Directory-basedprotocol • Each processor (or cluster of processors) has its own memory • The directory is also distributed along with the corresponding memory • Each processor has: • Fast access to its local memory • Slower access to “remotememory which is located at other processors • The physical address is enough to determine the location of memory. • Processing nodes • Are connected with a scalable interconnect, resulting in routing of the messages from sender to receiver instead of broadcasting. • Cannot snoop anymore, thus records of sharing state is now kept in the directory in order to track them.

Directory-based protocol(Cont..) • Typically three processors involved: • Local node: where a request originates • Home node: contains the memory location of an address • Remote node: contains a copy of the cache block, either exclusive or shared.

Directory-based protocol(Cont..) • cache states: • Shared: • At least one processor has cached data • Memory is up-to-date • Any processor is able to read the block • exclusive: • Only one processor (the owner) has the data cached. • Memory will be staled. • only that processor can write to it • Invalid (Un-cached): • No processor has the data cached. • In order to keep tracking which processors have data in shared state, Bit vector is used in which 1means the processor has the data.

Directory-based protocol(Cont..) Processor 1 & its Caches Processor 2 & its Caches Processor 3 & its Caches Processor 4 & its Caches 1 Memory 2 2 Memory 3 0 Memory 1 3 Memory 4 I / O I / O I / O I / O Directory Directory Directory Directory Interconnection network

Directory-based protocol(Cont..) • Assuming processor 1 wants to read block Aand from the address of the block “2.5 GB”: • The processor1 will recognize that block A is in the memory of processor 3. • The processor 1 will send request to the node 3. • The directory of node 3 will check the state of this block and make sure it is in the shared state and keep tracking of this block

Read miss S Processor 1 & its Caches Processor 2 & its Caches Processor 3 & its Caches Processor 4 & its Caches Read A Address 2.5 GB A A 0 Memory 1 2 Memory 3 3 Memory 4 1 Memory 2 I / O I / O I / O I / O S: p1 Directory Directory Directory Directory Interconnection network If P1wants to read block Aand from the address of the block “2.5 GB” the processor recognizes that it is in the memory of processor 3, so the processor 1 will send request to the node 3 Then send a copy of A to P1 then put the block in shared state and keep tracking it Then the directory of node 3 will check the state of this block and make sure it is in the shared state

Directory-based protocol(Cont..) • Assuming now processor 2wants to read block Aand from the address of the block “2.5 GB”: • The processor will recognize that it is in the memory of processor 3, also it is in shared state with processor 1. • The processor 2 will send request to the node 3. • Then the directory of node 3 will check the state of this block and make sure it is in the shared state and keep tracking of this block

Read miss Processor 1 & its Caches Processor 2 & its Caches Processor 3 & its Caches Processor 4 & its Caches S S Read A Address 2.5 GB A A A 0 Memory 1 1 Memory 2 3 Memory 4 2 Memory 3 I / O I / O I / O I / O S: p1 , P2 Directory Directory Directory Directory Interconnection network Then will put the block in shared state and keep tracking it Directory check the state of the block

Example 3:Assuming now processor 4 wants to WRITE in block A and from the address of the block “2.5 GB”1. The processor recognizes that it is in the memory of processor 3, also it is in shared state with processor 1, and 2.2. The processor 4will send request to the node 3.

Then the directory of node 3 will check the state of this block and make sure it is in the shared state after that will send node to node request to P1 and P2 to change the state of A from share to invalid and wait for ACKsince there is no Bus used here. The directory will be updated by deleting the state of block copy of P1, P2 and putting the copy of block for P4 in Exclusive state And keep tracking of this block.

Write miss I S S E I ACK ACK Write A Address 2.5 GB A A Processor 1 & its Caches Processor 2 & its Caches Processor 3 & its Caches Processor 4 & its Caches A 2 Memory 3 0 Memory 1 3 Memory 4 1 Memory 2 Node to node message to P2 to change the state Node to node message to P1 to change the state E: p4 S: p1 , P2 I / O I / O I / O I / O Directory Directory Directory Directory Interconnection network The directory will update its by deleting the state of block copy of P1, P2 and putting the copy of block for P4 in Exclusive state And keep tracking of this block. Then the directory of node 3 will check the state of this black and make sure it is in the shared state after that will send node to node request to P1 and P2 to change the state of A from share to invalid and wait for ACK If processor 4 wants to WRITE in block Aand from the address of the black “2.5 GB” the processor recognizes that it is in the memory of node 3 So, the processor 4 will send request to the node 3

Example 4: • Assuming now processor 1wants to READ block ABUT its copy is invalid . So, from the address of the block “2.5 GB” • The processor recognizes that it is in the memory of processor 3, BUT it is in Exclusivestate withprocessor 4, so the processor 1 will send request to the node 3. • Then the directory of node 3 will check the state of this block and find out it is in Exclusivestate withp4

3. So, node 3 will forward the request to node 4 which will change the block state to shared and by write back technique it will update the memory of node 3 by the updated copy of block A. 4. After that either node 3 (Home node ) or the node 4 (Remote node ) will send the copy of block A to the node 1 (Local node ) 5. Finally, the directory of node 3 will update its table and keep tracking of this block.

Read miss I S M S I Read A Address 2.5 GB A A A A Processor 1 & its Caches Processor 2 & its Caches Processor 3 & its Caches Processor 4 & its Caches A A E: p4 S: p1,P4 3 Memory 4 0 Memory 1 2 Memory 3 1 Memory 2 I / O I / O I / O I / O Read A for P1 Address 2.5 GB Directory Directory Directory Directory Interconnection network processor 1 wants to read block ABUT its copy is invalid . 3, so the processor 1 will send request to the node 3whichwill check the state of this black and find out it is in Exclusivestate withp4 Node 3 will forward the request to node 4 which will change the block state to shared and by write back technique will update the memory of node 3 by the updated copy of block A. After that either node 3 (Home node ) or the node 4 (Remote node ) will send the copy of block A to the node 1 (Local node ) Finally, the directory of node 3 will update its table and keep tracking of this block.

Directory Actions • If block is in un-cached state: • Read miss: send data, make block shared • Write miss: send data, make block exclusive • If block is in shared state: • Read miss: send data, add node to sharers list • Write miss: send data, invalidate sharers, make exclusive • If block is in exclusive state: • Read miss: ask owner for data, write-back to memory, send data, make shared, add node to sharers list • Data write back: write to memory, make un-cached • Write miss: ask owner for data, write to memory, send data, update identity of new owner, remain exclusive

Snoopy-Based Advantages and Disadvantages • Adv: • The average miss latency is low, especially for cache-to-cache misses. • In case of having small number of processors, snoopy will be fast. • Dis: • The cache coherence overhead and the speed of shared buses limit the bandwidth needed to broadcast messages to all processors. • Regarding power dissipation, the efficiency is fairly low • For large systems, it is not scale since each request will be broadcasted to all processors. • Buses have limitations for scalability: • Physical (number of devices that can be attached) • Performance (contention on a shared resource: the bus)

Directory-Based Advantages and Disadvantages: • Adv: • The scale much better than snoopy protocols. • It can exploit random point-to-point interconnects • Dis: • The directory access and the extra interconnect traversal is on the critical path of cache to cache misses. • It involves the storage and manipulation of directory state

Observation • Snoopy based protocol outperforms directory based in case of high bandwidth. • As the number of processors are increasing, directory based outperforms snoopy based protocol.

Cache Coherence Mechanisms (Research project) CSCI-5593

Cache Coherence Mechanisms (Research project) CSCI-5593

Presentation Transcript

Extra Cache Coherence Examples

Directory-Based Cache Coherence

Cache Coherence Schemes for Multiprocessors

Cache Coherence for GPU Architectures

Cache coherence

Cache Coherence

Cache Coherence in NUMA Machines

Cache coherence for CMPs

The Cache-Coherence Problem

Cache Coherence CS433 Spring 2001

Cache Coherence

Cache coherence, etc… - MIMD –

Cache Coherence Protocols

The Cache-Coherence Problem

Power Efficient Cache Coherence

Directory-based Cache Coherence

Cache Coherence

The Cache-Coherence Problem

The Cache-Coherence Problem

Example Cache Coherence Problem

The Cache-Coherence Problem