Lecture 3. Directory-based Cache Coherence

COM503 Parallel Computer Architecture & Programming Lecture 3. Directory-based Cache Coherence Prof. Taeweon Suh Computer Science Education Korea University

Scalability Problem with Bus • Bus-based machines are essentially based on broadcasting via bus • Bus bandwidth limitation as the number of processors increases • Bandwidth = #wires X clock freq. • Wire length grows with #processors • Load (capacitance) grows with #processors • Snoopy bus could delay memory accesses

Distributed Shared Memory • Goal: Scale up machines • Distribute memory • Use scalable point-to-point interconnection providing • Bandwidth scales linearly with #nodes • Memory access latency grows sub-linearly with #nodes P1 Pn … $ $ CA CA Mem Mem Scalable Interconnection network CA (Communication Assist)

Distributed Shared Memory • Important issue • How to provide caching and coherence in hardware on a machine with physically distributed memory, without the benefits of a globally snoopable interconnect such as a bus • Not only must the memory access latency and bandwidth scale well, but so must the protocols used for coherence, at least up to the scales of practical interest • The most common approach for full hardware support for cache coherence: Directory-based cache coherence

A Scalable Multiprocessor with Directories • Scalable cache coherence is typically based on the concept of a directory • Maintain the state of memory block explicitly in a place called a directory • The directory entry for the block is co-located with the main memory • Its location can be determined from the block address P1 Pn … $ $ Directory Memory Memory Directory CA CA Scalable Interconnection network CA (Communication Assist)

2-level Cache Coherent Systems • An approach that is (was?) popular takes a form of 2-level protocol hierarchy • Each node of the machine is itself a multiprocessor • The caches within a node are kept coherent by inner protocol • Coherence across the nodes is maintained by outer protocol • A common organization • Inner protocol: Snooping protocol • Outer protocol: Directory-based protocol

ccNUMA • Main memory is physically distributed and has non-uniform access costs to a processor • Architectures of this type are often called cache-coherent, non-uniform memory access (ccNUMA) • More generally, systems that provide a shared address space programming model with physically distributed memory and coherent replication are called distributed shared memory (DSM) systems

DSM (NUMA) Machine Examples • Nehalem-based systems with QPI Nehalem-based Xeon 5500 QPI: QuickPath Interconnect http://www.qdpma.com/systemarchitecture/SystemArchitecture_QPI.html

Directory-based Approach • Three important functions upon cache miss • Finding out enough information about the state of the memory block • Locating other copies if needed (e.g., to invalidate them) • Communicating with the other copies (e.g., obtaining data from them or invalidating or updating them) • In snoopy protocols, all three functions are done by the broadcast on bus • Broadcasting in DSM machines generates a large amount of traffic • On a p-node machine, at least p network transactions on every miss • So, it does not scale • A simple directory approach: • Look up directory to find out the information about the state of blocks in other caches • The location of the copies is also found from the directory • The copies are communicated with using point-to-point network transactions in an arbitrary interconnection network, without resorting to broadcast

Terminology • Home node is the node in whose main memory the block is allocated • Dirty node is the node that has a copy of the block in its cache in modified (dirty) state • The home node and the dirty node for a block may be the same • Owner node is the node that currently holds the valid copy of a block and must supply the data when needed • This is either the home node (when the block is not in dirty state in a cache) or the dirty node • Exclusive node is the node that has a copy of the block in its cache in an exclusive (dirty or clean) state • Local node (or requesting node) is the node containing the processor that issues a request for the block • Blocks whose home is the requesting processor are called locally allocated (or local blocks), whereas all others are called remotely allocated (or remote blocks) P1 Pn … $ $ Memory Directory Memory Directory CA CA Scalable Interconnection network

A Directory Structure • A natural way to organize a directory • Maintain the directory together with a block in main memory at home • A simple organization of the directory for a block is a bitvector of n presence bits (which indicates each of the n nodes) together with state bits • For simplicity, let’s assume that there is only one state bit (dirty) P1 Pn … Directory $ $ Memory Directory Memory Directory • • • CA CA dirty bit n presence bits Scalable Interconnection network

A Directory Structure (Cont.) • If the dirty bit is ON, then only one node (the dirty node) should be caching that block and only that node’s presence bit should be ON • A read miss can be serviced by looking up the directory to see which node has a dirty copy of the block or if the block is valid in main memory at home • A write miss can be serviced by looking up the directory as well to see which nodes are the sharers that must be invalidated

Basic Operation Home Home

A Read Miss (at node i) Handling • If the dirty bit is OFF, • CA obtains the block from main memory, supplies it to the requestor in a reply transaction, and turns the ith presence bit (presence[i]) ON • If the dirty bit is ON, • CA responds to the requestor with the identity of the node whose presence bit is ON (i.e., the owner node) • The requestor then sends a request network transaction to that owner node • At the owner, the cache changes its state to shared and supplies the block to both the requesting node as well as to main memory at the home node • The requestor node stores the block in its cache in shared state • At memory, the dirty bit is turned off and presence[i] is turned ON CA (Communication Assist)

A Write Miss (at node i) Handling • If the dirty bit is OFF, • Main memory has a clean copy of the data • Invalidation request transactions must be sent to all nodes j for which presence[j] is ON • The home node supplies the block to the requesting node (node i) together with presence bit vector • CA at the requestor sends invalidation requests to the required nodes and waits for acknowledgment transactions from the nodes • The directory entry is cleared, leaving only presence[i] and the dirty bit ON • The requestor places the block in its cache in dirty state • If the dirty bit is ON, • The block is first delivered from the dirty node (whose presence bit is ON) using network transactions • The cache in the dirty node changes its state to invalid, and then the block is supplied to the requesting processor • The requesting processor places the block in its cache in dirty state • The directory entry in the home node sets presence[i] and the dirty bit ON CA (Communication Assist)

Block Replacement Issues in Cache • Replacement of a dirty block in node i • The dirty data being replaced is written back to main memory in home node • The directory is updated to turn off the dirty bit and presence[i] • Replacement of a shared block in node i • A message may or may not be sent to the directory in home node to turn off the corresponding presence bit • An invalidation is not sent to this node the next time the block is written if the message was sent • This message is called a replacement hint (whether it is sent or not does not affect the correctness of the protocol or the execution)

Scalability of Directory-based Protocol • The main goal of using directory protocols is to allow cache coherence to scale beyond the number of processors that may be sustained by a bus • It is important to understand the scalability of directory protocols in terms of both performance and storage overhead for directory information • The major performance scaling issues for a protocol are how the latency and bandwidth demands scale with the number of processors used • The bandwidth demands are governed by the number of network transactionsgenerated per miss (multiplied by the frequency of misses) and latency by the number of these transactions that are in the critical path of the miss • These quantities are affected both by the directory organization and by how well the flow of network transactions is optimized in the protocol

Reducing Latency Optimized (adopted in Stanford DASH) Baseline RemRd: Remote Read request

Race Conditions • What if multiple requests for a block are in transit • Solutions • Directory controller can serialize transactions for a block through an additional status bit (busy or lock bit) • If the busy bit is set and the home node receives another request for the same block, 2 options: • Reject incoming requests for as long as the busy bit is set. That is, nack (negative acknowledgement) to the requests • Queue incoming requests – Bandwidth saved, Request latency kept to a minimum, Design complexity increased (Queue overflow)

Issue with Turning Off Busy Bit • When to turn off busy bit? • Turn off the busy bit before home responds to the requester (local node)? • Race condition example • In (c), before UpgrAck is reached to the local node, what if RemRd arrives at the node? • This race condition occurs if subtranctions follow physically different paths • One solution • Send Nack to the RemRd request RemRd: Remote Read request

Scalability of Directory-based Protocol • Storage is affected only by how the directory information is organized • For the simple bit vector organization, the number of presence bits needed grows linearly with • #nodes: p bits for a memory block for p nodes • Main memory size: one bit vector for a memory block • It could lead to a potentially large storage overhead for the directory • Examples • With a 64-byte block size and 64 processors, what is the directory overhead as a fraction of non-directory (i.e., data)? • 64 bits for each 64 bytes block: 64b/64B = 1/8 = 12.5% • What if there are 256 processors with the same block size? • 256 bits for each 64 bytes block: 256b/64B = ½ = 50%! • What if there are 1024 processors? • 1024 bits for each 64 bytes block: 1024b/64B = 2 = 200%!!!

Alternative Directory Organizations • Fortunately, there are many other ways to organize directory information that improve the scalability of directory storage • The different organizations naturally lead to different high-level protocols with different ways of addressing the protocol functions

Alternative Directory Organizations • The 2 major classes of alternatives for finding the source of the directory information for a block • Hierarchical directory schemes • Flat directory schemes

Hierarchical Directory Schemes • Hierarchical scheme organizes the processors as the leaves of a logical tree (need not be binary) • An internal node stores the directory entries for the memory blocks local to its children • A directory entry essentially informs which of its children subtrees are caching the block and if there are some subtrees (which are not its children) that are caching the block • Finding the directory entry of a block involves a traversal of the tree until the entry is found • Inclusion is maintained between level k and k+1 directory node where the root is at the highest level • In the worst case you may have to go to the root to find directory • These organizations are not popular since the latency and bandwidth characteristics tend to be much worse than flat schemes

Flat, Memory-Based Directory Schemes • The bit vector organization is the most straightforward way to store directory information in a flat, memory-based scheme • Performance characteristics on writes • The number of network transactions per invalidating write grows only with the number of actual sharers • But, because the identity of all sharers is available at home, invalidations can be sent in parallel • The number of fully serialized network transactions in the critical path is thus not proportional to the number of sharers, reducing latency • The main disadvantage of the bit vector organization is storage overhead

Flat, Memory-Based Directory Schemes • 2 ways to reduce the overhead • Increase the cache block size • Put multiple processors (rather than just one) in a node that is visible to the directory protocol (that is, to use a 2-level protocol) • Example: 4-processor node, 128 byte cache blocks, the directory memory overhead for a 256-processor machine is 6.25% • However, these methods reduce the overhead by only a small constant factor • Total directory storage is still proportional to P x (P x m) • P is #nodes • P x m is #total memory blocks in the machine • m is #memory blocks in local memory

Reducing Directory Overhead • Reducing the directory width • Directory width is reduced by using limited pointer directories • Most of time, only a few caches have a copy of a block when a block is written • Limited pointer schemes maintain a fixed number of pointers, each pointing to a node that currently caches a copy of the block • Each pointer takes log2P bits of storage for P nodes • For example, for a machine with 1024 nodes, each pointer needs 10 bits, so even having 100 pointers uses less storage than a full bit vector scheme • These schemes need some kind of backup or overflow strategy for the situation when more than i readable copies are cached when they can keep track of only icopies precisely • Reducing the directory height • Directory height can be reduced by organizing the directory itself as a cache • Only small fraction of the memory blocks will actually be present in caches at a given time, so most of the directory entries will be unused anyway

Limited Pointer Directories • Assume that directory can track i copies • In other words, directory can keep i number of pointers • Pointer overflow handling options • If there are more than i sharers, broadcast from i+1th request • Victimize (Invalidate) one pointer and use it for the new node • Let software handle overflow (used by MIT Alewife) • When hardware pointers are exhausted, a trap to a software handler is taken • Handler allocates a new pointer in regular (non-directory) memory • Software handler is needed • When the number of nodes exceeds the number of hardware pointers available • When a block is modified and some pointers have been allocated in memory

Flat, Cache-Based Directory Schemes • There is still a home main memory for the block • However, the directory entry at the home node does not contain the identities of all sharers, but only a pointer to the first sharer in the list plus a few state bits • This pointer is called the head pointer for the block • The remaining nodes caching that block are joined together in a distributed, doubly-linked list • A cache that contains a copy of the block also contains pointers to the next and previous caches that have a copy • They are called the forward and backward pointers

A Read Miss Handling • The requesting node sends a network transaction to the home memory to find out the identity of the head node • If the head pointer is null (no sharers), the home replies with data • If the head pointer is not null, the requester must be added to the list of sharers • The home responds to the requestor with the head pointer • The requestor sends a message to the head node, asking to be inserted at the head of the list and hence to become the new head node • The head pointer at home now points to the requestor • The forward pointer of the requestor’s cache entry points to the old head node • The backward pointer of the old head node points to the requestor • The data for the block is provided by the home if it has the latest copy or by the head node, which always has the latest copy

A Write Miss Handling • The writer obtains the identity of the head node from the home • It then inserts itself into the list as the head node • If the writer was already in the list as a sharer and is now performing an upgrade, it is deleted from its current position in the list and inserted as the new head • The rest of the distributed linked list is traversed node by node via network transactions to find and invalidate the cached blocks • If a block that is written is shared by three nodes A, B, and C, the home only knows about A, so the writer sends an invalidation message to it; the identity of the next sharer B can only be known once A is reached, and so on. • Acknowledgements for these invalidations are sent to the writer • If the data for the block is needed by the writer, it is provided by either the home or the head node as appropriate

Pros and Cons of Flat, Cache-based Schemes • Cons • Latency: the number of messages per invaliding write is in the critical path • Pros • The directory overhead is small • The linked list records the order in which accesses were made to memory for the block, thus making it easier to provide fairness in a protocol • Sending invalidations is not centralized at the home but rather distributed among sharers • Thus reducing the bandwidth demands placed on a particularly busy home CA CA (Communication Assist)

IEEE 1596-1992 SCI • Scalable Coherent Interface (SCI) • Manipulating insertion in and deletion from distributed linked lists can lead to complex protocol implementations • These complexity issues have been greatly alleviated by formalization and publication of a standard for a cache-based directory organization and protocol: IEEE 1596-1992 Scalable Coherent Interface (SCI) • Several commercial machines use this protocol • Sequent NUMA-Q, 1996 • Convex Exemplar, Convex Computer Corp. 1993 • Data General, 1996

Correctness Issues • Read Section 8.4 in the Culler’s book • Serialization to a location for coherence • Serialization across locations for sequential consistency • Deadlock, Livelock, and Starvation

Example: Write Atomicity • A common solution for write atomicity in an invalidation-based scheme is for the current owner of a block (the main memory or the processor holding the dirty copy in its cache) to provide the appearance of atomicity by not allowing access to the new value by any process until all invalidation acknowledgements for the write have returned

Lecture 3. Directory-based Cache Coherence

Lecture 3. Directory-based Cache Coherence

Presentation Transcript

Lecture 4: Directory-Based Coherence

Directory-Based Cache Coherence

Lecture 2. Snoop-based Cache Coherence Protocols

Cache coherence

Cache Coherence

“An Evaluation of Directory Schemes for Cache Coherence”

The Cache-Coherence Problem

11 – Snooping Cache and Directory Based Multiprocessors

Cache Coherence

Cache Coherence Protocols

Directory-based Cache Coherence

Cache Coherence

Lecture 3: Coherence Protocols

The Cache-Coherence Problem

Lecture 7: PCM Wrap-Up, Cache coherence

A Novel Directory-Based Non-Busy, Non-Blocking Cache Coherence

“An Evaluation of Directory Schemes for Cache Coherence”

Lecture 7: Implementing Cache Coherence