Multiprocessor and Thread-Level Parallelism-II Chapter 4

Multiprocessor and Thread-Level Parallelism-IIChapter 4 Dr. Anilkumar K.G

Cache Coherence • In computing, cache coherence(also cache coherency)refers to the consistency of shared data stored in local caches of processors in a multi-processing system • Cache coherence is a special case of memory coherence • When clients in a system maintain caches of a common memory resource, problems may arise with inconsistent data • Different processors may access values at same memory location • An update by a processor at time t is available for other processors at time t + x is Cache coherence problem • Where x is unknown Dr. Anilkumar K.G

Cache Coherence Problem • Cache coherence problems can arise in shared-memory multiprocessors when more than one processor cache holds a copy of a data item • Upon a write (update) of this shared data on the cache location, other processor caches must updateorinvalidate their copies otherwise cause an incoherence data situation in caches • The writing processor to gain exclusive access to the cache line and completes its writes into the cache line without generating external traffic • This leads to a cache coherence issue if this dirty cache block (written cache block) modifies the memory without informing other processors those holding the same block of data in their caches Dr. Anilkumar K.G

Cache Coherence Problem • Referring to the "Multiple Caches of Shared Resource" figure 4.a, if the top client has a copy of a memory block from a previous read and the bottom client changes that memory block, the top client could be left with an invalid cache of memory without any notification of the change. • Cache coherenceconditionis to manage such conflicts and maintain consistency between cache and memory Dr. Anilkumar K.G

Cache Coherence Figure 4.a Dr. Anilkumar K.G

Two Types of Cache Coherence Protocols • Directory based • Snooping Dr. Anilkumar K.G

Directory Based Cache Coherence Protocol • In a directory based cache coherence protocol, the directory relieves the processor caches transaction on memory requests by keeping track of which caches hold each memory block • A directory tracks which processor have cached a block of memory • Directory contains information for all cache block in the system • The directoryacts as a filter from which the processor gets permission to load data entry (data block) from the memory to its cache • Memory block transfer happening under the control of directory Dr. Anilkumar K.G

Directory Based Cache Coherence Protocol • A simple directory structure is shown in figure 4.b, it has one directory entry per block of memory • Each directory entry contains one presence bit per processor cache • A state bit indicates whether the block is uncached, shared, or held exclusively by one cache Dr. Anilkumar K.G

Directory Based Cache Coherence Protocol Dr. Anilkumar K.G

Directory Based Cache Coherence Protocols • The directory indicates whether a memory block is up to date or which cache holds the copy of the memory block • When a cache miss is incurred, the local node (the processor that issues a request for the block) sends a request through the network transaction to the home node (the node that holds the block- main memory) • On a write miss, the directory identifies the copies of the block, and an invalidation or update network transaction may be sent to these copies (to caches those who have the same data block) • Since invalidation or update is sent to multiple copies through disjoint paths in the network, determining the completion or commitment of a write. Dr. Anilkumar K.G

Directory Based Cache Coherence Protocol • The advantage of directories is that they keep track of which nodes have copies of the memory block, eliminating the need for a broadcast – causing less network usage • On read misses since a request for a block will either be satisfied at the main memory or the directory will tell it where to go to retrieve the missed block • On write misses, the needed block for a write can be detected from the directory and also it is possible to see whether the block is a shared one or not. • Directory based coherence protocol has higher implementation overhead than snooping protocol; but it can scale to a large number of processing nodes(microprocessors + caches + memories) Dr. Anilkumar K.G

Snooping Cache Coherence Protocol • A snoopy cache coherence protocol relies on all caches monitoring through a common or a global bus that connects all processors to a shared memory (Fig. 4.c) • In this protocol, thebus plays an important role - each device in the bus can observe every bus transaction (called snooping!) • When a processor issues a request to its own cache, the cache controller examines the state of the cache and takes suitable action, • which may include generating bus transactions to access memory • Cache coherence is maintained by having all cache controllers “snoop” onthe bus and monitor the bus transactions Dr. Anilkumar K.G

Snooping Cache Coherence Protocol Dr. Anilkumar K.G

Snooping Cache Coherence Protocol • A snooping cache controller may take action if a bus transaction is relevant to it – that is, if it involves a memory block of which it has a copy in its cache • There is no centralized state in a snooping protocol (each cache is allowed to snoop) • Thus, P1 may take action (in Figure 4.c), such as invalidating or updating its copy, it sees a write from P3 • The key properties of a bus that supports coherenceprotocol are the following: • First, all transactions that appear on the bus are visible to all cache controllers • Second, transactions are visible to all controllers in the same order (in the bus order) Dr. Anilkumar K.G

Snooping Cache Coherence Protocols • In a snooping cache system, every cache controller observes every write on the bus. If a cache has a copy of the block, it is either invalidates or updates its copy • A protocol that invalidates copies on a write are commonly referred to as invalidation-based protocols, whereas those protocols that update other cached copies are called update-based protocols • In either case, the next time the processor with the copy accesses the block, it will see the most recent value either through a miss or the updated value is available in its cache Dr. Anilkumar K.G

Snooping Cache Coherence Protocol • While a snoopy cache-coherence protocol obtains data quickly, it consumes a lot of bus BW dueto the broadcast nature of its requests • As a result, snoopy protocols are generally limited to small-scale multiprocessor systems (Why?) • Snooping protocols became popular with multiprocessors using microprocessors and caches which are attached to a single shared memory(UMA or Symmetric memory systems) • For example, PCs Dr. Anilkumar K.G

Cache Write Through Scheme • With a write through cache scheme, all processor write update local cache and a global bus write that; • Updates main memory • Invalidates/updates all other caches with that item • Advantage: • Simple to implement • Disadvantages: • Since 15% of references are writes, this cache scheme consumes tremendous bus BW • Thus only a few processors can be supported Dr. Anilkumar K.G

Cache Write Back (ownership) Scheme • Under write back cache scheme, when a single cache has ownership of a block, processor writes only to the cache that do not cause any memory or bus BW problem • Later the dirty cache block will be moved to main memory or victim cache upon the request • Most bus-based multiprocessors (with snoopy protocol) nowadays use cache write back scheme! • Why? Dr. Anilkumar K.G

Invalidation and Update Strategies • During invalidation on a write, all other caches with a copy of the shared block are invalidated (except the cache with exclusive write access!) • During the completion of an update on a write (by broadcasting an update message), all other caches with the copy of memory block are updated - update-based protocol • How is update possible? • Update is bad when multiple writes by a microprocessor before the data is read by another processor (it needs the old data) • Overall, the invalidation schemes are more popular as a default Dr. Anilkumar K.G

Write Invalidate in Snooping Protocol • In a snooping protocol, there are two ways to maintain the cache coherence: • One of ways is to ensure that a processor has exclusive accessto a shared data item before it writes that item to its cache • Exclusive access ensures that no other readable or writable copies of an item exist when the write occurs ( during a write, all other cached copies of the item are invalidated by the protocol) • This style of protocol is called a write invalidate protocol because it invalidates other copies on a write • It is the most common protocol, both for snooping and directory schemes • Figure 4.4 shows an example of an invalidation protocol for a snooping bus with write-back caches Dr. Anilkumar K.G

Dr. Anilkumar K.G

Write Invalidate in Snooping Protocol • As per the Figure 4.4, to see how the write invalidationprotocol ensures coherence, consider a write followed by a read by another process: • Since the write requires exclusive access, request of a reading processor (if any) must be invalidated ( by invalidation protocol) • Thus, when the read occurs, it misses in the cache (because data is invalidated!) and is forced to fetch a new copy of the data (read from memory for the data block after the update operation). Shown in Fig 4.4 • For a write, it is require that the writing processor have exclusive access to the shared block (for preventing any other processor from being able to write simultaneously) • If two processors do attempt to write the same data simultaneously, one of them wins the race, causing the other processor’s copy to be invalidated • For other processor to complete its write, it must obtain a new copy of the data (updated ) from memory • Therefore, write invalidate protocol enforces write serialization (How?) Dr. Anilkumar K.G

Implementation Techniques of Invalidation Protocol (in Snoopy case) • The key to implement an invalidation protocol in a small-scale multiprocessor is the use of the bus, or another broadcast medium, to perform invalidates: • The processor simply acquires bus access andbroadcasts the block address (tag) to be invalidated on the bus • All processors continuously snoop on the bus, watching the addresses (of the data which is going to be invalidated) • The processor check whether the address on the bus is in their cache • If so, the corresponding data in the cache is invalidated (by changing its valid control bit) • When a write to a block that is shared, the writing processor must acquire bus access to broadcast block invalidation –Important! Dr. Anilkumar K.G

Implementation Techniques of Invalidation Protocol (in Snoopy case) • If two processors attempt to write shared blocks at the same time, their attempts to broadcast an invalidate operation will be serialized when they arbitrate for the bus ( i.e., depends on the bus access) • The first processor to obtain bus access will cause any other copies of the block it is writing to be invalidated • If the second processor was attempting to write the same block, the write serialization enforced by the bus to serialize their writes • One implication of this scheme is that a write to a shared data item cannot actually be completed until it obtains the bus access • Gain the bus is important before a write operation(Why?) Dr. Anilkumar K.G

Implementation Techniques of Invalidation Protocol (in Snoopy case) • In addition to invalidate outstanding copies of a cache blockduring a write, it is also need to locate a data item during a cache miss • In a write-through cache, it is easy to find the recent value of a data item, since all the written data are available in the memory • from main memory the most recent value of a data item can always be fetched • What about the write-back cache system? • Updated block should be written back to memory or victim cache! Dr. Anilkumar K.G

Implementation Techniques of Invalidation Protocol (in Snoopy case) • For a write-back cache, the problem of finding the most recent data value is harder • since the most recent value of a data item can be in a cache rather than in memory • Write-back caches can use the same snooping scheme for both cache misses and for writes • If the processor finds that it has a dirty copy of the requested cache block, it can provide that cache block in response to the read request directly (usage of victim cache!) • Then no need to access the memory for that block (advantage!) Dr. Anilkumar K.G

Implementation Techniques of Invalidation Protocol (in Snoopy case) • The additional complexity comes from having to retrieve the cache block from a processor’s cache, which take longer retrieving time than it from the shared memory if the processors are in separate chips • Write-back caches generate lower requirements for memory BW, hence they can support large processor nodes Dr. Anilkumar K.G

Implementation Techniques of Invalidation Protocol (in Snoopy case) • The cache tags can be used to implement the process of snooping • The valid bit (V) for each block makes invalidation easy • Read misses due to invalidation or some other event can be handled straight forward since they rely on snooping • For writes, check whether any copies of the block are cached (shared by other CPUs using a share bit), if there are no other cached (shared) copies, then the write need not be broadcasted on the bus in a write-back cache- write can be straight forward! • That is the write is possible without passing through the network • This will reduce both the time taken by the write and the required BW Dr. Anilkumar K.G

Implementation Techniques of Invalidation Protocol (in Snoopy case) • To track whether or not acache block is shared, we can add an extra state bit associated with each cache block • Just like a valid bit (V) and a dirty bit (D),by adding a shared bit (S) indicating whether the block in the cache is shared • When a write to a block in the shared state occurs, the processor of the cache generates an invalidation message on the bus and its marks its block asexclusive (by keeping S =0) • All caches those who share the block can snoop the invalidation message which is broadcasted by the owner of the block • The processor with the exclusive copy of a cache block for a write is called the owner of the cache block • Owner is the one under go the exclusive access prior to a write Dr. Anilkumar K.G

Implementation Techniques of Invalidation Protocol (in Snoopy case) • When an invalidation is sent to other processors, the state of the owner’s cache block is changed from shared to unshared (or exclusive) by changing the ‘S’ bit of the cache HW • If another processor later requests this cache block (updated), the state must be made shared again • Snooping cache knows when the exclusive cacheblock has been requested by another processor • And the state should be made shared Dr. Anilkumar K.G

Implementation Techniques of Invalidation Protocol (in Snoopy case) • Every bus transaction must check thecache block address (tag), which normally interfere with processor-cache accesses • One way to reduce this interference is to duplicate the cache tags • This interference can also be reduced in multilevel cache by directing the snoop requests to the L2 cache • For this scheme to work, every entry in the L1 cache must be present in the L2 cache, a property called the inclusion property • If the snoop request gets a hit in the L2 cache, then it must arbitrate for the L1 cache to update the state and possibly retrieve the data Dr. Anilkumar K.G

An Example Snooping Coherence Protocol • A snooping coherence protocol is usually implemented by incorporating a finite state controller in each node • This controller responds to request from the processor and from the bus, changing the state of the selected cache block, as well as using the bus to access data or to invalidate it • Logically, imagine a separate controller being associated with each block; that is, snooping operations or cache requests for different blocks can proceed independently • In actual implementations, a single controller allows multiple operations to distinct blocks to proceed in interleaved fashion Dr. Anilkumar K.G

An Example Snooping Coherence Protocol • The simple snooping coherence protocol has three states: • Invalid, • Shared and • Modified (exclusive) Dr. Anilkumar K.G

An Example Snooping Coherence Protocol • The shared state indicates that the cache block is potentially shared (available to other caches) • The modified state implies that the block is exclusive (S = 0, means not shared!) • Figure 4.5 shows the requests generated by the processor-cache module in a node (in the top half of the table) as well as those coming from the bus (in the bottom half of the table) • Snooping protocol is for write-back cache but is easily changed to work for a write-through cache • By re-interpreting the modified state as an exclusive state and updating the cache on writes in the normal fashion for a write-through cache Dr. Anilkumar K.G

An Example Snooping Coherence Protocol • The most common extension of this protocol is the addition of an exclusive state, which describes a block that is unmodified but held in only one cache during a write • When an invalidate or a write miss (when an exclusive access failed) is placed on the bus, any processors with copies of the cache block invalidate it (by changing the V bit) • For a write-through cache, the data for a write miss can always be retrieved from the memory • For a write miss in a write back cache, if the block is exclusive in just one cache, that cache writes back the block to memory; otherwise the data can not read from memory • Figure 4.6 shows a finite-state transition diagram for a single cache block using a write invalidation protocol and a write-back cache Dr. Anilkumar K.G

An Example Snooping Coherence Protocol • The state in each node represents the state of the selected cache block specified by the processor or bus request • All the states in this cache protocol would be needed in a uniprocessor cache, where they would correspond to the invalid,valid, and dirty states • Figure 4.7 shows how the state transitions in the right half of the Figure 4.6 are combined with those in the left half of the figure to form a single state diagram for each cache block Dr. Anilkumar K.G

An Example Snooping Coherence Protocol • To understand this protocol, observe that any valid cache block is either in the shared state in one or more caches or in the exclusive state in exactly one cache • Any transition to the exclusivestate (which is required for a processor to write to the block) requires an invalidate or write miss to be placed on the bus, causing all caches to make the block invalid • Finally, if a read miss occurs on the bus to a block in the exclusive state, the cache with the exclusive copy changes its exclusive state into shared state Dr. Anilkumar K.G

An Example Snooping Coherence Protocol • Snooping protocol assumes that operations are atomic, means that an operation can be done at a time in such a way that no other intervening operations allowed • For ex. The protocol described assumes that write miss can be detected by acquire the bus, and receive a response as a single atomic actions • Non-atomic actions introduce the possibility that the protocol can reach a deadlock • Means that it reaches a state where it cannot continue Dr. Anilkumar K.G

Constructing a Small-scale Multiprocessor • Constructing a small-scale (2 or 4 processors) multiprocessors has become easy • Example, Intel Pentium4 and Xeon, AMD Opteron, etc. processors are designed for use in cache-coherent multiprocessors and have an external interface that supports snooping and allows two to four processors to be directly connected • They also have larger on-chip caches to reduce bus utilization Dr. Anilkumar K.G

Limitations in Symmetric Shared-memory Multiprocessors and Snooping Protocols • As the no. of processors in a multiprocessor grows or as the memory demands of each processor grow, any centralized resource in the system can become a bottleneck • In the case of a bus based multiprocessor (snoopy), the bus must support both the coherence traffic as well as normal memory traffic arising from the caches • Likewise if there is a single memory unit, it must accommodate all processor requests • How can a designer increase the memory BW to support faster processors? Dr. Anilkumar K.G

Limitations in Symmetric Shared-memory Multiprocessors and Snooping Protocols • To increase the communication BW between processors and memory, designers have used multiple buses as well as interconnection networks such as crossbars, etc. • In such designs, the memory system can be configured into multiple physical banks, so as to boost the effective memory BW while retaining the uniform access time to memory • Figure 4.8 shows a multiprocessor with uniform memory accessusing an interconnection network rather than a bus to access multiple bank memory Dr. Anilkumar K.G

Performance of Symmetric Shared-memory Multiprocessors • Shared or coherence misses can be broken into two separate sources: • The first source is the so-called true sharing misses that arise from the communication of data through the cache coherence mechanism • In an invalidation based protocol, the first write by a processor to a shared cache block causes an invalidation to establish ownership of that block causes a miss (invalidation write miss!) • When another processor attempts to read a modified word in that cache block, a miss occurs (invalidation read miss!) • Both these write and read misses are classified as true sharing misses by invalidation since they directly arise from the sharing of data among processors Dr. Anilkumar K.G

Performance of Symmetric Shared-memory Multiprocessors • The second effect, called false sharing miss arises from the use of an invalidation based coherence algorithm with a single valid bit per cache block • False sharing occurs when a block is invalidated because some word in the block, other than the one being read is written into • If the word written into is actually used by the processor that received the invalidate, then the reference was a true sharing reference and would have caused a miss • If the word being written and the word read are different and the invalidation does not cause a new value to be communicated, but only causes an extra cache miss, then it is a false sharing miss Dr. Anilkumar K.G

Performance of Symmetric Shared-memory Multiprocessors • Assume that words x1 and x2 are in the same cache block, which is in the shared state in the caches of both P1 and P2. Assuming the following sequence of events, identify each miss as a true sharing miss, a false sharing miss or a hit. Any miss that would occur if the block size were one word is designated a true sharing miss. Dr. Anilkumar K.G

Performance of Symmetric Shared-memory Multiprocessors Dr. Anilkumar K.G

Distributed Shared Memory and Directory-Based Coherence • Snooping protocolrequires communication with all caches on every cache miss, including writes of potentially shared data • The absence of any centralized data structure that tracks the state of the caches is the fundamental advantage of a snooping-based scheme – it is an inexpensive • Alternative to a snoop-based coherence protocol is a directory protocol • A directory protocol keeps the state of every block that may be cached • Information in the directory includes which caches have copies of the block, whether it is dirty and so on • A directory protocol also can be used to reduce the BW demands in a centralized shared memory machine Dr. Anilkumar K.G

Distributed Shared Memory and Directory-Based Coherence • The simplest directory implementations associated an entry in the directory with each memory block • In such implementations, the amount of information is proportional to the product of the number of memory blocks and the number of processors • This overhead is not a problem for multiprocessors with less than about 200 processors because the directory overhead with a reasonable block size will be tolerable • To prevent the directory from becoming the bottleneck, the directory is distributed along with the memory • So that different directory access can go to different directories, just as different memory requests go to different memories Dr. Anilkumar K.G

Multiprocessor and Thread-Level Parallelism-II Chapter 4

Multiprocessor and Thread-Level Parallelism-II Chapter 4

Presentation Transcript

Chapter 4: Multiprocessors and Thread-Level Parallelism

CPE 631: Multiprocessors and Thread-Level Parallelism

Chapter 6 Multiprocessors and Thread-Level Parallelism

CPE 731 Advanced Computer Architecture Thread Level Parallelism

Chapter 5: Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism

Encoding H.264 by Thread Level Parallelism

Multiprocessors and Thread-Level Parallelism

Programming Explicit Thread-level Parallelism

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H

Chapter 4 Advanced Pipelining and Intruction-Level Parallelism

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5: Multiprocessors (Thread-Level Parallelism)– Part 2

Optimistic Intra-Transaction Parallelism using Thread Level Speculation

Single Thread Parallelism

Multiprocessors and Thread Level Parallelism Chapter 4, Appendix E

Chapter 5: Thread Level Parallelism and Cache Coherence

Thread Level Parallelism (TLP)

CPE 432 Computer Design 11 – Thread Level Parallelism

Chapter 5 Multiprocessors and Thread-Level Parallelism

Chapter 5 Thread-Level Parallelism