ccNUMA Cache Coherent Non-Uniform Memory Access Chris Coughlin MSCS521 Prof. Ten Eyck Spring 2004
Let’s First Talk About Computer Architectures In 1966, Michael Flynn proposed a classification for computer architectures based on the number of instruction steams and data streams (Flynn’s Taxonomy). SISD(Single Instruction Stream-Single Data Stream) • A single-processor computer (uniprocessor) in which a single stream of instructions is generated from the program. SIMD(Single Instruction Stream-Multiple Data Stream) • Each instruction is executed on a different set of data by different processors. (Used for vector and array processing) MISD(Multiple Instruction Stream-Single Data Stream) • Each processor executes a different sequence of instructions. • Never been commercially implemented. MIMD(Multiple Instruction Stream-Multiple Data Stream) • Each processor has a separate program. • An instruction stream is generated from each program. • Each instruction operates on different data.
Multiprocessors • The idea behind multiprocessors is to create powerful computers by connecting many smaller ones. • Computational speed is increased by using multiple processors operating together on a single problem. • A parallel processing program is a single program that runs on multiple processors simultaneously. • The overall problem is split into parts, each of which is performed by a separate processor in parallel. • In addition to a faster solution, it may also generate a more precise solution.
MIMD Systems Shared Memory Multiprocessor System • Multiple processors are connected to multiple memory modules such that each processor can access any other processor’s memory module. This multiprocessor employs a shared address space (also known as a single address space). • Communication is implicit with loads and stores – there is no explicit recipient of a shared memory access. • Processors may communicate without necessarily being aware of one another. • A single image of the operating system runs across all the processors.
MIMD Systems (cont.) Multicomputer • A term for parallel processors with separate, private address spaces (not accessible by the other processors in the system). • Communicate by message-passing – the messages carry data from one processor to another as dictated by the program. • Complete computers, consisting of a processor and local memory, connected through an interconnection network (e.g. a LAN).
Computer Architecture Classifications Processor Organizations Single Instruction, Single Instruction, Multiple Instruction Multiple Instruction Single Data Stream Multiple Data Stream Single Data Stream Multiple Data Stream (SISD) (SIMD) (MISD) (MIMD) Uniprocessor Vector Array Shared Memory Multicomputer Processor Processor (tightly coupled) (loosely coupled) Note: We will expand on this later
Back to Shared Memory Multiprocessors Two styles: UMA and NUMA: UMA (Uniform Memory Access) • The time to access main memory is the same for all processors since they are equally close to all memory locations. • Machines that use UMA are called Symmetric Multiprocessors (SMPs). • In a typical SMP architecture, all memory accesses are posted to the same shared memory bus. • Contention - as more CPUs are added, competition for access to the bus leads to a decline in performance. • Thus, scalability is limited to about 32 processors.
Shared Memory Multiprocessors (cont.) NUMA (Non-Uniform Memory Access) • Since memory is physically distributed, it is faster for a processor to access its own local memory than non-local memory (memory local to another processor or shared between processors). • Unlike SMPs, all processors are not equally close to all memory locations. • A processor’s own internal computations can be done in its local memory leading to reduced memory contention. • Designed to surpass the scalability limits of SMPs.
Communication and Connection Options for Multiprocessors Multiprocessors come in two main configurations: a single bus connection, and a network connection. The choice of the communication model and the physical connection depends largely on the number of processors in the organization. Notice that the scalability of NUMA makes it ideal for a network configuration. UMA, however, is best suited to a bus connection.
A Multiprocessor Bus Configuration The single bus design is limited in terms of scalability. The largest number of processors in a commercial product using this configuration is 36 (SGI Power Challenge).
A Multiprocessor Network Configuration The network-connected processor design is very scalable. Since each processor has its own memory, the network connection is only used for communication between processors.
A Quick Look at Cache • Modern processors use a faster, smaller cachememory to act as a buffer for slower, larger memory. • Caches exploit the principal of locality in memory accesses. Temporal locality – the concept that if data is referenced, it will tend to be referenced again soon after. Spatial locality – the concept that data is more likely to be referenced soon if data near it was just referenced. • Caches hold recently referenced data, as well as data near the recently referenced data. • This can lead to performance increases by reducing the need to access main memory on every reference.
What is ccNUMA? • The cc in ccNUMA stands for cache coherent. • The use of cache memory in modern computer architectures leads to the cache coherence problem. • It is a situation that can occur when two or more processors reference the same shared data. If one processor modifies its copy of the data, the other processors will have stale copies of the data in their caches. • Machines that are cache coherent ensure that a processor accessing a memory location receives the most up-to-date version of the data. • Cache coherence is maintained by software, special-purpose hardware, or both. • NUMA systems that maintain cache coherence are referred to as ccNUMA machines. • Since few applications still exist for non-cache coherent NUMA machines, the terms NUMA and ccNUMA are used interchangeably.
Computer Architecture Classifications (revisited) Processor Organizations Single Instruction, Single Instruction, Multiple Instruction Multiple Instruction Single Data Stream Multiple Data Stream Single Data Stream Multiple Data Stream (SISD) (SIMD) (MISD) (MIMD) Uniprocessor Vector Array Shared Memory Multicomputer Processor Processor (tightly coupled) (loosely coupled) UMA (SMP) NUMA ccNUMA
Cache Coherency Protocols Snooping protocol • A bus-based method in which cache controllers monitor the bus for activity and update or invalidate cache entries as necessary. • Two types: Write invalidate – the writing processor sends an invalidation signal to the bus. All other caches check to see if they have a copy of the cache block. If they do, the block containing the data gets invalidated. The writing processor then changes its local copy. Write-update – the writing processor broadcasts the new data over the bus and all copies are updated with the new value. • Commercial machines use write-invalidate to preserve bandwidth. • Write-update has the advantage of making the new values appear in the caches sooner.
Cache Coherency Protocols (cont.) Directory-based protocol • A central directory maintains the information about which memory locations are being shared in multiple caches and which are contained in just one processor’s cache. • On any memory access, it knows the caches that need to be updated or invalidated. • It is used by all software-based implementations of shared memory. • It is a scalable scheme that is suitable for a network configuration.
A Side-Effect of Cache Coherency False sharing • Caches are organized into blocks of contiguous memory locations – mainly because programs tend to use spatial locality of reference. • It is therefore possible for two processors to share the same cache block, but to not share the same memory location within the block. • If one processor writes to its own part of the block, it then causes the other processor’s entire block, including the memory location it was accessing, to get updated or invalidated. • Unnecessary invalidations can affect performance. • It is up to the programmer to detect it and avoid it. • Compiler-based solutions are being researched.
ccNUMA Implementations Stanford Dash – • Dash stands for Directory Architecture for Shared Memory. • First to use directory-based cache coherence. SGI Origin 2000 (Silicon Graphics Inc.) - • Can support up to 1024 processors. • SGI claims it accounts for over 95% of worldwide shipments of ccNUMA-based systems. IBM’s LA (Local Access) ccNUMA
References • Computer Organization and Design: The Hardware/Software Interface, David A. Patterson & John L. Hennessy, 1998, 2nd edition • Supercomputing Systems: Architectures, Design, and Performance, Svetlana P. Kartashev & Steven I. Kartashev, 1990 • Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, Barry Wilkinson & Michael Allen, 1999 • www.mkp.com/cod2e.htm • Non-Uniform Memory Access – Wikipedia • Symmetric Multiprocessing - Wikipedia • Cache Coherence - Wikipedia • Parallel Computing - Wikipedia • Locality of Reference – Wikipedia
References (cont.) • A Primer on NUMA ( Non-Uniform Memory Access) • Cache Coherence in the context of Shared Memory Architecture • Distributed shared memory -- ccNUMA interconnects • The Stanford Dash Multiprocessor • The SGI Origin: A ccNUMA Highly Scalable Server • IBM Distributed Shared Memory Plans Uncovered • http://benchoi.info/Bens/Teaching/Csc364/PDF/CH18.pdf • http://www.cs.ucsd.edu/classes/fa00/cse240/lectures/Lecture17.html • http://www.cs.ucsd.edu/users/carter/260/260class02.pdf