910 likes | 1.12k Views
Chapter4 Multiprocessors. Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010. Outline. Trend for Multiprocessors Multiprocessors Model Multiprocessors Performance Cache Coherence. Uniprocessor Performance.
E N D
Chapter4 Multiprocessors Dr. Bernard Chen Ph.D. University of Central Arkansas Spring 2010
Outline • Trend for Multiprocessors • Multiprocessors Model • Multiprocessors Performance • Cache Coherence
Uniprocessor Performance • Electronic circuits are ultimately limited un their speed of operation by the speed of light… and many of the circuits were already operating in the nanosecond range W.Jack Bouknight, 1972
Uniprocessor Performance From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006
Multiprocessor • “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) • All microprocessor companies switch to MP (2X CPUs / 2 yrs)
Multiprocessor • The importance of multiprocessors was growing throughout the 1990s as designers sought a way to build a server • The slowdown in uniprocessor performance arising from diminishing returns in exploiting ILP • Leading to an era where multiprocessors plays a major role
Multiprocessor • Major reasons are: • Growth in data-intensive applications • Data bases, file servers, … • Growing interest in servers, server performance • Increasing desktop performance less important • Outside of graphics • Improved understanding in how to use multiprocessors effectively • Especially server where significant natural TLP • Advantage of leveraging design investment by replication • Rather than unique design
Multiprocessor • However, we are left with two problems: • Multiprocessor architecture is a large and diverse field, and much of the field is in its youth • Broad coverage would necessarily entail discussing approaches that may not stand the test of time • Therefore, we focus on the mainstream of multiprocessor design: multiprocessors with small to medium number of processors (4~32 processors)
Outline • Trend for Multiprocessors • Multiprocessors Model • Multiprocessors Performance • Cache Coherence
Flynn’s Taxonomy M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.
SIMD • SIMD Data Level Parallelism
SIMD • Output
SIMD • Some other MPI basic codes • MPI_Send(&phase2_n_H,1,MPI_INT,id^i,1,MPI_COMM_WORLD); • MPI_Recv(&phase2_n_H,1,MPI_INT,id^i,1,MPI_COMM_WORLD,&stat); • MPI_Bcast(¢roid[i][0],ws*20+1,MPI_INT,0,MPI_COMM_WORLD); • MPI_Reduce(&local_lower[0], &global_lower[0], num_cluster, MPI_DOUBLE, MPI_SUM, 0, MPI_COMM_WORLD);
Flynn’s Taxonomy • MIMD Each processor fetches its own instructions and operates on its own data. • MIMD operates thread-level parallelism • MIMD offers flexibility • MIMD can build on the cost-performance advantages
Back to Basics • “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” • Parallel Architecture = Computer Architecture + Communication Architecture • 2 classes of multiprocessors WRT memory: • Centralized Memory Multiprocessor • < few dozen processor chips Small enough to share single, centralized memory • Physically Distributed-Memory multiprocessor • Larger number chips and cores • BW demands Memory distributed among processors
Centralized Memory Multiprocessor • Large caches single memory can satisfy memory demands of small number of processors • Can scale to a few dozen processors by using a switch and by using many memory banks • Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases
Physically Distributed-Memory multiprocessor • Pro: Reduces latency of local memory accesses • Con: Communicating data between processors more complex
2 Models for Communication and Memory Architecture • The first kind, communication occurs through a shared address space. • Centralized memory processor utilized this type of communication, named symmetric shared memory multiprocessors
2 Models for Communication and Memory Architecture • The first kind, communication occurs through a shared address space • Even the physically separate memories can be addressed as on logically shared space • Meaning that the memory reference can be made by any processor to any memory location, (assume it has the access right) • These multiprocessors are called distributed shared memory (DSM)
2 Models for Communication and Memory Architecture • Communication occurs through a shared address space (via loads and stores): shared memory multiprocessorseither • symmetric shared memory (centralized memory MP) • distributed shared memory (distributed memory MP) • Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors, distributed memory MP
Outline • Trend for Multiprocessors • Multiprocessors Model • Multiprocessors Performance • Cache Coherence
Multiprocessors Performance • Amdahl’s Law
Example • For 2 Processors, 50% of the work is parallel available, what is the speed up? Speedup= =1/0.75=1.333
Challenges of Parallel Processing • First challenge is % of program inherently sequential • Suppose 80X speedup from 100 processors. What fraction of original program can be sequential? • 10% • 5% • 1% • 0.5% • 0.25%
Challenges of Parallel Processing • Application parallelism primarily via new algorithms that have better parallel performance • Long remote latency impact both by architect and by the programmer • Today’s lecture on HW to help latency via caches
Outline • Trend for Multiprocessors • Multiprocessors Model • Multiprocessors Performance • Cache Coherence
What is Multiprocessor Cache Coherence • Informally, we could say that a memory system is coherent if any read of a data item returns the most recently written value of the data item • However, two formal definitions are required • Coherence: defines what values can be returned by a read • Consistency: determines when a written value will be returned by a read
Defining Coherent Memory System • Preserve Program Order: A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P • Coherent view of memory: Read by a processor to location X that follows a write by anotherprocessor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses • Write serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors • If not, a processor could keep value 1 since saw as last write • For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1
Defining Coherent Memory System (Translated) • A read follows a write in the same processor • A read follows a write by another CPU • Two writes
Defining Consistent Memory System • Coherence and Consistency are complimentary: • Coherence defines the behavior of reads and writes at the same location • While consistency defines the behavior of reads and writes to other memory location
Defining Consistent Memory System • A write does not complete (and allow the next write to occur) until all processors have seen the effect of that write • The processor does not change the order of any write with respect to any other memory access • if a processor writes location A followed by location B, any processor that sees the new value of B must also see the new value of A
2 Classes of Cache Coherence Protocols • Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • Directory based — Sharing status of a block of physical memory is kept in just one location, the directory
Snooping • Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • All caches are accessible via some broadcast medium (a bus or switch) • All cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access
Snooping • Write through: the information is written to both the block in the cache and to the block in the lower-level memory • Write back: the information is only to the block in the cache. The modified cache block is written to main memory only when it is replaced or needed
Snooping • The key in Snooping is the “bus”, to perform invalidations • To perform invalidate, the processor simply acquires bus access and broadcasts the address to be invalidated on the bus • If two processors attempt to write on the same lactation, who get the bus first wins
Snoopy Cache-Coherence Protocols • Cache Controller “snoops” all transactions on the shared medium (bus or switch)
u = ? u = ? u = 7 5 4 3 1 2 u u u :5 :5 :5 u = 7 Example: Write-thru Invalidate • The write step by p3 Must invalidate before step 3 • Write update uses more broadcast medium BW all recent MPUs use write invalidate P P P 2 1 3 $ $ $ I/O devices Memory
Locate up-to-date copy of data • Write-through: get up-to-date copy from memory • Write through simpler if enough memory BW • Write-back harder • Most recent copy can be in a cache
Locate up-to-date copy of data • Can use same snooping mechanism • Snoop every address placed on the bus • If a processor has dirty copy of requested cache block, it provides it in response to a read request and aborts the memory access • Complexity from retrieving cache block from a processor cache, which can take longer than retrieving it from memory • Write-back needs lower memory bandwidth Support larger numbers of faster processors Most multiprocessors use write-back
A commercial workload • Server: AlphaServer 4100 • Each processor issues up to four instructions per clock cycle and runs at 300MHz • Each processor has a three-level cache hierarchy: • L1 consists of a pair of 8KB direct-mapped on-chip cache • L2 is a 96KB on-chip three-way set associative cache • L3 is a off-chip direct-mapped 2MB cache • The latency for an access to L2 is 7 cycles, L3 is 21 cycles, and main memory access is 80 clock cycles
Limitations in Symmetric Shard-Memory Multiprocessors and Snooping Protocols • In the simple case of a bus-based multiprocessor, the bus must support both the coherence traffic as well as normal memory traffic • If there is a single memory unit, it must accommodate all processor requests
Limitations in Symmetric Shard-Memory Multiprocessors and Snooping Protocols • To increase the communication bandwidth between processors and memory, designers used multiple buses as well as interconnection networks (such as crossbars or small p2p networks) • In such designs, the memory system can be configured into multiple physical banks • The next figure represents a midpoint between the centralized shared memory and distributed shared memory
Limitations in Symmetric Shard-Memory Multiprocessors and Snooping Protocols