chapter4 multiprocessors l.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Chapter4 Multiprocessors PowerPoint Presentation
Download Presentation
Chapter4 Multiprocessors

Loading in 2 Seconds...

play fullscreen
1 / 77

Chapter4 Multiprocessors - PowerPoint PPT Presentation


  • 139 Views
  • Updated on

Chapter4 Multiprocessors. Dr. Bernard Chen Ph.D. University of Central Arkansas. Outline. Trend for Multiprocessors Multiprocessors Model Cache Coherence. Uniprocessor Performance.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Chapter4 Multiprocessors


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Chapter4 Multiprocessors Dr. Bernard Chen Ph.D. University of Central Arkansas

    2. Outline • Trend for Multiprocessors • Multiprocessors Model • Cache Coherence

    3. Uniprocessor Performance • Electronic circuits are ultimately limited un their speed of operation by the speed of light… and many of the circuits were already operating in the nanosecond range W.Jack Bouknight, 1972

    4. Uniprocessor Performance From Hennessy and Patterson, Computer Architecture: A Quantitative Approach, 4th edition, 2006

    5. Multiprocessor • “We are dedicating all of our future product development to multicore designs. … This is a sea change in computing” Paul Otellini, President, Intel (2005) • All microprocessor companies switch to MP (2X CPUs / 2 yrs)

    6. Multiprocessor • The importance of multiprocessors was growing throughout the 1990s as designers sought a way to build a server • The slowdown in uniprocessor performance arising from diminishing returns in exploiting ILP • Leading to an era where multiprocessors plays a major role

    7. Multiprocessor • Major reasons are: • Growth in data-intensive applications • Data bases, file servers, … • Growing interest in servers, server performance • Increasing desktop performance less important • Improved understanding in how to use multiprocessors effectively • Especially server where significant natural TLP • Advantage of leveraging design investment by replication • Rather than unique design

    8. Multiprocessor • However, we are left with two problems: • Multiprocessor architecture is a large and diverse field, and much of the field is in its youth • Broad coverage would necessarily entail discussing approaches that may not stand the test of time • Therefore, we focus on the mainstream of multiprocessor design: multiprocessors with small to medium number of processors (4~32 processors)

    9. Outline • Trend for Multiprocessors • Multiprocessors Model • Cache Coherence

    10. Flynn’s Taxonomy M.J. Flynn, "Very High-Speed Computers", Proc. of the IEEE, V 54, 1900-1909, Dec. 1966.

    11. SIMD • SIMD  Data Level Parallelism

    12. SIMD • Output

    13. SIMD MPI_Send • Some other MPI basic functions • int MPI_Send(void *buf,intcount,MPI_Datatypedatatype,intdest,inttag,MPI_Commcomm); • buf • [in] initial address of send buffer (choice) • count • [in] number of elements in send buffer (nonnegative integer) • datatype • [in] datatype of each send buffer element (handle) • dest • [in] rank of destination (integer) • tag • [in] message tag (integer) • comm • [in] communicator (handle) Code example: http://mpi.deino.net/mpi_functions/MPI_Send.html

    14. SIMD MPI_Recv • int MPI_Recv(void *buf,intcount,MPI_Datatypedatatype,intsource,inttag,MPI_Commcomm,MPI_Status *status); • buf • [out] initial address of receive buffer (choice) • count • [in] maximum number of elements in receive buffer (integer) • datatype • [in] datatype of each receive buffer element (handle) • source • [in] rank of source (integer) • tag • [in] message tag (integer) • comm • [in] communicator (handle) • status • [out] status object (Status)

    15. SIMD MPI_Reduce • int MPI_Reduce(void *sendbuf,void *recvbuf,intcount,MPI_Datatypedatatype,MPI_Opop,introot,MPI_Commcomm);Parameters • sendbuf • [in] address of send buffer (choice) • recvbuf • [out] address of receive buffer (choice, significant only at root) • count • [in] number of elements in send buffer (integer) • datatype • [in] data type of elements of send buffer (handle) • op • [in] reduce operation (handle) • root • [in] rank of root process (integer) • comm • [in] communicator (handle)

    16. Flynn’s Taxonomy • MIMD  Each processor fetches its own instructions and operates on its own data. • MIMD operates thread-level parallelism • MIMD offers flexibility • MIMD can build on the cost-performance advantages

    17. Back to Basics • “A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast.” • Parallel Architecture = Computer Architecture + Communication Architecture • 2 classes of multiprocessors WRT memory: • Centralized Memory Multiprocessor • < few dozen processor chips Small enough to share single, centralized memory • Physically Distributed-Memory multiprocessor • Larger number chips and cores • BW demands  Memory distributed among processors

    18. Centralized Memory Multiprocessor • Large caches  single memory can satisfy memory demands of small number of processors • Can scale to a few dozen processors by using a switch and by using many memory banks • Although scaling beyond that is technically conceivable, it becomes less attractive as the number of processors sharing centralized memory increases

    19. Eniac (350 op/s) 1946 - (U.S. Army photo)

    20. Arkansas Star

    21. ASCI White (12.3 teraops/sec)IBM Jun28 2000 Mega flops = 10^6 flops = 2^20 Giga = 10^9 = billion = 2^30 Tera = 10^12 = trillion = 2^40 Peta = 10^15 = quadrillion = 2^50 Exa = 10^18 = quintillion = 2^60

    22. Physically Distributed-Memory multiprocessor • Pro: Reduces latency of local memory accesses • Con: Communicating data between processors more complex

    23. 2 Models for Communication and Memory Architecture • The first kind, communication occurs through a shared address space. • Centralized memory processor utilized this type of communication, named symmetric shared memory multiprocessors

    24. 2 Models for Communication and Memory Architecture • Even the physically separate memories can be addressed as one logically shared space • Meaning that the memory reference can be made by any processor to any memory location, (assume it has the access right) • These multiprocessors are called distributed shared memory (DSM)

    25. 2 Models for Communication and Memory Architecture • Communication occurs through a shared address space (via loads and stores): shared memory multiprocessorseither • symmetric shared memory (centralized memory MP) • distributed shared memory (distributed memory MP) • Communication occurs by explicitly passing messages among the processors: message-passing multiprocessors, distributed memory MP

    26. Outline • Trend for Multiprocessors • Multiprocessors Model • Cache Coherence

    27. What is Multiprocessor Cache Coherence

    28. What is Multiprocessor Cache Coherence • Informally, we could say that a memory system is coherent if any read of a data item returns the most recently written value of the data item • However, two formal definitions are required • Coherence: defines what values can be returned by a read • Consistency: determines when a written value will be returned by a read

    29. Defining Coherent Memory System • Preserve Program Order: A read by processor P to location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P • Coherent view of memory: Read by a processor to location X that follows a write by anotherprocessor to X returns the written value if the read and write are sufficiently separated in time and no other writes to X occur between the two accesses • Write serialization: 2 writes to same location by any 2 processors are seen in the same order by all processors • If not, a processor could keep value 1 since saw as last write • For example, if the values 1 and then 2 are written to a location, processors can never read the value of the location as 2 and then later read it as 1

    30. Defining Coherent Memory System (Translated) • A read follows a write in the same processor • A read follows a write by another CPU • Two writes

    31. 2 Classes of Cache Coherence Protocols • Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • Directory based — Sharing status of a block of physical memory is kept in just one location, the directory

    32. Snooping • Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • All caches are accessible via some broadcast medium (a bus or switch) • All cache controllers monitor or snoop on the medium to determine whether or not they have a copy of a block that is requested on a bus or switch access

    33. Snooping (write back)

    34. Snooping • Write through: the information is written to both the block in the cache and to the block in the lower-level memory • Write back: the information is only to the block in the cache. The modified cache block is written to main memory only when it is replaced or needed

    35. Snooping (write through)

    36. Snooping • The key in Snooping is the “bus”, to perform invalidations • To perform invalidate, the processor simply acquires bus access and broadcasts the address to be invalidated on the bus • If two processors attempt to write on the same lactation, who get the bus first wins

    37. Snoopy Cache-Coherence Protocols • Cache Controller “snoops” all transactions on the shared medium (bus or switch)

    38. u = ? u = ? u = 7 5 4 3 1 2 u u u :5 :5 :5 u = 7 Example: Write-thru Invalidate • The write step by p3 Must invalidate before step 3 • Write update uses more broadcast medium BW all recent MPUs use write invalidate P P P 2 1 3 $ $ $ I/O devices Memory

    39. Locate up-to-date copy of data • Write-through: get up-to-date copy from memory • Write through simpler if enough memory BW • Write-back harder • Most recent copy can be in a cache

    40. Locate up-to-date copy of data • Can use same snooping mechanism • Snoop every address placed on the bus • If a processor has dirty copy of requested cache block, it provides it in response to a read request and aborts the memory access • Complexity from retrieving cache block from a processor cache, which can take longer than retrieving it from memory • Write-back needs lower memory bandwidth  Support larger numbers of faster processors  Most multiprocessors use write-back

    41. A commercial workload • Server: AlphaServer 4100 • Each processor issues up to four instructions per clock cycle and runs at 300MHz • Each processor has a three-level cache hierarchy: • L1 consists of a pair of 8KB direct-mapped on-chip cache • L2 is a 96KB on-chip three-way set associative cache • L3 is a off-chip direct-mapped 2MB cache • The latency for an access to L2 is 7 cycles, L3 is 21 cycles, and main memory access is 80 clock cycles

    42. A commercial workload

    43. 2 Classes of Cache Coherence Protocols • Snooping — Every cache with a copy of data also has a copy of sharing status of block, but no centralized state is kept • Directory based — Sharing status of a block of physical memory is kept in just one location, the directory

    44. Directory-Based Cache Coherence Protocols • To implement the operations, a directory must track the state of each cache block: • Shared (S): one or more processors have the block cached, and the value is up-to-date • Uncached (U): no processor has a copy of the cache block • Modified/Executed (E): exactly one processor has a copy of the cache block. The processor is called the owner of the block

    45. Directory-based Protocol Interconnection Network Directory Directory Directory Local Memory Local Memory Local Memory Cache Cache Cache CPU 0 CPU 1 CPU 2

    46. Directory-based Protocol CPU 0 CPU 1 CPU 2 Interconnection Network Bit Vector X U 0 0 0 Directories X 7 Memories Caches

    47. CPU 0 Reads X Read Miss CPU 0 CPU 1 CPU 2 Interconnection Network X U 0 0 0 Directories X 7 Memories Caches

    48. CPU 0 Reads X CPU 0 CPU 1 CPU 2 Interconnection Network X S 1 0 0 Directories X 7 Memories Caches