1 / 66

Principles of High Performance Computing (ICS 632)

Principles of High Performance Computing (ICS 632). Concurrent Computers. Concurrency and Computers. Concurrency occurs at many levels in computer systems Within a CPU Within a “Box” Across Boxes. Concurrency and Computers. Concurrency occurs at many levels in computer systems

Download Presentation

Principles of High Performance Computing (ICS 632)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Principles of High Performance Computing (ICS 632) Concurrent Computers

  2. Concurrency and Computers • Concurrency occurs at many levels in computer systems • Within a CPU • Within a “Box” • Across Boxes

  3. Concurrency and Computers • Concurrency occurs at many levels in computer systems • Within a CPU • Within a “Box” • Across Boxes

  4. Concurrency within a “Box” • Two main techniques • SMP • Multi-core • Let’s look at both of them

  5. Multiple CPUs • We have seen that there are many ways in which a single-threaded program can in fact achieve some amount of true concurrency in a modern processor • ILP, vector instructions • On a hyper-threaded processors, a single-threaded program can also achieve some amount of true concurrency • But there are limits to these techniques, and many systems provide increased true concurrency by using multiple CPUs

  6. SMPs • Symmetric Multi-Processors • often mislabeled as “Shared-Memory Processors”, which has now become tolerated • Processors are all connected to a single memory • Symmetric: each memory cell is equally close to all processors • Many dual-proc and quad-proc systems • e.g., for servers P P 1 n Main memory

  7. Multi-core processors • We’re about to enter an era in which all computers will be SMPs • This is because soon all processors will be multi-core • Let’s look at why we have multi-core processors

  8. Moore’s Law • Many people interpret Moore’s law as “computer gets twice as fast every 18/24 months” • which is not true • The law is about transistor density • This wrong interpretation is no longer true • We should have 20GHz processors right now • And we don’t!

  9. No more Moore? • We are used to getting faster CPUs all the time • We are used for them to keep up with more demanding software • Known as “Andy giveth, and Bill taketh away” • Andy Grove • Bill Gates • It’s a nice way to force people to buy computers often • But basically, our computers get better, do more things, and it just happens automatically • Some people call this the “performance free lunch” • Conventional wisdom: “Not to worry, tomorrow’s processors will have even more throughput, and anyway today’s applications are increasingly throttled by factors other than CPU throughput and memory speed (e.g., they’re often I/O-bound, network-bound, database-bound).”

  10. Commodity improvements • There are three main ways in which commodity processors keep improving: • Higher clock rate • More aggressive instruction reordering and concurrent units • Bigger/faster caches • All applications can easily benefit from these improvements • at the cost of perhaps a recompilation • Unfortunately, the first two are hitting their limit • Higher clock rate lead to high heat, power consumption • No more instruction reordering without compromising correctness

  11. Is Moore’s laws not true? • Ironically, Moore’s law is still true • The density indeed still doubles • But its wrong interpretation is not • Clock rates do not doubled any more • But we can’t let this happen: computers have to get more powerful • Therefore, the industry has thought of new ways to improve them: multi-core • Multiple CPUs on a single chip • Multi-core adds another level of concurrency • But unlike, say multiple functional units, hard to compile for them • Therefore, programmers need to be trained to develop code for multi-core platforms • See ICS432

  12. Shared Memory and Caches? • When building a shared memory system with multiple processors / cores, one key question is: where does one put the cache? • Two options P P n 1 P P 1 n Switch $ $ $ Inter connection network Main memory Main memory Shared Cache Private Caches

  13. Shared Caches • Advantages • Cache placement identical to single cache • Only one copy of any cached block • Can’t have different values for the same memory location • Good interference • One processor may prefetch data for another • Two processors can each access data within the same cache block, enabling fine-grain sharing • Disadvantages • Bandwidth limitation • Difficult to scale to a large number of processors • Keeping all processors working in cache requires a lot of bandwidth • Size limitation • Building a fast large cache is expensive • Bad interference • One processor may flush another processor’s data

  14. Shared Caches • Shared caches have known a strange evolution • Early 1980s • Alliant FX-8 • 8 processors with crossbar to interleaved 512KB cache • Encore & Sequent • first 32-bit microprocessors • two procs per board with a shared cache • Then disappeared • Only to reappear in recent MPPs • Cray X1: shared L3 cache • IBM Power 4 and Power 5: shared L2 cache • Typical multi-proc systems do not use shared caches • But they are common in multi-core systems

  15. Caches and multi-core Core #1 Core #2 L1 Cache L1 Cache • Typical multi-core architectures use distributed L1 caches • But lower levels of caches are shared Core #1 Core #2 L1 Cache L1 Cache L2 Cache

  16. Multi-proc & multi-core systems Processor #1 Processor #2 Core #1 Core #2 Core #1 Core #2 L1 Cache L1 Cache L1 Cache L1 Cache L2 Cache L2 Cache RAM

  17. Private caches • The main problem with private caches is that of memory consistency • Memory consistency is jeopardized by having multiple caches • P1 and P2 both have a cached copy of a data item • P1 write to it, possibly write-through to memory • At this point P2 owns a stale copy • When designing a multi-processor system, one must ensure that this cannot happen • By defining protocols for cache coherence

  18. Snoopy Cache-Coherence State Address Data • The memory bus is a broadcast medium • Caches contain information on which addresses they store • Cache Controller “snoops” all transactions on the bus • A transaction is a relevant transaction if it involves a cache block currently contained in this cache • Take action to ensure coherence • invalidate, update, or supply value Pn P0 $ $ bus snoop memory bus memory op from Pn Mem Mem

  19. Limits of Snoopy Coherence Assume: 4 GHz processor => 16 GB/s inst BW per processor (32-bit) => 9.6 GB/s data BW at 30% load-store of 8-byte elements Suppose 98% inst hit rate and 90% data hit rate => 320 MB/s inst BW per processor => 960 MB/s data BW per processor => 1.28 GB/s combined BW Assuming 10 GB/s bus bandwidth 8 processors will saturate the bus MEM ° ° ° MEM 1.28 GB/s ° ° ° cache cache 25.6 GB/s PROC PROC

  20. Sample Machines • Intel Pentium Pro Quad • Coherent • 4 processors • Sun Enterprise server • Coherent • Up to 16 processor and/or memory-I/O cards

  21. Directory-based Coherence • Idea: Implement a “directory” that keeps track of where each copy of a data item is stored • The directory acts as a filter • processors must ask permission for loading data from memory to cache • when an entry is changed the directory either update or invalidate cached copies • Eliminate the overhead of broadcasting/snooping, a thus bandwidth consumption • But is slower in terms of latency • Used to scale up to numbers of processors that would saturate the memory bus

  22. Example machine • SGI Altix 3000 • A node contains up to 4 Itanium 2 processors and 32GB of memory • Uses a mixture of snoopy and directory-based coherence • Up to 512 processors that are cache coherent (global address space is possible for larger machines)

  23. Sequential Consistency? • A lot of hardware and technology to ensure cache coherence • But the sequential consistency model may be broken anyway • The compiler reorders/removes code • Prefetch instructions cause reordering • The network may reorder two write messages • Basically, a bunch of things can happen • Virtually all commercial systems give up on the idea of maintaining strong sequential consistency

  24. Weaker models • The programmer must program with weaker memory models than Sequential Consistency • Done with some rules • Avoid race conditions • Use system-provided synchronization primitives • We will see how to program shared-memory machines • ICS432 is “all” about this • We’ll just do a brief “review” in 632

  25. Concurrency and Computers • We will see computer systems designed to allow concurrency (for performance benefits) • Concurrency occurs at many levels in computer systems • Within a CPU • Within a “Box” • Across Boxes

  26. Multiple boxes together • Example • Take four “boxes” • e.g., four Intel Itaniums bought at Dell • Hook them up to a network • e.g., a switch bought at CISCO, Myricom, etc. • Install software that allows you to write/run applications that can utilize these four boxes concurrently • This is a simple way to achieve concurrency across computer systems • Everybody has heard of “clusters” by now • They are basically like the above example and can be purchased already built from vendors • We will talk about this kind of concurrent platform at length during this class

  27. Multiple Boxes Together • Why do we use multiple boxes? • Every programmer would rather have an SMP/multi-core architecture that provides all the power/memory she/he needs • The problem is that single boxes do not scale to meet the needs of many scientific applications • Can’t have enough processors or enough powerful enough cores • Can’t have enough memory • But if you can live with a single box, do it! • We will see that single-box programming is much easier than multi-box programming

  28. Where does this leave us? • So far we have seen many ways in which concurrency can be achieved/implemented in computer systems • Within a box • Across boxes • So we could look at a system and just list all the ways in which it does concurrency • It would be nice to have a great taxonomy of parallel platforms in which we can pigeon-hole all (past and present) systems • Provides simple names that everybody can use and understand quickly

  29. Taxonomy of parallel machines? • It’s not going to happen • Up until last year Gordon Bell and Jim Gray published an article in Comm. of the ACM, discussing what the taxonomy should be • Dongarra, Sterling, etc. answered telling them they were wrong and saying what the taxonomy should be, and proposing a new multi-dimensional scheme! • Both papers agree that most terms are conflated, misused, etc. (MPP) • Complicated by the fact that concurrency appears at so many levels • Example: A 16-node cluster, where each node is a 4-way multi-processor, where each processor is hyperthreaded, has vector units, and is fully pipelined with multiple, pipelined functional units

  30. Taxonomy of platforms? • We’ll look at one traditional taxonomy • We’ll look at current categorizations from Top500 • We’ll look at examples of platforms • We’ll look at interesting/noteworthy architectural features that one should know as part of one’s parallel computing culture

  31. The Flynn taxonomy • Proposed in 1966!!! • Functional taxonomy based on the notion of streams of information: data and instructions • Platforms are classified according to whether they have a single (S) or multiple (M) stream of each of the above • Four possibilities • SISD (sequential machine) • SIMD • MIMD • MISD (rare, no commercial system... systolic arrays)

  32. SIMD single stream of instructions fetch decode broadcast Control Unit Processing Element Processing Element Processing Element Processing Element Processing Element • PEs can be deactivated and activated on-the-fly • Vector processing (e.g., vector add) is easy to implement on SIMD • Debate: is a vector processor an SIMD machine? • often confused • strictly not true according to the taxonomy (it’s really SISD with pipelined operations) • but it’s convenient to think of the two as equivalent

  33. MIMD • Most general category • Pretty much every supercomputer in existence today is a MIMD machine at some level • This limits the usefulness of the taxonomy • But you had to have heard of it at least once because people keep referring to it, somehow... • Other taxonomies have been proposed, none very satisfying • Shared- vs. Distributed- memory is a common distinction among machines, but these days many are hybrid anyway

  34. A host of parallel machines • There are (have been) many kinds of parallel machines • For the last 12 years their performance has been measured and recorded with the LINPACK benchmark, as part of Top500 • It is a good source of information about what machines are (were) and how they have evolved • Note that it’s really about “supercomputers” http://www.top500.org

  35. LINPACK Benchmark? • LINPACK: “LINear algebra PACKage” • A FORTRAN • Matrix multiply, LU/QR/Choleski factorizations, eigensolvers, SVD, etc. • LINPACK Benchmark • Dense linear system solve with LU factorization • 2/3 n3 + O(n2) • Measure: MFlops • The problem size can be chosen • You have to report the best performance for the best n, and the n that achieves half of the best performance.

  36. What can we find on the Top500?

  37. Pies

  38. Pies

  39. Pies

  40. Pies

  41. Pies

  42. Platform Architectures

  43. SIMD Machines • ILLIAC-IV, TMC CM-1, MasPar MP-1 • Expensive logic for CU, but there is only one • Cheap logic for PEs and there can be a lot of them • 32 procs on 1 chip of the MasPar, 1024-proc system with 32 chips that fit on a single board! • 65,536 processors for the CM-1 • Thinking Machine’s gimmick was that the human brain consists of many simple neurons that are turned on and off, and so was their machine • CM-5 • hybrid SIMD and MIMD • Death • Machines not popular, but the programming model is. • Vector processors often labeled SIMD because that’s in effect what they do, but they are not SIMD machines • Led to the MPP terminology (Massively Parallel Processor) • Ironic because none of today’s “MPPs” are SIMD

  44. Clusters, Constellations, MPPs P1 NI P0 NI Pn NI memory memory memory . . . interconnect • These are the only 3 categories today in the Top500 • They all belong to the Distributed Memory model (MIMD) (with many twists) • Each processor/node has its own memory and cache but cannot directly access another processor’s memory. • nodes may be SMPs • Each “node” has a network interface (NI) for all communication and synchronization. • So what are these 3 categories?

  45. Clusters • 80% of the Top500 machines are labeled as “clusters” • Definition: Parallel computer system comprising an integrated collection of independent “nodes”, each of which is a system in its own right capable on independent operation and derived from products developed and marketed for other standalone purposes • A commodity cluster is one in which both the network and the compute nodes are available in the market • In the Top500, “cluster” means “commodity cluster” • A well-known type of commodity clusters are “Beowulf-class PC clusters”, or “Beowulfs”

  46. What is Beowulf? • An experiment in parallel computing systems • Established vision of low cost, high end computing, with public domain software (and led to software development) • Tutorials and book for best practice on how to build such platforms • Today by Beowulf cluster one means a commodity cluster that runs Linux and GNU-type software • Project initiated by T. Sterling and D. Becker at NASA in 1994

  47. Constellations??? • Commodity clusters that differ from the previous ones by the dominant level of parallelism • Clusters consist of nodes, and nodes are typically SMPs • If there are more procs in a node than nodes in the cluster, then we have a constellation • Typically, constellations are space-shared among users, with each user running openMP on a node, although an app could run on the whole machine using MPI/openMP • To be honest, this term is not very useful and not very used.

  48. MPP???????? • Probably the most imprecise term for describing a machine (isn’t a 256-node cluster of 4-way SMPs massively parallel?) • May use proprietary networks, vector processors, as opposed to commodity component • Cray T3E, Cray X1, and Earth Simulator are distributed memory machines, but the nodes are SMPs. • Basicallly, everything that’s fast and not commodity is an MPP, in terms of today’s Top500. • Let’s look at these “non-commodity” things • People’s definition of “commodity” varies

  49. Vector Processors • Vector architectures were based on a single processor • Multiple functional units • All performing the same operation • Instructions may specify large amounts of parallelism (e.g., 64-way) but hardware executes only a subset in parallel • Historically important • Overtaken by MPPs in the 90s as seen in Top500 • Re-emerging in recent years • At a large scale in the Earth Simulator (NEC SX6) and Cray X1 • At a small scale in SIMD media extensions to microprocessors • SSE, SSE2 (Intel: Pentium/IA64) • Altivec (IBM/Motorola/Apple: PowerPC) • VIS (Sun: Sparc) • Key idea: Compiler does some of the difficult work of finding parallelism, so the hardware doesn’t have to

  50. Vector Processors • Advantages • quick fetch and decode of a single instruction for multiple operations • the instruction provides the processor with a regular source of data, which can arrive at each cycle, and processed in a pipelined fashion • The compiler does the work for you of course • Memory-to-memory • no registers • can process very long vectors, but startup time is large • appeared in the 70s and died in the 80s • Cray, Fujitsu, Hitachi, NEC

More Related