1 / 40

Parallel Computer Organization and Design EDA282

Parallel Computer Organization and Design EDA282. Why Study Parallel Computers?. Almost ALL computers are now parallel Understanding hardware is important for producing good software (converse also true!) It’s fun!. Logistics. EL43 1:15-3:00 T/ Th (often1:15 F, too) Expected participation

elan
Download Presentation

Parallel Computer Organization and Design EDA282

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Computer Organization and DesignEDA282

  2. Why Study Parallel Computers? Almost ALL computers are now parallel Understanding hardware is important for producing good software (converse also true!) It’s fun!

  3. Logistics • EL43 1:15-3:00 T/Th (often1:15 F, too) • Expected participation • Attend lectures, participate in discussion • Complete labs (including a satisfactory writeup)— dates/times TBD • Read papers • Complete quizzes • Write (short) survey article (in teams) • Finish (short) take-home exam • Canvas course-management system • https://canvas.instructure.com/courses/777378 • Link: http://www.cse.chalmers.se/~mckee/eda282

  4. Personnel • Prof. Sally McKee • Office hours: arrange meetings via email • Available for discussions after class • mckee@chalmers.se • Jacob Lidman • lidman@chalmers.se

  5. Course Materials “Parallel Computer Organization and Design”Dubois, Annevaram, Stenström (at Cremona) Research and survey papers (linked to web page)

  6. Course Structure/Contents • Intro today • Programming models • Data parallelism • Shared address spaces • Message passing • Hybrid • Design principles/tradeoffs(this is the bulk of the material) • Small-scale systems • Scalable systems • Interconnects

  7. For Each Big Topic, We’ll Discuss . . . • History • How concepts originated in old machines • How they show up in current machines • Basics required in any parallel machine • Memory coherence • Communication • Synchronization

  8. How Did We Get Here? Transistor count doubling every ~2 years Transistor feature sizes shrinking Costs changing Clock speeds hitting limits Parallelism per processor increasing Looking at trends is important when designing new systems!

  9. Costs of Parallel Machines Things to keep in mind when designing a machine . . . What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime? (how long is a "lifetime"?)

  10. Interesting Questions (i.e., course content) • What do we mean by parallel? • Task parallelism (SPMD, MPMD) • Data parallelism (SIMD) • Thread parallelism (Hyperthreading, SMT) • How do the processors coordinate their work? • Shared memory/message passing • Interconnection network (at least one!) • Synchronization primitives • Many combinations/variations • What’s the best way to put these pieces together? • What do you want to run? • How fast do you have to run it? • How much can you spend? • How much energy can you use?

  11. Moore’s Law: Transistor Counts

  12. Feature Sizes

  13. Costs: Apple

  14. Costs of Consumer Electronics Today

  15. History Pascal adding machine, 1642 Leibniz adder/multiplier, ~1670 Babbage analytical engine, 1837 (punch cards, memory, printer!) Hollerith punchcards, 1890 (used for US census data) Aiken digital computer, 1940s (Harvard) Von Neumann stored-program computer, 1945 Eckert/Mauchly ENIAC GP computer, 1946

  16. Evolution of Electronic Computers • Vacuum tubes replaced by transistors, late 1950s • Smaller, faster, more versital logic elements • Lower power • Longer lifetime • Integrated Circuits, late 1960s • Many transistors fabricated on silicon substrate • Wires plated in place • Lower price • Smaller size • Lower failure rate • LSI/VLSI/microprocessors, 1970s • 1000s of interconnected transistors etched into silicon • Could check 8 switches at once → 8-bit “byte”

  17. History of Supercomputers • IBM 7030 Stretch, 1961 • 2K sq. ft. • Fastest computer in world at time • Slower than expected! • Cost initially $13M, dropped to $8.5M • Instruction pipelining, prefetching/decoding, memory interleaving • CDC 6600, 1964 • Size ~= 4 filing cabinets. • Cost $8M ($60M today) • 40MHz, 3M FLOPS at peak • Freon cooled • CPU == 10 FUs, multiple PCBs • 60-bit words/regs

  18. History of Supercomputers (2) • Cray 1, 1976 • 64-bit words • 80 MHz • 136 MFLOPS! • Speed-critical partsplaced inside • 1662 PCBs w/ 144 Ics • 80 sold in 10 years • $5-8M ($25M now)

  19. History of Supercomputers (3) • Cray XMP, 1982 • Up to 4 CPUs in 1 chassis • Up to 16M 64-bit words (128 MB, all SRAM!) • Up to 32 1.2GB disks • 105 MHz • Up to 800 MFLOPS (200/CPU) • Double memory bandwidth wrt Cray 1 • Cray 2, 1985 • Again ICs packed on logic boards • Again, horseshoe shape • Boards packed tightly — submersed in Fluorinert to cool(see http://archive.computerhistory.org/resources/text/Cray/Cray.Cray2.1985.102646185.pdf) • Up to 8 CPUs, 1.9 GFLOPS • Mainstream software/Unix System V OS

  20. History of Supercomputers (4) • Intel Paragon, 1989 • I860-based • 32- or 64-bit • Up to 4K CPUs • 2D MIMD topology • Poor memory bandwidth utilization • ASCI Red, 1996 • First to use off-the-shelf CPUs (Pentium Pros, Xeons) • 6K CPUs • Broke 1 TFLOP barrier • Cost $46M ($67M now) • Upgrade had 9298 Xeons for 3.1 TFlops • Over 1MW power!

  21. History of Supercomputers (5) • Hitachi SR2201, 1996 • H-shaped chassis • 2048 CPUs • 600 GFLOPS peak • Other similar machines (many Japanese) • 100s of CPUs • 2D or 3D networks (e.g., Cray torus) • MIMD • Seymour Cray leaves Cray Research • Cray Computer Corp (CCC) • Cray 3 first gallium arsenide chips • Cray 4 failed → bankruptcy • SRC Computers (see http://www.srccomp.com/about/aboutus.asp)

  22. Biggest Machine Today SequoiaIBM BlueGene/Q machine at theU.S. Dept. of Energy Lawrence Livermore National Lab

  23. Types of Parallelism • Instruction-Level Parallelism (ILP) • Superscalar issue • Out-of-order execution • Very Long Instruction Word (VLIW) • Thread-Level Parallelism (TLP) • Loop-level • Multithreading • Explicit • Speculative • Simultaneous/Hyperthreading • Task-Level Parallelism • Program-Level Parallelism • Data-Level Parallelism

  24. for i = 0 to N-1 a[(i+1) mod N] := b[i] + c[i]; for i = 0 to N-1 d[i] := C*a[i]; Iteration: 0 1 2 … N-1 Loop 1 a[1] a[2] … a[0] Loop 2 a[0] a[1] … a[N-1] data dependencies Parallelism in Sequential Programs • Programming model: C (sequential) • Architecture: superscalar • ILP • Communication through registers • Synchronization through pipeline interlocks

  25. Parallel Programming Models • Extend semantics to express • Units of parallelism • Instructions • Threads • Programs • Communication and coordination between units via • Registers • Memory • I/O

  26. Model vs. Architecture CAD Databases Scientific modeling Parallel applications Multipr ogramming Shar ed Message Data Pr ogramming models addr ess passing parallel Compiler Communication abstraction or library User/system boundary Operating system support Har dwar e/softwar e boundary Communication harr dwar e Physical communication medium • Communication abstraction supports model • Communication architecture (ISA + comm/sync) implements part of model • Hw/sw boundary defines which parts of comm arch implemented in which

  27. Memory P P P Shared Address Space Model for_alli = 0 to P-1 for j = i0[i] to in[i] a[(j+1) mod N] := b[j] + c[j]; barrier; for_alli = 0 to P-1 for j = i0[i] to in[i] d[j] := C*a[j]; Communication abstraction supported by HW/SW interface TLP Communication/coordination among threads via shared global address space

  28. Message Passing Model for_alli = 0 to P-1 for j = i0[i] to in[i] index = (j+1) mod N; a[index] := b[j] + c[j]; if j = in[i] then send(a[index], (j+1) mod P, a[j]); end_for barrier; for_alli = 0 to P-1 for j = i0[i] to in[i] if j = i0[i] then recv(tmp,(P+j-1) mod P, a[j]); d[j] := C * tmp;} end_for Process-level parallelism (separate addr spaces) Communication/coordination via messages

  29. Data Parallelism (SIMD) parallel (i:0->N-1) a[(i+1) mod N] := b[i] + c[i]; parallel (i:0->N-1) d[i] := C * a[i]; • Programming model • Operations done in parallel on multiple data elements • Single thread of control • Architectural model • Array of simple, cheap processors w/ little memory • Attached to control proc that issues instructions • Specialized + general comm, cheap sync

  30. Coarser-Grain Data Parallelism Single-Program Multiple-Data More broadly applicable than SIMD

  31. Creating a Parallel Program • ID work that can be done in parallel • Computation • Data access • I/O • Partition work/data among entities • Processes • Threads • Manage data access, comm, sync Speedup(P) = Performance(P)/Performance(1) = Time(1)/Time(P)

  32. Steps • Decomposition • Assignment • Orchestration • Mapping • Can be done by • Programmer • Compiler • Runtime • Hardware (speculatively)

  33. Parallelization Architecture independent Architecture dependent P0 P1 Mapping Decomposition Orchestration Assignment P2 P3 Sequential Compuitation Parallel Program Processors Tasks Processes

  34. Concepts • Task • Arbitrary piece of work from computation • Sequentually executed • Could be fine- or coarse-grained • Process (or thread) • What gets executed by a core • Abstract entity that performs tasks assigned to it • Processes comm & sync to perform tasks • Processor (core) • Physical engine on which processes run • Virtualized machine view for programmer

  35. Decomposition • Purpose: Break up computation into tasks to be divided among processes • Tasks may become available dynamically • Number of available tasks may vary with time • i.e., identify concurrency and decide level at which to exploit it • Goal: keep processes busy, but keep management reasonable • Number of tasks creates upper bound on speedup • Too many tasks requires too much coordination

  36. Assignment • Specify mechanism to divide work among processes • Strive for balance • Reduce communication, management • Structured approach recommended • Inspect code • Apply well known heuristics • Programmer focuses on decomp/assign 1st • Largely independent of architecture/programming model • Choice of primitives (cost/complexity) affects decisions • Architects assume program(mer) does decent job

  37. Orchestration • Purpose • Name data, structure comm/sync • Organize data structures, schedule tasks (temporally) • Goals • Reduce costs of comm/sync from processor POV • Improve data locality • Reduce overhead of managing parallelism • Choices depend heavily on comm abstraction, efficiency of primitives • Architects must provide appropriate, efficient primitives

  38. Mapping • Two aspects • Which processes to run on same processor • Which process runs on which processor • One extreme-sharing • Partition machine s.t.only 1 app at a time in a subset • Pin processes to cores (or let OS balance workloads) • Another extreme • Control complete resource management in OS • Use performance techniques for dynamic balancing • Real world is between the two • User specifies desires in some aspects • System may ignore

  39. High-Level Goals • High performance • Low resource usage • Low development effort • Low power consumption • Implications for algorithm designers and architects • Algorithm designers: high-performance, low resource needs • Architects: high-performance, low cost, reduced programming effort

  40. Costs of Parallel Machines Things to keep in mind when designing a machine . . . What does it cost to design the mechanism? What does it cost to verify? What does it cost to manufacture? What does it cost to test? What does it cost to program it? What does it cost to deploy (turn on)? What does it cost to keep it running? (power costs, maintenance) What does it cost to use it? What does it cost to dispose of it at the end of its lifetime? (how long is a "lifetime"?)

More Related