1 / 25

CPS 258 Announcements

Explore parallel architectures, levels of parallelism, memory organization, network topologies, programming modes, performance measures, and optimization techniques.

rodj
Download Presentation

CPS 258 Announcements

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CPS 258 Announcements • http://www.cs.duke.edu/~nikos/cps258 • Lecture calendar with slides • Pointers to related material

  2. Parallel Architectures (continued)

  3. Parallelism Levels • Job • Program • Instruction • Bit

  4. Parallel Architectures • Pipelining • Multiple execution units • Superscalar • VLIW • Multiple processors

  5. Pipelining Example for i = 1:n z(i) = x(i) + y(i); end Prologue Loop body Epilogue

  6. Generic Computer • CPU • Memory • Bus

  7. Memory Organization • Distributed memory • Shared memory

  8. Shared Memory

  9. Distributed Memory

  10. Interleaved Memory

  11. Network Topologies • Ring • Torus • Tree • Star • Hypercube • Cross-bar

  12. Flynn’s Taxonomy • SISD • SIMD • MISD • MIMD

  13. Programming Modes • Data Parallel • Message Passing • Shared Memory • Multithreaded (control parallelism)

  14. Performance measures • FLOPS • Theoretical vs actual • MFLOPS, GFLOPS, TFLOPS • Speedup(P) = Execution time in 1 proc/time in P procs • Benchmarks • LINPACK • LAPACK • SPEC (System Performance Evaluation Cooperative)

  15. Speedup • Speedup(P) = Best execution time in 1 proc/time in P procs • Parallel Efficiency(P) = Speedup(P)/P

  16. Example Suppose a program runs in 10sec and 80% of the time is spent in subroutine F that can be perfectly parallelized. What is the best speedup I can achieve?

  17. Amdahl’s Law Speedup is limited by the percentage of the code that has to be executed sequentially

  18. “Secrets” to Success Overlap communication with computation Communicate minimally Avoid synchronizations T = tcomp + tcomm + tsync

  19. Processors • CISC • Many and complex/multicycle instructions • Few registers • Direct access to memory • RISC • Few “orthogonal” instructions • Large register files • Access to memory only through L/S units

  20. Common μProcessors • Intel X86 • Advanced Micro Devices • Transmeta Crusoe • PowerPC • SPARC • MIPS

  21. Cache Memory Hierarchies • Memory speed progresses much slower than processor speed • Memory Locality • Spatial • Temporal • Data Placement • Direct mapping • Set associative • Data Replacement

  22. Example • Matrix multiplication • As dot products • As sub-matrix products

  23. Vector Architectures • Single Instruction Multiple Data • Exploit uniformity of operations • Multiple execution units • Pipelining • Hardware assisted loops • Vectorizing compilers

  24. Compiler techniques for vectorization • Scalar expansion • Statement reordering • Loop • Distribution • Reordering • Merging • Splitting • Skewing • Unrolling • Peeling • Collapsing

  25. Epilogue • Distributed memory systems win • Memory hierarchy is critical in performance • Compilers do a good job in ILP but programmers are still important • System modeling inadequate to help us tune optimal performance

More Related