1 / 33

Emerging Challenges & Opportunities in Parallel Computing: The Cretaceous Redux?

Emerging Challenges & Opportunities in Parallel Computing: The Cretaceous Redux?. Bruce Hendrickson Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept.

davidn
Download Presentation

Emerging Challenges & Opportunities in Parallel Computing: The Cretaceous Redux?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Emerging Challenges & Opportunitiesin Parallel Computing:The Cretaceous Redux? Bruce Hendrickson Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept.

  2. The Relationship Between Theory & Practice in Parallel Computing:Plus a Silly Metaphor Bruce Hendrickson Senior Manager for Math & Computer Science Sandia National Laboratories, Albuquerque, NM University of New Mexico, Computer Science Dept.

  3. Outline • Theory and practice in parallel computing are estranged • Emerging applications will challenge the status quo • Architectural changes will add further disruption • These forces will create rich opportunities for the theory community

  4. Parallel Computing Theory is Robust • Theoretical foundations • P-Completeness [Cook’73] • Boolean Circuits [Borodin’77] • PRAMs [Fortune and Wyllie’78] • NC and P-Completeness [Pippenger/Cook’79] • Technology-informed theoretical models • Fixed interconnection machines, e.g. hypercubes [many] • LOGP [Culler, et al.’93] • Bulk Synchronous Parallel [Gerbessiotis & Valiant’92] • “Practical” ideas with strong theoretical underpinnings • PGAS Languages [several] • CILK [Leiserson’s group’95] • 21 years of SPAA, 28 years of PODC, etc…

  5. iPSC-860 Sandia is a Leader in Parallel Computing Red Storm ASCI Red CM-2 nCUBE-2 Paragon Cplant 2007 2009 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 Designed by Rolf Riesen, July 2005 2008 2006 1988 1990 1992 1994 1996 1998 2000 2002 2004 R&D 100 Dense Solvers R&D 100 Xyce Gordon Bell Prize Gordon Bell Prize Patent Meshing R&D 100 Allocator World Record Teraflops R&D 100 Storage R&D 100 Parallel Software Gordon Bell Prize Patent Data Mining Mannheim SuParCup SC96 Gold Medal Networking R&D 100 Signal Processing R&D 100 Catamount R&D 100 Salvo Karp Challenge World Record 281 GFlops Patent Partitioning R&D 100 Trillinos R&D 100 Meshing World Record 143 GFlops Patent Parallel Software R&D 100 Aztec Fernbach Award Patent Paving

  6. Theory at Sandia • Sandia designs, procures, programs, runs & treasures big parallel computers • Sandia has at least 200 PhDs working on parallel computing • Mostly physics & engineering degrees • But many computer scientists as well • Very few of these practitioners could define a PRAM • Let alone explain NC! • None use CILK or UPC • What’s wrong with this picture!?

  7. Elements of Parallel Computing Practice • Clusters • “Killer micros” enable commodity-based parallel computing • Attractive price and price/performance • Stable model for algorithms & software • MPI • Portable and stable programming model and language • Allowed for huge investment in software • Bulk-Synchronous Parallel Programming (BSP) • Basic approach to almost all successful MPI programs • Compute locally; communicate; repeat • Excellent match for clusters+MPI • Good fit for many scientific applications • Algorithms • Stability of the above allows for sustained algorithmic research

  8. Architectures Programming Models Software Algorithms A Virtuous Circle… Commodity Clusters Explicit Message Passing MPI Bulk Synchronous Parallel …or a vicious noose?

  9. PETSc Trilinos MPI Applications LAMPPS Linpack

  10. CILK PRAM LOGP UPC

  11. Existing Applications Are Evolving • Leading edge scientific applications increasingly include: • Adaptive, unstructured data structures • Complex, multiphysics simulations • Multiscale computations in space and time • Complex synchronizations (e.g. discrete events) • These raise significant parallelization challenges • Limited by memory, not processor performance • Unsolved micro-load balancing problems • Finite degree of coarse-grained parallelism • Bulk synchronous parallel not always appropriate • These changes will stress existing approaches to parallelism

  12. The way it was … The way it is now … Zachary’s karate club (|V|=34) Twitter social network (|V|≈200M) New Applications are Emerging:E.g. Network Science • Graphs are ideal for representing entities and relationships • Rapidly growing use in biological, social, environmental, and other sciences

  13. Emerging New Scientific Questions • New algorithms • Community detection, centrality, graph generation, etc. • Right set of questions and concepts still unknown. • Statistics, machine learning, anomaly detection, etc. • New issues • Noisy, error-filled data. What can we conclude robustly? • Temporal evolution of networks. • New science • Social dynamics and ties to technology & media • Large economic, social, political consequences • Parallel computing needed for big data and/or fast response

  14. Computational Challenges for Network Science • Minimal computation to hide access time • Runtime is dominated by latency • Random accesses to global address space • Parallelism is very fine grained and dynamic • Access pattern is data dependent • Prefetching unlikely to help • Usually only want small part of cache line • Potentially abysmal locality at all levels of memory hierarchy • Many algorithms are not bulk synchronous • Approaches based on virtuous circle don’t work!

  15. What we traditionally care about Emerging Codes What industry cares about From: Murphy and Kogge, On The Memory Access Patterns of Supercomputer Applications: Benchmark Selection and Its Implications, IEEE T. on Computers, July 2007 Locality Challenges

  16. A Renaissance in Architecture Research • Good news • Moore’s Law marches on • Real estate on a chip is essentially free • Major paradigm change – huge opportunity for innovation • Bad news • Power considerations limit the improvement in clock speed • Parallelism is only viable route to improve performance • Current response, multicore processors • Computation/Communication ratio will get worse • Makes life harder for applications • Long-term consequences unclear

  17. Example: AMD Opteron

  18. Example: AMD Opteron Memory (Latency Avoidance) L1 D-Cache L2 Cache L1 I-Cache

  19. Example: AMD Opteron Memory (Lat. Avoidance) Out-of-Order Exec Load/Store Mem/Coherency (Latency Tolerance) Load/Store Unit L1 D-Cache L2 Cache I-Fetch Scan Align L1 I-Cache Memory Controller

  20. Example: AMD Opteron Memory (Latency Avoidance) Load/Store Unit L1 D-Cache Out-of-Order Exec Load/Store Mem/Coherency (Lat. Toleration) L2 Cache Bus DDR HT I-Fetch Scan Align L1 I-Cache Memory and I/O Interfaces Memory Controller

  21. Example: AMD Opteron Memory (Latency Avoidance) FPU Execution Load/Store Unit L1 D-Cache Out-of-Order Exec Load/Store Mem/Coherency (Lat. Tolerance) L2 Cache Int Execution Bus DDR HT I-Fetch Scan Align L1 I-Cache Memory and I/O Interfaces Memory Controller COMPUTER Thanks to Thomas Sterling

  22. Architectural Wish List for Graphs • Low latency / high bandwidth • For small messages! • Latency tolerant • Light-weight synchronization mechanisms for fine-grained parallelism • Global address space • No graph partitioning required • Avoid memory-consuming profusion of ghost-nodes • No local/global numbering conversions • One machine with these properties is the Tera MTA-2 • And its successor the Cray XMT

  23. How Does the MTA/XMT Work? • Latency tolerance via massive multi-threading • Context switch every tick • Global address space, hashed to reduce hot-spots • No cache or local memory. • Multiple outstanding loads • Remote memory request doesn’t stall processor • Other streams work while your request gets fulfilled • Light-weight, word-level synchronization • Minimizes conflicts, enables parallelism • Flexible dynamic load balancing • Thread virtualization • Futures

  24. PBGL SSSP Time (s) MTA SSSP # Processors Case Study: Single Source Shortest Path • Parallel Boost Graph Library (PBGL) • Lumsdaine, et al., on Opteron cluster • Some graph algorithms can scale on some inputs • PBGL - MTA Comparison on SSSP • Erdös-Renyi random graph (|V|=228) • PBGL SSSP can scale on non-power law graphs • Order of magnitude speed difference • 2 orders of magnitude efficiency difference • Big difference in power consumption • [Lumsdaine, Gregor, H., Berry, 2007]

  25. Disruptive Architectures Multicore New Apps

  26. What Happens Next? • Virtuous circle will not survive the coming disruptions • New programming models, languages, algorithms and abstractions will be needed • But MPI cannot die • Billions of dollars in investment in software • “I don’t know what the parallel programming language of the future will look like, but I know it will be called MPI” • Luckily, theory is forever …

  27. Rebuilding the Foundations • Applied parallel computing will need new ideas to continue moving forward • Ideas and tools from theory community can: • Provide abstractions to manage hardware complexity • Underlie robust algorithm development and analysis • Suggest new programming models and abstractions • Point towards new architectural features • Support efficient utilization of resources • Provide underpinnings for the future of applied parallel computing

  28. Conclusions • Applied parallel computing is facing unprecedented challenges • Multi-core processors • Disruptive architectural innovations • Demands of emerging applications • Theory can provide reliable light in the coming darkness • Theoretical insights are resilient to technology changes • Theory community will have new opportunities • Provide robust foundation for future progress • Become central to applied parallel computing • This is a great time to be doing parallel computing!

  29. Thanks • Cevdet Aykanat, Michael Bender, Jon Berry, Rob Bisseling, Erik Boman, Bill Carlson, Ümit Çatalürek, Edmond Chow, Karen Devine, Iain Duff, Danny Dunlavy, Alan Edelman, Jean-Loup Faulon, John Gilbert, Assefaw Gebremedhin, Mike Heath, Paul Hovland, Vitus Leung, Simon Kahan, Pat Knupp, Tammy Kolda, Gary Kumfert, Fredrik Manne, Michael Mahoney, Mike Merrill, Richard Murphy, Esmond Ng, Ali Pınar, Cindy Phillips, Steve Plimpton, Alex Pothen, Robert Preis, Padma Raghavan, Steve Reinhardt, Suzanne Rountree, Rob Schreiber, Viral Shah, Jonathan Shewchuk, Horst Simon, Dan Spielman, Shang-Hua Teng, Sivan Toledo, Keith Underwood, etc.

More Related