1 / 21

Application Scalability and High Productivity Computing

Application Scalability and High Productivity Computing. Nicholas J Wright John Shalf Harvey Wasserman Advanced Technologies Group NERSC/LBNL. NERSC- National Energy Research Scientific Computing Center.

geri
Download Presentation

Application Scalability and High Productivity Computing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Application Scalability and High Productivity Computing Nicholas J Wright John Shalf Harvey Wasserman Advanced Technologies Group NERSC/LBNL

  2. NERSC- National Energy Research Scientific Computing Center • Mission: Accelerate the pace of scientific discovery by providing high performance computing, information, data, and communications services for all DOE Office of Science (SC) research. • The production computing facility for DOE SC. • Berkeley Lab Computing Sciences Directorate • Computational Research Division (CRD), ESnet • NERSC

  3. NERSC is the Primary Computing Center for DOE Office of Science • NERSC serves a large population Over 3000 users, 400 projects, 500 codes • NERSC Serves DOE SC Mission • Allocated by DOE program managers • Not limited to largest scale jobs • Not open to non-DOE applications • Strategy: Science First • Requirements workshops by office • Procurements based on science codes • Partnerships with vendors to meet science requirements

  4. NERSC Systems for Science • Large-Scale Computing Systems • Franklin (NERSC-5): Cray XT4 • 9,532 compute nodes; 38,128 cores • ~25 Tflop/s on applications; 356 Tflop/s peak • Hopper (NERSC-6): Cray XE6 • Phase 1: Cray XT5, 668 nodes, 5344 cores • Phase 2: 1.25 Pflop/s peak (late 2010 delivery) • Clusters • 140 Tflops total • Carver • IBM iDataplex cluster • PDSF (HEP/NP) • ~1K core throughput cluster • Magellan Cloud testbed • IBM iDataplexcluster • GenePool (JGI) • ~5K core throughput cluster • NERSC Global • Filesystem (NGF) • Uses IBM’s GPFS • 1.5 PB capacity • 5.5 GB/s of bandwidth Analytics Euclid (512 GB shared memory) Dirac GPU testbed (48 nodes) • HPSS Archival Storage • 40 PB capacity • 4 Tape libraries • 150 TB disk cache

  5. NERSC-9 1 EF Peak NERSC-8 100 PF Peak NERSC-7 10 PF Peak Hopper (N6) >1 PF Peak Peak Teraflop/s Franklin (N5) +QC 36 TF Sustained 352 TF Peak Franklin (N5) 19 TF Sustained 101 TF Peak NERSC Roadmap How do we ensure that Users Performance follows this trend and their Productivity is unaffected ? Top500 Users expect 10x improvement in capability every 3-4 years

  6. Hardware Trends: The Multicore era • Moore’s Law continues unabated • Power constraints means cores will double every 18 months not clock speed • Memory capacity is not doubling at the same rate –GB/core will decrease Power is the Leading Design Constraint Figure courtesy of Kunle Olukotun, Lance Hammond, Herb Sutter, and Burton Smith

  7. … and the power costs will still be staggering From Peter Kogge, DARPA Exascale Study $1M per megawatt per year! (with CHEAP power)

  8. Changing Notion of “System Balance” • If you pay 5% more to double the FPUs and get 10% improvement, it’s a win (despite lowering your % of peak performance) • If you pay 2x more on memory BW (power or cost) and get 35% more performance, then it’s a net loss (even though % peak looks better) • Real example: we can give up ALL of the flops to improve memory bandwidth by 20% on the 2018 system • We have a fixed budget • Sustained to peak FLOP rate is wrong metric if FLOPs are cheap • Balance involves balancing your checkbook & balancing your power budget • Requires a application co-design make the right trade-offs

  9. Summary: Technology Trends: • Number Cores  • Flops will be “free” • Memory Capacity per core  • Memory Bandwidth per core  • Network Bandwidth per core  • I/O Bandwidth 

  10. NERSC-9 1 EF Peak NERSC-8 100 PF Peak NERSC-7 10 PF Peak Hopper (N6) >1 PF Peak Peak Teraflop/s Franklin (N5) +QC 36 TF Sustained 352 TF Peak Franklin (N5) 19 TF Sustained 101 TF Peak Navigating Technology Phase Transitions Exascale + ??? GPU CUDA/OpenCL Or Manycore BG/Q, R Top500 COTS/MPP + MPI (+ OpenMP) COTS/MPP + MPI

  11. Application Scalability How can a user continue to be productive in the face of these disruptive technology trends?

  12. Source of Workload Information • Documents • 2005 DOE Greenbook • 2006-2010 NERSC Plan • LCF Studies and Reports • Workshop Reports • 2008 NERSC assessment • Allocations analysis • User discussion

  13. New Model for Collecting Requirements • Joint DOE Program Office / NERSC Workshops • Modeled after ESnet method • Two workshops per year • Describe science-based needs over 3-5 years • Case study narratives • First workshop is BER, May 7, 8

  14. Numerical Methods at NERSC(Caveat: survey data from ERCAP requests)

  15. Application Trends Performance • Weak Scaling • Time to solution is often a non-linear function of problem size • Strong Scaling • Latency or Serial fraction will get you in the end. • Add features to models – “New” Weak Scaling “Processors” Performance “Processors”

  16. Develop Best Practices in Multicore Programming NERSC/Cray Programming Models “Center of Excellence” combines: • LBNL strength in languages, tuning, performance analysis • Cray strength in languages, compilers, benchmarking Goals: • Immediate goal is training material for Hopper users: hybrid OpenMP/MPI • Long term input into exascale programming model = OpenMP thread parallelism

  17. Develop Best Practices in Multicore Programming Conclusions so far: • Mixed OpenMP/MPI saves significant memory • Running time impact varies with application • 1 MPI process per socket is often good Run on Hopper next: • 12 vs 6 cores per socket • Gemini vs. Seastar = OpenMP thread parallelism

  18. Co-Design Eating our own dogfood

  19. Inserting Scientific Apps into the Hardware Development Process • Research Accelerator for Multi-Processors (RAMP) • Simulate hardware before it is built! • Break slow feedback loop for system designs • Enables tightly coupled hardware/software/science • co-design (not possible using conventional approach)

  20. Summary • Disruptive technology changes are coming • By exploring • new programming models (and revisiting old ones) • Hardware software co-design • We hope to ensure that scientists productivity remains high !

More Related