1 / 15

Stampede Update

Stampede Update. Jay Boisseau CASC Meeting October 4, 2012. TACC’s Stampede: The Big Picture. Dell, Intel, and Mellanox are vendor partners Almost 10 PF peak performance in initial system (2013) 2+ PF of Intel Xeon E5 (several thousand nodes)

aquene
Download Presentation

Stampede Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Stampede Update Jay Boisseau CASC Meeting October 4, 2012

  2. TACC’s Stampede: The Big Picture • Dell, Intel, and Mellanox are vendor partners • Almost 10 PF peak performance in initial system (2013) • 2+ PF of Intel Xeon E5 (several thousand nodes) • 7+ PF of Intel Xeon Phi (MIC) coprocessors (1st in world!) • 14+ PB disk, 150+ GB/s I/O bandwidth • 250+ TB RAM • 56 Gb/s Mellanox FDR InfiniBand interconnect • 16 x 1TB large shared memory nodes • 128 Nvidia Kepler K20 GPUs for remote viz • Software for high throughput computing

  3. What We Like About MIC • Intel’s® MIC is based on x86 technology • x86 cores w/ caches and cache coherency • SIMD instruction set • Programming for MIC is similar to programming for CPUs • Familiar languages: C/C++ and Fortran • Familiar parallel programming models: OpenMP & MPI • MPI on host and on the coprocessor • Any code can run on MIC, not just kernels • Optimizing for MIC is similar to optimizing for CPUs • Make use of existing knowledge!

  4. Coprocessorvs. Accelerator • Architecture: x86 vs. streaming processors coherent caches vs. shared memory and caches • HPC Programming model: extension to C++/C/Fortran vs. CUDA/OpenCLOpenCL support • Threading/MPI: OpenMP and Multithreading vs. threads in hardware MPI on host and/or MIC vs. MPI on host only • Programming details offloaded regions vs. kernels • Support for any code: serial, scripting, etc. YesNo • Native mode: Any code may be “offloaded” as a whole to the coprocessor

  5. Key Questions Why do we need to use a thread-based programming model? MPI programming Where to place the MPI tasks? How to communicate between host and MIC?

  6. Programming MIC with Threads Key aspects of the MIC coprocessor • A lot of local memory, but even more cores • 100+ threads or tasks on a MIC • Severe limitation of the available memory per task • Some GB of Memory & 100 tasks ➞ some ten’s of MB per MPI task One MPI task per core? — Probably not • Threads (OpenMP, etc.) are a must on MIC

  7. “Offloaded” Execution Model • MPI task execute on the host • Directives “offload” OpenMP code sections to the MIC • Communication between MPI tasks on hosts through MPI • Communication between host and coprocessor through “offload” semantics • Code modifications: • “Offload” directives inserted before OpenMP parallel regions One executable (a.out) runs on host and coprocessor

  8. MIC Training Offloading Base program: Basic Program • #include <omp.h> • #define N 10000 • void foo(double *, double *, double *, int); • int main(){ • inti; double a[N], b[N], c[N]; • for(i=0;i<N;i++){ a[i]=i; b[i]=N-1-i;} • foo(a,b,c,N); • } • void foo(double *a, double *b, double *c, int n){ • int i; • for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } } Example program Offload engine is a device Objective: Offload foo Do OpenMP on device

  9. MIC Training Offloading Function offload: Requirements Function Offload • #include <omp.h> • #define N 10000 • #pragma omp <offload_function_spec> • void foo(double *, double *, double *, int ); • int main(){ • int i; double a[N], b[N], c[N]; • for(i=0;i<N;i++){ a[i]=i; b[i]=N-1-i;} • #pragma omp <offload_this> • foo(a,b,c,N); • } • #pragma omp <offload_function_spec> • void foo(double *a, double *b, double *c, int n){ • inti; • #pragma omp parallel for • for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } } Direct compiler to offload function or block “Decorate” function and prototype Usual OpenMP directives on device

  10. Symmetric Execution Model • MPI tasks on host and coprocessor • Equal (symmetric) members of the MPI communicator • Same code on host processor and MIC processor • Communication between any MPI tasks through regular MPI calls • Host ⬌ Host • Host ⬌ MIC • MIC ⬌ MIC

  11. Getting Access • Production begins January 7 • XSEDE Allocations requests for January being accepted now, through October 15 • Early user access in November/December • Informal, contact me • Training in December (1 workshop), January (2 workshops)

  12. Stampede and Data Driven Science • Stampede configured for simulation-based and for data-driven science • Stampede’s architecture and configuration offer diverse capabilities, fully integrated: • Intel Xeon has great memory bandwidth, deals with ‘complex’ code well • Intel Xeon Phi has 50+ cores, can process lots of data concurrently • Big file systems, high I/O, HTC scheduling, large shared memory nodes, vis nodes, data software, data apps support…

  13. Stampede and Data Driven Science • Complementary storage resources • Corral: data management systems (iRODS, DBs, etc.), 10+ PB and grow in 4Q12 • Global file system (1Q13) – more persistent storage, 20+ PB and expandable • Ranch: archival system, 50 -> 100PB

  14. Requests from You • Do you have data-driven science problems that you want to run on Stampede? • These can help us with configuration, policies, training, docs, etc. • Do you have any suggestions for such policies, software, support, etc. for data-driven science?

  15. New Data-Focused Computing System • TACC will deploy a new system, different configuration than Stampede • Lots of local disk, memory per node • Different usage policies, software set • Still collecting requirements • Also connected to all of the data storage resources on previous slide • Pilot in 4Q12 or 1Q13, full system in 3Q13

More Related