Stampede Update

Stampede Update Jay Boisseau CASC Meeting October 4, 2012

TACC’s Stampede: The Big Picture • Dell, Intel, and Mellanox are vendor partners • Almost 10 PF peak performance in initial system (2013) • 2+ PF of Intel Xeon E5 (several thousand nodes) • 7+ PF of Intel Xeon Phi (MIC) coprocessors (1st in world!) • 14+ PB disk, 150+ GB/s I/O bandwidth • 250+ TB RAM • 56 Gb/s Mellanox FDR InfiniBand interconnect • 16 x 1TB large shared memory nodes • 128 Nvidia Kepler K20 GPUs for remote viz • Software for high throughput computing

What We Like About MIC • Intel’s® MIC is based on x86 technology • x86 cores w/ caches and cache coherency • SIMD instruction set • Programming for MIC is similar to programming for CPUs • Familiar languages: C/C++ and Fortran • Familiar parallel programming models: OpenMP & MPI • MPI on host and on the coprocessor • Any code can run on MIC, not just kernels • Optimizing for MIC is similar to optimizing for CPUs • Make use of existing knowledge!

Coprocessorvs. Accelerator • Architecture: x86 vs. streaming processors coherent caches vs. shared memory and caches • HPC Programming model: extension to C++/C/Fortran vs. CUDA/OpenCLOpenCL support • Threading/MPI: OpenMP and Multithreading vs. threads in hardware MPI on host and/or MIC vs. MPI on host only • Programming details offloaded regions vs. kernels • Support for any code: serial, scripting, etc. YesNo • Native mode: Any code may be “offloaded” as a whole to the coprocessor

Key Questions Why do we need to use a thread-based programming model? MPI programming Where to place the MPI tasks? How to communicate between host and MIC?

Programming MIC with Threads Key aspects of the MIC coprocessor • A lot of local memory, but even more cores • 100+ threads or tasks on a MIC • Severe limitation of the available memory per task • Some GB of Memory & 100 tasks ➞ some ten’s of MB per MPI task One MPI task per core? — Probably not • Threads (OpenMP, etc.) are a must on MIC

“Offloaded” Execution Model • MPI task execute on the host • Directives “offload” OpenMP code sections to the MIC • Communication between MPI tasks on hosts through MPI • Communication between host and coprocessor through “offload” semantics • Code modifications: • “Offload” directives inserted before OpenMP parallel regions One executable (a.out) runs on host and coprocessor

MIC Training Offloading Base program: Basic Program • #include <omp.h> • #define N 10000 • void foo(double *, double *, double *, int); • int main(){ • inti; double a[N], b[N], c[N]; • for(i=0;i<N;i++){ a[i]=i; b[i]=N-1-i;} • foo(a,b,c,N); • } • void foo(double *a, double *b, double *c, int n){ • int i; • for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } } Example program Offload engine is a device Objective: Offload foo Do OpenMP on device

MIC Training Offloading Function offload: Requirements Function Offload • #include <omp.h> • #define N 10000 • #pragma omp <offload_function_spec> • void foo(double *, double *, double *, int ); • int main(){ • int i; double a[N], b[N], c[N]; • for(i=0;i<N;i++){ a[i]=i; b[i]=N-1-i;} • #pragma omp <offload_this> • foo(a,b,c,N); • } • #pragma omp <offload_function_spec> • void foo(double *a, double *b, double *c, int n){ • inti; • #pragma omp parallel for • for(i=0;i<n;i++) { c[i]=a[i]*2.0e0 + b[i]; } } Direct compiler to offload function or block “Decorate” function and prototype Usual OpenMP directives on device

Symmetric Execution Model • MPI tasks on host and coprocessor • Equal (symmetric) members of the MPI communicator • Same code on host processor and MIC processor • Communication between any MPI tasks through regular MPI calls • Host ⬌ Host • Host ⬌ MIC • MIC ⬌ MIC

Getting Access • Production begins January 7 • XSEDE Allocations requests for January being accepted now, through October 15 • Early user access in November/December • Informal, contact me • Training in December (1 workshop), January (2 workshops)

Stampede and Data Driven Science • Stampede configured for simulation-based and for data-driven science • Stampede’s architecture and configuration offer diverse capabilities, fully integrated: • Intel Xeon has great memory bandwidth, deals with ‘complex’ code well • Intel Xeon Phi has 50+ cores, can process lots of data concurrently • Big file systems, high I/O, HTC scheduling, large shared memory nodes, vis nodes, data software, data apps support…

Stampede and Data Driven Science • Complementary storage resources • Corral: data management systems (iRODS, DBs, etc.), 10+ PB and grow in 4Q12 • Global file system (1Q13) – more persistent storage, 20+ PB and expandable • Ranch: archival system, 50 -> 100PB

Requests from You • Do you have data-driven science problems that you want to run on Stampede? • These can help us with configuration, policies, training, docs, etc. • Do you have any suggestions for such policies, software, support, etc. for data-driven science?

New Data-Focused Computing System • TACC will deploy a new system, different configuration than Stampede • Lots of local disk, memory per node • Different usage policies, software set • Still collecting requirements • Also connected to all of the data storage resources on previous slide • Pilot in 4Q12 or 1Q13, full system in 3Q13

Stampede Update

Stampede Update

Presentation Transcript

2010 Rotary 43 Annual “Duck Stampede”

ACE/RUS School and Symposium Corralling the Broadband Stampede

Stampede over Beckham

THE LITERACY STAMPEDE

Mead Stampede Walk-A-Thon 2013

Marketing the Calgary Stampede to Australia

Calgary stampede

Seward Stampede

Stampede Overview

STAMPEDE

Alderwood M.S. Mustang Stampede

Marketing the Calgary Stampede to Australia

27 th Annual Armadillo Stampede

SENIOR STAMPEDE

STAMPEDE trial (MRC PR08): Arm J overview

ACE/RUS School and Symposium Corralling the Broadband Stampede

2012 Calgary Stampede Parade

STAMPEDE trial (MRC PR08): Arm J Pharmacy Training

Upper Stampede News

Bronco Stampede

Enjoy Calgary Stampede- The greatest outdoor show on Earth

Calgary Stampede