HPC-Colony

www.HPC-Colony.org June 2005 PI Meeting Terry Jones, LLNL, Coordinating PI Laxmikant Kale, UIUC, PI Jose Moreira, IBM, PI Celso Mendes, UIUC Sayantan Chakravorty, UIUC Todd Inglett, IBM Andrew Tauferner, IBM

Outline • Recap Colony Project Goals & Approach • Status Update • Identify possibilities for Collaborations • Identify resources which may be of interest to other Proects

Recap Colony Project Goals & Approach • Status Update • Identify possibilities for Collaborations • Identify resources which may be of interest to other Proects

Colony Project Overview Collaborators Title Services and Interfaces to Support Systems with Very Large Numbers of Processors Lawrence Livermore National Laboratory At mtg: Terry Jones Topics • Scalable Load Balancing • OS mechanisms for Migration • Processor Virtualization for Fault Tolerance • Single system management space • Parallel Awareness and Coordinated Scheduling of Services • Linux OS for Blue Gene like machines University of Illinois at Urbana-Champaign At mtg: Laxmikant Kale, Celso Mendes International Business Machines

Motivation & Goals Parallel Resource Management • Today, application programmers must explicitly manage these complex resources. We address scaling issues and porting issues by delegating resource management tasks to a sophisticated parallel OS. • “Managing Resources” includes balancing CPU time, network utilization, and memory usage across the entire machine. Global System Management • Linux Everywhere • Enhance operating system support for parallel execution by providing coordinated scheduling and improved management services for very large machines. • Virtual memory management across the entire system. Fault Tolerance • Scalable stategies for dealing with faults. Checkpoint restart may not be adequate in all cases (and It may be too expensive).

Approach • Top Down • Our work will start from an existing full-featured OS and remove excess baggage with a “top down” approach. • Processor virtualization • One of our core techniques: the programmer divides the computation into a large number of entities, which are mapped to the available processors by an intelligent runtime system. • Leverage Advantages of Full Featured OS & Single System Image • Applications on these extreme-scale systems will benefit from extensive services and interfaces; managing these complex systems will require an improved “logical view” • Utilize Blue Gene • Suitable platform for ideas intended for very large numbers of processors

Recent AccomplishmentsProject Overview • Started consolidating previous fault-tolerance work • Checkpoint/Restart Techniques • In-Memory Checkpointing • Preliminary Sender-based Message Logging Scheme • Continued parallel resource-management work • Measurement-based load balancing • New classes of balancers designed & implemented • Support for object & thread migration on BG/L • Global systemmanagement work • Linux to BG/L compute nodes • comparison performance measurements • parallel aware scheduling

Fault Tolerance WorkStatus • Current focus: Proactive fault-tolerance • Approach: Migrate tasks away from processors where faults are imminent • Rationale: Many faults may have advance indicators • Hardware / O.S. notification • Basis: Processor virtualization (many virtual processors mapped to physical processors) • Implementation: “anytime-migration” support in Charm++ and AMPI (for MPI codes) used for processor evacuation • Tests: Jacobi (C,MPI), Sweep3d (Fortran,MPI)

Fault Tolerance Work • Example: Jacobi-relaxation 2D (C,MPI) • 6,000x6,000 dataset, run on 8 Xeon processors 7 processors remaining fault warning, nigration load balancing

Fault Tolerance Work • Example: sweep3d • Processor-utilization snapshopts load from failed processor (1) prior to fault after first fault/migration after load-balancing

Fault ToleranceRecent Accomplishments • Proactive Fault-Tolerance Lessons: • Evacuation time is proportional to dataset size per processor and to network speed – good scalability • Subsequent load balancing step is critical for good performance • Current scheme can tolerate multiple faults that are not simultaneous • Current Status: • Working on selecting appropriate load balancers • Analyzing how to improve performance between evacuation and next load-balancing • Reducing time between notice of impending failure and stable post-evacuation state • Integrating to regular Charm++/AMPI distribution

Parallel Resource Management Work • Recent focus: Preliminary work on Advanced Load Balancers • Multi-Phase load balancing • balance for each program phase separately • phases may be identified by user-inserted calls • sample results following • Asynchronous load balancing • hiding load balancing overhead • overlap of load balancing and computation • Topology-aware load balancing • considers network topology • major goal: optimize hopbytes metric • On going work

Parallel Resource Management Work • Example: Phase load balancing Phase 1 Phase 2 Processor 0 original execution Processor 1 Processor 0 good balancing, but bad performance Processor 1 Processor 0 phase balancing Processor 1

Parallel Resource Management Work • Example: synthetic 2-phase program Tests: 32 pe’s, Xeon cluster utilization70% no load-balance greedy load-balancer utilization60% Multi-phase load-balancer

Parallel Resource ManagementRecent Accomplishments • Load Balancing Trade-offs: • Trade-off in centralized load balancers • good balance can be achieved • expensive memory and network usage for LB • Trade-off in fully distributed load balancers • cheap in terms of memory/network usage • may be slow to achieve acceptable balance • Current Status: • Working on hybrid load-balancing schemes • Idea: divide processors into hierarchy, balance load at each level first, then across levels • Integrating balancers to regular Charm++/AMPI

Global System ManagementRecent Accomplishments • Defined team for Blue Gene/L Linux Everywhere • Research and development of Linux Everywhere • Limitations of the Compute Node Kernel • Motivation for Linux on the compute nodes • Challenges of a Linux solution for Blue Gene/L • Initial experiments and results • Measurements for Parallel Aware Scheduling

Global System ManagementLinux Everywhere Solution for BG/L • Would significantly extend the spectrum of applications that can be easily ported to Blue Gene/L – more penetration in industrial and commercial markets • Multiple processes/threads per compute node • Leverage set of middleware (libraries, run-time) from I/O node to compute node • An alternative to cross-compilation environments • Platform for research on extreme scalability of Linux • Possible solution for follow on machine (Blue Gene/P)

Global System ManagementChallenges of Linux Everywhere • Scalability challenges: • Will the more asynchronous nature of Linux lead to scalability problems? • Sources of asynchronous events in Linux: Timer interrupts, TLB misses, Process/thread scheduler, I/O (devices) events • Parallel file system for 100,000+ processors? • If Linux cannot scale to 10,000-100,000 processor range, can it be a solution for smaller systems? • Reliability challenges: • Despite some difficulties with file system and Ethernet devices, Linux on BG/L has not presented reliability problems • Test, test, test, and more test …

Global System ManagementFull Linux on Blue Gene Nodes • 5 out of 8 NAS class A benchmarks fit in 256MB • Same compile (gcc –O3) and execute options, and similar libraries (GLIBC) • No daemons or extra processes running in Linux; mainly user space code • difference suspected to be caused by handling of TLB misses

Global System ManagementImpact of TLB misses (single node) • Normally, CNK does not have TLB misses (memory is directly mapped) – we added a TLB handler with the same overhead as Linux (~160 cycles) and varied the page size • CNK with a 4kB page performs just like Linux • CNK with 64kB page performs just like CNK without TLBs

Global System ManagementImpact of TLB misses (multi-node) • CNK with 64kB pages performs within 2% of no TLBs • CNK with 1MB page performs just like CNK without TLBs

Global System ManagementLinux Everywhere Conclusions • A Linux Everywhere (I/O and compute nodes) solution for Blue Gene/L can significantly increase the spectrum of applications for Blue Gene/L • Significant challenges in scaling Linux to 100,000 processors, primarily due to interference on running applications • Large pages effective in reducing impact of TLB misses – 64kB pages seems like a good target • Next steps: • Implement large page support in Linux for Blue Gene/L • Study other sources of interference (timers) • Devise file system solution • Decide what to do with lack of coherence

Global System Management Parallel Aware Scheduling Miranda -- Instability & Turbulence • High order hydrodynamics code for computing fluid instabilities and turbulent mix Results @ 1024 Tasks Favored Priority:41 Unfavored Priority:100 Percent to Application:99.995% Total Duration:20 seconds -------------------------------------------------------------- Without Co-Scheduler….Mean:452.52 Without Co-Scheduler….Standard Dev:108.45 With Co-Scheduler……..Mean:254.45 With Co-Scheduler……..Standard Dev:5.45

Needs • Early fault indicators • What kinds of fault indicators can the O.S. provide? • Can these be provided in a uniform interface? • O.S. support for task migration • Is it possible to allocate/reserve virtual memory consistently and “globally”? • How to handle thread migration in 32-bit systems? • O.S. support for load-balancing • Production of updated computational load and communication information • Fast data collection (via extra-network support ?) • Interfaces between schedulers and jobs

Collaboration Opportunities • Extending Linux for extreme scale • Addressing OS interference • Addressing how system resources are managed (processor virtualization and SSI strategies) • Migration (for both load-balancing and run through failure) • How far can we push full-featured OS?

We can provide… • Several “scalable” applications from LLNL and UIUC • A full-featured OS comparison point (if given a metric or application) • Feedback on how novel ideas affect the key aspects of our approach (e.g. “That would affect our model of process migration in the following way…”) And, as mentioned earlier, we’re very interested in collaborating in any of a number of areas…

Extra Viewgraphs

Fault Tolerance Work • Example: sweep3d (Fortran,MPI) • 1503 case, run on 8 Xeon processors, 32 VP’s 6 processors remaining

Parallel Resource Management Work • Example: LeanMD, Communic.-aware load balancing • HCA 30K atoms data • Run on PSC LeMieux

Parallel Resource Management Work • Example: Topology-aware load balancing LeanMD code, 1024 processor run with HCA atom benchmark

Global System ManagementLimitations of the CNK Solution • Less than half of the Linux system calls are supported – in particular the following are missing • Process and thread creation • Server side sockets • Shared memory segments/memory mapped files • CNK requires a cross-compilation environment, which has been a challenge for some applications • CNK also requires its own separate set of run-time libraries – maintenance and test issue • We could keep extending CNK, but that would eventually defeat its very reason for being (simplicity) • Instead, we want to investigate a “Linux Everywhere” solution for Blue Gene/L – Linux in I/O and compute nodes

Global System ManagementLinux on the Compute Nodes • Our priority is to guarantee scaling of “well behaved” Linux use in the compute nodes: • Small number of processes/threads • No daemons (leave them to the I/O nodes) • Rely on MPI for high-speed communication • Ideally, MPI applications with one or two tasks per node will perform just as well with Linux as with the Compute Node Kernel • TCP and UDP support over the Blue Gene/L networks will be important for a broader set of applications that do not use MPI • For now, our study has focused on reducing the impact of TLB misses

Global System ManagementBlue Gene/L Compute Node Kernel • The Blue Gene/L Compute Node Kernel was developed from scratch – simple and deterministic for scalability and reliability (about 13,000 lines of code) • Every user-level thread is backed by a processor – deterministic execution and no processor sharing • Only timer interrupt is counter virtualization every 6 seconds, which is synchronous across partition • No TLB misses – memory is directly mapped • This deterministic execution has been the key to scalability of (for example) FLASH code • Implements a subset of Linux system calls – complex calls (mostly I/O) are actually executed on I/O node • GNU and XL run-time libraries ported to CNK – many applications have required just “compile and go”

Global System ManagementBlue Gene/L Kernels • In Blue Gene/L, Linux is used in a limited role • Runs on the I/O nodes, in support of file I/O, socket operations, job control, and process debugging • Linux on the I/O nodes acts as an extension of the Compute Node Kernel – operations not directly supported by the CNK (e.g., file I/O) are function shipped for execution on the I/O node • Linux also used in the front-end nodes (job compilation, submission, debugging) and service nodes (machine control, machine monitoring, job scheduling) • Compute nodes run the lightweight Compute Node Kernel for reliability and scalability

Global System ManagementScalability of CNK Solution (FLASH)

Global System Management Parallel Aware Scheduling Miranda -- Instability & Turbulence • High order hydrodynamics code for computing fluid instabilities and turbulent mix • Employs FFTs and band-diagonal matrix solvers to compute spectrally-accurate derivatives, combined with high-order integration methods for time advancement • Contains solvers for both compressible and incompressible flows • Has been used primarily for studying Rayleigh-Taylor (R-T) and Richtmyer- Meshkov (R-M) instabilities, which occur in supernovae and Inertial Confinement Fusion (ICF)

Top Down (start with Full-Featured OS) • Why? Broaden domain of applications that can run on the most powerful machines through OS support • More general approaches to processor virtualization, load balancing and fault tolerance • Increased interest in applications such as parallel discrete event simulation • Multi-threaded apps and libraries • Why not? Difficult to sustain performance with increasing levels of parallelism • Many parallel algorithms are extremely sensitive to serializations • We will address this difficulty with parallel aware scheduling Question: How much should system software offer in terms of features? Answer: Everything required, and as much desired as possible

P0 P1 P2 Systemimplementation Parallel Resource ManagementProcessor Virtualization • Divide the computation into a large number of pieces • Independent of the number of processors • Let the runtime system map objects to processors • Implementations: Charm++, Adaptive-MPI (AMPI) User View

MPI processes MPI “processes” Implemented as virtual processes (user-level migratable threads) Real Processors Parallel Resource ManagementAMPI: MPI with Virtualization • Each MPI process implemented as a user-level thread embedded in a Charm++ object

Research Using Processor Virtualization Efficient resource management • Dynamic load balancing based on object migration • Optimized inter-processor communication • Measurement-based runtime decisions • Communication volumes • Computational loads • Focus on highly scalable strategies • Centralized  Distributed  Hybrid Fault-tolerance approaches for large systems • Proactive reaction to impending faults • Migrate objects when a fault is imminent • Keep “good” processors running at full pace • Refine load balance after migrations • Appropriate for system failures consisting of a small subset of a large job • Automatic checkpointing / fault-detection / restart • In-memory checkpointing of objects • Using message-logging to tolerate frequent faults in a scalable fashion

Leverage Advantages of Full Featured OSs • Make Available on BlueGene/L Compute Nodes • Enable a large scale test bed for our work • Increase flexibility for target applications

Global System ManagementSingle System Image • Move from Physical view of machine to Logical View of machine • In a single process space model, all processes running on the parallel machine belong to a single pool of processes. A process can interact with any other process independent of their physical location on the machine. The single process space mechanism will implement the following process services: • By coalescing several individual operations into one collective operation, and raising the semantic level of the operations, the aggregate model will allow us to address performance issues that arise.

Global System ManagementThe Logical View… • Single Process Space • Process query: These services provide information about which processes are running on the machine, who they belong to, how they are grouped into jobs, and how much resources they are using. • Process creation: These services support the creation of new processes and their grouping into jobs. • Process control: These services support suspension, continuation, and termination of processes. • Process communication: These services implement communications between processes. • Single File Space • any completely qualified file descriptor (e.g., “/usr/bin/ls”) represents exactly the same file to all the processes running on the machine. • Single Communication Space • we will provide mechanisms in which any two processes can establish a channel between them.

Utilize Blue Gene • Port , optimize and scale existing Charm++/AMPI applications on BlueGene/L • Molecular Dynamics: NAMD (Univ.Illinois) • Collection of (charged) atoms, with bonds • Simulations with millions of timesteps desired • Cosmology: PKDGRAV (Univ.Washington) • N-Body problem, with gravitational forces • Simulation and analysis/visualization done in parallel • Quantum Chemistry: CPAIMD (IBM/NYU/others) • Car-Parrinello Ab Initio Molecular Dynamics • Fine-grain parallelization, long time simulations  Most of these efforts will leverage on current collaborations/grants

Task Migration Time in Charm++ • Migration time for 5-point stencil, 16 processors

Task Migration Time in Charm++ • Migration time for 5-point stencil, 268 MB total data

HPC-Colony

HPC-Colony

Presentation Transcript

Colony Maintenance

HPC Rhinoplasty

Microsoft HPC

Maryland Colony

Pennsylvania Colony

Jamestown Colony

Massachusetts Colony

Jamestown Colony

HPC Program

HPC

Colony Park

From Trustee Colony to Royal Colony

Jamestown Colony

HPC - NAME

French Colony

Colony Review

Georgia Colony

EMBASSY @ COLONY

Plymouth Colony

HPC Program

Colony Morphology

Colony Buffe