1 / 30

CIFTS Coordinated Infrastructure for Fault Tolerant Systems

CIFTS Coordinated Infrastructure for Fault Tolerant Systems. Agenda. The Problem and the purpose The CIFTS framework The CIFTS team Getting Involved. Current HPC Systems. Top 500 statistics Performance growth 35.86TF/s (2002 ) to 280FT/s ( 2007 ) Average node count growth

marsha
Download Presentation

CIFTS Coordinated Infrastructure for Fault Tolerant Systems

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CIFTS Coordinated Infrastructure for Fault Tolerant Systems

  2. Agenda • The Problem and the purpose • The CIFTS framework • The CIFTS team • Getting Involved

  3. Current HPC Systems • Top 500 statistics • Performance growth • 35.86TF/s (2002) to 280FT/s (2007) • Average node count growth • 128-258 (2002) to 1024-2048 (2007) *“A Power-aware Run-Time System for High-Performance Computing”, Chung-hsing Hsu and Wu-chun Feng, IEEE International Supercomputing Conference (SC), 2005

  4. Downtime Cost “Faults directly impact system downtime and TCO” *“A Power-aware Run-Time System for High-Performance Computing”, Chung-hsing Hsu and Wu-chun Feng, IEEE International Supercomputing Conference (SC), 2005

  5. Fault Tolerance in HPC • Available for some HPC components • Storage (RAID variations) and File Systems ( dCache, Tera Grid FS, Panasas, IBRIX, BulkFS) • Checkpointing software (application checkpointing ex: BLCR, Condor; operating system checkpointing ex: TICK) • Software built using hardware technologies like lmsensors, OpenMPI, BMC and other monitoring software like Ganglia • Middleware (FT-MPI, MPICH-V, FE-MPI, FT ARMCI) Components mostly deal with faults on an individual basis! Sharing of fault information globally is missing!

  6. A typical scenario MPI Application (job1) Job Scheduler detects “communication failure” with node X Launches MPI Job 1 MPI Aborts! Launches MPI Job 2 Application Aborts! More failures Other software on the cluster are agnostic of this MPI job failure. Other software are also agnostic of the reason of MPI job failure!

  7. The CIFTS Framework Operating System System Management hardware Applications System Monitoring software Universal Logger Networking libraries Fault Tolerant Backplane Event Analysis File Systems Job Scheduler/ Resource manager Automatic Actions Operating systems HPC Middleware Linear Algebra Libraries Diagnostics Tools Autonomics System components, libraries and applications

  8. CIFTS - Usage Scenario Job Scheduler Parallel FS Launch jobs with NFS file system IO node failure. File system down Migrates existing jobs File System shares this information Application Checkpoints itself MPI-IO Prints a coherent error message Application Checkpoints itself

  9. CIFTS - Usage Scenario Diagnostics Utility Hardware sensor detects increasing disk temp. on a Node X Runs scripts for further root-causing Sensor shares this knowledge Job Scheduler Not launch jobs on node X until further diagnosis Parallel FS MPI Application Prepare for I/O data migration from Node X Starts Checkpointing Starts Checkpointing

  10. Lifecycle of a componentinteraction with FTB Component Instance Component Instance 1 Distributed Fault Tolerant Backplane 1 2 2 3 3 4 4 Register with FTB Subscribe for events Publish events Deregister from FTB 1 2 3 4

  11. Delving deeper in FTB framework Component Instance Component Instance Register Register FTB Agent Subscribe to a set of events FTB Agent Publish event FTB Agent FTB Agent FTB Agent FTB Agent FTB Agent Register Component Instance Publish event

  12. FTB Internal Architecture Layers Component 1 Component n FTB Agent FTB Client API Client Library Linux BGL CRAY FTB Manager API Manager Library Manager Library Network Network Network Module1 Network Module2 Network Module1 Network Module2 Component software stack FTB Agent software stack

  13. What you need to know! Component 1 Component n FTB Agent Just the FTB Client API Client Library Linux BGL CRAY FTB Manager API Manager Library Manager Library Network Network Network Module1 Network Module2 Network Module1 Network Module2 Component software stack FTB Agent software stack

  14. CIFTS API* Snapshot • FTB_Init (IN FTB_comp_info_t *comp_info, OUT FTB_client_handle_t *client_handle, OUT char *error_msg) • FTB_Publish_event (IN FTB_client_handle_t handle, IN char *event_name, IN FTB_event_data_t *datadetails, OUT char *error_msg) • FTB_Create_mask (INOUT FTB_event_mask_t *evt_mask, IN char *field_name, IN char *field_val, OUT char *error_msg) • FTB_Subscribe (IN FTB_client_handle_t chandle, IN FTB_event_mask_t *event_mask, OUT FTB_subscribe_handle_t *shandle, OUT char *error_msg IN int (*callback)(OUT FTB_catch_event_info_t *, OUT void*), IN void *arg) • FTB_Poll_for_event (IN FTB_subscribe_handle_t shandle, OUT FTB_catch_event_info_t *catch_event, OUT char *error_msg); • FTB_Finalize (IN FTB_client_handle_t handle); *Under works

  15. FTB-enabled Software -- Planned BLCR FT-LA PVFS Fault Tolerant Backplane ROMIO ScaLAPACK Cobalt MVAPICH2 MPICH2 CCA Applications ZeptoOS OpenMPI LAMMPS LAM/MPI NWChem SWIM IPS

  16. Status • Alpha version under works • Demos available on SC exhibit floor • Client API to be finalized by Q4’ CY07 • Beta release, targeted Q1’ CY08 • Platforms supported : Linux clusters, IBM BGL, Cray XT

  17. CIFTS team • Argonne National Laboratory • Pete Beckman, Rinku Gupta, Ewing Lusk, Rob Ross, Rajeev Thakur • Indiana University • Andrew Lumsdaine • Lawrence Berkeley National Laboratory • Paul Hargrove • Oak Ridge National Laboratory • Al Geist, David Bernholdt, Pratul Agarwal, Scott Hampton, Byung-Hoon Park, Aniruddha Shet • Ohio State University • D.K. Panda • University of Tennessee, Knoxville • Jack Dongarra

  18. Call for Action FT-LA Lustre GPFS Intel MLK BLCR IBRIX GFS ScaLAPACK Polyserv Fault Tolerant Backplane Panasas PVFS MAUI Cobalt ROMIO Condor MVAPICH2 MPICH-MX SGE LSF MPICH2 LAMMPS ZeptoOS Intel MPI PBS/Pro Global Arrays OpenMPI SWIM IPS NWChem SLURM Linux LAM/MPI Other Applications Scali MPI Star-CD LS-Dyna MM5 BLAST Eclipse Fluent

  19. Need more information? • SC’07 Exhibit floor • Demos and/or talks at ANL, ORNL and LBNL booth • CIFTS website • http://www.mcs.anl.gov/research/cifts/ • CIFTS wiki • http://wiki.mcs.anl.gov/cifts • CIFTS mailing list • cifts_discuss@googlegroups.com

  20. Discussion Topics • Need of CIFTS infrastucture in enterprise environment • Requirements/constraints for adoption of CIFTS? • …..

  21. Backup

  22. CIFTS - The working view PVFS Universal Logger Checkpoint Restart System Resource Manager/JS Event Analysis System Components Automatic Actions Diagnostics Tools Bootstrap Server Middleware Like MPI MPI-IO Linear Algebra Libraries Autonomics Libraries and Applications

  23. Building a FTB-enabled sample component • List the events you may want to publish in an XML file (for convenience) • Use the API to make the component FTB-enabled • Publish and subscribe to events

  24. FTB-Enabled Component Development (Step1) STEP 1: Create an XML file, outlining the publishable events <ftb_component_details> <namespace>ftb.ftb_examples.watchdog<namespace> <publish_event> <event_name>WATCH_DOG_EVENT</event_name> <event_severity>Info</event_severity> <event_desc>This event is used by watchdog</event_desc> </publish_event> <publish_event> … </publish_event> </ftb_component_details>

  25. Developing a FTB-enabled component (Step 2) STEP 2: Enabling your FTB component #include "libftb.h" #include "ftb_event_def.h" #include "ftb_throw_events.h" int main (int argc, char *argv[]) { strcpy(cinfo.comp_namespace, "FTB.FTB_EXAMPLES.Watchdog"); strcpy(cinfo.schema_ver, "0.5"); strcpy(cinfo.inst_name, "watchdog"); strcpy(cinfo.jobid,"watchdog-111"); strcpy(cinfo.catch_style,"FTB_POLLING_CATCH"); FTB_Init(&cinfo, &handle, err_msg); FTB_Register_publishable_events(handle, ftb_ftb_examples_watchdog_events, FTB_FTB_EXAMPLES_WATCHDOG_TOTAL_EVENTS, err_msg); FTB_Create_mask(&mask, "all", "init", err_msg); FTB_Subscribe(handle, &mask, &shandle, err_msg, NULL, NULL); FTB_Publish_event(handle, "WATCH_DOG_EVENT", publish_event_data, err_msg); FTB_Poll_for_event(shandle, &caught_event, err_msg); FTB_Finalize(handle); return 0; }

  26. Developing a FTB-enabled component (Step 2..contd) Creating your subscribe event mask Create a mask to catch all events • FTB_Create_mask(&mask, "all", "init", err_msg); Create a mask to catch “WATCH_DOG_EVENT” • FTB_Create_mask(&mask, "all", "init", err_msg); 2. FTB_Create_mask(&mask, "event_name", "WATCH_DOG_EVENT", err_msg); Create a mask to catch events of severity fatal • FTB_Create_mask(&mask, "all", "init", err_msg); 2. FTB_Create_mask(&mask, “severity”, ”FTB_FATAL", err_msg);

  27. Developing a FTB-enabled component (Step 3) STEP 3: Provide options to end user to compile your code with FTB • Modify configure.in and makefiles, so that you can compile your code • ./configure --with-ftb=<PATH to FTB install directory>

  28. Setting up FTB environment Compiling FTB • Download FTB • ./configure --with-platform=linux --with-bstrap-name=hostname • make • make install

  29. Using FTB Starting FTB • ./ftb_database_server • ./ftb_agent on all linux nodes • Run you component executables Connection Topology FTB Agent FTB Agent FTB Agent Agent contacts server FTB Agent Bootstrap DB server BS -Server provides parent address FTB Agent FTB Agent

  30. Open Issues We don’t know the answers to these questions, so we should not be discussing them in the BOF? • Policy management • Global knowledge of component prioritization for handling events • How can components announce their FT capabilities? • How can components request for action from other components? • How to we establish scoping of events?

More Related