1 / 18

13 mai 14 : AtelierSupervision

13 mai 14 : AtelierSupervision. Matin : les messages sortants IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC et à l'IDRIS IPSL : besoins et esquisses de solutions pour l'envoi de messages d ’ informations depuis les centres de calcul vers l ’ IPSL

Download Presentation

13 mai 14 : AtelierSupervision

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. 13 mai 14 : AtelierSupervision Matin : les messages sortants • IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC et à l'IDRIS • IPSL : besoins et esquisses de solutions pour l'envoi de messages d’informations depuis les centres de calcul vers l’IPSL • IDRIS : outils existants et contraintes • TGCC : outils existants et contraintes • Discussion • Plan d'action Après-Midi : les jobs entrants • IPSL : besoins de relance de jobs de calcul et/ou de post-traitements depuis l’IPSL • IDRIS : soumission de jobs : quelles solutions? quelles API? quelles contraintes? • TGCC : soumission de jobs : quelles solutions? quelles API? quelles contraintes? • Discussion • Plan d’action

  2. ANR MN2013 CONVERGENCE T0 : management T2 : towards a high-resolution coupled model T1 : platform • Improving coupled model parallelism in terms of computing and memory • Managing efficiently input and restart files • Integrating parallel interpolation mechanisms in XIOS • Parallel component coupling ensemble of toolsdifferent configurations different resolutionset of simulations set of diagnostics assessment T3 : runtime environments • Process assignment • Optimization, Load balancing • Climate Simulations Supervision IPSL implementation T4 : Big Data management and analytics of Climate Simulations • XIOS implemented within project models • XIOS a bridge towards standardisation • Data and metadata services • Big Data Analytics GAME-CERFACS implementation T5 : CliMAF: a framework for climate models evaluation and analysis • General driver and upstream user interface • Services layer • Visualization tools • Evaluation and monitoring diagnostics

  3. Task 3.3 : Runtime Environment Leaders : Arnaud Caubel and Marie-Alice Foujols Contributors : IPSL, CERFACS, IDRIS, CNRS-GAME Help and expertise : TGCC, MDLS

  4. Task 3.3 : Climate Simulations Supervision Launch a simulation with libIGCM Supervisor User Objective : libIGCM self-healing application : more reliability, less human intervention IDRIS or TGCC Commands Computing Post-traitment ? Jussieu

  5. Task 3.3 : Climate Simulations Supervision Context • one simulation • 3 weeks running, 100 000 files, 25 TB, 1000 jobs : 40 computing and post-processing • static workflow vs dynamic workflow Development of a supervisor agent • detect and understand failure event • understand the ultimate goals of the workflow • re-plan, re-schedule, re-map the workflow Tasks for the supervisor • events log in a comprehensive call tree (job sub., work to be done, each cp, ....) • reliable lightweight communication channel between client agents and server agents (RabbitMQ implementation of AMPQ) • call tree traversal capabilities to determine checkpoint restart • autonomous rescheduling of necessary jobs • monitoring capabilities : coloured graphs with all jobs and status • regression tests handling capabilities

  6. Task 3.3 : Climate Simulations Supervision

  7. Task 3.3 : Climate Simulations Supervision Additional manpower : • CDD 21 pm IPSL (tasks 3.1, 3.2, 3.3) + CDD/IDRIS 6 pm • Subcontractor IPSL 42 pm (tasks 3.3) • TGCC/CEA : prestation ? Success criteria : • A significant number of “standard” (ie “nonexpert”) users of Earth System model launch typical climate simulation (including development done in this WP) using libIGCM runtime environment on HPC centres (IDRIS and TGCC) Identified risks : • if it's not possible to install supervisor agent : lighter installation with warning instead of correction • the supervisor must be as transparent as possible : lighter usage ie des/activation of main tasks/secondary tasks Planning for next 6 months : • Meeting/workshop to plan to discuss “Supervisor Design” (task 3.3)

  8. PeriodLength , PeriodNb Job_EXP00 Job_EXP00 Job_EXP00 Job_EXP00 Computing job PackFrequency pack_debug PackFrequency pack_restart RebuildFrequency rebuild PackFrequency pack_output Post-processing jobs SeasonalFrequency create_se atlas atlas create_ts TimeSeriesFrequency create_ts monitoring

  9. Generical job: AA_Job PeriodLength

  10. TGCC computers and file system in a nutshell Computers airainfront-end curie hybrid nodes-q hybrid airainnodes curiefront-end curiethin nodes -q standard curielarge nodes -q xlarge login compute File system Small precious filesSaved space $HOME $CCCWORKDIR sources small results IGCM_OUT : MONITORING/ATLAS cp dods/work dods_cp temporary REBUILD IGCM_OUT : files to be packed outputs of post-proc jobs $SCRATCHDIR cp quotas $CCCSTOREDIR IGCM_OUT : Packed resultsOutput, Analyse SE and TS dods/store ccc_hsm get dods_cp HPSS : Robotic tapes Temporary space Non saved space Saved space Space on tapes Visible from www October 2013

  11. curie Job_EXP00 Job_EXP00 Job_EXP00 Compute TGCC PeriodLength PeriodLength $SCRATCHDIR/IGCM_OUT/.../REBUILD RebuildFrequency rebuild Post curie $SCRATCHDIR/IGCM_OUT/XXX/Output $SCRATCHDIR/IGCM_OUT/XXX/Restart Debug PackFrequency PackFrequency pack_restart pack_debug ncrcat tar pack_output Post curie $CCCSTOREDIR/IGCM_OUT/.../RESTART DEBUG $CCCSTOREDIR/IGCM_OUT/XXX/Output TimeSeriesFrequency SeasonalFrequency create_ts create_se Post monitoring atlas curie TS et SE : $CCCSTOREDIR/IGCM_OUT/…  dods/storeMONITORING et ATLAS : $CCCWORKDIR  dods/work DodsCopy=TRUE/FALSE

  12. IDRIS computers and file system in a nutshell turingfront-end turingcalcul adappfront-end adappcompute adacompute login compute Small precious filesSaved space $HOME File system $HOME sources small results temporary REBUILD IGCM_OUT : files to be packed outputs of post-proc jobs $WORKDIR $WORKDIR $TMPDIR $TMPDIR $TMPDIR mfput/mfget mfput/mfget gaya dods $HOME dmput/dmget IGCM_OUT :Output, Analyse MONITORING/ATLAS dods_cp Robotic tapes Temporary space Non saved space Saved space Space on tapes Visible from www October 2013

  13. ada Job_EXP00 Job_EXP00 Job_EXP00 Compute IDRIS PeriodLength PeriodLength $WORKDIR/IGCM_OUT/.../REBUILD RebuildFrequency rebuild Post adapp $WORKDIR/IGCM_OUT/XXX/Output $WORKDIR/IGCM_OUT/XXX/Restart Debug PackFrequency PackFrequency pack_restart pack_debug ncrcat tar pack_output Post adapp gaya:IGCM_OUT/.../RESTART DEBUG gaya:IGCM_OUT/XXX/Output TimeSeriesFrequency SeasonalFrequency create_ts create_se Post monitoring atlas adapp DodsCopy=TRUE/FALSE gaya:IGCM_OUT/…  dods.idris.fr

  14. CM5AEH01 : 1850-2350

  15. CM5AEH01 – 500 ans : 1850-2349 • 500 ans • PeriodLength=1M 240 jobs de calcul, PeriodNb=12, 60 • RebuildFrequency=1Y 432 rebuild • PackFrequency=10Y 43 pack_debug, 43 restart, 43 output • SeasonalFrequency=50Y  8 create_se et 32 atlas • TimeSeriesFrequency=10Y  757 create_ts et 43 monitoring • 12interventions manuelles 12/1641 = 0,73% 1 2 3

  16. CM5AEH01 : RunChecker.job

  17. CM5AEH01 : erreurs rencontrées 1 Erreur job calcul : • 2123-11 et 2130-07: Fatal : error writing restartphy, job bloqué qq heures,  clean_month et relance • 2206-04 : Fatal : erreur SLURM ,  clean_month et relance • 2249-03 : Fatal : 3h de blocage, killed,  clean_month et relance Erreur job post-traitements : • 1999, 2000, 2118 et 2127 : pack_restart (1999) et rebuild parti en time limit  pack_r et rebuild relancé et, si besoin, pack_output (2119, 2129) • 2166 et 2174 : rebuild KO IGCM_sys_rebuild[1860]: /ccc/cont003/home/dsm/p86ipsl/X64_CURIE/bin/rebuild: cannot execute [Permission denied],  rebuild relancé • 2059, 2079, 2119 et 2129 : pack_output lancé trop tôt pack_output relancé Autres erreurs : • 13 monitoring KO : problèmes d’environnement instable (nco) entre 10/3 et 30/31 • 1 sub rebuild KO, resource temporarily unav : 3 tentatives dans libIGCM v2.2 • IDRIS : disparition tous jobs 2 3

  18. 13 mai 14 : Atelier Supervision Matin : les messages sortants • IPSL : introduction générale : supervision de la chaine de calcul et post-traitements au TGCC et à l'IDRIS • IPSL : besoins et esquisses de solutions pour l'envoi de messages d’informations depuis les centres de calcul vers l’IPSL • IDRIS : outils existants et contraintes • TGCC : outils existants et contraintes • Discussion • Plan d'action Après-Midi : les jobs entrants • IPSL : besoins de relance de jobs de calcul et/ou de post-traitements depuis l’IPSL • IDRIS : soumission de jobs : quelles solutions? quelles API? quelles contraintes? • TGCC : soumission de jobs : quelles solutions? quelles API? quelles contraintes? • Discussion • Plan d’action

More Related