1 / 26

Grid Checkpoining Architecture

Grid Checkpoining Architecture. Radosław Januszewski CoreGrid Summer School 2007. motivation. The Grids are complex and therefore prone to errors. The distributed nature of the Grid makes scheduling of system maintenance hard.

coral
Download Presentation

Grid Checkpoining Architecture

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Grid Checkpoining Architecture Radosław Januszewski CoreGrid Summer School 2007

  2. motivation • The Grids are complex and therefore prone to errors. • The distributed nature of the Grid makes scheduling of system maintenance hard. • Each uncoordinated power-down or failure effects in loss of currently running applications. • Loss of computation time means additional cost! European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  3. goal • To enhance the reliability, fault-tolerance and robustness of the Grid computing environment. European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  4. the solution • Grid Checkpoint Architecture (GCA): a proposal of placement, functionality and interaction schemes of checkpoinitng service in the Grid environment European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  5. grid - model European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  6. GCA in the Grid European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  7. Proof of concept – the goals • check whether the GCA survives contact with the reality • prepare PoC on the basis of real-life installation • the Grid with the GCA should provide additional value comparing with the „traditional” approach European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  8. GCA proof of concept installation European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  9. involved elements • GUI: command line, Grid Sphere, Migrating Desktop • Broker: GRMS • Local Resource Manager: Globus + TORQUE • Core service: SGIckpt European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  10. Bottom-up approach • How to make the checkpointer work with the local resource manager? European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  11. pbs/torque special features • action checkpoint • action restart • action checkpoint_abort European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  12. config • $action checkpoint 0 !/usr/pbs/bin/pbs-mom-checkpoint.sh %globid %jobid %sid %ta • skid %path • $action restart 0 !/usr/pbs/bin/pbs_restart_test.sh %path %taskid • $restart_transmogrify true • $action checkpoint_abort 0 !/usr/pbs/bin/pbs-mom-checkpoint-and-stop.sh %globid • %jobid %sid %taskid %path • Detailed description accessible on the http://checkpointing.psnc.pl European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  13. Broker – local RM connectivity European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  14. problem • The checkpointer: a service or resource? European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  15. job description with checkpointing • <grmsJob appid="matrix_demo_submit"> • <task taskid="matrix" persistent="true" crucial="true"> • <resource> • <localrmname>pbs</localrmname> • </resource> • <executable type="multiple" count="1"> • <execfile name="matrixi"> • <url>gsiftp://xxx.xxx.xxx.xxxl//home/user/povray</url> • </execfile> • </executable> • <other> • <grms_id>${JOB_ID}</grms_id> • <checkpointable>true</checkpointable> • <period>1</period> • </other> • </task> • </grmsJob> European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  16. the end-user point of view European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  17. manual scenario European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  18. manual scenario - restart European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  19. <grmsJob appid="matrix_demo_resume"> • <task taskid="matrix" persistent="true" crucial="true"> • <resource> • <hostname>node-03.checkpointing.psnc.pl</hostname> • <localrmname>pbs</localrmname> • </resource> • <executable type="multiple" count="1"> • <execfile name="matrix_long"> • <url>gsiftp://xxx.xxx.xxx.xxx//home/xxxxxx/test_apps/matrix_long</url> • </execfile> • </executable> • <other> • <grms_id>${JOB_ID}</grms_id> • <recovery>true</recovery> • <ckpt_id>1179315947518_matrix_demo_submit_0459</ckpt_id> • <checkpointable>true</checkpointable> • <period>1</period> • </other> • </task> • </grmsJob> European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  20. failure – end-user view European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  21. problem • This semi-automatic solution is not optimal. • How to introduce automatic job failure handling without introducing new functionality in the Broker? • Use the workflows! European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  22. the workflow Problem: using this broker we are not able to model loops European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  23. automatic scenario European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  24. end-user point of view European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  25. the benefits • user: more robust and fault-tolerant Grid environment • sysadmin: much easier system management due to automatic checkpoint and recovery mechanism European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

  26. Thank you! European Research Network on Foundations, Software Infrastructures and Applications for large scale distributed, GRID and Peer-to-Peer Technologies

More Related