1 / 19

Towards Highly Available OSG (Open Science Grid)

Towards Highly Available OSG (Open Science Grid). VISHAL RAMPURE Louisiana Tech University . OVERVIEW. INTRODUCTION HA-OSCAR OSG services Proposed work Fault Tolerance Data Replication Recovery Conclusion. INTRODUCTION.

jjuarez
Download Presentation

Towards Highly Available OSG (Open Science Grid)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Towards Highly Available OSG (Open Science Grid) VISHAL RAMPURE Louisiana Tech University Towards Highly Available OSG VISHAL RAMPURE

  2. OVERVIEW • INTRODUCTION • HA-OSCAR • OSG services • Proposed work • Fault Tolerance • Data Replication • Recovery • Conclusion

  3. INTRODUCTION • High Availability: It refers to a system or a component that is continuously operational for long period of time. • With respect to a grid, high availability refers to continuous operation of a grid resource. • In other words, ensure that the uptime of a service provided by a grid resource is maintained at a high level.

  4. Continued… • If a system providing some services crashes then the services will be unavailable for a certain amount of time until the system is repaired. • Mostly HA is provided by redundancy • A system failure is detected by continuously monitoring its health and the service is restarted on another system automatically without any human intervention.

  5. HAOSCAR (High Availability Open Source Cluster Application Resources) • HA-OSCAR is a open source project that aims to provide a combined power of High availability and Performance computing solution. • The motivation is to enhance the Beowulf cluster system for critical grade applications. • Component redundancy is adopted in order to provide a high availability on HA-OSCAR cluster.

  6. Continued… • HA-OSCAR incorporates the following features to eliminate a single point failure in a HA-OSCAR cluster • Self Healing mechanism. • Failure Detection • Recovery • Automatic failover • Automatic fail-back

  7. HA-OSCAR FAILOVER STRATEGY

  8. HA-OSCAR VS BEWOULF CLUSTER HA-OSCAR Beowulf

  9. OSG SERVICES • The Open Science Grid (OSG) is a distributed computing infrastructure for large-scale scientific research. • OSG provides a lot of services some optional and some critical. • Virtual Data Toolkit (VDT) provides the basic grid infrastructure with critical services like GRAM, Gridftp etc and optional services like Uberftp, MonALISA etc.

  10. Some of the services provided by OSG • Virtual Data Toolkit • GT4 services • GRAM • Gridftp • CONDOR • VOMS • Monitoring and Information Services • Core MIS • Grid Cat • MonALISA

  11. Continued… • Resource Selection Service • Condor Match Making Service • Storage Services • Disk Resource Manager

  12. Proposed Work • Our aim is to provide a HA enabled infrastructure that monitors the critical services on a grid resource. • We intend to provide a HA enabled grid infrastructure with three functionalities • Fault Tolerance • Data Replication • Recovery

  13. How do we intend to make OSG Fault Tolerant? • The ability of a system to respond gracefully to an unexpected hardware or software failure is FAULT TOLARANCE. • There are many levels of fault tolerance, the lowest being the ability to continue operation in the event of a power failure. • Many fault-tolerant systems mirror all operations that is, every operation is performed on two or more duplicate systems, so if one fails the other can take over.

  14. Continued… • We intend to provide fault tolerant OSG system using the HA-OSCAR infrastructure. • There would be two identical systems one active (primary) and the other passive (standby). The standby system monitors the health of the primary system continuously. • The standby takes over the functionality of the primary as soon as it detects a failure.

  15. How do we intend to provide Data Replication? • The copying of data to and from sites to improve local service response times and availability frequently employed as part of a backup and recovery strategy. • In case a disaster occurs, recovery ability and speed are critical. • Every time HA-OSCAR is completely re-installed or the kernel updated, ghost images of before and after are saved.

  16. Continued.. • Snapshot of an old and new kernel, gzips it and sends the image to the secondary head node as well as to a predefined disaster recovery site. • Important OSG data as well as application and configuration files also can be included in the ghost image. • The running jobs are checkpointed and replicated on the standby node in case of failure for fast recovery.

  17. How do we intend to provide Recovery mechanism? • We intend to provide a automatic fault detection and recovery mechanism for HA enabled grid (OSG). • The critical grid services like Gridftp and globus-gatekeeper can be continuously monitored. • The failure of these services will generate an alert wherein the standby can takeover. • The checkpointed jobs replicated on the standby node would be restarted.

  18. Critical Service Monitoring & Failover-Failback

  19. CONCLUSION • Providing a HA enabled grid infrastructure would make a OSG grid resource more reliable. • HA-OSCAR solution for a OSG site-manager provides better availability, self healing and fault tolerance. • HA-OSCAR ensures that the critical grid and cluster service interruptions are minimized.

More Related