1 / 23

Adding High Availability to Condor Central Manager

Adding High Availability to Condor Central Manager. Artyom Sharov Technion – Israel Institute of Technology, Haifa. Startd and Schedd. Startd and Schedd. Startd and Schedd. Startd and Schedd. Startd and Schedd. Startd and Schedd. Startd and Schedd.

pippa
Download Presentation

Adding High Availability to Condor Central Manager

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Adding High Availability to Condor Central Manager Artyom SharovTechnion – Israel Institute of Technology, Haifa Artyom Sharov, Technion, Haifa

  2. Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Condor Pool without High Availability Central Manager Negotiator Collector Artyom Sharov, Technion, Haifa

  3. Why Highly Available CM? • Central Manager is a single-point-of-failure • No additional matches are possible • Condor tools do not work • Unfair resource sharing and user priorities • Our goal - continuous pool functioning in case of failure Artyom Sharov, Technion, Haifa

  4. Highly Available Central Manager Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Startd and Schedd Highly Available Condor Pool Artyom Sharov, Technion, Haifa

  5. Solution Requirements • Automatic failure detection • Transparent failover • “Split brain” reconciliation • Persistency of CM state • No changes to CM code Artyom Sharov, Technion, Haifa

  6. Replicator HAD Replicator HAD Replicator HAD Highly Available Central Manager Condor Pool with HA Collector Negotiator Collector Collector Artyom Sharov, Technion, Haifa

  7. HA – Election + Main Backup 1 Backup 2 Backup 3 #1 Election message Election message Election message I win Raise Negotiator I loose I loose #2 I am alive Active Artyom Sharov, Technion, Haifa

  8. HA – Crash Active Backup 1 Backup 2 #3 Failure detection Election messages I win Raise Negotiator I loose #4 I am alive Active Artyom Sharov, Technion, Haifa

  9. Replication – Main + Joining Active Backup Joining #1 State update #2 Solicit version Solicit version reply Pick Best Replica Downloading request #3 State update Artyom Sharov, Technion, Haifa

  10. Replication – Crash Active Backup 1 Backup 2 #4 State update Failure detection #5 State update Active Artyom Sharov, Technion, Haifa

  11. Configuration • Stabilization time • Depends on number of CMs and network performance • HAD_CONNECT_TIMEOUT – upper bound on the time to establish TCP connection • Example: HAD_CONNECT_TIMEOUT = 2 and 2 CMs - new Negotiator is guaranteed to be up and running after 48 seconds • Replication frequency • REPLICATION_INTERVAL Artyom Sharov, Technion, Haifa

  12. Testing • Automatic distributed testing framework:simulation of node crashes, network disconnections, network partition and merges • Extensive testing: • distributed testing on 5 machines in the Technion • interactive distributed testing in Wisconsin pool • automatic testing with NMI framework Artyom Sharov, Technion, Haifa

  13. HA in Production • Already deployed and fully functioning for more than a year in • Technion • GLOW, UW • California Department of Water Resources, Delta Modeling Section, Sacramento, CA • Hartford Life • Cycle Computing • Additional commercial users Artyom Sharov, Technion, Haifa

  14. Usability and Administration • HAD Monitoring System • Configuration/administration utilities • Detailed manual section • Full support by Technion team Artyom Sharov, Technion, Haifa

  15. Future Work • HA in WAN • HAIFA – High Availability Is For Anyone • HA for any Condor service (e.g.: HA for schedd) • More consistency schemes and HA semantics • Dynamic registration of services requiring HA • Dynamic addition/removal of replicas • More details in "Materializing Highly Available Grids" - hot topic paper, to appear inHPDC 2006. Artyom Sharov, Technion, Haifa

  16. Collaboration with Condor Team • Ongoing collaboration for 3 years • Compliance with Condor coding standards • Peer-reviewed code • Integration with NMI framework • Automation of testing • Open-minded attitude of Condor team to numerous requests and questions • Unique experience of working with large peer-managed group of talented programmers Artyom Sharov, Technion, Haifa

  17. Collaboration with Condor Team This work was a collaborative effortof: • Distributed Systems Laboratory in Technion • Prof. Assaf Schuster, Gabi Kliot, Mark Zilberstein, Artyom Sharov • Condor team • Prof. Miron Livny, Nick, Todd, Derek, Greg, Anatoly, Peter, Becky, Bill, Tim Artyom Sharov, Technion, Haifa

  18. You Should Definitely Try It • Part of the official 6.7.18 development release • Will soon appear in stable 6.8 release • More information: • http://dsl.cs.technion.ac.il/projects/gozal/project_pages/ha/ha.html • http://dsl.cs.technion.ac.il/projects/gozal/project_pages/replication/replication.html • more details + configuration in my tutorial • Contact: • {gabik,marks,sharov}@cs.technion.ac.il • condor-users@cs.wisc.edu Artyom Sharov, Technion, Haifa

  19. In case of time Artyom Sharov, Technion, Haifa

  20. HAD Replication HAD Replication Replication –“Split Brain” Active 1 Merge of networks Active 2 I am alive, Active 2 I am alive, Active 1 Decision making: my ID > ‘Active 2’ ID, I am a leader Decision making: my ID < ‘Active 1’ ID, give up Artyom Sharov, Technion, Haifa

  21. HAD Replication HAD Replication Replication –“Split Brain” Active Merge of networks Backup You’re leader merging versions from two pools ‘Active 2’ last version before merge State update Artyom Sharov, Technion, Haifa

  22. HAD State Diagram Artyom Sharov, Technion, Haifa

  23. RD State Diagram Artyom Sharov, Technion, Haifa

More Related