Gocdb failover status and plans
Download
1 / 9

GOCDB failover status and plans - PowerPoint PPT Presentation


  • 60 Views
  • Uploaded on

GOCDB failover status and plans. COD-19, 01/04/2009 G.Mathieu, A.Cavalli, C.Peter, P.Sologna. Assessment and progress. Last week's outage at RAL a good (!) usecase for testing our procedures and listing improvements DNS aspect new DNS machine at CNAF. Last RAL outage. Timeline

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' GOCDB failover status and plans' - keely


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Gocdb failover status and plans

GOCDB failoverstatus and plans

COD-19, 01/04/2009

G.Mathieu, A.Cavalli, C.Peter, P.Sologna


Assessment and progress
Assessment and progress

  • Last week's outage at RAL

    • a good (!) usecase for testing our procedures and listing improvements

  • DNS aspect

    • new DNS machine at CNAF


Last ral outage
Last RAL outage

  • Timeline

    • 5:20 UTC - power glitch at RAL.

    • 8:00 – Start failover process

    • 9:20 - DNS switch complete.

    • 10:00 - Failover working properly.

    • 13:25 - reverse DNS switch


Post mortem
Post mortem

  • good things

    • failover worked

    • DNS swap quick, efficient and transparent

    • Good synchronisation

    • CNAF IRC channel was useful

  • encountered problems

    • Problems with CNAF DB schema

    • DB Connection from ITWM to RAL

    • SSL issues

    • The overall process to swap completely took a rather long time (2h)


Proposed improvements 1
Proposed improvements (1)

  • Improve manual process

    • Reduce the number of needed people. we need to allow different people to carry on the whole chain alone.

    • Create scripts to reduce number of actions

  • Sort out CNAF schema issue

    • Improve current synchronisation mechanism

  • Contacts and documentation

    • Keep somewhere a list of phone contacts, or alternative mail addresses to use in case main mail system does not work

    • Document all processes


Proposed improvements 2
Proposed improvements (2)

  • Regular tests

    • Test CNAF replica DB

    • ITWM web interface

    • All possible scenarios

  • Configuration improvements

    • Simplify configuration file

    • have the service publish itself the fact that it is in read-only mode.

  • Automation

    • Work with OAT monitoring group

    • Automate DB switch

    • Automate portal switch the same way


Actions list 1
Actions list (1)

  • Doc and processes

    • Gilles to draft process + test documentation

    • Christian to add [email protected] tests to ITWM procedures

    • All: provide contacts (phone, alternate mail, etc.)

  • Access to machines

  • Scripting

    • Gilles to write scripts to change GOC portal conf

    • Peter/Ale to write DNS configuration scripts


Actions list 2
Actions list (2)

  • Improvements on CNAF-RAL DB sync

    • Gilles to provide a dump to CNAF whenever the schema changes

    • Peter/Ale/Gilles to study encryption solution to secure the dump

    • Gilles to check the dump solution is valid

    • Peter/Ale to implement new procedures

    • Ale to do speed tests in different scenarios


Actions list 3
Actions list (3)

  • Test

  • Test again

  • Re-test

    • Test

    • Test

  • Test (if there is some time left)


ad