Gocdb failover status and plans
This presentation is the property of its rightful owner.
Sponsored Links
1 / 9

GOCDB failover status and plans PowerPoint PPT Presentation


  • 43 Views
  • Uploaded on
  • Presentation posted in: General

GOCDB failover status and plans. COD-19, 01/04/2009 G.Mathieu, A.Cavalli, C.Peter, P.Sologna. Assessment and progress. Last week's outage at RAL a good (!) usecase for testing our procedures and listing improvements DNS aspect new DNS machine at CNAF. Last RAL outage. Timeline

Download Presentation

GOCDB failover status and plans

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Gocdb failover status and plans

GOCDB failoverstatus and plans

COD-19, 01/04/2009

G.Mathieu, A.Cavalli, C.Peter, P.Sologna


Assessment and progress

Assessment and progress

  • Last week's outage at RAL

    • a good (!) usecase for testing our procedures and listing improvements

  • DNS aspect

    • new DNS machine at CNAF


Last ral outage

Last RAL outage

  • Timeline

    • 5:20 UTC - power glitch at RAL.

    • 8:00 – Start failover process

    • 9:20 - DNS switch complete.

    • 10:00 - Failover working properly.

    • 13:25 - reverse DNS switch


Post mortem

Post mortem

  • good things

    • failover worked

    • DNS swap quick, efficient and transparent

    • Good synchronisation

    • CNAF IRC channel was useful

  • encountered problems

    • Problems with CNAF DB schema

    • DB Connection from ITWM to RAL

    • SSL issues

    • The overall process to swap completely took a rather long time (2h)


Proposed improvements 1

Proposed improvements (1)

  • Improve manual process

    • Reduce the number of needed people. we need to allow different people to carry on the whole chain alone.

    • Create scripts to reduce number of actions

  • Sort out CNAF schema issue

    • Improve current synchronisation mechanism

  • Contacts and documentation

    • Keep somewhere a list of phone contacts, or alternative mail addresses to use in case main mail system does not work

    • Document all processes


Proposed improvements 2

Proposed improvements (2)

  • Regular tests

    • Test CNAF replica DB

    • ITWM web interface

    • All possible scenarios

  • Configuration improvements

    • Simplify configuration file

    • have the service publish itself the fact that it is in read-only mode.

  • Automation

    • Work with OAT monitoring group

    • Automate DB switch

    • Automate portal switch the same way


Actions list 1

Actions list (1)

  • Doc and processes

    • Gilles to draft process + test documentation

    • Christian to add [email protected] tests to ITWM procedures

    • All: provide contacts (phone, alternate mail, etc.)

  • Access to machines

    • Christian to give failover team access to [email protected]

    • Gilles to give failover team access to gocdb@ral- Gilles to write goc portal

  • Scripting

    • Gilles to write scripts to change GOC portal conf

    • Peter/Ale to write DNS configuration scripts


Actions list 2

Actions list (2)

  • Improvements on CNAF-RAL DB sync

    • Gilles to provide a dump to CNAF whenever the schema changes

    • Peter/Ale/Gilles to study encryption solution to secure the dump

    • Gilles to check the dump solution is valid

    • Peter/Ale to implement new procedures

    • Ale to do speed tests in different scenarios


Actions list 3

Actions list (3)

  • Test

  • Test again

  • Re-test

    • Test

    • Test

  • Test (if there is some time left)


  • Login