1 / 4

Emergency Database Failover : Impacts & Recovery Plan

Emergency Database Failover : Impacts & Recovery Plan. Aaron Smallwood – ERCOT IT Joel Mickey – ERCOT Market Operations. Emergency Database Failover. Summary: ERCOT conducted an emergency database failover on April 21 st , 2008 following a hardware failure

emeryr
Download Presentation

Emergency Database Failover : Impacts & Recovery Plan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Emergency Database Failover:Impacts & Recovery Plan Aaron Smallwood – ERCOT IT Joel Mickey – ERCOT Market Operations

  2. Emergency Database Failover • Summary: • ERCOT conducted an emergency database failover on April 21st, 2008 following a hardware failure • While ERCOT does perform controlled database failovers monthly, this was different due to the nature of the hardware failure • Normally, the database is ‘stopped’ at one site, and then ‘started’ at the other in controlled manner • In this case, the database ‘hung’ – meaning that it became unresponsive and data was unable to be written to or read from database • The impacts: • Transactions were prevented from updating downstream databases • The lack of transaction updates in downstream databases left a gap in transactional records (out of sync) • The affected extracts for April 21st through April 30th are listed in market notices for the incident • ERCOT considers this to be an isolated incident and not a systemic problem

  3. Recovery Plan • Goal: • Recover transactions that are needed to perform price adjustment calculations that are missing in downstream databases from a restored copy of the production database • Plan: • Build an environment identical to the production environment • Servers, storage, applications • Restore data to pre-crash state (4/21) • Over 20TB of data to restore from tape (in progress) • Using the restored environment and data, extract transactions missing from downstream databases and then roll forward all subsequent transactions • ERCOT Market Operations will then review the data for reasonableness and approve the data for reporting and settlement

  4. Questions • Actions to prevent future occurrences: • Nodal market databases will be on newer hardware with more fault tolerance and redundancy • Potential re-architecture of system integration between the databases • Lessons learned are being documented but no plan yet • Resources are focused on the data recovery efforts • Questions: • When will non-spinning reserve price adjustments for PRR 650 be completed? • When the transactional data has been restored, reviewed, and approved • What is the timeline? • The environment build is complete, we anticipate the data restore from tape to be the task that takes the longest • We are estimating weeks, not months, to complete the plan • Unknowns include the amount of time needed to restore from tape and the quality of the data once it’s been restored • Market notices will continue to be sent to indicate status

More Related