1 / 8

Emergency Database Failover : Impacts & Recovery Plan

Emergency Database Failover : Impacts & Recovery Plan. Trey Felton – ERCOT IT. Synopsis. ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse. Synopsis. Failover. Emergency DB failover on April 21 st , 2008

quinto
Download Presentation

Emergency Database Failover : Impacts & Recovery Plan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Emergency Database Failover:Impacts & Recovery Plan Trey Felton – ERCOT IT

  2. Synopsis ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse

  3. Synopsis Failover • Emergency DB failover on April 21st, 2008 • Market DB (which feeds ISM) became unresponsive • Data could not be written/read • Synchronization issues caused a 24 hr gap in data • Propagated through to ISM Out of synch(24 hrs) ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse

  4. Synopsis Failover • Physical Standby brought online • ISM rebuilt through Source data to recover affected extracts ISM - Information Services Master Database DB – Database EDW – Electronic Data Warehouse

  5. Impacts • Impacts: • Market transactions were prevented from updating ISM through Logical Standby • Market DB utilizes a standby to prevent outages / performance degradations • Logical Standby (RSS) became out of synch with Physical Standby by 24 hrs • April 22 at 11:14am through April 21 at 10:44am • Other DBs feeding ISM continued normally (only Market DB was out of synch) • Priority of rebuild led to the Standby being rebuilt before the RSS • Market DB has to be kept up • This prolonged the outage to the EDW and affected extracts • Prices had to be recalculated and extracts restored from Source • Price adjustments for NSRS were completed June 5th • Missing extracts for April 21 - April 30 completed on July 1st • Why did recovery take so long? • ISM generates up to 25-35G of data per day • Data restored from Source back to April 1st • 120 Terabytes had to be restored in order to roll-forward through transaction gap • Archive log changes applied during 24-hour gap

  6. Emergency Database Failover • All data was restored with 100% accuracy • The affected market systems that caused the April failure: • Run the balancing energy and ancillary services markets • Not used for wholesale batch or the retail markets.  • ERCOT considers this to be an isolated incident and not a systemic problem

  7. Going Forward • Actions to prevent future occurrences: • Nodal market DBs will utilize newer Hardware • More fault tolerance • Redundancy • Change of architecture in the replication process for Nodal • Proof of Concept recently introduced into the Nodal market systems • Testing underway • ERCOT is conducting a risk/cost analysis of several options for these Zonal systems • To be presented to TAC in August • New Backups / Recovery Procedures • Project initiated to stabilize our database backup procedures • Shorter recovery time

  8. Data Recovery NOTICE DATE: July 1, 2008 NOTICE TYPE: W-A042308-48 UPDATE Extracts - Wholesale CLASSIFICATION: Public SHORT DESCRIPTION: ERCOT has completed recovery of the missing data for April 21 through April 30, 2008. INTENDED AUDIENCE: QSEs DAY AFFECTED: April 21 through April 30, 2008 LONG DESCRIPTION: ERCOT conducted an emergency database failover on April 21, 2008 following a hardware failure. This database failover resulted in an out-of-synch data problem from April 21 through April 30. ERCOT developed a phased process to attempt to thoroughly recover the missing data. The missing data has been recovered for the following extracts.  A market notice will be sent when the extracts are expected to be posted. Act_Res_Output Ancillary_Services_Daily Bids_and_Schedules_Daily Forecast_Data_Daily Market_Information_Daily Sched_and_Actual_Load Self_Sch_Energy_Services ASDEPLOYMENTS

More Related