1 / 11

Gemini OSU - UKLC Update

Gemini OSU - UKLC Update. Annie Griffith December 2007. Discussion Items. Focus of discussions will be around the system elements of the recent Gemini incident :- Summary of findings to date Review of learning points Lessons learned applicable to UKLTR Open discussion.

shea
Download Presentation

Gemini OSU - UKLC Update

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Gemini OSU - UKLC Update Annie Griffith December 2007

  2. Discussion Items Focus of discussions will be around the system elements of the recent Gemini incident :- • Summary of findings to date • Review of learning points • Lessons learned applicable to UKLTR • Open discussion

  3. Overview of events • 21st Oct – upgraded system implemented – • API errors identified • 22nd Oct – issue with Shipper views of other shippers’ data • shipper access revoked • 24th Oct – Code fix for data view implemented • internal National Grid access only • 26th Oct – external on-line service restored • 1st Nov – hardware changes implemented to external service • 2nd Nov – API service restored • Further intermittent outage problem occurring to APIs • 5th Nov – Last outage on API service recorded at 13:00 • Root cause analysis still underway

  4. Summary - Causes • Two problems identified • Application Code Construct – associated with high volume instantaneous concurrent usage of same transaction type. Fix deployed 05:00 23/10/07 • API error – associated with saturation usage displaying itself as “memory leakage”, builds up over time and eventually results in loss of service. Indications are that this is an error with a 3rd party system software product. Investigations continuing

  5. Fixes since Go-live • Since 4th November • 10 Application defects • All minor • All fixed • No outstanding application errors

  6. Gemini OSU Testing • Extensive testing programme • 2 months integration and system testing • 6 weeks OAT Performance Testing • Volume testing of 130% of current user load • 8 weeks UAT • 4 weeks shipper trials (voluntary) • 3 participants • 7 weeks dress rehearsal • Focus was on actions needed to complete the technical hardware upgrade across multiple servers and platforms

  7. Testing Lessons Learnt • UAT • each functional area tested discretely • Issues around concurrent usage unknown and therefore not specifically targeted for testing • “Field” testing of system under fully loaded conditions may have highlighted this problem, but this is not certain. • OAT • Although volume and stress testing completed successfully, reliability testing/soak testing over a prolonged period not undertaken

  8. Other Observations • Communications during incident • Undersold the scale of the change • Engagement - right individuals/forums ? • Planning for failure…..as well as success

  9. UKLTR – What’s different ? • Main workhorse of the system is the batch processing • Predictable transaction volumes • Far easier to replicate load and volume testing • Easy to verify outputs • Shipper interaction is batch driven • Low volume of on-line users • Doesn’t have same level of real-time/instantaneous transaction criticality • Ability to do more verification following cut-over before releasing data from upgraded system to the outside world.

  10. UKLTR – Lessons to be applied • Plan for failure • Differing levels, problems vs. incident • Technical and resource planning • Fully prepared Incident Management procedure established in advance and understood by all parties • Escalation routes • Communications mechanisms • Status Communications to be issued during outage period • Milestone Updates ? • Who to ? • Fall-back options • Old system provides straight forward option • However, once interface data has been propagated to other systems will be in a “fix-forward” situation.

  11. Discussion ?

More Related