1 / 12

Data Center Outage BRIEFING

Data Center Outage BRIEFING. Information and educational technology. January 10–11, 2014. Agenda. Review of Events Cause Analysis and Current Efforts Communications Vulnerabilities Mitigation Plans Lessons Learned Communication Improvements. Review of Events:. Summary: 3 incidents

razi
Download Presentation

Data Center Outage BRIEFING

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data Center Outage BRIEFING Information and educational technology January 10–11, 2014

  2. Agenda • Review of Events • Cause Analysis and Current Efforts • Communications • Vulnerabilities • Mitigation Plans • Lessons Learned • Communication Improvements

  3. Review of Events: • Summary: 3 incidents • Friday, Jan 10: Virtualization and uConnect firewall • Saturday, Jan 11: Virtualization • Virtualization outage affected most major systems on campus • Some mitigation lessened impact on Saturday • uConnect firewall outage • Extended email, authentication and DNS service outage for uConnect users (additional 4 hours)

  4. Outage Timeline: 3 incidents Friday, January 10th Saturday, January 11th Most services restored Most services restored except uConnect All services restored Virtualization Outage Virtualization Degradation (critical services stable) 1 VM hosts rebooted 3 Virtualization Outage VM guests started to restore Services uConnect Firewall Outage VM hosts rebooted CAS & Smartsite restored VM guests started to restore Services Email routing restored 2 Firewall fail over to secondary w/o success Hard power cycle restores firewall and uConnect Services

  5. Services Impacted • Admissions • Banner • Central Authentication Services (CAS)* • Computing Accounts • Electronic Death Registry System • Data Center File Services • DaFIS • DavisMail • Data Center Virtualization • Final Grade Submission • Geckomail • Kuali Financial Services • Identity and Access Management • IET Web Sites • MyInfoVault • MyUCDavis • ServiceNow and SSC Case Management • Shibboleth • Smartsite* • Time Reporting System • Web Content Management System • uConnect Services • UC Davis Directory Listings • UC Davis Home Site * CAS was restored to physical hardware on Fri 1:40pm which restored dependent services such as departmental applications and Smartsite.

  6. Communications • Regular outage communication channels were unavailable • Email • Websites (status page, www.ucdavis.edu) • Communications issued • Automated notices on IT-Express phone system (updated 3 times) • Twitter updates (8 on 01/10; 5 on 01/11) • Progress updates on Status web page (status.ucdavis.edu) starting Friday mid-afternoon • Email to 300+ campus technologists (01/11)

  7. Vulnerabilities • Hardware is redundant, but many services are hosted in single location on a single SAN • Critical uConnect directory services reside on a single network • The system status page is dependent on the local infrastructure • IET is not aware of all critical services that rely on our infrastructure

  8. Mitigation Plans • SAN Software Upgrade completed • Implement diversification for critical services (Authentication, uConnect Directory Services, Status Page, WWW) • Integrate cloud services to improve diversity • Develop process to identify critical campus services dependent on IET infrastructure.

  9. Lessons Learned • Move from disaster recovery to business continuity • Normal communication channels were unavailable • Communication and decision-making protocols when normal channels unavailable • Not prepared for normal channels being unavailable

  10. Communication Improvements • Review service outage communication protocols, contacts, and venues • Ensure multiple modes of communication (text, cell, email, web, phone, social media) are available; leverage new WarnMe system extension for non-emergency notifications • Closer collaboration with Emergency Manager and StratComm • Ensure broad awareness of outage communication channels • Launch cloud-based status page – Status Page I/O • Leverage AggieFeed for broader communication

  11. Status Page I/O

  12. Architecture

More Related