data center outage briefing n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Data Center Outage BRIEFING PowerPoint Presentation
Download Presentation
Data Center Outage BRIEFING

Loading in 2 Seconds...

play fullscreen
1 / 12

Data Center Outage BRIEFING - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Data Center Outage BRIEFING. Information and educational technology. January 10–11, 2014. Agenda. Review of Events Cause Analysis and Current Efforts Communications Vulnerabilities Mitigation Plans Lessons Learned Communication Improvements. Review of Events:. Summary: 3 incidents

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Data Center Outage BRIEFING' - razi


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
data center outage briefing

Data Center Outage BRIEFING

Information and educational technology

January 10–11, 2014

agenda
Agenda
  • Review of Events
  • Cause Analysis and Current Efforts
  • Communications
  • Vulnerabilities
  • Mitigation Plans
  • Lessons Learned
  • Communication Improvements
review of events
Review of Events:
  • Summary: 3 incidents
    • Friday, Jan 10: Virtualization and uConnect firewall
    • Saturday, Jan 11: Virtualization
  • Virtualization outage affected most major systems on campus
    • Some mitigation lessened impact on Saturday
  • uConnect firewall outage
    • Extended email, authentication and DNS service outage for uConnect users (additional 4 hours)
outage timeline 3 incidents
Outage Timeline: 3 incidents

Friday, January 10th

Saturday, January 11th

Most services restored

Most services restored

except uConnect

All services restored

Virtualization Outage

Virtualization Degradation

(critical services stable)

1

VM hosts rebooted

3

Virtualization Outage

VM guests started to restore Services

uConnect Firewall Outage

VM hosts rebooted

CAS & Smartsite

restored

VM guests started to restore Services

Email routing restored

2

Firewall fail over to secondary

w/o success

Hard power cycle

restores firewall and uConnect Services

services impacted
Services Impacted
  • Admissions
  • Banner
  • Central Authentication Services (CAS)*
  • Computing Accounts
  • Electronic Death Registry System
  • Data Center File Services
  • DaFIS
  • DavisMail
  • Data Center Virtualization
  • Final Grade Submission
  • Geckomail
  • Kuali Financial Services
  • Identity and Access Management
  • IET Web Sites
  • MyInfoVault
  • MyUCDavis
  • ServiceNow and SSC Case Management
  • Shibboleth
  • Smartsite*
  • Time Reporting System
  • Web Content Management System
  • uConnect Services
  • UC Davis Directory Listings
  • UC Davis Home Site

* CAS was restored to physical hardware on Fri 1:40pm which restored dependent services such as departmental applications and Smartsite.

communications
Communications
  • Regular outage communication channels were unavailable
    • Email
    • Websites (status page, www.ucdavis.edu)
  • Communications issued
    • Automated notices on IT-Express phone system (updated 3 times)
    • Twitter updates (8 on 01/10; 5 on 01/11)
    • Progress updates on Status web page (status.ucdavis.edu) starting Friday mid-afternoon
    • Email to 300+ campus technologists (01/11)
vulnerabilities
Vulnerabilities
  • Hardware is redundant, but many services are hosted in single location on a single SAN
  • Critical uConnect directory services reside on a single network
  • The system status page is dependent on the local infrastructure
  • IET is not aware of all critical services that rely on our infrastructure
mitigation plans
Mitigation Plans
  • SAN Software Upgrade completed
  • Implement diversification for critical services (Authentication, uConnect Directory Services, Status Page, WWW)
  • Integrate cloud services to improve diversity
  • Develop process to identify critical campus services dependent on IET infrastructure.
lessons learned
Lessons Learned
  • Move from disaster recovery to business continuity
  • Normal communication channels were unavailable
  • Communication and decision-making protocols when normal channels unavailable
    • Not prepared for normal channels being unavailable
communication improvements
Communication Improvements
  • Review service outage communication protocols, contacts, and venues
  • Ensure multiple modes of communication (text, cell, email, web, phone, social media) are available; leverage new WarnMe system extension for non-emergency notifications
    • Closer collaboration with Emergency Manager and StratComm
  • Ensure broad awareness of outage communication channels
  • Launch cloud-based status page – Status Page I/O
  • Leverage AggieFeed for broader communication