1 / 12

Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005

Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005. David Kelsey CCLRC/RAL, UK d.p.kelsey@rl.ac.uk. LHC Tier 0/1/2. Network Architecture. Background. Computing and Networking is essential Tier 0 (CERN) and 12 Tier 1 critical for data taking

krutland
Download Presentation

Planning for LCG Emergencies HEPiX, Fall 2005 SLAC, 13 October 2005

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Planning for LCG EmergenciesHEPiX, Fall 2005SLAC, 13 October 2005 David KelseyCCLRC/RAL, UKd.p.kelsey@rl.ac.uk

  2. LHC Tier 0/1/2 Network Architecture David Kelsey, LCG Emergencies

  3. Background • Computing and Networking is essential • Tier 0 (CERN) and 12 Tier 1 critical for data taking • 10 Gbps Optical Private link to each T1 • The T1’s collectively keep a second copy of the raw data • The T1’s play vital role in (re)processing and providing access to derived data • During data taking, can cope with Tier 0 - Tier 1 link down for 12 hours to < few days. All T1’s down – very bad! • LCG MoU requires avg T1 uptime during data taking: 99% • LCG TDR says • “Special attention needs to be paid to the security aspects of the Tier-0, the Tier-1s and their network connections to maintain these essential services during or after an incident so as to reduce the effect on LHC data taking.” • LCG also essential for analysis • Need to keep the Grid running at all times • Therefore must deal quickly with incidents David Kelsey, LCG Emergencies

  4. Security Incident Response • Joint (LCG/EGEE) Security Policy Group & EGEE Operational Security Coordination Team • Based Security Incident Response Policy and procedures on work of Open Science Grid • Agreement on Incident Response See https://edms.cern.ch/document/428035/ • Sites must • Take local action to prevent disruption • Report to local security officers • Report to others via Grid Incident Response mail list • “Volunteer” incident response team created when needed David Kelsey, LCG Emergencies

  5. Incident classification • High: (team leader required) • The incident could lead to exploitation of the trust fabric, i.e user and host identities, or the incident could lead to instability of the overall Grid, or a denial-of-service is in progress against all replicas of a given Grid service. • Medium: (team leader required if widespread) • The incident affects an instance of a Grid service, but Grid stability is not at risk, or a denial-of-service affects one replica of a given Grid service, or a local attack compromised a privileged user account. • Low: (team leader probably not required) • A local attack comprised individual user, non-privileged credentials, or a denial-of-service attack or compromise affects only local grid resources. David Kelsey, LCG Emergencies

  6. Emergency procedures • JSPG discussed this at last meeting (Sep 2005) • Started from point of view of Security incidents • But quickly realised that other disasters are also likely, so should deal with these too • Very early overview of the issues at this point • Certainly no plan yet • Invite feedback from HEPiX • There must be lots of site-based plans • JSPG will produce a draft emergency plan (and address policy issues) • Grid Operations and OSCT will need to define the details David Kelsey, LCG Emergencies

  7. JSPG discussion topics • What is the scope? • LCG vs EGEE? • Critical: Tier 0/1, data taking, data integrity • Inter-site information flow • This is the critical point to be tackled • Users, Sys Admins and Managers • External information • including interface(s) to the Press • How do we keep the infrastructure operational? • Is this the aim? • What do we take down? • And who decides? • Can optical private networks remain up? • And are they sufficient for LCG data taking? • How do we deal with Tier 2 problems? David Kelsey, LCG Emergencies

  8. LCG/EGEE Emergency Procedures Denise Heagerty CERN

  9. When are emergency procedures required? • Emergency procedures are required to cover the following cases: • Incident response plans cannot be followed: critical parts of the infrastructure are unavailable (e.g. mailing lists) • Incident response plans are inappropriate: E.g. need to rapidly inform large parts of the community beyond the security contacts or incident communication channels are compromised • Examples • Major power cut at Site A lasted several days • Cable cut network access to Site B • Major worm disrupted network access at Site C • Security incident blocks user access to accounts at Site D • Wide area exploit of the (homogeneous) security fabric David Kelsey, LCG Emergencies

  10. What is needed in an emergency? • Out of band communication channels • Alternative service providers (Internet, telephony) • Alternative contact details (e-mail, chat, …) • Alternative technology • Clear decision-making roles • There is no time for consensus during a crisis • Usual decision making process needs to be bypassed • Clear information flow and roles • For at least management, users, the press • Reduce the risk of mis-communication • Disaster Recovery Plan • Definition of critical infrastructure to kept running or repaired quickly • Dependencies and sequence must be clear for restoring services • Mailing lists (at CERN) are key to restoring communication David Kelsey, LCG Emergencies

  11. Some ideas to stimulate discussion • Define an emergency advisory committee? • Members, mandate • Goal is to ensure rapid and appropriate decisions • Assure information flow • E.g. update DNS servers to point to temporary (web) servers • Pre-record messages on telephone help services • Prepare alternative communication channels • E.g. commercial conference call facilities • Alternative Internet providers (e-mail addresses, chat, phone,…) • When/do we return to normal Incident Response? David Kelsey, LCG Emergencies

  12. Final words • LCG needs a written plan • Clear definition of roles • Operations staff need to know what to do • Training • The sites need to agree to policy and procedures • Recognise the powers of operations staff • Sites already have their own internal plans • Now trying to extend to the Grid • Feedback and advice is welcome! David Kelsey, LCG Emergencies

More Related