1 / 26

ATLAS Central Services and Computing Security

ATLAS Central Services and Computing Security. Flavia Donno CERN/IT-ES-VOS. Outline. Security in ATLAS: can this model be exported to other LHC experiments? Why we do it How we do it: policies and plan Handling a security incident in ATLAS

topper
Download Presentation

ATLAS Central Services and Computing Security

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. ATLAS Central Services and Computing Security Flavia Donno CERN/IT-ES-VOS Flavia Donno, Tier1 Service Coordination

  2. Outline • Security in ATLAS: can this model be exported to other LHC experiments? • Why we do it • How we do it: policies and plan • Handling a security incident in ATLAS • The ATLAS Central Services Operations Team. Can this evolve into a general services for LHC experiments? • Who we are and what we do • Some details: service inventory and the web redirector • Better interface and documentation to CERN/IT available tools. FlaviaDonno, Tier1 Service Coordination 9/8/2014 2

  3. ATLAS Security: The Goal • Minimize ATLAS central services unavailability during data taking, mitigating vulnerability and improving operation and management of ATLAS services. Preserve ATLAS reputation worldwide. • The focus is on security • Security cannot be enforced without good service management practices Policies • Strategy: increase robustness and availability of critical services while containing vulnerabilities of less critical ones Plan • Other sites supporting ATLAS can follow the same policies where applicable FlaviaDonno, Tier1 Service Coordination

  4. ATLAS Security: references and documentation • Various CERN IT and wLCG/EGEE documents define policies and good practices: • VOBox Service Level Agreement (CERN-IT/FIO-FS) • https://twiki.cern.ch/twiki/pub/FIOgroup/FsSLA/sla-v1.2.1.pdf • VOBox Security Recommendations and Questionnaire (wLCG Joint Security Policy Group) • https://edms.cern.ch/file/639856/0.6/VO-Box-security-policy-0-6.pdf • GRID system administrators best practice and guidelines (EGEE security group) • http://rss-grid-security.cern.ch/glite.php?display=1 • SECURITY POLICIES AND PLAN FOR CERN ATLAS CENTRAL SERVICES • https://twiki.cern.ch/twiki/pub/Atlas/ATLASInternalSecurityPlan/ATLAS_Security_PlanV20.pdf • Special thanks to Sebastian Lopienski, Romain Wartel and Stefan Lueders FlaviaDonno, Tier1 Service Coordination

  5. Key figures and roles FlaviaDonno, Tier1 Service Coordination

  6. The policies • Sensors are provided to warn SM of warranty expirations. • Changes cannot be made to existing hardware configurations. FlaviaDonno, Tier1 Service Coordination

  7. The policies FlaviaDonno, Tier1 Service Coordination

  8. The Service Documentation Card If you edit your service Twiki page, you are presented With the edit window. At the bottom you find The button “Add Form” (*) https://twiki.cern.ch/twiki/bin/view/Atlas/ATLASServiceDocumentationTemplate No need to copy and past. Just follow the instructions to add a form in your twiki. FlaviaDonno, Tier1 Service Coordination

  9. The policies • (**) https://twiki.cern.ch/twiki/bin/view/Atlas/CentralServicesManagementPoliciesAndProcedures • Procedures and good practices are listed in the twiki. It is ongoing work. • Distribution through rpms (ATLAS software repository) • SLC5 migration • ATLAS secure external configuration repository (based on SINDES) • Protecting DB passwords • Good practices (aliases, http vs https and CERN SSO, svn checkout, …) • (***) The VOBox service is provided as a business hours service on CERN working days. • For 24/7 coverage procedures can be provided for the operators (OP) • The VOC or AM must react to monitoring alarms escalated to them. • The VOC or AM should be careful not to take actions that could raise alarms FlaviaDonno, Tier1 Service Coordination

  10. The policies • System level security compliance is monitored by the CERN IT security team • SM/AM must proactive check for security related updates • Housekeeping of the machines is the responsibility of the VOC and AM • (logs rotation, tmp space management, etc.) FlaviaDonno, Tier1 Service Coordination

  11. The policies [a] Report a Computer Security Incident http://lcg.web.cern.ch/LCG/incident.htm [b] LCG Agreement on Incident Response https://edms.cern.ch/document/428035 FlaviaDonno, Tier1 Service Coordination

  12. More on Security in ATLAS • The AM must make available to the VOC and VO operators the list of possible threats/risks and actions taken to mitigate them (software updates and security patches, etc.). In particular services should depend as little as possible on specific package versions, allowing for automatic updates. • The AM is responsible to train ATLAS operators to identify anomalies with the services. • The SM is responsible for monitoring the service for the correct behavior (load, performance, available free space, etc. – it might reveal an intrusion) • The VOC will regularly review machine configurations to identify possible security threats (limited interactive or root access, interactive access [or sudo privileges!!!] to generic accounts, open doors, access control lists, un-needed services, world-writable files/directories [used in init.d], etc.) - we work with FIO to make this process semi-automatic. • The VOC might run security challenges vs. specific services if needed. • VOC, AM and SM are responsible to follow recommended procedures: • http://rss-grid-security.cern.ch/glite.php FlaviaDonno, Tier1 Service Coordination

  13. The plan FlaviaDonno, Tier1 Service Coordination

  14. The plan …such tasks will be repeated every three months. FlaviaDonno, Tier1 Service Coordination

  15. Handling a security incident in ATLAS • Report all suspicious security incidents to the ATLAS VOC: atlas-adc-central-services@cern.ch and to Computer.Security@cern.ch or phone 70500. Please use caution and discretion. • The ATLAS VOC will coordinate the over incident response process in Atlas: in particular, the VOC will contact the AM/SM for the specific service. The AM/SM will be contacted both via phone (whenever provided) and e-mail. The subject of the e-mail is of the form “[Important|Severe]: SI#” where # is an internal sequence number. In case of a severe incident we expect AM/SM to respond within 1 hour. • The service will be removed from the public network(restricting to local access if possible) weighting this with the impact to the experiment and the organization. Depending on the severity of the incident, services can be shut down either in agreement with AM/SM or, in case of unresponsiveness, independently (contacting ATLAS management and informing user communities). FlaviaDonno, Tier1 Service Coordination

  16. Handling a security incident in ATLAS • As in case of normal maintenance interventions, if specific procedures are provided by the AM/SM concerning the unavailability of the service, the ATLAS VOC will follow them (i.e. informing user communities, advertising downtime, redirecting dependent services, operation procedures, etc.) • In agreement with the AM/SM and if the nature of the incident will allow it, the service could be partially restored on hot spares, if previously arranged. • The original machines where the service was running and intruded will be made available to the CERN Security Team for analysis. However, it is the responsibility of the AM and SM to investigate the incident with the advice and guidance of the CERN Security Team. FlaviaDonno, Tier1 Service Coordination

  17. Handling a security incident in ATLAS • Once the service vulnerability is mitigated by the FM/AM/SM, the ATLAS VOC will coordinate with the CS for a service examination/scan to make sure the mitigation is effective. • Once the vulnerability problem is solved the service is restored. FlaviaDonno, Tier1 Service Coordination

  18. The C entralServices Operations Team C O S FlaviaDonno, Tier1 Service Coordination • Started with B. Koblitz who left in July 2009 • The team: Flavia Donno (CERN-IT/GS) – coordination Serguei Baranov (JINR - Russia) Sergey Makarychev (ITEP - Russia) - here till end of December 2009 Alexey Buzykaev (BINP - Russia) - left on November 18th, 2009 9/8/2014 19

  19. The ATLAS Central Services Team FlaviaDonno, Tier1 Service Coordination • The mandate: • ATLAS contact for CERN/IT (Alarm Tickets, security, etc.) • Management of ATLAS Central Services according to the SLA between CERN-IT and ATLAS (hardware requests, quattor, best practices, etc.) • Providing assistance with machine, software and service management to ATLAS service developers and providers (distribution practices, software versions, rebooting, etc.) • Provision of general frameworks, tools and practices to guarantee availability and reliability of services (squid/frontier distribution, web redirector, hot spares, sensors, better documentation, etc.) • Collection of information and operational statistics (service inventory, machine usage, etc.) • Enforcing the application of security policies • Training of newcomers in the team • Spreading knowledge among service developers and providers about the tools available within CERN/IT (Shibboleth2, SLS, etc.) • Participating in the work of the ADC Operations Team • And more … 9/8/2014 20

  20. The ATLAS Central Services Team FlaviaDonno, Tier1 Service Coordination • Communication channels: • Support requests are submitted here: https://savannah.cern.ch/support/?group=atlascsops • You can contact us via e-mail: atlas-adc-central-services@cern.ch • Documentation: • Useful twiki pages: https://twiki.cern.ch/twiki/bin/view/Atlas/CentralServicesManagementPoliciesAndProcedures https://twiki.cern.ch/twiki/bin/view/Atlas/ADCMachinesHowto • Public Machine and Service Inventory: https://twiki.cern.ch/twiki/bin/view/Atlas/DistributedComputingMachines • Security Document: https://twiki.cern.ch/twiki/pub/Atlas/ATLASInternalSecurityPlan/ATLAS_Security_Planv20.pdf 9/8/2014 21

  21. Service Inventory https://twiki.cern.ch/twiki/bin/view/Atlas/DistributedComputingMachines FlaviaDonno, Tier1 Service Coordination

  22. Service Inventory • We manage a total of 184 machines • [virtual/development/production] • Production • 13 machines for Run Time Tester Service • 64 machine for the ATLAS Build services • 53 machines used for production services (DQM, Site Services, Central Catalogues, T0, etc.) • Development • 23 Development Machines • 20 Virtual Machines (on 3 physical ones) • Spares • 5 spares • 6 unknown 9/8/2014 24 FlaviaDonno, Tier1 Service Coordination

  23. Availability in ATLAS • Normal operations are complemented by the following actions to ensure availability of services: • Every ~6 months the ATLAS VOC will exercise rebooting and moving of some critical, load-balanced services on hot spares in order to be prepared for emergencies. • In coordination with the Security Team the ATLAS VOC performs regular checks on the existing services for signaled vulnerabilities. • The ATLAS VOC makes available web framework (such as the ATLAS web redirector) services to AM/SM. Such frameworks ensure a secure and CERN supported environment where to run web services (Shibboleth, CERN approved software packages, safe services etc.). In case the web redirector cannot be used, recommendations will be given to avoid common problems. • The ATLAS VOC advices and provides support on specific software packages and their versions to be used on specific platforms (Python 2.6, emacs, mod_python, mod_wsgi, Django, etc.). • The ATLAS VOC provides sensors for most common alarm needs. FlaviaDonno, Tier1 Service Coordination

  24. The web redirector atmbadm https://atlas-minimumbias.cern.ch atmbsrv https://atlas-trigconf.cern.ch https://atlas-runquery.cern.ch attrcadm attrcsrv atrqsrv atrqadm Server accounts Administration accounts FlaviaDonno, Tier1 Service Coordination 9/8/2014 26

  25. Conclusions • Security is important both for obtaining what we are working for (physics results) and for the good reputation of ATLAS and all LHC experiments • Please, help us enforcing it! • The ATLAS security model is general and can be applied to other LHC experiments • Central Services Operations Teams have an important role for the smooth running of experiment services • The goal is to improve service availability, reliability and management • We depend on CERN/IT, but a lot to be done • Can the ATLAS model be expanded in order to offer a common services to all LHC experiments? FlaviaDonno, Tier1 Service Coordination 9/8/2014 27

  26. C O S … comments/suggestions/corrections ? Please, send e-mail to Flavia.Donno@cern.ch FlaviaDonno, Tier1 Service Coordination

More Related