1 / 12

Report on second TN Disconnection test of January 28 th 2014

Report on second TN Disconnection test of January 28 th 2014. Presented by Alastair Bland (BE/CO) to January 25 th 2014 LBOC Based on 2013 slides to TIOC and FOM of Stefan L ü ders (CERN Computer Security Officer) and his January 30 th 2014

earl
Download Presentation

Report on second TN Disconnection test of January 28 th 2014

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Report on second TN Disconnection test of January 28th 2014 Presented by Alastair Bland (BE/CO) to January 25th 2014 LBOC Based on 2013 slides to TIOC and FOM of Stefan Lüders (CERN Computer Security Officer) and his January 30th2014 Computing and Network Infrastructure for Controls minutes http://information-technology.web.cern.ch/about/meeting/cnic-meetings

  2. Outline • CERN Networking overview • Objective of test and disclaimer • TN Disco Procedure • Findings 2014 • Mitigation needed before next test • Findings 2013 still to be fixed • Next steps

  3. CERN Networking “Trusted” Bypass List ~1200 hosts sorted by function: AB FEC DEV AB FESA DEV AB LINUX DEV AB MISC DEV AB PO FECS AB WIN DEV BE OPERATIONS BE VM DEVELOPMENT DIP GPN HOSTS EN-CV TEMPORARY PATCHING EN-ICE APPLICATION GATEWAYS GS-ASE CSAM - NO TN GS-ASE-AAS SERVERS - NO_TN GS-SE PLC SERVERS TRUSTED BY TN ISOLDE NO TN IT BACKUP SERVERS IT CC AFS IT CC CDB IT CC CONSOLE SERVICE IT CC CVS IT CC SVN IT DB ACCELERATOR IT DB MANAGEMENT IT DB WEB IT LICENCE SERVERS IT LINUXSOFT IT NICE CMF SERVICES IT SECURITY BACKENDS IT TELEPHONE SERVERS LCR FOR 936 LCR FOR BI LCR FOR BT LCR FOR CCR LCR FOR ICE LCR FOR RF LCR FOR STI LCR FOR VACUUM NICE_CA NICE_DFS NICE_DOMAINCONTROLLERS NICE_MAIL_MX NICE_PRINTING NICE_TS_INFRASTRUCTURE NICE_XLDAP PH EXPERIMENTS SC GPRS TN APPLICATION GATEWAYS TN INTERNET PUBLISHERS TN WEB SERVERS TS-CV SERVERS - NO_TN TS-EL SERVERS - NO_TN TS-FM SERVERS - NO_TN TS-HE GPRS “Exposed” List ~90 hosts sortedby function: AB LINUX OPER AB WIN OPER CMW PUBLISHERS GS-SE MASTER PLCS EXPOSED TO GPN LASER EXPOSED ORACLE APPLICATION SERVERS POST MORTEM PUBLISHERS SC DOSE-READERS SUSIDB-ACCESS-CONTROL TIMING EVENT SERVERS • The TN should not depend on the GPN by design. • In principle, gateways are only necessary for developing or configuring systems and should not be needed for running accelerators / technical infrastructure.

  4. Objective & Disclaimer • Reassure people that a disconnection does not do harm. • “Understand the extent to which control systems connected to the TN depend on external services (GPN, CERN b513 CC) and in how far these systems are able to run autonomously in case the GPN is not available. Given past experience, the dependency should be rather low, but hidden dependencies might exist.” • Confirm that “TN disconnection” is a valid preventive action in case of major security incidents on the GPN or in the CC. • DISCLAIMER • We’re in the middle of LS1. • Not all systems were up and running, few in operational mode. • Thus, findings of the second TN Disco Test were limited.

  5. TN Disco Procedure The FOM, TIOC and LS1 Committee approved the test for Tuesday January 28th 2014. The FOM added that a final test in May/June/July might be difficult to be scheduled. The IEFC, LMC and LCSP were informed, a Note d’Interventionwas issued and the CERN Status Board updated (https://cern.service-now.com/service-portal/view-outage.do?from=CSP-Service-Status-Board&&n=OTG6329). The procedure applied was: • Before Jan. 27th: Inform all affected parties • Jan. 27th: Freeze TN-GPN gate and clone “TN BYPASS LIST” • Jan. 28th 9:00: Inform TI. Start of TN Disco Test. • 9:15: de-trust the individual hosts in the TN BYPASS LIST as well as the following LANDB sets used for development or GPN experiments: AB FEC DEV, AB FESA DEV, AB LINUX DEV, AB MISC DEV, AB PO FECS, AB WIN DEV, BE VM DEVELOPMENT, EN-CV TEMPORARY PATCHING, EN-ICE APPLICATION GATEWAYS, ISOLDE NO TN, IT CC SVN, LCR FOR 936, LCR FOR BI, LCR FOR BT, LCR FOR CCR, LCR FOR ICE, LCR FOR RF, LCR FOR STI, LCR FOR VACUUM, NICE_TS_INFRASTRUCTURE, PH EXPERIMENTS, TN APPLICATION GATEWAYS, TN WEB SERVERS • 9:30: de-trust the following sets for data export: TN EXPOSED TO GPN, DIP GPN HOSTS, TN INTERNET PUBLISHERS • 9:45: de-trust the following sets for misc. IT services: IT CC AGILE INFR, IT CC CONSOLE SERVICE, IT SECURITY BACKENDS, IT TELEPHONE SERVERS, NICE_PRINTING • 10:00: de-trust the following sets for Windows: IT NICE CMF SERVICES, NICE_DFS • 10:40: de-trust the following sets for Linux: IT CC AFS, IT LICENCE SERVERS, IT LINUXSOFT • 10:54: de-trust the following sets for operations: BE OPERATIONS, IT BACKUP SERVERS, IT CC CDB, IT DB ACCELERATOR, IT DB MANAGEMENT, IT DB WEB • 11:18: de-trust the following sets for IT basics: NICE_CA, NICE_DOMAINCONTROLLERS, NICE_MAIL_MX, NICE_XLDAP • 11:30: de-trust the following sets for areas where there is no TN: GS-ASE CSAM - NO TN, GS-ASE-AAS SERVERS - NO_TN, GS-SE PLC SERVERS TRUSTED BY TN, SC GPRS, TS-CV SERVERS - NO_TN, TS-EL SERVERS - NO_TN, TS-FM SERVERS - NO_TN, TS-HE GPRS • 11:30: Cut the power to the two TN-GPN gates (in CCR and 513) • 13:00: Re-establish power & re-establish status quo. • 13:00: Inform TI. End of TN Disco Test.

  6. 2014 Findings • Linux PCs without AFS worked well. Those with AFS, however, were suffering severely from dependencies on the local AFS client as well as some hidden dependencies on Kerberos. Thus, without the AFS service proper accelerator operations is NOT possible • Windows console reboot sequence is now well acceptable (below 5 minutes reboot time when the GPN and esp. DFS are missing) indicating that caching of credentials and using local Domain Controllers worked well. [BE/CO DHCP server was used for this test, see below] • However rebooting of normal Windows consoles was not possible because they were not able to obtain a new IP address as the DHCP servers are all on the GPN. Thus, without dedicated DHCP servers on the TN proper accelerator operations is NOT possible • Login to the Controls Database forms worked thanks to work by BE/CO/DA since 2013 • RBAC authentication depends on the CERN IT web service. Thus, without RBAC proper accelerator operations is NOT possible • The TI operators observed a hiccup with the LHC Laser console, but this was not reproducible. Apart from this, all TI applications that were currently open appeared to work fine (PVSS supervisions, TIM, ENS and PSEN) • For the Access System’s PS domain, a few dependencies on GPN services were discovered and shall be mitigated during this year by using TN equivalents • Power converter controls were unaffected, but suffered from the Linux & CCM problems • Tests from the Linac control room in building 363 have confirmed that in principal operation is possible even during a disconnection but diagnostics and development would be inhibited • The LHC experiments have seen no problems • Development and testing was not possible from offices for many colleagues as the GPN does not host a sufficiently independent development and test environment

  7. Mitigation needed before next test (1/2) • (Richard Scrivens, Alastair Bland, Arne Wiebalck) Restarting SLC6 Linux PCs ran into a halt due to a blocking OpenAFS client (v1.6) which turned out to be “unkillable”. When shutting down systems, “afsumount” blocks forever. SLC5 worked better. Arne confirmed that they have a patch which corrects this. • (Peter Sollander, Richard Scrivens, Alastair Bland, Arne Wiebalck, Jarek Polok) Login to Linux consoles and servers failed as any activity invoking the pam stack (pam_krb5.so) was blocked as it “sees” an /afs mount and tries to execute an AFS syscall. This includes failing to start CCM, the Diamon console, Java as well as PVSS3.8, Labview from the CCM menus, and sudo, ssh, su, … (already started programs continued working; kinit worked; systems without AFS were fine, too). • (Richard Scrivens, Alastair Bland) Resolving of hostnames via “host” was not working properly (strangely “nslookup” was better). Eventually this is due to different configuration files, or a difference in using TCP vs. UDP. • (Alastair Bland, David Gutierrez Rueda) With the current DHCP servers all connected to the GPN, it won’t be possible to get a lease for new devices or for devices that need to renew their lease. This blocked any reboot of Windows PCs as they “forget” their current IP address. Linux PCs were fine. For mitigation, a redundant pair of DHCP servers would be needed connected to the TN (eventually in b513 and b874). • The DHCP lease time is fixed to seven days and properly configured devices shall regularly try during the second 3.5 days of this period to get a new lease. Thus, this is fine as long as the disconnection lasts less than 3.5 days. • (Peter Sollander) TI piquet webpage (http://abop-piquets.web.cern.ch/ABOP-piquets/) and TIMWEB help alarm pages (e.g. https://oraweb.cern.ch/pls/timw3/helpalarm.AlarmQuery?p_header=Y) not available as these are hosted on the GPN.

  8. Mitigation needed before next test (2/2) • (Piotr Golonka, Peter Sollander) The PVSS device faceplate could not be opened as it tries to connect to the HelpAlarm DB (Oracle). This puts the UI on hold until the connection is successful or timed-out (5-10mins on Windows PCs, a couple of seconds on Linux PCs) • (Richard Scrivens, Alastair Bland) It was not possible to open a FESA Navigator window due to an exception in thread "JavaFX Application Thread" java.lang.Error: Unable to load configuration file (http://wwwpsco.cern.ch/private/java/fesa/3.1/Fesa3NavigatorNext/fesa.cfg). • (Stephen Page, Alastair Bland) EquipState did not work, probably due to an AFS dependency. • (Alastair Bland, Anthony Rey) All Fesa 2.10 (and perhaps Fesa 3) tools could not be started. Probably this is caused by wrong configurations within Fesa which create dependencies on the GPN. • (Piotr Golonka) Starting PVSS UIs on Cryo consoles is delayed waiting for LDAP requests to time out. A fix is ready but pending to be deployed. • (Uwe Epting, Michal Kwiatek) CV consoles were not able to start some application as they required DFS. Probably, the DFS caching was not properly configured. • (Uwe Epting, Stefan Lueders, Timo Hakulinen) As the application gateways for EN/CV were located on the GPN, any interventions from Windows PC terminals in the field was inhibited. Therefore, it was suggested to deploy similar, dedicated Windows 2008 WTS for GS/ASE and EN/CV on the TN. • (Piotr Golonka) As expected, the retrieval of PVSS historical events is not possible as the corresponding DB is connected to the GPN. The archive/retrieval of historical data will be solved with the new DBs that are placed in the TN (will be put in production in some days).

  9. 2013 Findings still valid (1/2) • (IT/OIS) Dependency on CERN SSO/winservices-soap for certain web applications: • (Wojciech Sliwinski) Login to RBAC as well as the e-logbook depend on https://winservices-soap.web.cern.ch. With the absence of the central webservers (i.e. webr10) logins were impossible. • (Anna Suwalska) Login to the TIM Reference tools, based on the NICE web service, was not possible. • (Anna Suwalska) Login to TIM Viewer applications did not work during the disconnection test as they are based on RBAC. This implied that accesses to the CTF3 and SPS tunnels could not be granted. • (Jean-Michel Nonglaton) CTF3 operations had similar issues with login to RBAC on Diamon and Oasis. • (Mark Buttner) For LASER/DIAMON, similar RBAC problems were reported when renewing the token. Their GUI was blocked. • (Wojciech Sliwinski) This RBAC dependency will be changed during LS1 to a BE/TN Active Directory instance. The issue is followed up in Jira: http://issues/browse/RBAC-488. • (Vito Baggiolini) “phonebook” command on Linux failed as it points to $ldapserver="xldap.cern.ch".

  10. 2013 Findings still valid (2/2) • (Pierre Baehler) The CPS/PSB/AD/LEIR Tomoscope application invokes “Mathematica”. With the corresponding license server not available, beam tuning will not be possible. • (Jean-Michel Nonglaton) CTF3 reported (expected) problems with (compiled) Matlab due to missing AFS (and license servers). • (IT/OIS) DFS home folders are not available, and, thus, remote access to shared project folders is inhibited. Existing user profiles, however, are not affected as these are created and stored locally. • (David Gutierrez Rueda) Neither modifications to the network configuration nor monitoring of the equipment (and, subsequently, interventions) would be possible as LANDB and the Spectrum servers are currently on the GPN. • (Timo Hakulinen, Rui Nunes) For the LHC/SPS/PS access system, access points as well as individual surface door controllers would work in off-line mode (with their local database of users and rights). Thus, any accesses put in IMPACT timely before the TN Disco Test are not affected. However, if IMPACT requests are made during the Disco Test, they cannot be propagated to the access system and, thus, access will be denied. The same is valid for newly registered people. Local panel-PC displays on access points will not be able to display live information. [ The 2013 findings which are (currently) tolerated have not be been included in the list above ]

  11. Next Steps • (CNIC) Fix issues related with Windows DHCP and Linux AFS dependency • (BE) Fix issues related to RBAC availability • (BE-TE-EN management) Define operation level in case of TN/GPN disconnection: • Scenario 1: Immediately stop any beam and put accelerators in a safe mode • Scenario 2:Keep operation as usual; stop only if disco lasts more than NN mins • Scenario 3: Depending on machine mode, either stop LHC beam (e.g. if not yet in physics) or keep physics mode until EIC/experiments detect non-safe situation • Scenario 4:(other scenarios as defined by the accelerator sector) (CNIC) Once defined, provide cost estimates of mitigations and fixes. (BE-TE-EN management) Decide which scenario to implement taking external costs of e.g. IT department into account. (CNIC) Organise and coordinate implementation of mitigations and fixes. • (CNIC) Re-conduct the TN Disco Test in mid 2014with mitigations in place and more systems operational and online.

  12. A big THANK YOU… • … to all those fine people being present in the CCC during the two tests, checking and helping: • Richard Scrivens (BE/ABP) • Vito Baggiolini, Mark Buttner, Pierre Charrue, Jean-Michel Elyn, Luigi Gallerani, Enzo Genuardi, Mikhail Grozak, Jose Rolland Lopez De Coca, Nicolas de Metz-Noblat, Stephen Page, Louis Pereira, Jakub Wozniak, Zornitsa Zaharieva (BE/CO) • Pierre Freyermuth, Julien Pache, Laurette Ponce, Peter Sollander (BE/OP) • Pierre Carbonez (DGS/RP), Gustavo Segura Millan (DGS/SEE) • Uwe Epting (EN/CV), Piotr Golonka (EN/ICE) • Timo Hakulinen (GS/ASE) • David Gutierrez Rueda, Jean-Michel Jouanigot (IT/CS),Stefan Lüders (IT/DI),Arne Wiebalck (IT/DSS), Michal Kwiatek, Jarek Polok (IT/OIS) • Fabien Antoniotti, Helder Filipe Carvalho Pereira (TE/VSC) • Fabien Chevet (TE/CRG), BozhidarIvanovPanev (TE/MPE)

More Related