Service Resilience

Service Resilience Introduction & overview • GridPP22 – UCL • 1st April 2009

Overview • What does it mean to be resilient - hummm • How are we doing - ooohhh • What areas do we need to consider - aaahhh • What can we do - yeeha

COST vs RISK Being resilient is the ability to rapidly adapt and respond to a change. What we can achieve depends on how much it will cost vs the impact of a failure. Svalbard seed vault, Spitsbergen

Does being resilient mean you never fail? Blackout sequence of events, August 14, 2003 (times in EDT): • 12:15 p.m. Incorrect telemetry data renders inoperative the state estimator, a power flow monitoring tool operated by the Ohio-based Midwest Independent Transmission System Operator (MISO). An operator corrects the telemetry problem but forgets to restart the monitoring tool. 1:31 p.m. The Eastlake, Ohio generating plant shuts down. The plant is owned by FirstEnergy, an Akron, Ohio-based company that had experienced extensive recent maintenance problems. 2:02 p.m. The first of several 345 kV overhead transmission lines in northeast Ohio fails due to contact with a tree in Walton Hills, Ohio.2:14 p.m. An alarm system fails at FirstEnergy's control room and is not repaired. 2:27 p.m. A second 345 kV line fails due to contact with a tree. 3:05 p.m. A 345 kV transmission line known as the Chamberlain-Harding line fails in Parma, south of Cleveland, due to a tree. 3:17 p.m. Voltage dips temporarily on the Ohio portion of the grid. Controllers take no action. 3:32 p.m. Power shifted by the first failure onto another 345 kV power line, the Hanna-Juniper interconnection, causes it to sag into a tree, bringing it offline as well. While MISO and FirstEnergy controllers concentrate on understanding the failures, they fail to inform system controllers in nearby states. • * 3:39 p.m. A FirstEnergy 138 kV line fails.[where?] • * 3:41 p.m. A circuit breaker connecting FirstEnergy's grid with that of American Electric Power is tripped as a 345 kV power line (Star-South Canton interconnection) and fifteen 138 kV lines fail in rapid succession in northern Ohio • * 3:46 p.m. A sixth 345 kV line, the Tidd-Canton Central line, trips offline. • * 4:06 p.m. A sustained power surge on some Ohio lines begins an uncontrollable cascade after another 345 kV line (Sammis-Star interconnection) fails. • * 4:09:02 p.m. Voltage sags deeply as Ohio draws 2 GW of power from Michigan, creating simultaneous undervoltage and overcurrent conditions as power attempts to flow in such a way as to rebalance the system's voltage. • * 4:10:34 p.m. Many transmission lines trip out, first in Michigan and then in Ohio, blocking the eastward flow of power around the south shore of Lake Erie. Suddenly bereft of demand, generating stations go offline, creating a huge power deficit. In seconds, power surges in from the east, overloading east-coast power plants whose generators go offline as a protective measure, and the blackout is on. • * 4:10:37 p.m. The eastern and western Michigan power grids disconnect from each other. Two 345 kV lines in Michigan trip. A line that runs from Grand Ledge to Ann Arbor known as the Oneida-Majestic interconnection trips. A short time later, a line running from Bay City south to Flint in Consumers Energy's system known as the Hampton-Thetford line also trips. • * 4:10:38 p.m. Cleveland separates from the Pennsylvania grid. • * 4:10:39 p.m. 3.7 GW power flows from the east along the north shore of Lake Erie, through Ontario to southern Michigan and northern Ohio, a flow more than ten times greater than the condition 30 seconds earlier, causing a voltage drop across the system. • * 4:10:40 p.m. Flow flips to 2 GW eastward from Michigan through Ontario (a net reversal of 5.7 GW of power), then reverses back westward again within a half second. • * 4:10:43 p.m. International connections between the United States and Canada begin failing. • * 4:10:45 p.m. Northwestern Ontario separates from the east when the Wawa-Marathon 230 kV line north of Lake Superior disconnects. The first Ontario power plants go offline in response to the unstable voltage and current demand on the system. • * 4:10:46 p.m. New York separates from the New England grid. • * 4:10:50 p.m. Ontario separates from the western New York grid. • * 4:11:57 p.m. The Keith-Waterman, Bunce Creek-Scott 230 kV lines and the St. Clair-Lambton #1 and #2 345 kV lines between Michigan and Ontario fail. • * 4:12:03 p.m. Windsor, Ontario and surrounding areas drop off the grid. • * 4:12:58 p.m. Northern New Jersey separates its power-grids from New York and the Philadelphia area, causing a cascade of failing secondary generator plants along the Jersey coast and throughout the inland west. • * 4:13 p.m. End of cascading failure. 256 power plants are off-line, 85% of which went offline after the grid separations occurred, most due to the action of automatic protective controls.

But surely a Grid is inherently reliable? BDII Updates Network Reconfiguration Power Resources Patch Upgrades Upgrades WMS fault

Major service incidents Q109 • In case you did not know, major incidents are now logged by WLCG. GridPP now also logs incidents in the GridPP twiki to document the problem and follow up. Compiled by J Shiers “Most disasters are the result of a collection of relatively small events happening at the same time in response to a common trigger”. IBM

Reliability according to SAM

So what can we do? • Duplicate services or machines • Increase the hardware’s capacity (to handle faults) • Use (good) fault detection • Implement automatic restarts • Provide fast intervention • Fully investigate failures • Report bugs -> ask for better middleware “The best defense is a good offense”. Being proactive here can save time, money and effort. There are several areas we need to consider, here some major ones: nb. Industry tackles this via these technology domains: storage, network, facilities, storage, server, database management system, application architecture, middleware and systems management.

Redundancy – or duplicate Thin line between resilience and seamless recovery. • Hardware: PSUs, disks • Metadata-stuff: backup databases regularly • Services: Install multiple instances (e.g. CEs) • Remove single copies of important data • Ensure catalogues replicated Two pilots. Two meals. No Problems!

A network example Type “service resilience” into Google and Edinburgh does pretty well. In physically doubling up there is a risk….

Does simply duplicating make you resilient? An aside: Do we take “pairs” redundancy for granted? To look at this sign you would think that men are more resilient than woman! But most of us know that this is not true.

Does simply duplicating make you resilient? An aside: Do we take “pairs” redundancy for granted? To look at this sign you would think that men are more resilient than woman! But most of us know that this is not true. Fairer!

Does simply duplicating make you resilient? An aside: Do we take “pairs” redundancy for granted? To look at this sign you would think that men are more resilient than woman! But most of us know that this is not true. PS. Anybody know what the Korean’s had in mind with this one?

… no. Bigger picture solutions can have more impact! Some (obvious) things that are not often said but which need attention. Ensuring a good machine room environment can have more advantages than simply putting good equipment in a bad situation. • Don’t overload machine rooms • Ensure adequate power and cooling Having good security is a major component in any service resilience strategy. • Physical security means people won’t just walk off with your dual PSU super whatsit. • Keeping systems well patched and monitored is good sysadmin practice… and generally assumed. Since you are now thinking about the bigger picture we can move on to some of strategies for making the systems/services more resilient.

Increase the hardware capacity • “Throw better hardware at the problem” • Seen for storage headnodes • Used to address CE problems • Recommended for the WMS • Slowly happening for WNs • Doing this may help in the short-term but if there are middleware/OS problems then they need to be addressed. Look at the BDII changes after indexing. We need to be smart about following this route. The experiments/users need to look at how their code and models stress the systems. • We have yet to see full user loads on our sites. What spikes in usage can current hardware tolerate?

Good fault detection Decent internal and external monitoring – preferably with alarms. If we can not prevent all faults then at least we can spot them and react quickly. But beware the bodge: “I heard a remark the other day that seemed stupid on the surface, but when I really thought about it I realized it was completely idiotic and irresponsible. The remark was that it’s better to crash and let Watson report the error than it is to catch the exception and try to correct it”. Eric Brechner (Microsoft Engineering Excellence Center) • Handling exceptions rather than addressing the underlying problem turns an easily debugged problem into a real mess or possibly a security issue. • There are hardware fault detection modules that isolate problems. Cost? • We are looking increasingly at better monitoring. Nagios to alert on passive and active probes is now a key to reducing operations effort.

Auto restarts & fast intervention “Auto-restart: If the wash process is interrupted due to power failure, this feature restarts the machine from the point when the cycle was interrupted. Saves you the trouble of restarting as well as reprogramming the machine. This feature is particularly useful in areas facing frequent power cuts”. So, baring in mind the point about fault detection (and as long as there is decent logging of the problem and a mechanism is in place to stop restarts spiraling) then there is a place for restarts in keeping our infrastructure running. Uses: Cron jobs are currently used in a number of areas to restart daemons that have fallen over Bad use! Using restarts to get around code that has large memory leak. • Strategies for a quick swap: swapping dodgy drives in an array. Fast instancing of a new service (e.g. bring up a new replacement CE) • Switching to a backup. The GOCDB for example: uses Oracle 11g cluster at RAL. Replica web portal in Germany and a replica database in Italy.

Clear documentation • How can we expect the services to run properly if they are not used correctly? • Running a service properly means understanding the configuration options* which means… • Read The Manual and if something is not clear then report it to the authors so that the next reader does not run into the same problem. • For users it means knowing how to correctly use the system – remember those biomed jobs that locked up the SEs of some sites? * There are thoughts to better validate configs. For example YAIM confirming externally before allowing a new LFC instance.

Fully investigate, understand and report problems Sometimes the REAL problem will not be apparent unless one really delves into the logs. Improving resilience means catching bugs early (so perhaps we need to do more testing before we trust a production release … but note that several recent problems are intermittent and only apparent on some production instances and only then after some time), and recording them. Of course we then have to trust that bugs are followed up… “The lcg-CE has fundamental scalability problems due to the GRAM-2 protocol. Condor-G has mitigated these problems via the introduction of "grid_monitor" functionality and corresponding GRAM-2 enhancements. Andrey Kiryanov has further mitigated the problems by introducing new daemons running as root (globus-job-manager-marshal, globus-gass-cache-marshal and globus-gma), but the architecture remains fragile …” M. Litmaath

Clear communications are essential • [Insert your preferred word here] HAPPENS! • If a site is going down then make sure that the users are informed early enough to avoid being stuck. This means scheduling downtimes well in advance. • If there is an unforeseen problem, then where possible keep users (and colleagues) updated. If a site is known to have problems it can be removed from VO production systems before it causes problems. • Some problems are going to impact other sites and in these cases other sites can benefit from your knowledge. Mails to the lists help but blogging about the problem allows you to show plots, log extracts and have something that is more searchable (though we need to improve on this last aspect). Sysadmins rule!

Some specifics – constraints and dependencies We do not work in isolation! FTS: awaiting automated failover. Currently 5 frontends with round robin. Oracle RAC for database. WMS: redundant frontends with two independent LBs. Other instances at T2s. Proxy renewal vs multiple VOMS servers is coming LFC: Multiple frontends and use of Oracle Server for DB. [Offsite backup? CNAF looking at Dataguard] tBDII: Several across sites. Use load-balanced rotating alias MyProxy:Dual machine with round robin[use of myProxy lists. DNS load balanced setup. Linux-HA setup] CE: Multiple CEs but single scheduler WNs: Job interference and resource exhaustion VOBoxes: Easy installs but … Databases: Have RACs. Oracle Data Guard is a next step Networking: LANs moving to dual links to central switch. T1 OPN options often reviewed. Tier-2 SEs: [distribute the headnodes vs hot spares] GOCDB: Oracle cluster. Replica at CNAF. CA: [specialist components] (Experiment) components outside of the UK…

Some things to think about • Ways to integrate the CE and batch systems. Are the schedulers a single failure point and how can we reduce this risk? • Do we need to look again at WN resource contention? Do some sites really need to kill jobs that exceed very briefly some resource limit (like memory)? • Can we make better use of virtualisation? For WNs allows for more repeatable and predictable environments. • Are resilience options being properly enabled? For example are sites using the alternative BDIIs settings in LCG_GFAL_INFOSYS: LCG_GFAL_INFOSYS=our-bdii:2170,neighbour-bdii:2170,..... • How secure are our sites? Having good security is a core component in running a resilient service • Do we have sufficient variety in terms of suppliers and technologies? • How should we reduce the risks associated with a new roll-out? • Do we all understand the experiment priorities? • What resilience options are there around our people!?

The sort of recommendations we are looking for… “If you find any problem with the load on the DPM head node or you are worried about a single point of failure you should: - avoid to run the head node as a disk server or gridFTP door - try to run the DB on a different node You may as well run the Name Server (DPNS) on a different host, actually the DPNS can be configured as a set of servers connected to a DNS load balancing system: this would allow to not only distribute the load but also have several machines offering the same service in a transparent way. This could be also easily done for the SRM servers if needed but this requires small modifications to the existing code base. Currently the DPM daemon itself cannot be easily distributed/replicated but this feature request is part of the workplan. As you can see above the DB itself could be the single point of failure. This can easily be solved with Oracle, but could probably be solved as well by using recent versions of MySQL or Postgres. However we haven't started to test/use the replication features of MySQL/Postgres”. Jean-Philippe Baud

Summary • Building a resilient infrastructure is a gradual process (as SAM history shows). We’ll always see outages but less often. • There is a lot we can do: • Make hardware and middleware services redundant • Uncover and report bugs • Detect faults early and use restarts • Communicate about problems (inc. with documentation) • The middleware has some resilience features… but at this meeting we particularly want to discuss and share deployment best practices

Service Resilience Backup talk • GridPP22 – UCL • 2nd April 2009

Service Resilience

Service Resilience

Presentation Transcript

Resilience

ResiliencE

Denial of Service Resilience in Ad Hoc Networks

Resilience

Resilience

RESILIENCE

RESILIENCE

Resilience

RESILIENCE

Resilience

Resilience:

RESILIENCE

Resilience

Resilience

GridPP22 – Service Resilience and Disaster Planning

Resilience

Resilience

Resilience Thinking - Insights From Resilience Thinking

Resilience

resilience