Improving Air Traffic Safety: Challenges and Solutions

Experiences from real projectsHIØ presentation 17.04.07 Ph.D. Bjørn Axel Gran Division Head ICT Risk and Dependability bjorn.axel.gran@hrp.no 69212395 IFE, Safety-MTO, Halden

Content and Objective • Through work with improving and assessing the safety of current air-traffic-management (ATM) systems, there have been identified some challenges and solutions relevant for assessing the safety of such systems. The aim of the lecture is to present some of these experiences. • Short presentation of IFE • MTO – Man – Technology - Organisation • Total system and Operational Risk • Examples on activities/experiences by IFE • Challenges and Lessons Learned

Tromsø Trondheim Halden Bergen Institute for Energy Technology • Established 1948 • 2nd largest research institute in Norway • Ca 520 employees, ca. 260 in Halden • Budget, ca. NOK 500 millions • 5 sectors • Energy, Environmental Technology and Physics • Nuclear Technology • Petroleum Technology • Nuclear Safety and Reliability • Safety MTO

The OECD Halden Reactor Project • International research program on safe and reliable operation of nuclear plants in cooperation with OECD-NEA • joint undertaking established 1958 • 20 participating countries • > 100 organisations • 3-year research programs

Safety - MTO • Localised in Halden • ca. 80 employees • 5 departments • Operation Centers • Computerised Operation Support Systems • ICT Risk & Dependability • Industrial Psychology • Visual Interface Technology • Main Activities Safety MTO • Research through the ”Halden Reactor Project”. • Research programs, consultancy and development.

IFE-projects towards Norwegian ATM • Ph.D. cooperative project with the M-ADS system from KDA as case (1998-2002, NTNU 2002:35 B.A. Gran) • Analysis (FTA) of an solution, Park Air Systems (2005) • Risk and safety analysis (PSSA, FTA, SW AL, ..) for Park Air Systems related to the modernisation of ”ground-to-air communications system connecting pilots with air-traffic controllers” for United Kingdom's National Air Traffic Services (NATS) (2005 - ...) • Support in safety assessments, Artech, (2006 - ...) • Safety assessments in relation to operational use of SCAT-1, Avinor (2005 - ..)

Air traffic – current state • has grown by more than 50% over the last decade. • According to numbers from Eurocontrol (the European Organization for the Safety of Air Navigation) • Europe now has close to 8.5 million flights per year • and up to 28,000 flights on busiest days. • expects that today’s traffic will have doubled by 2020, • and that one has to plan for a tripled capacity. • Furthermore, also proposed objectives as • reducing the costs, • reducing environmental consequences, • and improving the safety.

Long and short term initiatives • The Single European Sky ATM Research Programme (SESAR) is the answer from the European commission to these issues. • However, until the program • is defined, • and the aviation players (e.g. civil and military, legislators, industry, operators, users, ground and airborne) have committed to it, • and started implementing it, • current systems, with ongoing improvements, have be able to handle the increased load and safety expectations until the middle of the next decade. • The development of ATM systems has other characteristics: • they are complex and software intensive • they consist off parts which are commercial-off-the-shelf, COTS (e.g. operative systems, libraries) • they consist of legacy-code, developed by the use of a variety of techniques, and with variety of documentation.

Our approach to risk and dependability

MTO – Man Technology Organisation Safety – Man – Technology – Organisation • The concept assumes that one can improve the safety by addressing • the operator/man /human, • the technology, • and the organisation / procedures together, as one total system. • Their interaction and ability to work well on all areas critical for the safety of the total system have to be optimised. The areas include design, operation, maintenance, management, supervision and control. • Similar definitions in ESARR4 / SAM and by ATM-actors.

”Operational Risk” “Risk assessment and mitigation in ATM” according to ESARR 4: • concerns the use of a quantitative risk based-approach in ATM when introducing and/or planning changes to the ATM System. • covers the human, procedural and equipment (hardware, software) elements of the ATM System as well as its environment of operations. • covers the complete life-cycle of the ATM System, and, in particular, of its constituent parts. This means: • to have control on all factors which can lead to unwanted events, addressing the system in an operational use within an operational environment.

Components – a few buzzwords • HRA • Human Reliability Assessment • ORG • Organisation, staffing, shift, changes and transition • V&V • Usability-testing, validation of design, test and evaluation of operator support systems • CREATE • Visualisation and testing of design proposals, iterative designs • SW • Assurance Level, Software Engineering, DO-178B, ED109, ESARR 6 • SAFETY • ESARR 4, FHA, PSSA, SSA • RISIKO • Hazop, FMEA, FTA, Reliability assessments, Risk-based analysis

Total system vs. Components Our experience is that the analysis of components is dependent upon understanding the total system e.g.: • Introducing new technology (new equipment) leads to a need of changing procedures. • Acceptance of new/changed technology demands that the applied operational mitigations are ok/acceptable. • Staffing and procedures are depending on knowledge about technical failure modes, error messages and back-up systems. • Risk, dependability, and related changes in man, technology or organisation has to be communicated addressing the total system and different actors. • Dependability (such as safety) of components are measured in terms of unwanted events in the total system.

Software (SW) • SW Assurance Level • e.g. applying SW01 or IEC61508 • as part of a Safety Case • evaluation based upon independent barriers • Software Engineering • Dependable and Risk-Informed Requirements Engineering • Characteristics of dependent failures (including common cause failures) in computerized systems with complex software structure. • Experiences with use of ATM Standards • ESARR 6 • DO-178B • ED109

Establish the context Identify Risk Analyse Risks Frequency Consequences Level of risk Evaluate Risks Accept Risks Yes No Treat Risks Safety • Analyse processaccording to ESARR 4 and standards for risk management. • Documentation: • FHA • PSSA • SSA • Safety Case

Safety • Analyse processaccording to ESARR 4 and standards for risk management. • Documentation: • FHA • PSSA • SSA • Safety Case

Staff Staff Principles and Principles and Education Education selection selection procedures procedures and training and training Inspectorate Inspectorate Safety Req. Safety Req. Safety Req. Specification Specification Specification First line First line actor(s) actor(s) Goals, standards and Goals, standards and Super Super - - resources of resources of the the Hazards Hazards Hazards vision vision Interfaces Interfaces organisation organisation Outcome Outcome Risks Risks Risks Process Process Reporting Reporting System System design design Maintenance Maintenance Evaluation Evaluation Risk • Metods (SAM): • Hazop • FMECA • FTA • ETA • Reliability • Risk-informed MTO approach

HRA (Human Reliability Analysis) • Human reliability analysis, humans as part of barriers • Halden project: Empirical data for • improving use of HRA-methods • improving HRA-methods • validating HRA-methods • Activities towards industry: • assess the human contribution to barriers (event trees, fault trees) • assess human performance shaping factors (PSF) • ex: “Statoil Operasjonell Tilstand Sikkerhet”. Barrier analysis, risk contributing factors

Change Transition Oldstate Newstate Driving forces Technology, economy, capacity, etc Ind. Skills, competence, etc. Motivation, predictability? Team Roles, responsibilities, etc. Status, conflicts? Org. Structure, reporting, procedures Salary, job security? Resistance forces? Organisation • Staffing • Shift work • Change and transition

Verification & Validation • Usability-testing of human-machine-interfaces • Validation of control room design • Test and evaluation of operator support systems

CREATE • Visualisation of design proposals for control room/centre • Testing of design proposals vs. standards • Supporting an iterative design process

Challenges and Lessons Learned • Risk Assessment: when, what to do, resulting in • The risk of simplified assessments • Configuration management of your documents • 3 needs of traceability • Assessing the software

When to do a Risk Assessment? ESARR4 require that a risk assessment is undertaken, if there are: • changes in the context of the system (e.g. definitions of sectors, capacity, staffing, etc.), • changes in rules and regulations, procedures or work processes, • development of new equipment/systems (ATM systems or infrastructure), • changes in existing systems.

What should the Risk Assessment include? The risk assessment shall according to ESARR4 include: • a definition of scope, • safety objectives, • safety requirements and identification of risk mitigations, • an assessment of the system with respect to that the safety objectives and safety requirements are fulfilled (evidences).

What will the Risk Assessment result in? Traditionally the risk assessment will end up in a functional hazard assessment (FHA), preliminary system safety assessment (PSSA) or a system safety assessment (SSA), depending on the phase of the system development: • the system planning, providing safety objectives and the assessment documented in a FHA. • the system design, providing the safety requirements and risk mitigations and the assessment documented in a PSSA. • the system implementation, installation and operation, providing the evidences with respect to the safety objectives and safety requirements are met, and the assessment in a SSA.

The system provider the management the development team the safety team The customer the management the systems users the safety team The consultants supporting the development the provider safety assessment the customer safety assessment Have different focus: selling the product, deadlines functionality mitigations Have also different focus: contract, delivery, deadlines functionality arguments, traceability Also have different agendas: solutions, technology methods, concepts standards, best practise There are more stakeholders!

Transmitter 2 Transmitter 1 Monitor 1 Monitor 2 So, which RA is delivered? • There are more alternatives: • the one according to the providers in-house practise • the one according to the contract (defined by customer) • the one similar to the one the provider delivered last time • the one proposed by the customer (which may be taken from a different branch...) • Example: • redundant Transmitters • redundant Monitors • If Monitors say T-1 has failed • Then use T-2

Deliver the FTA: .. also correct when only one transmitter or one monitor! But do not include the scenario when transmitter 1 works, but monitors decide (wrongly) to switch to transmitter 2, and transmitter 2 fails Complex systems gives complex assessments applying a simplified safety assessment requires to also identify the conditions for when it is ok to simplify. these conditions may require complex arguments ... The “last time” solution System Fails Both Transmitters failed Both Monitors failed

What defines the RA? • safety objectives / FHA • customer safety requirements • integrity requirements • safety and assurance plan • roles and responsibilities • standards • organisation • safety lifecycle • safety documentation • safety work assumptions

Which documents to produce? • Programme Safety Management and Assurance Plan • Functional Hazard Analysis Report • Safety Assumptions • Safety Plan • Preliminary System Safety Assessment Report • Hazard Log • Supplier Safety Case 2 • Safety Argument (CAE Trees) • Supplier Safety Case 3 • System Safety Assessment Report

Including the documents • PSSA (Preliminary System Safety Assessment) • includes FTA’s • in accordance with the system design/architecture • based upon FMEA’s for sub-parts (such as COTS) • including probabilities on events (can we get them?) • in accordance with the design/implementation • in accordance with the requirements • in accordance with common cause failures assessment • identifying mitigations, i.e. new/changed requirements • supported by sensitivity assessments

For small changes • this is relatively straight forward, • and one can complete one phase or risk assessment / document before the next is started. • In reality there are more challenges hidden.

Example: a small upgrade of an existing system represents • an ATM system has been applied for a long time, and an upgrade of some functionality is wanted. • According to ESARR4 an assessment of the changes is required. • “Are the safety objective and safety requirements influences by the change” • collect the needed evidence. • This requires that both safety objectives and safety requirements for the original system are there. • one meets a challenge like getting the system users (such as NATS in UK or Avinor in Norway) to provide the safety objectives, • or the developer has to assume a set of safety objectives for the system. • In both cases, a new risk assessment process with impact on the assessment of the risk assessment is started. • In addition one runs the project risk that the new defined safety objectives are not met by the original system one is upgrading.

Example: a major modernization • The modernized ATM system includes both • existing equipments, • replacement of existing components, • development of new components • the possibility of using COTS. (wanted both by the developers and the users, since it has advantages both with respect to price, availability and technology, but is also related with some disadvantages.) • The customer will most likely provide the safety objectives and a FHA. • Then one starts the development of new components, and the identification of safety requirements and mitigations. • One has one risk assessment process, and everything runs smooth and fine.

The suprises envolves • one starts to include old components and external components. • They are developed for another set of safety objectives • The evidences with respect to the fulfillment of its safety objectives and safety requirements are not satisfying. • The developer meets challenges like: • replace the external component with other components, for which the safety requirements are met. • compensate for the weaknesses in the external components in the development of the new components • get the customer to reevaluate the safety objectives. • Illustrates that the definition of FHA, PSSA and SSA is not that clear. • Can be met in a satisfactory and effective way when the customers and developers work together through all the phases.

Needs of traceability The need for traceability is a topic of increased focus and importance: • ensure effective communication in relation to requirements elicitation and analysis, • understandability of requirements to all parties, • traceability of requirements through the different design phases. • the need for traceability between the risk assessment results and the system development, “ best practice within reliability engineering” • have traceability between the identified mitigations for each failure mode/hazard and the system module/item where the mitigation should be implemented. • These mitigations should also results in the definition of new safety requirements. • When there is a good trace between the safety requirements, the high level software requirements and the low level software requirements, this allows for traceability from the failure modes and mitigations back to the safety objectives. • It also makes it possible to perform tests on the different levels, providing evidences for that the mitigations are correct implemented • pinpointed in guidance to FMEA and FMECA (BS5760, 1991). • one of the applied principles within the CORAS tool (CORAS, 2003).

A third need of traceability • The need for traceability between the risk assessment results themselves. • The failure modes and mitigations might for example be inserted in a fault tree, addressing the possible causes for an unwanted event – the event that a safety objective is not fulfilled. • That means, traversing the fault trees upward from a failure mode or a mitigation, we get another trace to the safety objective. • If the failure mode belongs to more fault trees, we get traces to more safety objectives. • Ideally, these results should now match the list of safety objectives we had from tracing the failure mode and mitigation backwards, through the system modules/items. • In practice there will be mismatches.

Possible causes of mismatches • is it a clear relation between safety objectives and safety requirements? • is there a clear relation between a safety requirement and a system module/item? • is the failure mode/mitigation related to only one system model/item? • in which sub-branches of the fault tree are the failure mode and mitigation assigned? Each of these four questions points to a source of uncertainty: • One safety objective can be covered by more safety requirement • Some safety requirements might address other safety objectives as well. • One safety requirement can be traced to more system models/items. • The complexity of the model – requirement relations hide that the same failure mode might be present with respect to more models / safety requirements. • A failure mode can be inserted in a number of branches in the fault tree, independent of the safety objective, but on the basis of how the local (or system) effect is described in the FMEA. • The construction of a fault tree is an iterative process which depends to a large extent on the ingenuity of the expert who makes the fault tree. • Problems also occur in the analysis of systems in which the same equipment is used at different times and in different configurations for different tasks.

Safety Objective Safety requirements Fault Tree System model Local Effect Failure modes and Mitigations Our experience, • Instead of only updating one assessment results on the basis of another result, it has to be a mechanism for reporting back and questioning and eventually updating the source of the change request. • The challenge with this is that it requires a system for assessing and reporting not only the change in a risk assessment result, such as the FMEA, but also the impact of the change.

Assessing the software code • The system safety requirements derived in the PSSA phase apply to software, hardware, environment and procedures. • The software requirements shall include both software assurance levels and functional safety requirements. • The equipment development and engineering process shall then assure that the requirements are met once the safety process has defined them. • The safety process shall validate the process carried out by the engineering and development teams to make sure that the safety requirements are met with sufficient confidence. • In most cases it is not possible to provide an absolute guarantee that a system will never fail. • The approach adopted in practice consists of gathering an adequate amount of evidence to support the claim that the risk associated with using the system is acceptable. • More elaborate evidence is required for the most severe unwanted events.

The extent of evidence required is described through assurance levels (AL). • The argument of safety can be based on three types of evidence: • testing, • field experience, • and analysis. • Assurance levels are derived per safety requirement, where the requirement giving the most stringent assurance level will determine the assurance level for that module. • If no partitioning is claimed within the same processor, all software running on the same CPU will acquire the most stringent assurance level derived for any module on that CPU. • To establish the required amount of evidence it is first necessary to determine the severity of possible unwanted events.

Assessment The primary goal of a high-level analysis is to identify critical failure modes for the various software modules, as well as mitigating factors: • Define and describe relevant scenarios. • system high-level design descriptions, detailed knowledge of the software, low level design, functional models • For each functional model, a scenario is described using message sequence charts that identify the software modules involved in the scenario. • For each scenario failure modes are identified. • For each failure mode both local effects and systems effects shall be identified, together with any existing mitigating factors or means for detection. • The failure modes, together with the identified mitigating factors, shall then be organized in a fault trees, • providing an overview of which combinations of failure occurrences might cause the critical system level events. • A derivation of software assurance levels can be based on the safety requirements and an overview of the available mitigations for the most demanding requirements.

Finding the failure modes • A functional FMEA can be seen upon as a FMEA of the software. • The main functions of the system were identified by the logical grouping of the system level requirements, • presented as a series of use cases, • e.g. for the purpose of communicating the results. • Each main function was described with the aid of a sequence diagram. • Each sequence diagram showed the flow of data and control within the system as a sequence of steps. • Each step represented the transmission and reception of data and control from one process to another. • Further details were added. • For each sequence diagram a number of possible failure modes were identified. • takes time. • requires a detailed knowledge to the processes. • can be supported by applying guidewords as for a Hazop process.

How to determine the assurance levels? IEC 61508 Part 5 Annex E • a qualitative method for determining integrity levels based on accident severity. • a typical ATM system worst-case relevant accident would be multiple deaths due to loss of aircraft in flight or due to collision whilst taxiing on ground. • The accident severity in this case would be catastrophic, • Equate to the highest integrity level, i.e. AL 1 if using ED-109 (ED-109). • If the system only contribute indirectly to the aircraft accident, and one applies the guidance in ED109, the worst-case assurance level should be AL 2. • Depending upon the number of additional protection measures that are included within the overall system architecture, IEC 61508 allows the AL to be reduced further for lower level safety related functions. • This methodology is based on giving each mitigation a quantitative value of 0.1. • In some cases, this methodology is very conservative. For some operational mitigations, a such assurance level reduction is more reasonable.

Derivation of ALs based on the cut cets Challenges: • a fault tree consist typically of both hardware and software failure modes, and represents also the configuration of components in the system. • more cut sets may include the same software failure mode, but different hardware failure modes within the same replaceable unit. • one cut set may address more safety requirements. The number of additional events in the cut sets was counted for each cut set. • This represents the number of events, e.g. failure of mitigations that have to happen concurrently with the software failure, in order to lead to the occurrence of the top event. • We could then compare with a “target number of events in a cut set”, • the number of mitigations or additional events that have to be present for arriving at e.g. AL 4. • A requirement with associated failure rate of 1e 7/h required 2 mitigations, • 1e 6/h required one mitigation • 1 e 5/h required none. • provided a mean to assess the relevance of applied mitigations, • particularly operational ones.

Analysis of source code • Assume one or a few failure modes are driving the assurance level above the level one is able to demonstrate. • One option is to prove analytic that the actual failure mode(s) are not present in the source code at all. • One way of practically doing this is through software fault tree analysis (Leveson 1991). • The principle of the SFTA is to start out by assuming that the unwanted event in terms of a hazardous output has occurred and then trace backward to either: • Find paths through the code from specific inputs to these outputs, or • Demonstrate that no such paths exist. • Or determine that the exact conditions that must be true in order to cause an event. • In the SFTA most of the “events” will be zero-probability events, • either because they represent a contradiction, • or because they can not contribute to the top event.

Experiences from the analysis • The goal of the analysis was to show that transmitters are activated if and only if the PTT is pressed, i.e. that transmission cannot occur unless the operator intends it to do so. • From the SFTAs we were able to prove that Tx is active if and only if the PTT is pressed. Thus, the goal of the analysis was reached. • the analysis was pretty fast to perform, and the documentation not too long, and could be done in parallel with other analysis, e.g. the derivation of assurance levels. • On the other hand, performing such analysis for larger pieces of software, or more unwanted events for the same code, would be to costly with respect to time needed. • It can also be questioned if the results of such analysis would be easily communicated.

Conclusion • ATM is an emerging branch where safety is in focus • Safety assessment by well known methods and practises, but often: • complex systems • complex solutions • complex assessments • providing challenges and new needs

Improving Air Traffic Safety: Challenges and Solutions