experiences from real projects hi presentation 17 04 07
Download
Skip this Video
Download Presentation
Experiences from real projects HIØ presentation 17.04.07

Loading in 2 Seconds...

play fullscreen
1 / 48

Experiences from real projects HIØ presentation 17.04.07 - PowerPoint PPT Presentation


  • 61 Views
  • Uploaded on

Experiences from real projects HIØ presentation 17.04.07. Ph.D. Bjørn Axel Gran Division Head ICT Risk and Dependability [email protected] 69212395 IFE, Safety-MTO, Halden. Content and Objective.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Experiences from real projects HIØ presentation 17.04.07' - mason


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
experiences from real projects hi presentation 17 04 07

Experiences from real projectsHIØ presentation 17.04.07

Ph.D. Bjørn Axel Gran

Division Head ICT Risk and Dependability

[email protected]

69212395

IFE, Safety-MTO, Halden

content and objective
Content and Objective
  • Through work with improving and assessing the safety of current air-traffic-management (ATM) systems, there have been identified some challenges and solutions relevant for assessing the safety of such systems. The aim of the lecture is to present some of these experiences.
  • Short presentation of IFE
  • MTO – Man – Technology - Organisation
  • Total system and Operational Risk
    • Examples on activities/experiences by IFE
  • Challenges and Lessons Learned
institute for energy technology

Tromsø

Trondheim

Halden

Bergen

Institute for Energy Technology
  • Established 1948
  • 2nd largest research institute in Norway
  • Ca 520 employees, ca. 260 in Halden
  • Budget, ca. NOK 500 millions
  • 5 sectors
    • Energy, Environmental Technology and Physics
    • Nuclear Technology
    • Petroleum Technology
    • Nuclear Safety and Reliability
    • Safety MTO
the oecd halden reactor project
The OECD Halden Reactor Project
  • International research program on safe and reliable operation of nuclear plants in cooperation with OECD-NEA
  • joint undertaking established 1958
  • 20 participating countries
  • > 100 organisations
  • 3-year research programs
safety mto
Safety - MTO
  • Localised in Halden
  • ca. 80 employees
  • 5 departments
    • Operation Centers
    • Computerised Operation Support Systems
    • ICT Risk & Dependability
    • Industrial Psychology
    • Visual Interface Technology
  • Main Activities Safety MTO
    • Research through the ”Halden Reactor Project”.
    • Research programs, consultancy and development.
ife projects towards norwegian atm
IFE-projects towards Norwegian ATM
  • Ph.D. cooperative project with the M-ADS system from KDA as case (1998-2002, NTNU 2002:35 B.A. Gran)
  • Analysis (FTA) of an solution, Park Air Systems (2005)
  • Risk and safety analysis (PSSA, FTA, SW AL, ..) for Park Air Systems related to the modernisation of ”ground-to-air communications system connecting pilots with air-traffic controllers” for United Kingdom\'s National Air Traffic Services (NATS) (2005 - ...)
  • Support in safety assessments, Artech, (2006 - ...)
  • Safety assessments in relation to operational use of SCAT-1, Avinor (2005 - ..)
air traffic current state
Air traffic – current state
  • has grown by more than 50% over the last decade.
  • According to numbers from Eurocontrol (the European Organization for the Safety of Air Navigation)
    • Europe now has close to 8.5 million flights per year
    • and up to 28,000 flights on busiest days.
    • expects that today’s traffic will have doubled by 2020,
    • and that one has to plan for a tripled capacity.
  • Furthermore, also proposed objectives as
    • reducing the costs,
    • reducing environmental consequences,
    • and improving the safety.
long and short term initiatives
Long and short term initiatives
  • The Single European Sky ATM Research Programme (SESAR) is the answer from the European commission to these issues.
  • However, until the program
    • is defined,
    • and the aviation players (e.g. civil and military, legislators, industry, operators, users, ground and airborne) have committed to it,
    • and started implementing it,
  • current systems, with ongoing improvements, have be able to handle the increased load and safety expectations until the middle of the next decade.
  • The development of ATM systems has other characteristics:
    • they are complex and software intensive
    • they consist off parts which are commercial-off-the-shelf, COTS (e.g. operative systems, libraries)
    • they consist of legacy-code, developed by the use of a variety of techniques, and with variety of documentation.
mto man technology organisation
MTO – Man Technology Organisation

Safety – Man – Technology – Organisation

  • The concept assumes that one can improve the safety by addressing
    • the operator/man /human,
    • the technology,
    • and the organisation / procedures

together, as one total system.

  • Their interaction and ability to work well on all areas critical for the safety of the total system have to be optimised. The areas include design, operation, maintenance, management, supervision and control.
  • Similar definitions in ESARR4 / SAM and by ATM-actors.
operational risk
”Operational Risk”

“Risk assessment and mitigation in ATM” according to ESARR 4:

  • concerns the use of a quantitative risk based-approach in ATM when introducing and/or planning changes to the ATM System.
  • covers the human, procedural and equipment (hardware, software) elements of the ATM System as well as its environment of operations.
  • covers the complete life-cycle of the ATM System, and, in particular, of its constituent parts.

This means:

  • to have control on all factors which can lead to unwanted events, addressing the system in an operational use within an operational environment.
components a few buzzwords
Components – a few buzzwords
  • HRA
    • Human Reliability Assessment
  • ORG
    • Organisation, staffing, shift, changes and transition
  • V&V
    • Usability-testing, validation of design, test and evaluation of operator support systems
  • CREATE
    • Visualisation and testing of design proposals, iterative designs
  • SW
    • Assurance Level, Software Engineering, DO-178B, ED109, ESARR 6
  • SAFETY
    • ESARR 4, FHA, PSSA, SSA
  • RISIKO
    • Hazop, FMEA, FTA, Reliability assessments, Risk-based analysis
total system vs components
Total system vs. Components

Our experience is that the analysis of components is dependent upon understanding the total system e.g.:

  • Introducing new technology (new equipment) leads to a need of changing procedures.
  • Acceptance of new/changed technology demands that the applied operational mitigations are ok/acceptable.
  • Staffing and procedures are depending on knowledge about technical failure modes, error messages and back-up systems.
  • Risk, dependability, and related changes in man, technology or organisation has to be communicated addressing the total system and different actors.
  • Dependability (such as safety) of components are measured in terms of unwanted events in the total system.
software sw
Software (SW)
  • SW Assurance Level
    • e.g. applying SW01 or IEC61508
    • as part of a Safety Case
    • evaluation based upon independent barriers
  • Software Engineering
    • Dependable and Risk-Informed Requirements Engineering
    • Characteristics of dependent failures (including common cause failures) in computerized systems with complex software structure.
  • Experiences with use of ATM Standards
    • ESARR 6
    • DO-178B
    • ED109
safety

Establish the context

Identify Risk

Analyse Risks

Frequency

Consequences

Level of risk

Evaluate Risks

Accept Risks

Yes

No

Treat Risks

Safety
  • Analyse processaccording to ESARR 4 and standards for risk management.
  • Documentation:
    • FHA
    • PSSA
    • SSA
    • Safety Case
safety1
Safety
  • Analyse processaccording to ESARR 4 and standards for risk management.
  • Documentation:
    • FHA
    • PSSA
    • SSA
    • Safety Case
slide17

Staff

Staff

Principles and

Principles and

Education

Education

selection

selection

procedures

procedures

and training

and training

Inspectorate

Inspectorate

Safety Req.

Safety Req.

Safety Req.

Specification

Specification

Specification

First line

First line

actor(s)

actor(s)

Goals, standards and

Goals, standards and

Super

Super

-

-

resources of

resources of

the

the

Hazards

Hazards

Hazards

vision

vision

Interfaces

Interfaces

organisation

organisation

Outcome

Outcome

Risks

Risks

Risks

Process

Process

Reporting

Reporting

System

System

design

design

Maintenance

Maintenance

Evaluation

Evaluation

Risk
  • Metods (SAM):
    • Hazop
    • FMECA
    • FTA
    • ETA
    • Reliability
  • Risk-informed MTO approach
hra human reliability analysis
HRA (Human Reliability Analysis)
  • Human reliability analysis, humans as part of barriers
  • Halden project: Empirical data for
    • improving use of HRA-methods
    • improving HRA-methods
    • validating HRA-methods
  • Activities towards industry:
    • assess the human contribution to barriers (event trees, fault trees)
    • assess human performance shaping factors (PSF)
    • ex: “Statoil Operasjonell Tilstand Sikkerhet”. Barrier analysis, risk contributing factors
organisation

Change

Transition

Oldstate

Newstate

Driving forces

Technology, economy, capacity, etc

Ind.

Skills, competence, etc.

Motivation, predictability?

Team

Roles, responsibilities, etc.

Status, conflicts?

Org.

Structure, reporting, procedures

Salary, job security?

Resistance forces?

Organisation
  • Staffing
  • Shift work
  • Change and transition
verification validation
Verification & Validation
  • Usability-testing of human-machine-interfaces
  • Validation of control room design
  • Test and evaluation of operator support systems
create
CREATE
  • Visualisation of design proposals for control room/centre
  • Testing of design proposals vs. standards
  • Supporting an iterative design process
challenges and lessons learned
Challenges and Lessons Learned
  • Risk Assessment: when, what to do, resulting in
  • The risk of simplified assessments
  • Configuration management of your documents
  • 3 needs of traceability
  • Assessing the software
when to do a risk assessment
When to do a Risk Assessment?

ESARR4 require that a risk assessment is undertaken, if there are:

  • changes in the context of the system (e.g. definitions of sectors, capacity, staffing, etc.),
  • changes in rules and regulations, procedures or work processes,
  • development of new equipment/systems (ATM systems or infrastructure),
  • changes in existing systems.
what should the risk assessment include
What should the Risk Assessment include?

The risk assessment shall according to ESARR4 include:

  • a definition of scope,
  • safety objectives,
  • safety requirements and identification of risk mitigations,
  • an assessment of the system with respect to that the safety objectives and safety requirements are fulfilled (evidences).
what will the risk assessment result in
What will the Risk Assessment result in?

Traditionally the risk assessment will end up in a functional hazard assessment (FHA), preliminary system safety assessment (PSSA) or a system safety assessment (SSA), depending on the phase of the system development:

  • the system planning, providing safety objectives and the assessment documented in a FHA.
  • the system design, providing the safety requirements and risk mitigations and the assessment documented in a PSSA.
  • the system implementation, installation and operation, providing the evidences with respect to the safety objectives and safety requirements are met, and the assessment in a SSA.
there are more stakeholders
The system provider

the management

the development team

the safety team

The customer

the management

the systems users

the safety team

The consultants supporting

the development

the provider safety assessment

the customer safety assessment

Have different focus:

selling the product, deadlines

functionality

mitigations

Have also different focus:

contract, delivery, deadlines

functionality

arguments, traceability

Also have different agendas:

solutions, technology

methods, concepts

standards, best practise

There are more stakeholders!
so which ra is delivered

Transmitter 2

Transmitter 1

Monitor 1

Monitor 2

So, which RA is delivered?
  • There are more alternatives:
    • the one according to the providers in-house practise
    • the one according to the contract (defined by customer)
    • the one similar to the one the provider delivered last time
    • the one proposed by the customer (which may be taken from a different branch...)
  • Example:
    • redundant Transmitters
    • redundant Monitors
    • If Monitors say T-1 has failed
    • Then use T-2
the last time solution
Deliver the FTA:

.. also correct when only one transmitter or one monitor!

But do not include the scenario when

transmitter 1 works, but

monitors decide (wrongly) to switch to transmitter 2,

and transmitter 2 fails

Complex systems gives complex assessments

applying a simplified safety assessment requires to also identify the conditions for when it is ok to simplify.

these conditions may require complex arguments ...

The “last time” solution

System Fails

Both Transmitters failed

Both Monitors failed

what defines the ra
What defines the RA?
  • safety objectives / FHA
    • customer safety requirements
    • integrity requirements
  • safety and assurance plan
    • roles and responsibilities
    • standards
    • organisation
    • safety lifecycle
    • safety documentation
    • safety work assumptions
which documents to produce
Which documents to produce?
  • Programme Safety Management and Assurance Plan
  • Functional Hazard Analysis Report
  • Safety Assumptions
  • Safety Plan
  • Preliminary System Safety Assessment Report
  • Hazard Log
  • Supplier Safety Case 2
  • Safety Argument (CAE Trees)
  • Supplier Safety Case 3
  • System Safety Assessment Report
including the documents
Including the documents
  • PSSA (Preliminary System Safety Assessment)
    • includes FTA’s
      • in accordance with the system design/architecture
      • based upon FMEA’s for sub-parts (such as COTS)
        • including probabilities on events (can we get them?)
        • in accordance with the design/implementation
          • in accordance with the requirements
      • in accordance with common cause failures assessment
      • identifying mitigations, i.e. new/changed requirements
      • supported by sensitivity assessments
for small changes
For small changes
  • this is relatively straight forward,
  • and one can complete one phase or risk assessment / document before the next is started.
  • In reality there are more challenges hidden.
example a small upgrade of an existing system represents
Example: a small upgrade of an existing system represents
  • an ATM system has been applied for a long time, and an upgrade of some functionality is wanted.
  • According to ESARR4 an assessment of the changes is required.
    • “Are the safety objective and safety requirements influences by the change”
    • collect the needed evidence.
  • This requires that both safety objectives and safety requirements for the original system are there.
    • one meets a challenge like getting the system users (such as NATS in UK or Avinor in Norway) to provide the safety objectives,
    • or the developer has to assume a set of safety objectives for the system.
  • In both cases, a new risk assessment process with impact on the assessment of the risk assessment is started.
  • In addition one runs the project risk that the new defined safety objectives are not met by the original system one is upgrading.
example a major modernization
Example: a major modernization
  • The modernized ATM system includes both
    • existing equipments,
    • replacement of existing components,
    • development of new components
    • the possibility of using COTS. (wanted both by the developers and the users, since it has advantages both with respect to price, availability and technology, but is also related with some disadvantages.)
  • The customer will most likely provide the safety objectives and a FHA.
  • Then one starts the development of new components, and the identification of safety requirements and mitigations.
  • One has one risk assessment process, and everything runs smooth and fine.
the suprises envolves
The suprises envolves
  • one starts to include old components and external components.
    • They are developed for another set of safety objectives
    • The evidences with respect to the fulfillment of its safety objectives and safety requirements are not satisfying.
  • The developer meets challenges like:
    • replace the external component with other components, for which the safety requirements are met.
    • compensate for the weaknesses in the external components in the development of the new components
    • get the customer to reevaluate the safety objectives.
  • Illustrates that the definition of FHA, PSSA and SSA is not that clear.
  • Can be met in a satisfactory and effective way when the customers and developers work together through all the phases.
needs of traceability
Needs of traceability

The need for traceability is a topic of increased focus and importance:

  • ensure effective communication in relation to requirements elicitation and analysis,
    • understandability of requirements to all parties,
    • traceability of requirements through the different design phases.
  • the need for traceability between the risk assessment results and the system development, “ best practice within reliability engineering”
    • have traceability between the identified mitigations for each failure mode/hazard and the system module/item where the mitigation should be implemented.
    • These mitigations should also results in the definition of new safety requirements.
    • When there is a good trace between the safety requirements, the high level software requirements and the low level software requirements, this allows for traceability from the failure modes and mitigations back to the safety objectives.
    • It also makes it possible to perform tests on the different levels, providing evidences for that the mitigations are correct implemented
    • pinpointed in guidance to FMEA and FMECA (BS5760, 1991).
    • one of the applied principles within the CORAS tool (CORAS, 2003).
a third need of traceability
A third need of traceability
  • The need for traceability between the risk assessment results themselves.
    • The failure modes and mitigations might for example be inserted in a fault tree, addressing the possible causes for an unwanted event – the event that a safety objective is not fulfilled.
    • That means, traversing the fault trees upward from a failure mode or a mitigation, we get another trace to the safety objective.
    • If the failure mode belongs to more fault trees, we get traces to more safety objectives.
    • Ideally, these results should now match the list of safety objectives we had from tracing the failure mode and mitigation backwards, through the system modules/items.
  • In practice there will be mismatches.
possible causes of mismatches
Possible causes of mismatches
  • is it a clear relation between safety objectives and safety requirements?
  • is there a clear relation between a safety requirement and a system module/item?
  • is the failure mode/mitigation related to only one system model/item?
  • in which sub-branches of the fault tree are the failure mode and mitigation assigned?

Each of these four questions points to a source of uncertainty:

  • One safety objective can be covered by more safety requirement
  • Some safety requirements might address other safety objectives as well.
  • One safety requirement can be traced to more system models/items.
  • The complexity of the model – requirement relations hide that the same failure mode might be present with respect to more models / safety requirements.
  • A failure mode can be inserted in a number of branches in the fault tree, independent of the safety objective, but on the basis of how the local (or system) effect is described in the FMEA.
  • The construction of a fault tree is an iterative process which depends to a large extent on the ingenuity of the expert who makes the fault tree.
  • Problems also occur in the analysis of systems in which the same equipment is used at different times and in different configurations for different tasks.
our experience

Safety Objective

Safety requirements

Fault Tree

System model

Local Effect

Failure modes and Mitigations

Our experience,
  • Instead of only updating one assessment results on the basis of another result, it has to be a mechanism for reporting back and questioning and eventually updating the source of the change request.
  • The challenge with this is that it requires a system for assessing and reporting not only the change in a risk assessment result, such as the FMEA, but also the impact of the change.
assessing the software code
Assessing the software code
  • The system safety requirements derived in the PSSA phase apply to software, hardware, environment and procedures.
  • The software requirements shall include both software assurance levels and functional safety requirements.
  • The equipment development and engineering process shall then assure that the requirements are met once the safety process has defined them.
  • The safety process shall validate the process carried out by the engineering and development teams to make sure that the safety requirements are met with sufficient confidence.
    • In most cases it is not possible to provide an absolute guarantee that a system will never fail.
    • The approach adopted in practice consists of gathering an adequate amount of evidence to support the claim that the risk associated with using the system is acceptable.
    • More elaborate evidence is required for the most severe unwanted events.
the extent of evidence required is described through assurance levels al
The extent of evidence required is described through assurance levels (AL).
  • The argument of safety can be based on three types of evidence:
    • testing,
    • field experience,
    • and analysis.
  • Assurance levels are derived per safety requirement, where the requirement giving the most stringent assurance level will determine the assurance level for that module.
  • If no partitioning is claimed within the same processor, all software running on the same CPU will acquire the most stringent assurance level derived for any module on that CPU.
  • To establish the required amount of evidence it is first necessary to determine the severity of possible unwanted events.
assessment
Assessment

The primary goal of a high-level analysis is to identify critical failure modes for the various software modules, as well as mitigating factors:

  • Define and describe relevant scenarios.
    • system high-level design descriptions, detailed knowledge of the software, low level design, functional models
  • For each functional model, a scenario is described using message sequence charts that identify the software modules involved in the scenario.
  • For each scenario failure modes are identified.
  • For each failure mode both local effects and systems effects shall be identified, together with any existing mitigating factors or means for detection.
  • The failure modes, together with the identified mitigating factors, shall then be organized in a fault trees,
    • providing an overview of which combinations of failure occurrences might cause the critical system level events.
  • A derivation of software assurance levels can be based on the safety requirements and an overview of the available mitigations for the most demanding requirements.
finding the failure modes
Finding the failure modes
  • A functional FMEA can be seen upon as a FMEA of the software.
  • The main functions of the system were identified by the logical grouping of the system level requirements,
    • presented as a series of use cases,
    • e.g. for the purpose of communicating the results.
  • Each main function was described with the aid of a sequence diagram.
    • Each sequence diagram showed the flow of data and control within the system as a sequence of steps.
    • Each step represented the transmission and reception of data and control from one process to another.
    • Further details were added.
  • For each sequence diagram a number of possible failure modes were identified.
    • takes time.
    • requires a detailed knowledge to the processes.
    • can be supported by applying guidewords as for a Hazop process.
how to determine the assurance levels
How to determine the assurance levels?

IEC 61508 Part 5 Annex E

  • a qualitative method for determining integrity levels based on accident severity.
  • a typical ATM system worst-case relevant accident would be multiple deaths due to loss of aircraft in flight or due to collision whilst taxiing on ground.
    • The accident severity in this case would be catastrophic,
    • Equate to the highest integrity level, i.e. AL 1 if using ED-109 (ED-109).
    • If the system only contribute indirectly to the aircraft accident, and one applies the guidance in ED109, the worst-case assurance level should be AL 2.
  • Depending upon the number of additional protection measures that are included within the overall system architecture, IEC 61508 allows the AL to be reduced further for lower level safety related functions.
    • This methodology is based on giving each mitigation a quantitative value of 0.1.
    • In some cases, this methodology is very conservative. For some operational mitigations, a such assurance level reduction is more reasonable.
derivation of als based on the cut cets
Derivation of ALs based on the cut cets

Challenges:

  • a fault tree consist typically of both hardware and software failure modes, and represents also the configuration of components in the system.
  • more cut sets may include the same software failure mode, but different hardware failure modes within the same replaceable unit.
  • one cut set may address more safety requirements.

The number of additional events in the cut sets was counted for each cut set.

  • This represents the number of events, e.g. failure of mitigations that have to happen concurrently with the software failure, in order to lead to the occurrence of the top event.
  • We could then compare with a “target number of events in a cut set”,
    • the number of mitigations or additional events that have to be present for arriving at e.g. AL 4.
      • A requirement with associated failure rate of 1e 7/h required 2 mitigations,
      • 1e 6/h required one mitigation
      • 1 e 5/h required none.
  • provided a mean to assess the relevance of applied mitigations,
    • particularly operational ones.
analysis of source code
Analysis of source code
  • Assume one or a few failure modes are driving the assurance level above the level one is able to demonstrate.
    • One option is to prove analytic that the actual failure mode(s) are not present in the source code at all.
    • One way of practically doing this is through software fault tree analysis (Leveson 1991).
  • The principle of the SFTA is to start out by assuming that the unwanted event in terms of a hazardous output has occurred and then trace backward to either:
    • Find paths through the code from specific inputs to these outputs, or
    • Demonstrate that no such paths exist.
    • Or determine that the exact conditions that must be true in order to cause an event.
  • In the SFTA most of the “events” will be zero-probability events,
    • either because they represent a contradiction,
    • or because they can not contribute to the top event.
experiences from the analysis
Experiences from the analysis
  • The goal of the analysis was to show that transmitters are activated if and only if the PTT is pressed, i.e. that transmission cannot occur unless the operator intends it to do so.
  • From the SFTAs we were able to prove that Tx is active if and only if the PTT is pressed. Thus, the goal of the analysis was reached.
  • the analysis was pretty fast to perform, and the documentation not too long, and could be done in parallel with other analysis, e.g. the derivation of assurance levels.
  • On the other hand, performing such analysis for larger pieces of software, or more unwanted events for the same code, would be to costly with respect to time needed.
  • It can also be questioned if the results of such analysis would be easily communicated.
conclusion
Conclusion
  • ATM is an emerging branch where safety is in focus
  • Safety assessment by well known methods and practises, but often:
    • complex systems
    • complex solutions
    • complex assessments
  • providing challenges and new needs
ad