Critical Systems

Critical Systems • Critical Systems : Systems which, if it fails, may result in serious damages: • Safety Critical : a failure may result in human injury or environmental damage which may in turn injure human or other living organisms (e.g. embedded system monitoring the dialysis machine) • Mission Critical : a failure may result in not achieving an important goal (e.g. unmanned spacecraft navigation system; election poll tabulating system) • Business Critical : a failure may result in loss of business or loss of customers (e.g. credit card payment system; general ledger)

Areas of Concern for Critical Systems • Include the whole system: • hardware : all the hardware components must be maintained and kept up from failure (possibly even with redundant components) • power : all the utility that services the system, such as electricity must be available (possibly with backup power supply system) • software : all the software components from operating system, database, network, interfaces, all the way to the actual application must be operational (possibly with system backup& recovery functions that allows the restart from the last software checkpoint or with user “undo” options) • user : all the users must be trained and know how to interface with the system under both normal and failure situations

Users evading and bypassing certain functions (loss of face) System is constantly under fix and revision (costly support) System is Returned and Refund is Demanded (loss of customers) Law suites, reparation money, or business discontinuance (loss of business) Non-Dependable System Progression of “negative” impacts

Quality and Dependability • For Critical Systems, quality means dependability and must have the following general characteristics: • Availability : the system needs to be up and be operational when user demands it. • Reliability : the system is performing as stated and prescribed by the user and reference documentation. • Safety : the system is performing without threatening to the well being of the people and/or the environment. • Security : the system is resistant to accidental or intentional undesired intrusion

Reliability and Availability • Reliability : strictly speaking, is the characteristic that the system is performing according to the specification. • Availability : is the characteristic that the system is operational when requested. • Reliability may be viewed as a superset characteristic of Availability. But - - - - look at the following: • consider a system A which is “relatively unreliable,” with a failure once a month, but can be fixed and recover in 5 to 10 minutes. (e.g. transaction capacity problem that gets fixed as soon as more resources are allocated.) • alternatively consider a system B which is “relatively reliable,” with a failure of once a year, but can not be fixed and recover in less than 2 days. (e.g. design error that escaped inspection and test)

Other Considerations • We need to consider both the user environment and users’ (normal) usage pattern • if the error made does not result in a defect in a area highly used by the users or in the “normal” path of the users then it may never be detected. Thus the system may be considered available (for the highly used paths) and reliable (rarely run into an failure). • We would also need to understand that not all system failures are the same. We need to consider the severityof the failure, not just the number of failures, in terms of the effect the failure has on the users’ environment. • ***(your thought ?) Does this mean that there is no need to fix problems that is not on the “normal” path or assign low severity levels to those bugs?****

Some Metrics for Reliability • Mean time to Failure (MTTF) : the average time between failure. • Given observed failures of f1 through fn, record the time between no failure to first failure, t1, and all subsequent between failure times, t2 through tn. • The mean of these between failure times, (sum of t1 through tn)/n, is used as the mean time to failure measure and possibly as an estimator for the next failure time t(n+1). • Rate of Failure : the frequency of failure occurrences per unit of time. • Given a fixed unit of system usage time, the number of failures are recorded. • The rate of failure is expected to follow some kind of exponential curve where the rate is high at first and then decreases after all the normal paths are exercised and fixed.

Safety • Safety is the attribute where the system is capable of operating without threatening the people or the environment. • Primary Safety-Critical: systems, such as embedded process control systems, whose failure in performing the tasks may cause direct injury to people or environment. • Secondary safety-Critical: systems that create other systems erroneously, such as populating a medical database with erroneous information, and thus indirectly cause injury to people or environment. • Reliable system is critical to safety, but not always enough---- a fault tolerant system may be deemed reliable because it has redundancy capability but it may still be unsafe if the switch over time is too long

Some Causes of Unsafe Systems • Incomplete requirements specification where not all the critical situations have been described. • Erroneous specification where the critical condition is not characterized correctly. • External behavior (e.g. hardware) that causes an unanticipated condition for software. (It is almost impossible to describe all possible external conditions) • User/Operator error(s) which caused a combination of conditions that result in an unanticipated case, leading to a malfunction

Safety-Criticality Needs • Hazard Avoidance: mechanism that has duplicative characteristics before committing, such as asking the user to re-confirm the command before actually deleting a file or activate a process. • Hazard Detection and Removal: detecting the hazard and removing it before a potential accident, such as process control system where constant pressure gauging can detect potential problems • Damage Limitation: minimizing damages from an accident that has happened, such as automatic stoppage of elevators when fire is detected by a building maintenance system

Security • Security is an attribute of the system that addresses the extent that the system protects itself from external attacks. • Non-Malicious Intention: • Authorized access to the system for only for limited functions which expands to functions beyond the limited functions, but no harm is done to the system or the owner of the system. • Unauthorized access of the system which violates privacy, but does not act do harm to the system itself. • Malicious Intention: • Authorized or unauthorized access to the system followed by intentionally performing activities which result in harming the system or the owner of the system.

Security Violation Damages • Denial of Services: the normal processing of the system is disabled through saturating the system capacity or engaging in all system resources; thus availability becomes a problem. • Corruption of Data or Program: the data or the program logic of the system may be altered which makes the system unreliable and, possibly, unsafe. • Exposure of Information : the information managed by the system is made available to unauthorized parties which can lead to harming both the system and the people; thus making the system unreliable and unsafe.

Security Critical Systems Need • Vulnerability avoidance: the disabling of potentially exploitable weakness of the system such as not directly connecting to external network or encode the data • Attack Detection and Neutralization: the inclusion of functions to detect an attack and to remove the intrusion before any harm is done such as virus checker that looks at “attachments” and strip them if there is any possibility of a threat. • Exposure Limitation: the minimization of harm if an exposure (successful attack) takes place such as instituting regular system backup policy that allows system recovery when needed.

Metrics for Safety and Security • There is no clear metrics for safety and security except for listing a set of approaches that define some measurements for levels of protection: • avoidance • detection and removal • limitation of damage and recovery

Back to Requirements • The attributes of critical system needs to be sated in the requirements specifications under the “non-functional” category. • For reliability and availability, numerical metrics is preferred. • For safety and security, the description of some level of protection needs to be stated, possibly also in the functional category.

Critical Systems