Martyn Thomas Founder: Praxis High Integrity Systems Ltd

Software in Practicea series of four lectures on why software projects fail, and what you can do about it - with particular emphasis on safety-critical systems Martyn Thomas Founder: Praxis High Integrity Systems Ltd Visiting Professor of Software Engineering, Oxford University Computing Laboratory

Lecture 1:What is the problem with software? • The state of practice • Scale • Complexity • What does testing tell us?

When I started in 1969 ... IBM 360/65 Computing service for 1000s of users. Now I have more computing power in my ‘phone.

The Software Crisis • First digital computer, Manchester 1948 • First commercial computer, LEO 1951 • We are still in the very early stages of software engineering ... • … like studying civil engineering when Archimedes was still alive! • NATO Software Engineering conferences in 1968 and 1969 to address the growing crisis in software dependability.

1972 Turing Award Lecture The vision is that, well before the 1970s have run to completion, we shall be able to design and implement the kind of systems that are now straining our programming ability at the expense of only a few percent in man-years of what they cost us now, and that besides that, these systems will be virtually free of bugs E W Dijkstra

Software in the 21st Century • Fifty years on, yet still at the beginning. • We are planning drive-by-wire cars, guiding themselves on intelligent roads • We are dreaming if we believe we can build such real-world systems safely, with today’s attitudes to software engineering. • We have still not achieved Dijkstra’s vision of thirty years ago!

Thirty years later… … Most computing system projects fail • Project cancellation • Major cost or time overrun • Much less functionality than planned • Security inadequate • Major usability problems • Excessive maintenance / upgrade costs • Serious in-service failure I’ll talk about some specific failures in later lectures

most software projects fail • Cancelled before delivery 31% • Exceed timescales & costs 53% or greatly reduced functionality • On time and budget 16% • Mean time overrun 190% • Mean cost overrun 222% • Mean functionality delivered 60% • large companies much worse than smaller • recent figures better, but still poor source The Chaos Report (1995) http://www.standishgroup.com

most computing projects fail • Of 1027 projects, 130 (12.7%) succeeded • Of those 130: • 2.3% were development projects • 18.2% maintenance projects • 79.5% data-conversion projects • of the 500+ development projects in the sample, 3 (0.6%) succeeded. Source: BCS Review 2001 page 62.

Why does it happen? Because: • scale matters. Small processes don’t scale up • process matters. Most developers lack discipline • rigour matters. Most developers are afraid of mathematics • engineering is conservative, whereas the software industry is ruled by fashion • CAA licensing system; C vs Ada at Lockheed Martin; eXtreme this, Agile that ... Who can make things better? You!

Scale • How many valid paths through 200 line module? • We have found around 750,000 • How big are modern systems? • Windows is ~100M LoC • Oracle talk about a “gigaLoC” code base. • How many paths is that? • How many do you think they have tested? • What proportion will ever be executed?

A medium-scale system: En Route ATC at Swanwick

RS 6000 workstations

Control Room

Airspace

A medium sized system • 114 controller workstations • 20 supervisory/management positions • 10 engineering positions • 48-workstation simulator • 2 15-workstation test systems • 2.5 million lines of software • >500 processors

Operational data • 1,667,381 flights in 2002 • Continuous operation, • one 3-hour failure • (other flight delays caused by NAS failures at West Drayton)

Challenges for the future • Current ATC safety depends on the controller’s ability to clear their sector with radio only. • Future traffic growth requires > 10 a/c on frequency. Controllers would be overloaded • So future ATC will depend on automatic systems, which must not fail. • Target? At least the avionics standard:10-8 pfh • No current air traffic management systems are built to such standards. This could be your job in 3 years time.

How can we be sure a system works? • Assurance: showing that a system works • Much harder than just developing a system that works • you need to generate evidence that it works • what evidence is sufficient? • How safe or reliable is a system that has never failed? • What evidence does testing provide? • How can we do better?

How safe is a system that has never failed? • If it has run for n hours without failure, and if the operating conditions remain much the same, the best estimate for the probability of failure in the next n hours is 0.5 • To show that a system has a pfh of <10-4 with 50% confidence, we need about 14 months of fault-free testing. (10,000 hours is 13.89 months)

What evidence does testing provide? “Testing shows the presence, not the absence, of bugs” - Dijkstra • We cannot test every path. • Testing individual operations or boundary conditions may find faults, but such tests provide no evidence of pfh. • Statistical testing, under operational conditions, provides evidence of pfh. • But it takes a very long time.

Statistical testing • To show an MTBF of n hours, with 99% confidence, takes around 10n hours of testing with no faults found. So avionics (10-8 pfh) would need around 109 hours (>100,000 years.) • With good prior evidence, e.g. from a strong process, using a Bayesian approach may reduce this to ~10,000 years • Actual testing is trivially short by comparison.

Summary • Developing reliable software is difficult because of the size and complexity of real-life systems. • The software industry is very young, amateurish and immature. Most significant projects overrun dramatically (and unnecessarily) or totally fail. • In future lectures, I will explore why some failures have occurred (Therac, Arianne, LAS, Taurus …) and talk about what you need to know if you are to become a professional amongst all these amateurs.

Martyn Thomas Founder: Praxis High Integrity Systems Ltd