During the 1980s, a VT100-controlled radiation therapy machine named the Therac-25 fatally overdosed six people. After an extensive investigation by MIT professor Nancy Leveson, a number of lessons were put forward for safety-critical systems development, including:
- Overconfidence – Engineers tend to ignore software
- Reliability versus safety – False confidence grows with successes
- Defensive design – Software must have robust error handling
- Eliminate root causes – Patching symptoms does not increase safety
- Complacency – Prefer proactive development to reactive
- Bad risk asessments – Analyses make invalid independence claims
- Investigate – Apply analysis procedures when any accidents arise
- Ease versus safety – Ease of use may conflict with safety goals
- Oversight – Government-mandated software development guidelines
- Reuse – Extensively exercised software is not guaranteed to be safe
The remaining lesson was about inadequate software engineering practices. In particular, the investigation noted that basic software engineering principles were violated for the Therac-25, such as:
- Documentation – Write formal, up-front design specifications
- Quality assurance – Apply rigorous quality assurance practices
- Design – Avoid dangerous coding practices and keep designs simple
- Errors – Include error detection methods and software audit trails
- Testing – Subject software to extensive testing and formal analysis
- Regression – Apply regression testing for all software changes
- Interfaces – Carefully design input screens, messages, and manuals
In 2017, Leveson revisited those lessons and concluded that modern software systems still suffer from the same issues. In addition, she noted:
- Error prevention and detection must be included from the outset.
- Software designs are often unnecessarily complex.
- Software engineers and human factors engineers must communicate more.
- Blame still falls on operators rather than interface designs.
- Overconfidence in reusing software remains rampant.
Whatever the reasons (market pressures, rushing processes, inadequate certifications, fear of being fired, or poor project management), Leveson's insights are being ignored. For example, after the first fatal Boeing 737 Max flight, why was the entire fleet not grounded indefinitely? Or not grounded after an Indonesian safety committee report uncovered multiple failures? Or not grounded when an off-duty pilot helped avert a crash? What analysis procedures failed to prevent the second fatal Boeing 737 Max flight?
No comments:
Post a Comment