Planning for Disaster

Alan Perlis once said:

I think that it’s extraordinarily important that we in computer science keep fun in computing. When it started out, it was an awful lot of fun. Of course, the paying customers got shafted every now and then, and after a while we began to take their complaints seriously. We began to feel as if we really were responsible for the successful, error-free perfect use of these machines. I don’t think we are.

This is a nice sentiment, perhaps even a defensible one if we interpret it as narrowly talking about academic computer science. On the other hand, probably within 20 years, there’s going to be a major disaster accompanied by the loss of many lives that is caused more or less directly by software defects. I don’t know what exactly will happen, but something related to transportation or SCADA seems likely. At that point we can expect things to hit the fan. I’m not optimistic that, as a group, computer scientists and computing professionals can prevent this disaster from happening: the economic forces driving automation and system integration are too strong. But of course we should try. We also need to think about what we’re going to do if, all of a sudden, a lot of people suddenly expect us to start producing computer systems that actually work, and perhaps hold us accountable when we fail to do so.

Obviously I don’t have the answers but here are a few thoughts.

We know that it is possible to create safety-critical software that (largely) works. Generally this happens when the organizations creating the software are motivated not only by market forces, but also by significant regulatory pressure. Markets (and the humans that make them up) are not very good at analyzing low-probability risks. A huge amount of critical software is created by organizations that are subject to very little regulatory pressure.
It is difficult to tell people that something is a bad idea when they very much want it to be a good idea. We should get used to doing this, following Parnas’s courageous example.
It is difficult to tell people that something is going to be slow and expensive to create, when they very much want it to be quick and cheap. We need to get used to saying that as well.
We can and should take responsibility for our work. I was encouraged by the field’s generally very positive response to The Moral Character of Cryptographic Work. Computer scientists and computing professionals almost always think that their particular technology makes the world better or is at worst neutral — but that is clearly not always the case. Some of this could be taught.
We need to be educating CS students in methods for creating software that works: testing, specification, code review, debugging, and formal methods. You’d think this is obvious but students routinely get a CS degree without having done any sort of serious software testing.

Finally, let’s keep in mind that causes are tricky. A major human disaster usually has a complex network of causes, perhaps simply because any major disaster with a single cause would indicate that the system had been designed very poorly. Atomic Accidents makes it clear that most nuclear accidents have been the result of an anti-serendipitous collection of poor system design, unhappy evolution and maintenance, and failure-enhancing responses by humans.

Acknowledgments: Pascal Cuoq and Kelly Heller gave some great feedback on drafts of this piece.

February 10, 2016

regehr

Computer Science, Futurist, Software Correctness

8 responses to “Planning for Disaster”

Anders says:

February 11, 2016 at 9:05 am

“We can and should take responsibility for our work.”
I find it a bit sad that we still need to discuss this after ~60 years of software development…
bcs says:

February 11, 2016 at 9:23 am

I think your also missing another (doomed to be very un-popular) point: we can’t make sure such a disaster *never* happens, and furthermore we shouldn’t make that a goal.

Even human life, while very valuable, isn’t *infinitely* valuable and after the first 3 or 4, adding the next 9 to the reliability starts becoming more and more expensive. At some point things cross over. Furthermore, at some point even things like the *opportunity cost* of pushing up reliability becomes prohibitive. Is it rally worth forcing my users to wait a collective man-century for my program to load in order to possibly save a single life? What about a man-millennium? Alternatively, if my goal is nothing more than to save as many lives as I can, at some point I’d be better off shipping the software, with possible bugs, and donating the money, that I didn’t spend adding that 6th 9, to a well run charity.

Now I’ll grant that just about any software developed today (with the, I hope, exception of the the stuff used to throw large amounts of metal through the sky) is far enough from that turning point that it hardly even matters, but if the TSA is any indication, the general political environment can’t be trusted to have any clue when to say “good enough” so we probably need to be laying the groundwork for when we are at that point.
Steve Walk says:

February 11, 2016 at 9:25 am

comp.risks is a good news group to follow
for examples of software gone bad.
regehr says:

February 11, 2016 at 2:03 pm

Anders, I know!

bcs, that’s all true.

Steve yes, tons of good stuff there.
Stuart Dootson says:

February 11, 2016 at 3:18 pm

Having direct experience of safety critical in aerospace, and second hand experience in other fields (electrical machine control – I’ve seen some scary software related failures there!, and marine systems), I can do nothing but agree. I know what is done in aerospace to certify a 10^-9 failures/hour (although let’s not get into the believability of failure rates of that magnitude…) and I see what’s commonly done in the other domains, and just how big the gap between the two is.., And it scares me…

And then I see all the talk of driverless cars, autonomous ships and such, and I get *really* scared!
Joe Duarte says:

February 16, 2016 at 3:26 am

Interesting post. Have you seen Michael Barr’s audit of Toyota’s electronic throttle control source code? (Slide version: http://www.safetyresearch.net/Library/BarrSlides_FINAL_SCRUBBED.pdf)

He also has a great presentation called Killer Apps, about software applications that have known body counts: http://www.barrgroup.com/files/killer_apps_barr_keynote_eelive_2014.pdf

I wonder about the talent pipeline for critical embedded software development. It doesn’t seem to be an area that interests young software developers very much. The brilliant programmers seem to go to Google, Apple, Docker, mobile, maybe game development. I sometimes wonder who writes the firmware for my mouse, my Blu-ray player, or my car’s brakes, and how they got into it. It’s not the sexy stuff with the three-year vesting and big windfall.

When I consulted at a nuclear power plant in 2014, I was stunned by the lack of talent. It was an IT organization where many of the staff *didn’t know anything about computers* â€“ they were just random people who started somewhere else in the big utility and ended up parked in IT. The developers were lifers who were there for the pseudo-government pension and stability, and were out of touch with the software industry as a whole. They tried to make me use IE8.
regehr says:

February 16, 2016 at 3:39 am

Joe, thanks– I know Michael Barr’s work but had not seen those slide decks.

Also this is good stuff, a transcript of his testimony as an expert witness in Bookout vs. Toyota:

http://www.safetyresearch.net/Library/Bookout_v_Toyota_Barr_REDACTED.pdf
Mate Soos says:

February 22, 2016 at 3:04 pm

Joe-> Wow, haven’t seen that slide deck but read the whole deposition that regehr points to. Amazing stuff, that should the taught in its entirety. Integrity, precision, and purely technical, a true engineer.

But this: “Youâ€™d think this is obvious but students routinely get a CS degree without having done any sort of serious software testing.” is the most scary. It makes no sense, either, as a good tester is worth its weight in gold.

Lots will argue that management/production pressure is driving quality down. My point of view is that it’s the lack of even basic know-how — that would allow for significantly better testing efficiency and hence better trade-offs — that is driving SW quality down.