The Piano Test for Program Verification

[Update from Feb 1 2011: I’ve written a new post that adds some technical details.]

Here’s a little thought experiment:

You’re spending the day visiting the Utah computer science department. We’re chatting and I describe a software verification project where my group has proved that some particular version of OpenSSH is memory safe: it will never execute an array bounds violation, null pointer dereference, or similar. Since you are perhaps slightly gullible, you believe me. Now I ask you to look up and for the first time, you notice that a large piano is hanging from my office ceiling, directly above your head. The rope from which the piano is hanging is securely tied off to an I-beam, so there’s no reason to worry about it falling on you accidentally. However, the rope passes through a small box that is connected to my desktop computer using a USB cable. I explain to you that this box contains a very sharp, remotely-activated knife that can instantly sever the rope when commanded to by the computer (don’t worry about this software being wrong, I verified it too).

Now, since you believe my verification result, you should be perfectly willing to keep sitting in your chair while I attack OpenSSH using some sort of highly advanced fuzz tool that tries to crash it by generating evil inputs. Of course, I have rigged OpenSSH’s SIGSEGV handler so that if it ever fires, it will command the USB knife to sever the rope — and you will be squished. Do you keep sitting in the chair?

My opinion is that a person would have to be a bit foolish to sit under the piano. But forget about my work: how many verification researchers believe in their own work to this level? Probably not too many. The first problem is one of semantics: the word “verified” is used extremely loosely by members of the verification community (I recently reviewed a paper that kept saying things like “we rigorously prove” and “we formally verify” — despite the fact that the paper contained nothing but proof sketches). The second problem is deeper: it is extremely difficult to argue convincingly that a verification result is correct. By “correct” I mean not only “mathematically sound” but also “the result is faithful to the intent of the verification effort.”

Although the picture I’ve painted about dropping a piano is silly, the situation is realistic. For example, every time you use the transportation system in a developed country, you are gambling that the benefit of using the system outweighs the risk that some embedded computer will malfunction, harming you. Of course, we willingly make this bet every day, but not because the stoplights and stability control systems are verified (they aren’t) but because we assume that if these systems worked well yesterday, they’ll work well today. The adage (supposedly from NASA) is “test what you fly, fly what you test” and it captures a fundamental fact about how we create reliable engineered systems: systems are tested in realistic environments and failures are rigorously studied and the results are fed back into the design of the next iteration of the system. This is not incompatible with formal verification, but it’s also not a point of view that comes out very often in the verification literature.

January 27, 2011

regehr

Computer Science, Software Correctness

10 responses to “The Piano Test for Program Verification”

Tweets that mention Embedded in Academia : The Piano Test for Program Verification — Topsy.com says:

January 27, 2011 at 3:18 pm

[…] This post was mentioned on Twitter by Marc Ruef, John Regehr. John Regehr said: The Piano Test for Program Verification: Hereâ€™s a little thought experiment: Youâ€™re spending the day visiting th… http://bit.ly/huXQPt […]
Zach Tatlock says:

January 27, 2011 at 3:33 pm

Thanks for the piano story. We have an analogous “bullet test” here at UCSD 🙂

I’m not sure the “test what you fly, fly what you test” principle works for building all reliable software systems. It seems to hinge on the notion of “realistic environment,” which is presumably narrower than the most general environment. Unfortunately, for many software systems every possible input is realistic. This is particularly true for internet-facing programs that must weather a relentless torrent of attacks by smart, well-funded criminal organizations. Program verification by proof is the only approach I know to ensure that a system will handle *any* input.

Of course, as you point out, program verification merely demonstrates that the implementation enjoys all the same bugs as the spec 🙂 While getting specs right remains a hard problem, at least program verification removes an entire class of bugs arising from a discrepancy between the spec and implementation.

Anyway, thanks for the great blog!
regehr says:

January 27, 2011 at 5:49 pm

Hi Zach- Thanks for the comments. I agree that my use of “test what you fly” here was strained, but I’m pretty sure I can defend it better in a longer post that should appear soon.
Pascal Cuoq says:

January 27, 2011 at 6:32 pm

> the stoplights and stability control systems are verified (they arenâ€™t)

I am pretty sure they are verified. It’s clear from context that you mean “formally verified” here, which indeed they aren’t with perhaps just a couple of exceptions. And the verification they were subjected too probably took place in the framework of some sort of certification. “Certification” is another word that formal methods researchers tend to use with a fuzzy, wishful meaning very different from the one engineers mean.

In real life, a system is certified if you have convinced the applicable certification authority (here, the certification authority for sitting under pianos) that people should be allowed to sit under your model of piano and OpenSSH.

It is true that certification usually entails the presence of a feedback loop from verification to design. For critical systems, they also hate to rely on this feedback loop. A quote I like that I think captures the spirit:

“A discovery of an error during verification testing is considered very serious, and its origin studied very carefully to avoid such mistakes in the future. Such unexpected errors have been found only about six times in all the programming and program changing (for new or altered payloads) that has been done.”

It’s from:
http://science.ksc.nasa.gov/shuttle/missions/51-l/docs/rogers-commission/Appendix-F.txt

There is testing, in realistic environments, to find flaws improve design. And there is verification testing, a second and independent â€”I’m tempted to say “adversarial”â€” check. You are not supposed to find flaws during the latter.
regehr says:

January 28, 2011 at 10:53 am

Hi Pascal- Thanks for the comments! You’re right, I should be clear about terms in a blog post where I’m complaining about other people’s sloppy use of terms :). The part about hating to rely on the feedback loop is very interesting. On the other hand, the _Space Systems Failures_ book that I blogged about a few days ago makes it clear that iteration is the only way that space systems ever go anywhere, so it’s possible that Feynman was overstating the role of the adversarial part of the process.
DM says:

February 1, 2011 at 8:18 am

Oh, regarding transportation systems, in fact, you trust that the development processes mandated by DO-178B have imbued the developers with enough cautiousness and even paranoia that they have designed a system that they reasonably understand and in which they have plugged most holes.

This is not necessarily a good assumption to make, as the Qantas Flight 72 episode demonstrated.
Madan Musuvathi says:

February 1, 2011 at 1:11 pm

Hi John,
I am a disbeliever in the absoluteness of verification myself. But, I would like to still argue that one has to distinguish between mathematically proving an algorithm to be sound and trusting that its implementation is correct. The former is a mathematical pursuit within certain assumptions which may or many not hold in practice, and the latter is an engineering pursuit. Both have their merits. The mathematicians hope is that proving sound algorithms will help engineers help build better verification/bug-finding tools.
regehr says:

February 1, 2011 at 1:54 pm

Hi Madan- You’re right, of course. I tend to not point this kind of thing out explicitly when writing, but hopefully nobody thinks this is because I don’t value the work on the theoretical side.
Alastair Reid says:

February 4, 2011 at 3:39 am

Next time I visit, I think I’ll bring a range of musical instruments of increasing size (penny whistle, clarinet, cello, …) – so that I can gradually build trust in your verification process. And I might bring some friends and invest in several pianos – so that I can compile OpenSSH with several different compilers and watch the results (from a safe distance).

More seriously, just how much of the system would you claim to have verified? Just the code? The compiler? Libraries? OS? Bootloader? CPU? Firmware in the network card? Firmware in the hard drive? Would you verify these additional components to show that they don’t interfere with OpenSSH execution or to show that they cannot be infected with viruses which can, for example, exploit the network card’s unrestricted access to system memory to do anything they like.
Sandro says:

February 9, 2011 at 9:29 pm

While NASA’s approach works for closed systems, like shuttles, traffic systems, etc., with constrained well-defined inputs and outputs, I don’t think it’s realistic for open systems, like Internet-facing software. Better to use a memory safe language with a strong type system to exploit for lightweight verification, and fall back on theorem proving only when you really need it.

The work on refinement types is very promising in this regard for a next gen language, since it expands the scope of program properties that can be easily verified without resorting to theorem proving.

I do like your piano test though, because it simply highlights how young verification is as an engineering discipline, in that we we don’t have enough verified software in the whole stack to have confidence in the actual final result.