Status of Software Testing

The other day I received a query from some software engineering researchers who are compiling sort of a survey paper about the status of software testing research; here are the questions, with my answers. I’m interested to hear what the rest of you think about this stuff.

What do you think are the most significant contributions to testing since 2000, whether from you or from other researchers?

These would be my choices:

delta debugging
symbolic and concolic testcase generation — Klee, SAGE, etc.
general-purpose open source execution monitoring tools — Valgrind, “clang -fsanitize=undefined”, etc.
incremental improvements in random testing — QuickCheck and its variants, etc.

What do you think are the biggest open challenges and opportunities for future research in this area?

Well, as far as I can tell, gaining confidence in the correctness and security of a piece of software is still both expensive and difficult. Not only that, but this process remains an art. We need to continue making it into a science. Just as a random example, if I want to build a compiler there are about 25 books I can read, all of which cover different aspects of a well-understood body of knowledge. They will tell me the various parts of a compiler and how I should put them together. If I want to become certain that a piece of software does what I hope it does, what books should I read? It’s not even clear. It’s not that there aren’t any good books about testing, but rather that even the good ones fail to contain actionable general-purpose recipes.

Of course not all software is testable. My work on testing GCC has convinced me of this (although I already knew it intellectually). I think there are lots of opportunities in learning how to create testable and/or verifiable software. For example, what could I accomplish if I had a really good concolic tester to help with unit testing? Hopefully, with a fairly low investment I’d end up with good test cases and also good contracts that would be a first step towards verification if I wanted to go in that direction. I think we can take a lesson from CompCert: Xavier didn’t just prove the thing correct, but rather pieced together translation validation and verification depending on which one was appropriate for a given task. Of course we can throw testing into the mix when those other techniques are too difficult. There’s a lot of synergy between these various ways of gaining confidence in software but at present verification and testing are largely separate activities (not least because the verification tools are so hard to use, but that’s a separate topic).

One of the biggest unsolved challenges is finding hidden functionality (whether deliberately or accidentally inserted) in software that operates across a trust boundary. Of course hidden functionality is difficult because the specification gives us few or no clues about where to find it; it’s all about the implementation. There’s no silver bullet for this problem but one idea I like (but haven’t worked on at all) is using continuous mathematics to model the system behavior and then focusing testing effort around discontinuities. This research is of the character that I’m trying to describe.

January 22, 2014

regehr

Computer Science, Software Correctness

14 responses to “Status of Software Testing”

Bryan Pendleton says:

January 22, 2014 at 4:29 pm

I’d nominate JUnit, for popularizing and making mainstream the writing of automated test suites.

According to http://c2.com/cgi/wiki?TenYearsOfTestDrivenDevelopment, JUnit.org went online on August 16th, 2000. I think I first remember using it around 2002, so I was a little late to the party.

Still, it’s been a great 13.5 years for that tool!
regehr says:

January 22, 2014 at 5:20 pm

Bryan, thanks, I think these unit test frameworks have had a big impact. I sort of grew up as a programmer in the early/mid 90s before these were popular and I still tend to roll my own unit testers — probably not the best strategy.
Gabriel Ferrer says:

January 22, 2014 at 7:51 pm

John, I had the same experience relative to automated unit testing, for the same reason. I’ve recently become an ardent user of JUnit, as a result of my need to teach it. I figured it was not principled for me to require students to use something I was not using myself. I’ve found it far nicer than rolling my own framework.

In Python, a similar framework is part of the standard library (http://docs.python.org/3.3/library/unittest.html).
Windows Programmer says:

January 23, 2014 at 6:55 am

Debugging != testing (re: delta debugging).

What do you think of tools like coverity? e.g. static verifiers

Another interesting development, Intel’s MPX capability
Kelly says:

January 23, 2014 at 8:24 am

I have benefited from Valgrind, and I read about Klee and was impressed. I second those nominations for sure! An area that is still a “ghetto” in terms of testability is relational SQL code. love to see more action there http://pqdtopen.proquest.com/results.html?keywords=Relational%20data 🙂
Alex Groce says:

January 23, 2014 at 10:48 am

John, I got the same survey. Interestingly my answers to #1 were exactly yours, minus Valgrind and company. Even that’s mainly because I think of Valgrind etc. for some reason as being some other kind of really useful research, but not testing research, even though that’s sort of nonsense now that I think about it.

For the second part, I first said that I think it would be good if experts in testing, at least, had more than a very vague guess when someone says “will testing technique X work for program P?” and that even nicer would be if testing tools actually could say “yes, I worked fairly well there” or “I didn’t work well for this program + spec + harness, here’s a mitigation approach I can suggest” or “give up, try another method I’m hopeless for this program.”

Then I said I expected the main problem in general would remain the current main problem, which is that in many cases we have to actually test a system, which means selecting a relatively tiny subset of the ludicrously large set of possible executions to observe and check. How best to do that strikes me as something we’ll continue to improve over the next 14 years, but we almost certainly won’t consider “done” at any point I’ll ever see.
Alex Groce says:

January 23, 2014 at 10:50 am

As to delta-debugging not being testing, I think DD makes some testing methods practical that are not otherwise. If you are doing random testing and don’t have some kind of reducer in your workflow, you’re not going to be a happy person, and if you send unreduced test cases to developers they will join that unhappy camp (or merrily ignore you, since you might as well be speaking some kind of crazy moon language).
Alex Groce says:

January 23, 2014 at 10:57 am

Windows Programmer,

I don’t know about John, but if the question hadn’t come from testing researchers specifically asking for a session on testing research, I might say that static analysis tools becoming (1) good and (2) commercially available, supported, and in my experience moderately well adapted in industry was the biggest concrete gain. I’m not sure how much of the boom in static analysis was due to major research advances vs. lots of hard engineering work, though the answer is at least “some.” Certainly at JPL, using Coverity (and some other tools) on the Curiosity software was probably a bigger gain than any testing advance I saw there. The exception would be modules like the file system where the testing/verification experts turned their full attention on, but that doesn’t scale to the whole project.
regehr says:

January 23, 2014 at 2:07 pm

Windows Programmer: My position on static analysis is the same as Alex’s: it’s a major advance that obviates some testing work, but it’s not really testing. On the other hand, I do indeed consider Delta debugging to be a key part of the testing process, YMMV.
regehr says:

January 23, 2014 at 2:08 pm

Alex, nice — maybe you can send me your full response (or post it here).
Jesse Ruderman says:

January 23, 2014 at 3:03 pm

I’d add in:
* Continuous integration, pioneered by Mozilla (with Tinderbox and later Buildbot)
* Smart, targeted fuzzing. My custom fuzzers have found thousands of bugs in Firefox.
* Differential testing

Alex is spot on that automated testcase reduction is part of what makes compiler/API fuzzing practical. So are:
* Replay debugging, for intermittent bugs
* Code archaeology tools, so I can find the right developer to ask for a fix (bisect, blame, pickaxe)
* Developer habits: writing assertions; creating testing APIs; fixing “minor” differential testing bugs so more bugs can be found
Alex Groce says:

January 23, 2014 at 3:55 pm

Jesse,

I counted “smart targeted fuzzing ” as part of my “taking random testing seriously” part. The exact wording in my response was something like: “the field has finally moved from papers on ‘random testing: worth doing at all?’ to taking a widely used approach seriously.” I’d count differential testing, except I assume the idea’s been around forever, right? I knew to use it in testing things when I started grad school around 1999.

Good replay is so critical I basically don’t research any testing target where I can’t get it, yeah. Of course, that’s the luxury of being an academic, I can pick my enemies.
Alex Groce says:

January 23, 2014 at 3:56 pm

Come to think of it, McKeeman is a 1998 paper.
Jeffrey Bosboom says:

January 24, 2014 at 11:40 pm

How about bug bounties/the black market in vulnerabilities? They’re a social strategy, not a technology, but they’ve recruited lots of eyeballs to code that would receive much less testing effort otherwise (as well as causing organizations to prioritize internal testing).