Academic Bug-Finding Projects in the Long Run

A number of computer science researchers, including me, have made careers out of creating tools that automate, at least partially, the process of finding bugs in computer programs. Recent work like this can be found in almost any good systemsy conference proceedings such as SOSP, OSDI, ASPLOS, ICSE, or PLDI. Examples go back many years, but a good early-modern paper in this mold is “WARLOCK–A Static Data Race Detection Tool” from USENIX 1993.

How do bug-finding papers get published? Two things are required. First, bugs should be found. A few bugs is fine if they are high-value, deep, or otherwise interesting, but it’s also ok to find a large number of garden variety bugs. It’s not required that the bugs are turned into bug reports that are well-received by the people who maintain the buggy software, but that is definitely a plus. It’s not even required that a bug-finding tool (or algorithm) finds any bugs at all, though I personally don’t have much use for that sort of paper. The second thing that is required is some level of novelty in the bug-finding tool. For example, the novel thing about Csmith was taking undefined and unspecified behavior seriously in a program generator and, in particular, the way that it interleaves program generation and static analysis.

So now you know how it works: someone creates a tool, finds some bugs, writes a paper, and that’s that. The problem, as I’ve increasingly come to realize, is that this cycle is not as useful as we would hope. For example, let’s take one of my tools, IOC (here I’m using “my” in a bit of a broad sense: I wrote very little of IOC, though I did cause it to be created). We created the tool, found some integer overflow bugs, wrote a paper. At this point, if I want to maximize things about me that can be measured (papers, citations, grants, etc.), then I’ll stop right there and move on to the next project. But this is a problem — at the time my paper is published, I’ll have only found and reported a tiny fraction of the integer overflow bugs lurking in all the C/C++ codes that are out there. Furthermore, even if I find and report, for example, all of the integer overflows in the Perl interpreter or the Linux kernel, it’s not like these code bases are going to stay clean. Overflow mistakes are so easy to make that developers will just add them back in unless they get frequent feedback from a tool. So, how can we extract more benefit from a tool?

One possibility is that my actual tool doesn’t matter since people will read the paper and implement similar techniques. From a selfish point of view this is the best case since my work has impact without my doing any additional work, but given the flood of CS papers, it’s a lot to expect that busy people in the real world will re-implement all of the useful bug-finding work that’s out there.
If I open-source my tool, people can use it. However, the bar is pretty high if (as is the case for IOC) people have to build a patched Clang, build their project using the new compiler and some special flags, run some tests, and then interpret the tool’s output. Some academic bug-finding tools are easier to use than this (and in fact IOC will be easier to use if we can get it included in LLVM), some are harder, and many are impossible since the source is not available or the tool is effectively unusable.
If I create a company based around my tool (think Coverity) then it becomes widely available, but so far I haven’t been interested in facing the big lifestyle change that comes along with a startup. Also, I believe that the products of publicly funded research should be open source.

The point is that the hit-and-run style of bug-finding research is depressingly inadequate as a way of improving software quality in the long run. We need to do better.

May 29, 2012

regehr

Academia, Computer Science, Software Correctness

20 responses to “Academic Bug-Finding Projects in the Long Run”

BCS says:

May 29, 2012 at 11:51 pm

IOC has aa distinct advantage in that it is built on a commonly used framework (LLVM) and as such has a clear path to common use. How much can that situation be generalized?
Seo Sanghyeon says:

May 30, 2012 at 12:10 am

I think more effors like http://qa.debian.org/daca/ should be done.
Dimitrios Leventeas says:

May 30, 2012 at 2:54 am

Disclaimer: I am a layman. Why is there a tool that includes all this algorithms? Maybe a core part and every new algorithm to be implemented as an extension.

Then, we (as programmers) could add to our build process the execution of this tool…
Daniel Lemire says:

May 30, 2012 at 5:43 am

My impression is that people overestimate the value of publishing yet another paper. It is a cognitive bias.
Serge Kruk says:

May 30, 2012 at 6:15 am

I agree, but I would say that the problem is much more prevalent
than only in the ‘bug-finding’ research area. Think, in all of
CS, of the number of papers without even a decent implementation,
or an implementation so unstable as to be unusable even if the
researcher makes it available.
Sam Tobin-Hochstadt says:

May 30, 2012 at 6:28 am

There’s another variant of the “open source” option, which is to not only open source, but continue to maintain the software, and make it easy to build and use for others. This is how LLVM started, before Apple took over, and it’s still true for software from ACL2 to GHC Haskell to Coq.
Derek Jones says:

May 30, 2012 at 8:52 am

John,

The interesting part of your research, to software developers, is the information on occurrences of the various constructs being searched for (the kind of tools you have produced are not new, just not published).

The faults you have found are only a problem if they have serious consequences and software engineering has a missing dead bodies problem.

If you want the results of your research to be taken up by industry you need to follow through on the faults you have found to find out what real world consequences they have had; find enough dead and maimed people and perhaps something will happen, better yet find somebody of influence who is loosing money because of these faults.
Heechul Yun says:

May 30, 2012 at 11:56 am

I think giving the tools to students for doing class projects (e.g., improve them in some way) is one option.
Matthias Felleisen says:

May 30, 2012 at 12:05 pm

When we (PLT at Rice) started to develop static debugging tools [*] in the early 90s, I faced a similar dilemma though not the same. Our target language was Scheme and our tools were based on some form of type inference. Soft Scheme [Andrew Wright] was the first reasonably complete tool that we could use on full-fledged Chez Scheme programs. It uncovered two distinct problems: (1) we weren’t really able to port this to other Scheme implementations easily (and this may be easier in the C world) and (2) it was unbelievably difficult to use the system. I may have been the only real programmer who used it on an almost daily basis for a year and a bit.

For the next project, I set the goal of solving the second problem. I wanted second or third year undergraduates at Rice to be able to use the tool. My expectation was that pushing the project to that level would teach us something and would help get the idea spread into other IDEs. The first part happened, but the pay off was much lower than expected (one or two additional publication). The key insight we took away was that the average programmer (as say represented by a near-graduate of a decent undergraduate CS program) needs a LOT MORE automation than static debuggers provided. The second part never happened. To this day, DrRacket [formerly DrScheme] appears to be the only IDE that can value flow graphs on top of a program to explain a bug. Why this is, I don’t know. Everyone — from simple undergrads to highly experienced research colleagues — loved this feature. I thought it would take off.

Eventually we tried to reuse the tool and extend it to cope with modules. At that point, we ran into yet another problem — we needed a different way of analyzing programs. Even though the tool had experienced some bit rot, the real obstacle was that the next tool needed a different architecture.

To this day I am hoping for a fourth solution. Someone gets an NSF infrastructure grant and builds a framework into which others plug in their bug-finding analyses. The infrastructure provides a slick API and may even come with a test bed of “debuggable” benchmarks. People who want to publish a bug-finding paper must develop their tool sufficiently well to plug it in and to report on user experiences outside their group. That should at least become the standard. If they can’t, they need to write a three-page justification.

If the tool takes off, someone should get an NSF industry collaboration grant. Tool software industry would pay for getting their hands on ideas from this group, early not when the paper has been polished and published. Indeed, good software houses may end up advertising their version of the tool by the time the paper appears. And it it really all goes well, this enterprise becomes self-sustaining.

Just some experiences and thoughts.
Anonymous Bug Finder says:

May 30, 2012 at 1:06 pm

I would go even further than Serge and say this issue is one of the core tensions in most research fields. Software engineering tools is an area where the effort required to get a new idea implemented and widely used is _low_ enough that it’s frustrating when it doesn’t happen. Having said that, based on my experience working at a company that develops bug finding tools of the Coverity sort, it takes a huge amount of effort to maintain such tools. Programming languages and environments are moving targets, no tools actually 100% adhere to standards, and every user has their own quirky hacks they’ve added. In order to be useful, tools have to deal with all this crap. This grungy maintenance stuff is way too much work for any individual researcher to take on. I think researchers who care about direct practical impact should band together around tools like the Clang Static Analyzer, and consider contributing to the maintenance of such tools part of their job. One could even go so far as to penalize authors that reinvent wheels instead of using popular frameworks (when the idea in question fits in an existing framework, obviously).
John Carter says:

May 30, 2012 at 1:25 pm

The problem you identify is pervasive across all computer science / EE research areas. A vanishingly small fraction of research papers get incorporated into products, for all the reasons you outline. Your list of options is good, but incomplete.

– Establish a deep collaboration with a company who builds a relevant product and who might be interested in your help getting your ideas productized. Unless you are very fortunate, the benefit here is the satisfaction (and credibility) of seeing your ideas impacting the real world, not extra $$$ or pubs. Assuming you do not sign over some form of exclusivity, this does not interfere (or advance) your goal of open sourcing the output of publicly funded research.

– Similar to the above, a graduating student who goes to work for the right company is a great form of technology transfer. Of course,this option is not (or should not) be totally under your control.

– Go to work for a company. You do not need to dive in with both feet like me, which is at least as big a change as starting your own company — there’s always sabbaticals or leaves of absence (esp. for you post-tenure folks).

It’s refreshing to hear that somebody really cares about their impact, not just pub count. It would be nice if more departments rewarded researchers for tech transfer.
regehr says:

May 30, 2012 at 8:44 pm

BCS, good point, some bug-finders are far harder to deploy than IOC. Of course, I’d argue that in the short run, a bug-finder that can’t be disguised as a compiler isn’t going anywhere.

Seo, thanks, I hadn’t seen DACA.

Dimitrios, my guess is that at present no such common framework is possible. In the future we probably will not see a single common framework, but perhaps there could be a small number of them. Clearly LLVM will be one of them.

Daniel, agreed.
regehr says:

May 30, 2012 at 8:57 pm

Serge, definitely you’re right. I wanted to single out bug-finding because it has this interesting characteristic where we can eliminate all members of a class of bugs, only to have them sneak back in later. But maybe that’s not a very important distinction.

Sam, right. But that’s a lot of work. I’m just whining here.
regehr says:

May 30, 2012 at 9:05 pm

Derek, the “missing dead bodies problem” is a cute phrase, but you risk overstating the case. Bugs cost money, period. Integer overflows, to take the running example, are a common root cause for exploitable vulnerabilities. These have a real economic cost. No bodies are needed.
regehr says:

May 30, 2012 at 9:20 pm

Matthias, that’s a great story. Usability problems with static analysis results are one of the reasons I’ve been doing more work on testing lately. If you give me a problem justified by some crazy path, and if past such bug reports have been hard to process and have sometimes been false alarms, I’ll probably quit using the tool. However, if you give me a test case that triggers a bug, as a right-thinking developer I must pay attention.

Clever test case generators in the style of Klee and SAGE seem to me to represent one of the best ways for static analysis people to channel their energy.
regehr says:

May 30, 2012 at 9:25 pm

Anon, agreed on all counts (however, penalizing researchers for reinventing wheels is a game I wouldn’t care to play — however useful it might be).
regehr says:

May 30, 2012 at 9:36 pm

John, thanks for adding to the list!

There may be incentive problems in implementing the “collaborate with industry” route. Compiler companies have been pretty uninterested in my tool, Csmith, which can find serious flaws in their tools. My take on the situation is that it is their customers who would benefit from these bugs being fixed, not the companies. It’s sort of like having a tool that can find safety flaws in automobiles — a car company may not be very interested in learning about these.
Flash Sheridan says:

May 31, 2012 at 10:49 am

Back when I was writing my article on GCC and PalmSource compiler testing, I asked a similar question of Mark Mitchell, the head GCC maintainer. He pointed out that “there’s no shortage of bugs in the bug database, so it’s not as if we don’t have enough
things to fix,” and advised checking some of my tests into the C Torture suite. I was only able to justify spending a limited amount of time on this, most of which I spent failing to get DejaGnu to perform as documented. (There would have been obstacles after that as well, since my tests were for our particular runtime.) But I still think Mark was right on the general approach: The ultimate goal of a bug report should be, not a one-time fix, but a test case which will be run on every future build. I hope to do this in my new job with some of your bugs, but it will probably take me a while to get to that point.
regehr says:

May 31, 2012 at 11:15 am

Flash, I’d be interested to talk more about this.

Beyond regressions, something that might be interesting to add to the LLVM test suite would be a collection of Csmith outputs that cover a lot of branches not covered by existing LLVM tests.
Alastair Reid says:

June 4, 2012 at 5:01 am

I have been thinking about this question from the other side: as an industrial researcher, what academic work would help our product divisions find bugs or write code more efficiently and, since the primary measure of my performance is tech transfer, how can I get the divisions to give the most promising a try?
I’ve introduced the compiler teams to CSmith, John came and gave a talk but, to be honest, I was more interested in doing research on making processors/programs faster than more reliable. Then I realized that much more than 50% of the engineering effort in my company is spent on testing/validation/verification but R&D was doing virtually nothing to help with this. So I’ve been changing my research focus.
So what should we be using?
– I know there’s a new CSmith release to try out. (We work on 3 compilers)
– We ought to give IOC a try – what kind of code is it best for (compilers? device drivers? Linux libraries?)
– I think Klee ought to be good for generating tests for device drivers (as a hardware company, we write a lot of device drivers)
– I’m looking very seriously at Peter Sewell’s work on generating litmus tests for weak memory models. But are litmus tests (designed to describe/explore the architecture) effective at finding the corner cases in a complex memory implementation where everything is written for speed and low-power?
– I don’t know what to use to find lock-related bugs.
– I don’t know what to use to find memory model bugs (i.e., where the weak memory model bites). (We port and optimize open-source code like Linux libraries, the Linux kernel, etc.)
– Are there any tools that help isolate problems in JITs/VMs?