Static Analysis Benchmarks

Many programmers would agree that static analysis is pretty awesome: it can find code defects that are very hard to find using testing and walkthroughs. On the other hand, some scientific validation of the effectiveness of static analysis would be useful. For example, this nice 2004 paper found that when five analyzers were turned loose against some C programs containing buffer overflows, only one of them was able to beat random guessing in a statistically significant fashion.

More recently, in 2014, some researchers at the Toyota InfoTechnology Center published a paper evaluating six static analyzers using a synthetic benchmark suite containing more than 400 pairs of C functions, where each pair contains one function exemplifying a defect that should be found using static analysis and the other function, while looking fairly similar, does not contain the defect. The idea is that a perfect tool will always find the bug and not find the not-bug, if you see what I mean. The results show that different tools have different strengths; for example, CodeSonar dominates in detection of concurrency bugs but PolySpace and QA-C do very well in finding numerical bugs. With a few exceptions, the tools do a good job in avoiding false positives. At this point I feel like I have to refer Coverity’s classic writeup about this kind of thing just because it’s awesome.

Shin’ichi Shiraishi, the lead author of the Toyota paper, contacted me asking if I’d help them release their benchmarks as open source, and I have done this: the repository is here, and here is a brief document describing the benchmarks. (I just want to make it clear that while I own the Github repo, the work here was entirely done by Shin’ichi and his colleagues — I added the simple autotooling, fixed a few portability problems in the benchmarks such as using intptr_t for integers that will be cast to and from pointer values, and did some other lightweight cleanup.) Anyway, I am very happy that Shin’ichi and his organization were able to release these benchmarks and we hope that they will be useful to the community.

Update from 2/25/15: Whoops — it looks like Coverity/Synopsys has a DeWitt clause in their EULA and based on this, they sent me a cease and desist notice. In accordance with their request, I have removed Shin’ichi’s group’s paper from the Github repo and also removed the link to that paper from this piece.

February 6, 2015

regehr

Computer Science, Software Correctness

13 responses to “Static Analysis Benchmarks”

Ben says:

February 6, 2015 at 12:48 pm

First I’m going to totally self-indulgently give myself a shout-out for having implemented a goodly chunk of CodeSonar’s concurrency bug detection.

Having spent a few years looking at multithreading bugs in lots of different software I am now totally convinced that with a few notable exceptions, mainstream software should not use threads. Processes for parallelism and some kind of non-parallel abstraction for multitasking!
bcs says:

February 6, 2015 at 12:58 pm

I’ve heard it claimed that the primary result from meta research on automated bug finding is that the more things you do to find bugs, the more bugs you find.

Assuming this is true, the solution would seem to be; get the cost of applying multiple systems to be sub linear.
regehr says:

February 6, 2015 at 2:05 pm

Ben, awesome!

I totally agree about threads. I was just telling my OS class this yesterday. In my own programming life I always go for process-level parallelism.
regehr says:

February 6, 2015 at 2:09 pm

Hi bcs, I agree that multiple tools are good, though at some point we need to ask “would our time/money be better spent on X” where X is more code reviews, more testing, more aggressive configurations of tools we’re already using, etc.

Also I’ve found that while using a static analyzer, you start to conform it its desires, for example by avoiding idioms you know it doesn’t like. When using 7 or 8 tools, this contortions required to avoid false positives might become confusing or painful.
bcs says:

February 6, 2015 at 6:44 pm

@Ben, I’d not go that far. Multiple threads/fibers in a single process with immutable global data, thread owned local data and pass-by-value and/or transfered-of-ownership IPC primitives will give most of the same benefits along with much lower overhead for things like communication and concurrency ramp-up/down.

@regehr: good points, however I’d lump all those into the “costs” I’m suggesting should be driven down. Specific to the idioms bit, driving that down would require reducing the false positive rate, which is more tolerable if you known a bunch of other tools are running, and/or having a “tuning knob” to control that pickiness.
regehr says:

February 7, 2015 at 11:25 am

bcs, I’d agree that threads can be used properly — but in the wild we don’t see this happening too often.
Ben says:

February 7, 2015 at 12:15 pm

@bcs, All the strategies you mentioned for making multithreading more reliable make sense. The trump card argument against threads in my opinion is that most mainstream software these days uses huge chunks of third-party library/framework code. Do you think all of those components are so nicely thread-sanitary? I’ll take the other side of that bet any day of the week.

To mitigate the overhead of processes you can:
1) Open windows of shared memory specifically for the data that really needs to be shared for performance reasons.
2) Be relatively conservative about spawning and killing workers. I think Chrome’s process-per-tab model is a nice existence proof that process overhead is not necessarily insurmountable.
regehr says:

February 7, 2015 at 3:01 pm

make -j4 is another nice example where process-level parallelism works. Both Csmith and C-Reduce make excellent use of processes too.
Mate Soos says:

February 10, 2015 at 10:56 am

Or we could abandon certain languages and go for, e.g. Go, where threads are easy to use. Or Erlang. Or a bunch of others. I feel like this threading issue is fully merged with pthreads. Sure, if we give people grenades without safety pins, they’ll blow themselves up real quick. Let’s not forget that there are grenade types out there that have safety pins, though.
Amy says:

February 10, 2015 at 1:39 pm

Props to Ben. More than double the next-highest concurrency score! A bit of self-indulgence is well-deserved.

(PS I hear the documentation on CodeSonar’s concurrency functionality is also v. good 😉
Ben says:

February 10, 2015 at 2:50 pm

@Mate, If by “pthreads” you mean {pthreads, win32 threads, Java threads, C/C++11 threads, .NET threads, …}, then sure.

Go is a kind of funny example, because its community talks a lot about the CSP model, but the language supports shared memory. See e.g.:
https://golang.org/doc/articles/race_detector.html

So I’m not convinced that goroutines are a huge improvement over regular old threads.

Erlang is kind of a funny example because its community encourages programming with immutable data structures, which certainly has a big impact on concurrency issues. However:
1) It remains to be seen whether all immutable data structures all the time can work well for wide swaths of mainstream software development.
2) Erlang has shared memory too:
http://yarivsblog.blogspot.com/2008/05/erlang-does-have-shared-memory.html

Using processes instead of threads is the strategy for improving the reliability of parallel software that is by far the easiest to adapt to mainstream software engineering practices.

And threads are an even less appropriate choice if you’re trying to make software that’s multitasking, as opposed to parallel. The best choice out there today is coroutines (e.g. generators in JavaScript, or async/await in .NET), though I believe there’s room for improvement.
Jason says:

February 12, 2015 at 11:05 am

I disagree with the benchmark.

Consider how many bugs -Wall or various linting tools can catch, while also flagging all similar looking functions. So long as it is simple to fix the bug-free function to not be flagged, but not simple to fix the buggy one to not be flagged (without also fixing the bug), you have a useful tool.

It may not be useful for large existing codebases, since nobody wants to go through every lint error, but it can be highly useful for new codebases.
regehr says:

February 12, 2015 at 4:18 pm

Jason, we’d need to look at specific examples before I can say how much I agree with you, but I understand what you’re saying. But also, there are clearly warnings that are driven by deeper, more powerful static analyses where eliminating the warning may not be so simple, and in those sorts of cases it is essential to keep the number of false positives low. See the Coverity article for more detail on this.