Counting Compiler Crashes – Embedded in Academia

This is a bit of material from my GCC Summit talk that I thought was worth bringing out in a separate post.

As an experiment, we compiled 1,000,000 randomly generated C programs using GCC 3.[0-4].0, GCC 4.[0-5].0, LLVM-GCC 1.9, LLVM-GCC 2.[0-8], and Clang 2.[6-8]. All compilers targeted x86 and were given the -O0, -O1, -O2, -Os, and -O3 options. We then asked the question: How many ways is each compiler crashed? We count crash bugs by looking for unique “assertion failure” strings in LLVM output and “internal compiler error” strings in GCC output. This is conservative because typically a compiler will also have a number of crashes due to null pointer dereferences and other memory safety violations, and we don’t try to tell these apart. Here are the three crash messages we got from Clang 2.8:

Statement.cpp:944: void Statement::post_creation_analysis(std::vector<const Fact*, std::allocator<const Fact*> >&, CGContext&) const: Assertion `0′ failed.

StatementIf.cpp:81: static StatementIf* StatementIf::make_random(CGContext&): Assertion `ok’ failed.

Block.cpp:512: bool Block::find_fixed_point(std::vector<const Fact*, std::allocator<const Fact*> >, std::vector<const Fact*, std::allocator<const Fact*> >&, CGContext&, int&, bool) const: Assertion `0′ failed.

Here are the GCC results:

And here are the LLVM-GCC results:

The thing I really like about these results is the way it shows that the latest versions of GCC and LLVM are really solid. Both teams have put a tremendous amount of work into their tools.

The “bugs fixed” annotations in the graphs refer to fixes to bugs that we found (using random testing) and reported. We had hoped to establish some sort of nice causal link between fixing these bugs and improving the resistance of compilers to crashing, but these graphs are a long way from showing anything like that. There is just a lot going on in each of these projects — the experiment is too uncontrolled to give us any kind of solid evidence.

Are compiler crashes good or bad? Certainly they’re a pain when you’re just trying to get work done, but overall they’re good. First, they can almost always be worked around by changing the optimization level. Second, a crash is the compiler’s way of failing fast, which is good. If that assertion hadn’t been there, it’s entirely possible the compiler would have run to completion, generating incorrect code.

Update from 11/6/2010: Here are the graphs showing rate of crashing, as opposed to number of distinct crash bugs, that Chris asked about in a comment.

Although it’s not totally clear what to read into these numbers, they do seem to tell a good story. GCC steadily improves during the 3.x series, regresses following the major changes that went into 4.0.0, then (more or less) steadily improves again. Similarly, LLVM starts off with a rather high rate of crashes and improves over time. Both compilers are exceptionally solid in their latest releases (note the log scale: both compilers have reduced their crash rates by at least 3 orders of magnitude).

3 responses to “Counting Compiler Crashes”

Chris Lattner says:

November 5, 2010 at 10:23 pm

This is indeed very interesting. A random thought though: how do the “number of distinct crashes” correspond to the total number of failures you see across the 1M tests. As a silly comparison, you show 27 distinct failures in LLVM 1.9 vs 10 distinct failures with GCC 3.0.0. Is this actually because there are 2.7x as many failing testcases, or is this because LLVM has more distinct assertions than GCC, so you have finer grain grouping of failure modes? It would be interesting to see # total failures as another bar for each compiler (with its own scale) on the same chart.
regehr says:

November 6, 2010 at 9:13 pm

Hi Chris– I’ll make a new post with these graphs. The problem with the number of failures is that those numbers are extremely dependent on what our generated code looks like. For example, one of the LLVM versions (2.2 maybe) looks really bad by the “crash rate” metric but it’s just because most of our programs contain bitfields (or whatever the triggering feature is, I forget…).
regehr says:

November 6, 2010 at 9:38 pm

Hi Chris- I’ve updated this post instead of making a new one. Would be interested to hear your thoughts on these numbers.