Are Compilers Getting More or Less Reliable?


Earlier this summer, Miod Vallat of the OpenBSD project posted some thoughts about compiler reliability. His thesis is that in the good old days of GCC 2.5 through 2.7, the compiler pretty much just worked and you could trust it to emit correct code. Subsequently, the level of code churn in GCC increased for several reasons:

  • new standards appeared, such as C++98
  • more effort was made to exploit platform-specific performance opportunities
  • more people started working on the compiler

The result is that “gcc had bugs, and you were expected to accept that and cope with them.” Upgrading the compiler is one way to get bugs fixed, but this is a mixed bag since new bugs are also added. What is needed is compiler versions with long-term support.

This post is a collection of responses to Miod’s post and also a bit of an attempt to figure out if open-source compilers have really gotten less reliable over the past 20 years.

First of all, to my eye the GCC project already does a good job in supporting their compilers. For example, 4.4.0 was released in April 2009 and 4.4.7 was released in March 2012 — that’s almost three years’ worth of bug fix releases. As a frequent reporter of compiler bugs, I can confirm that it is common for the GCC developers to take a bug report against trunk and backport the fix to at least one of the maintenance branches. So what else is needed here– Longer maintenance periods? More fixes being backported? I’d appreciate hearing from people about this. Miod’s mail indicates that he has some specific complaints:

… gcc developers have been eager to release “bug free” new versions by enforcing a policy that only “regressions” would get fixed, and spending more time changing their definition of “regression” or trying to explain why regressions weren’t, so as not need to fix them in any stable release.

I would tend to respect the opinions of an OpenBSD port maintainer, but on the other hand my observation is that the GCC people are in fact putting quite a bit of effort into the maintenance branches. Of course one could always argue that they are not doing enough.

The overall context for Miod’s mail is an ongoing discussion in the OpenBSD community about switching over to Clang as the default system compiler. One problem with this idea is that at present, the LLVM project does not have any maintenance branches at all. So realistically this switch would seem to be a step backward in terms of compiler stability. I’m not complaining about the reliability of Clang (see below for more details on this); rather, my group writes code that interfaces with LLVM/Clang and our experience is that these projects move forward very quickly.

The second thing I wanted to say is that I think we can add even more items to Miod’s list of reasons why GCC might be getting buggier over time:

  • GCC is now significantly larger than it was: 1300 KLOC for 4.8.1 vs. 331 KLOC for GCC 2.7.2.3 — that’s a million new lines of code for bugs to hide in, and much of this code is as hairy as you’ll see anywhere
  • the codes being compiled now are significantly larger and more complex than they were 20 years ago during the GCC 2.5 through 2.7 era; large and complex codes are more likely to trigger compiler bugs
  • the scope of compiler optimizations, in terms of how much code each optimization considers, has increased substantially in the last 20 years — this makes it harder to get the optimizations right
  • automatically generating code that uses SIMD instructions is tricky, and we do a lot more of that now

After writing this list I feel like we’re lucky to have compilers that work at all. But are there any trends indicating that GCC should be getting more reliable? I think so:

  • vastly improved compiler warnings, static analyzers, and dynamic tools such as Valgrind support easy detection of some kinds of bugs that used to be difficult
  • GCC’s move to an SSA-based internal representation, which caused many bugs at first, has permitted optimizations to be cleaned up
  • GCC’s test suite has gotten better
  • GCC attempts to fail rapidly when something goes wrong by using assertions much more heavily than it used to; I count about 12,000 assertions now, compared to fewer than 500 in 2.7.2.3 (these counts are very rough, using grep)

So now we’re ready to tackle the original question: Was GCC 2.7 really more reliable than our modern open-source compilers? Answering this kind of question is hard and in fact I don’t think anyone — in industry or in academia — has published quantitative information comparing the reliability of various compilers. It just isn’t done. My method (which has some obvious shortcomings that I’ll discuss later) was to take three compilers — GCC 2.7.2.3 for x86, a recent Clang snapshot for x86-64, and a recent GCC snapshot for x86-64 — and use them to compile 100,000 random programs generated by Csmith. Each program was compiled at -O0, -O1, -O2, -Os (if supported — GCC 2.7 did not have this flag yet), and one of -O3 or -Ofast. If any compiler crashed or failed to terminate within three minutes, this was noted. Then the compiled programs were run and their checksums were compared. These checksums are a feature of programs generated by Csmith; they are computed by looking at the values in all global variables just before the randomly-generated program terminates. The intent of the checksum is to support easy differential testing where we don’t know the correct result for a test case but we do know that the result is not supposed to change across compilers or optimization options. If changing the optimization level changed a checksum, the compiler was considered to have emitted incorrect code. For more details see the Csmith paper.

Here are the results:

crashes timeouts wrong code
GCC 2.7.2.3 13866 68 239
GCC r202341 4 0 3
Clang r190166 0 0 7

Here’s an example of a program that, when compiled by GCC 2.7.2.3 at -O2, causes the compiler to run for more than three minutes (in fact, it runs for more than 60 minutes — at which point I got bored and killed it):

int x0;
int x1() {
  for (;;) {
    x0 = 0;
    for (; x0;) {
      int x2 = 1;
      if (x2) return 0;
    }
  }
}

I didn’t investigate the crash bugs or wrong-code bugs in GCC 2.7, these are probably of very little interest now. It’s impressive that Clang never crashed in 100,000 test cases. The four crashes in the current GCC found during this testing run were all variants of this known problem; a patch exists, but it hasn’t been applied to trunk yet. The instances of incorrect code generation in Clang were due to some previously-unknown bugs that I reported as PR 17158 and PR 17179, which are triggered respectively by the following programs.

PR 17158:

int printf(const char *, ...);
struct x0 {
  int x1;
  char x2;
};
struct x3 {
  struct x0 x4;
  char x5;
} x6;
int main() {
  struct x3 x7 = { { 0, 0 }, 1 };
  x6 = x7;
  x7.x4.x2;
  printf("%d\n", x6.x5);
  return 0;
}

PR 17179:

int printf(const char *, ...);
int x0, x1, x2;
int main (void) {
  int x3;
  for (x1 = 0; x1 != 52; x1++) {
    for (x2 = 0; x2 <= 0; x2 = 1) {
      int x4 = x0 ^ x1;
      x3 = x4;
    }
  }
  printf("%d\n", x3);
  return 0;
}

The GCC miscompilations were also due to previously unknown bugs; I reported them as PR 58364, PR 58365, and PR 58385. Here are the two shorter ones.

PR 58364:

int printf(const char *, ...);
int x3 (int x4) { 
  return x4 < 0 ? 1 : x4; 
}
int x0 = 1, x1, x2;
int main (void) {
  if (x3(x0 > x2 == (x1 = 0))) {
    printf("x\n");
  }
  return 0;
}

PR 58385:

int printf(const char *, ...);
int x0, x1 = 1;
int x2() {
  x1 = 0;
  return 0;
}
int main() {
  ((0 || x0) & x2() >= 0) <= 1 && 1;
  printf("%d\n", x1);
  return 0;
}

The three new wrong-code bugs in GCC were all fixed within a few days of being reported. So, I ran another 100,000 random programs against an updated snapshot and here are the results:

crashes timeouts wrong code
GCC r202599 9 0 0

Out of the nine crashes, six were due to the PR 57393 problem mentioned above and the other three were due to null pointer dereference issues. The GCC bug database already contains a number of open bugs for segfaults in the compiler so I’m not in any big rush to report these.

I did not run another Clang test since neither of the new bugs in that compiler has been fixed yet.

Let’s get back to the original question: Has the reliability of open source compilers decreased since GCC 2.7? This does not appear to be the case. In fact, this experiment indicates that the opposite might be true. Of course we need to be careful in interpreting these results, since randomly generated C programs are not necessarily representative: they could trigger bugs that no real C program triggers, while also missing all of the bugs that Miod is complaining about. On the other hand, take a look at the C programs above that trigger the new wrong-code bugs and try to convince yourself that these patterns will never occur during compilation of real code. A different source of bias in these results is that GCC 2.7 was never tested with Csmith, but the current versions of GCC and Clang have been subjected to extensive random testing.

Getting back to Miod’s point about the need for LTS versions of compilers, I completely agree: this should be done. Five or even 10 years of support would not be too long, especially in the embedded systems world where old compilers often persist long after their expire-by date. I suspect that the main problem is that compiler maintenance is too difficult and unsexy to attract lots of volunteers. Of course people could be paid to do it, but as far as I know nobody has stepped up to fund this sort of effort.

Miod’s mail received some comments at Hacker News.

Updates from Sep 16:

  • good reading: “FreeBSD 10 Alpha Built With clang”
  • I added a bit of explanation of Csmith’s checksumming strategy
  • I added the test case text for four of the five new wrong-code bugs just because I think they’re interesting
, ,

20 responses to “Are Compilers Getting More or Less Reliable?”

  1. On the topic of csmith results matching the real world: How hard would it be to take a selection of bug reports from the “wild” and check if csmith would/could ever produce them?

    It seems to me that it might be worth while running csmith in 3+ modes:

    – the current mode, tuned (IIRC) to maximize some combination of coverage and bug provocation.
    – a mode tuned to mimic the structures and usage of a number of real world code bases.
    – a mode tuned to mimic other people’s bug report cases.

    Ideally the last two would be corpus driven so as to reuse work and also so that groups in industry can tune to match there own code base.

  2. What about the quality of the commercial compilers? My impression is that VC++ is constrained by extremely strict standards of reliability because it can never ever break Windows, Office and SQL Server which are huge and unorthodox codebases.

    They even had to make static data folding in the linker opt-in because they fund a single internal case where that would break a program (that was illegal, but still). http://blogs.msdn.com/b/vcblog/archive/2013/09/11/introducing-gw-compiler-switch.aspx The CLR was relying on compiled methods having different function pointers.

  3. Hi tobi, we’ve found bugs in every commercial compiler we’ve tested.

    I got started on this entire line of research due to how crappy some expensive embedded compilers were. They were MUCH worse than any 2.x, 3.x, or 4.x version of GCC that I’ve used.

    MSVC is quite good, though we did find bugs. We just never tested it very much since none of us like writing test automation code that works on Windows.

    Intel CC is probably the most reliable compiler we’ve tested, but again we found bugs. I think I heard that they do some random testing internally. We never tested it very much since we didn’t get a lot of feedback from the compiler team.

    In general, to make this kind of testing work really well, you need access to development versions of the compilers, so we’ve focused nearly 100% of our effort on open source tools. These are the tools that I want to improve anyway.

    That /Gw switch is pretty funny!

  4. Hi bcs, we’ve thought about this sort of thing, but it’s really hard to know what characteristics of actual source code we should mimic. Whatever characteristics we choose, we can’t prove that they are the ones that matter for bug-triggering purposes.

    So anyway this is one of those things that’s fun to think about but so far I haven’t been convinced that I want to work on it…

  5. You raise an interesting point, which got me thinking: wouldn’t a study of what characteristics are most important be interesting in and of its self? Particularly if it’s different for different compilers.

    I’d think for that study, your existing set of bug reports might make a better corpus than real world code as you can compare its generation rate to the original generator’s baseline.

  6. A minor suggestion – as a non C++ developer, this part threw me:

    “Then the compiled programs were run and their checksums were compared. If changing the optimization level changed a checksum, the compiler was considered to have emitted incorrect code”

    Not being familiar with Csmith, I initially read that as comparing the checksum of the output binaries, which of course doesn’t make sense as you’d expect them to change. You might add an explanation that Csmith generates programs that output a checksum to verify functional correctness to make this article appreciable by a wider technical audience.

  7. Excellent read.

    Although it would require (far) more work, testing less widely used architectures would make for a good comparison. From OpenBSD experience, gcc bugs on i386 are relatively rare. On m68k, etc., optimization bugs are much more frequent. The older or less common CPUs often have bizarre register constraints that make for extra trouble. (I know you mentioned getting started with embedded compilers, how does gcc targeting these embedded platforms compare?)

    From an OS developer perspective, a crash is relatively easy to deal with. Dial down to -O1 or -O0 if necessary and move on. The misoptimizations are the pernicious bugs that drive developers crazy looking for source bugs that don’t exist. (I note that gcc 2.7 crashed far more frequently than generating bad code, making it “better” in relative ratio terms.)

    (Another point of contention Miod didn’t mention, but which I think fits in with your research, is the increasingly aggressive optimizer behavior wrt undefined behaviors such as strong aliasing and type punning. For better or worse, there’s a lot of “working” code that makes “C is portable assembler” type assumptions and having it break in subtle ways whenever the compiler gets upgraded is also bad.)

  8. Hi tedu, it’s definitely true that miscompilations are the worst case. Both GCC and LLVM have had good success in making their tools crash rapidly by putting in lots of assertions, and hopefully that trend will continue.

    The problems from compilers exploiting undefined behaviors are really troublesome since there’s no end in sight. We seem to have abandoned the “portable assembler” option, the only other solution I can see is improving tools such as Clang’s “-fsanitize” family of checkers. It’s not clear how useful these will be to kernel developers, unfortunately.

    We’ve done some testing of compilers for architectures other than x86 and x64 and certainly you’re correct that they are buggier! Also, this isn’t really very difficult as long as some sort of emulator is available. However, I decided to focus less on that kind of testing since people aren’t always very eager to fix bugs that don’t affect that many people. Maybe sometime when x86 and x86-64 are free of bugs that Csmith can find, it will be time to start beating on ARM and some other architectures. I don’t even want to test m68k…

  9. I wonder if the real issue Miod is running into is that the C language is losing ability to be a “portable assembler” language as the various compilers become smarter and more able to exploit the hardware and through global analysis. For those of us who deal with programming problems where the developer must develop on multiple different platforms, but need for example, the ability to force a function have a unique memory address(such as the CLR issue referenced above), or whatnot, the evolution in recent years of compilers that take very significant advantage of understanding large amounts of the program simply make this task more difficult, and the various techniques for forcing the compiler into behaving less intelligently have become substantially less reliable.

  10. I’m a bit surprised fuzz-testing compilers with randomly generated code is not done on a regular basis. I thought this was a kind of a bare minimum. In fact, using only one fuzz-tester is usually not enough, as it typically doesn’t give enough variability of input according to my experience. Many fuzzers by different people (it matters who writes them as different people think of the problem differently hence creating fuzzers that stress different parts of the system) could do an even better job, I’m sure.

    I personally use fuzz-testing for my SAT solver, and different fuzzers really make a huge difference. If one fuzzer can no longer crash/fault the system even after days of running, another fuzzer typically can crash it under a couple of minutes.

  11. Hi Mate, of course I agree that compilers should be fuzzed regularly and that multiple fuzzers are always complementary even if we don’t always understand exactly why :).

    My guess is that the GCC and LLVM people just have their hands full at all times. They see the value of fuzzing, and they are generally happy to fix bugs that come from fuzzers, but they are just too busy to go off and dig up some fuzzers and run them.

    Something I’ve been planning to do for a while is to make an unattended fuzzing farm for GCC and LLVM and just let them run continuously. Then, every day or every week they can send me a mail letting me know about any new problems that have been discovered. You can imagine doing something similar with a collection of SAT solvers…

  12. Hi dcw, I expect that there is some of that going on. In fact, Miod’s comment about compiler developers “trying to explain why regressions weren’t” sounds precisely like a disagreement about what the compiler’s obligations are with respect to a particular piece of code. In general developers (and especially kernel developers) are writing programs for a slightly different language than the one the compiler developers are implementing. As you say, this drift has become quite a bit more serious in the last 10 years. This is a real problem and as long as the compiler developers are committed to maximizing performance of generated code, there aren’t any really good solutions other than writing static and dynamic checkers for the behaviors that are no longer guaranteed.

    One thing that I haven’t done yet is to go back and look at all bugs that Miod has reported to the GCC project and look at the resolution of each one. I suspect this would clarify the situation.

  13. Would it be interesting to to gather up several different compilers regression test suites and cross test each compiler against the others regression tests?

  14. Hi Charles, it might be, but on the other hand some tests for each compiler are testing extensions that are not implemented in other compilers, so it might be a lot of work to figure out which of the test cases are supposed to pass for the other compiler.

    On the other hand, I know of some non-GCC projects that have used GCC’s torture tests, but I imagine this only makes sense when GCC compatibility is the goal, as opposed to C99 or whatever.

  15. You could almost certainly test other architectures fairly easily by integrating QEMU into your test harnesses.

  16. Hi kme, that’s right– qemu plus binfmt_misc makes it super easy to test compilers for some architectures. For others we have to crank up an emulator. This is one of the benefits of Csmith emitting a checksum: for a random embedded processor without a good I/O library, we just stash the checksum into a register and then ask for a register dump after the program has terminated.

  17. It would also be interesting to give QEMU a shot with x86-to-x86 translation, and see whether it gives any false positive. ARM is generally a well-maintained QEMU target, but probably the same is not true of all the others.

  18. Hi Paolo, that’s a good idea, though I’ve found that Csmith + a compiler results in a pretty bad random tester for emulators, simulators, etc. There’s something that happens during compilation where the important randomness is lost, if that makes sense. To effectively test QEMU and related tools, a customized machine code fuzzer is needed, for example see here:

    http://roberto.greyhats.it/pubs/issta09.pdf