Before a tool such as a compiler is used as a critical component in an important software project, we’d like to know that the tool is suitable for its intended use. This is particularly important for embedded systems where the compiler is unlikely to be as thoroughly tested as a desktop compiler and where the project is very likely to be locked in to a specific compiler (and even a specific version, and a specific set of optimization flags).
Turns out, it’s hard to certify a compiler in any meaningful way. In practice, the most popular way to do it is to purchase a commercial test suite and then to advertise that it was used. For example, the Keil compiler validation and verification page mentions that the Plum Hall C validation suite has been used. This is all good but it’s sort of a minimal kind of certification in the sense that a highly broken compiler can pass this (or any) fixed set of tests. At some level, I think these validation suites are mainly used for two purposes. First, I’m sure they catch a lot of interesting edge-case bugs not caught by vanilla code. Second, using the test suite serves as an advertisement: “We care about details; we have enough money to buy the expensive tests, and we have enough engineers to fix the resulting issues.”
A random test case generator like Csmith can serve as a useful complement to fixed test suites. Whereas Plum Hall’s tests were written by hand and are primarily intended to check standards conformance, a properly designed random tester will deeply exercise internal code generation logic that may not be as strong as it should be.
If Csmith were used as a certification tool, a compiler vendor would get to advertise a fact like “Our compiler successfully translates 1,000,000 tests generated by Csmith 2.1.” This is a high standard; my guess is that no existing C compiler other than CompCert would meet it. Actually, CompCert would fail as well, but in a different sense: it does not handle the entire subset of C that Csmith generates by default.
For several years I’ve gently pushed the idea of Csmith as a compiler certification tool and have gotten some pushback. Random testing makes people uncomfortable just because it is random. For example, it may suddenly find a bug that has been latent for a long time. Traditional test suites, of course, will never do this—they only find latent bugs if you add new test cases (or if you modify the system in a way that makes the bug easier to trigger). People want to avoid unnecessary surprises near the end of a release cycle. This objection, though legitimate, is easy to fix: we can simply specify a starting PRNG seed and then the suite of 1,000,000 random programs suddenly becomes fixed. It retains its stress-testing power but will no longer surprise.
I’ve also received some non-technical pushback about Csmith that I don’t understand as well since people don’t tend to want to articulate it. My inference is that Csmith is a bit threatening since it represents an open-ended time sink for engineers. It’s hard to know ahead of time how deep the rabbit hole of compiler bugs goes. People would rather use their valuable engineers to make customers happy in a more direct fashion, for example supporting new targets or implementing new optimizations.
My hope has been that one high-profile compiler company will take the time to make (for example) 1,000,000 test cases work, and then advertise this fact. At that point, other companies would sense a certification gap and would then be motivated to also use Csmith in a high-profile way. So far it hasn’t happened.
The way that Csmith has found its way into compiler companies is not by top-down fiat (“Use this tool!”) but rather from the bottom up. Compiler engineers—probably the people most concerned with compiler quality—have run it probably mainly out of curiosity at first, have found compiler bugs, and then have kept using it. Unfortunately, due to the nature of this bottom-up process I generally get only indirect, informal indications that it is happening at all.
32 responses to “Certifying Compilers Using Random Testing”
This year’s ICFP has a paper “Experience Report – a Do-It-Yourself High-Assurance Compiler” that explores similar themes:
http://www.cs.indiana.edu/~lepike/pubs/pike-icfp12.pdf
Here’s the talk:
http://www.youtube.com/watch?v=7zXhP–9axQ
Thanks Robby. Love the organization of ICFP content at Youtube. I need to watch Peter’s keynote as well.
Peter’s talk was excellent. Malcolm Wallace does that videos and we owe him huge.
Is there a way to make Csmith only generate meaningful bugs (in the sense that a customer might stumple upon them)? That might increase adoption.
Or are most bugs equally meaningful in the sense that they are customer-facing problems?
As a compiler user I certainly don’t want my vendor to produce a perfect compiler. That would be a waste of time. I want it to be high quality, but not perfect. I’d rather have new features. I can work around the occasional bug every 6 month.
Re: “only generate meaningful bugs.”
One could imagine changing Csmith to be driven more by dialects and idioms. For example, instead of letting Csmith generate loops with random loop bounds, one could cause it to generate more idiomatic “for (0..N)” loops. Instead of generating random control flows, restrict it to structured control flows. Etc.
I think that this idea is worth exploring. I don’t know if it will generate more “meaningful” bugs or not.
I don’t even know if the following questions makes sense, but anyway:
How do you know that Csmith is “correct”? In other words, say you find a bug with Csmith, how do you know is the compiler fault, instead of Csmith fault?
After all, Csmith is a C++ program, compiled by a compiler, perhaps the same compiler that you’re trying to test.
John: We should trademark names for a few of these fixed-seed test suites, so that we can cash in.
“Validated by the Csmith 0xDEADC0DE Megatest(TM).”
Hi Tobi, as you might expect “meaninful bugs” is very hard to characterize. For example, let’s say we go with Eric’s suggestions and only generate code that looks like what people write. Then, someone will go and generate C code from Matlab or SCADE and it will get miscompiled.
My personal feeling is that most of the bugs we find could be triggered by real code, but I cannot prove this.
Anyway, your larger point, that the irrelevance of random bugs may be an argument against Csmith, is a good one. On the other hand, real compiler developers have shown a remarkable willingness to fix bugs found by Csmith. This is a tricky issue, but I think most right-thinking developers will just fix a bug regardless of where it come from.
Alejandro, we do indeed sometimes find Csmith bugs, but they are greatly outnumbered by compiler bugs. This isn’t because we’re smarter but rather because Csmith is much smaller than any real compiler.
The process is to take a maybe-bug, reduce it using C-Reduce, and then use the C standard (or a tool based on the standard) to decide if the resulting program is valid or invalid. Either way, someone has a bug to fix.
Eric, +1.
@Alejandro: You are right—an apparent compiler bug might not be a bug in the compiler at all, but a problem with the test case generated by Csmith.
Csmith goes to great lengths to generate test programs that are valid. Our published PLDI papers describe the techniques that we use.
But of course there could be (are) bugs in Csmith itself that would cause it to generate invalid test cases. In practice, we discover these when we examine the test case.
John will say more, I’m sure. I have to catch a ride…
Why bother with fixed seeds? Just publish every test case you know of that has ever tripped up any compiler (including hand reduced cases and maybe even the steps during the automatic reduction runs) and call that “CSmith Level 0”.
Is someone wants more, start publishing other test cases (filterer to terminating cases) and call it “CSmith Level 1” with 10^1 times as many cases cases, “Level 2” with 10^2 times as many, etc.
…. and if you do start publishing them, charge just enough (via the honor system) to pay for the hosting.
Publicly ‘certify’ a couple open-source compilers, and the proprietary vendors will then have a competitive deficit to make up. How many tests can Clang or GCC pass right now? If you keep a site with a running counter for each, that pretty publicly challenges others to ‘do better’. I’m imagining a graph output with “bugs found” on the X axis, and “cumulative tests run” on the Y axis, with a series for each compiler/version that you run. A higher line means a more robust compiler.
We’ve used CSmith to find bugs in an LLVM-based compiler of ours; indeed we found a few, one of our major fixes being disabling one of LLVM’s platform-independent optimization passes (in that compiler, we’re stuck at LLVM 2.4, so I think the LLVM bug is of no general interest).
We’re rather far from passing 1M CSmith tests, and we’ve stopped chasing bugs with CSmith for the moment. The reason is that we think of compiler bugs as a development speed problem more than a correctness problem. That’s because every program compiled with this compiler is also compiled with at least two other compilers and results of all builds are routinely and automatically compared.
So basically we believe that most bugs are developers’ bugs and a few bugs are compiler bugs and either way, ultimately bugs are found if and only if they manifest themselves on our test data and so what matters the most for correctness is collecting and using sufficient test data.
Compiler bugs, however, are worse than developers’ bugs not in their effect on correctness, but in their effect on development speed. A developer might spend a lot of time chasing a compiler bug; it could happen to more than one developer, especially since a compiler bug may not be reported once a workaround is found; and the developer will generally lose confidence and feel uncomfortable if frequently exposed to compiler bugs.
So we want to have as few compiler bugs as possible, but with this worldview, we start seeing diminishing returns pretty early.
I should mention that we aren’t a compiler vendor widely marketing tools for third-party platforms, so this is another factor (in addition to having automated verification ingrained rather deeply into the tools and counting on it for correctness more than on anything else).
Regarding the “open-ended time sink” problem – I think C-Reduce would help a lot with that since it is the manual reduction process that seems most painful to us. We haven’t integrated C-Reduce yet and we should.
Many thanks for making these tools available!
I fail to see what is so unnecessary or even undesirable about finding failures that point to latent and priorly unknown faults in your compiler late in the development cycle. I completely agree that the earlier the fault is found the better. However, when left with the choices of finding it late or not finding it at all, I would always pick the first.
Whether or not the fault should be fixed is an entirely different question, which would depend on the severity of the fault compared to the cost of fixing it.
Would using fixed seed PRNGs in this manner mean that even minor changes to Csmith that changed the order the random numbers are consumed would result in completely different sets of generated test cases?
We would be tied down to a particular version of Csmith in the sense that if compiler vendors started to care about the test cases generated by a given version and seed, they would have very little incentive to switch to to a higher-quality suite generated by a subsequent version of Csmith
Hi Yossi, I agree that (at least in many cases) compiler bugs mainly add development overhead. This is particularly true for compiler crashes, which happen pretty often due to nice use of assertions. The exception would be a project like the Linux kernel that is always built using GCC in practice. For example a few years ago the Linux people had that problem with the compiler deleting null checks. This wasn’t a compiler bug, but it’s a very similar issue (the kind of thing that could easily happen due to compiler bugs) and it resulted in exploitable vulnerabilities. They have also had compiler bug problems, I believe, but don’t have a reference handy.
I like bcs’ idea, also. Why not have a repository (call it the “Refined Doom Csmith Core Tests” to follow Eric’s idea) of a minimized version of every test case Csmith has ever produced that was a valid compiler bug. At JPL, testing file systems, we built something like that — the “regression of doom” — it ran in about 40 seconds and had source coverage considerably better than a day’s worth of random testing.
The caveat is that a fixed set of “this has crashed someone” bugs is not, in my experience, that great at finding new bugs. It will do a lot for new compilers, but for an existing compiler it won’t work as well as new random tests, probably. There’s a bit about the same issue for regressions in manual testing in Whittaker’s Exploratory Testing book.
Hi Li, yes– even minor Csmith changes will tend to change the test cases being generated.
Maybe we should (as BCS suggests) simply distribute Csmith programs instead of encouraging people to use Csmith. One nice advantage of this is we can choose the most interesting programs to distribute, for example those that result in good branch coverage of common compilers.
Hi Phil, the really inflammatory thing to do would be to post “correctness benchmark results” for GCC, LLVM, and a bunch of commercial compilers. If the commercial tools look bad, hopefully people would notice. I have not done this because I’m not that interested in annoying the compiler companies any more than I have to, and also I might not even be on solid legal ground. That is, if the EULA prevents posting benchmark results am I prohibited from posting reliability information? Hard to say.
Alex and BCS, the reduced test cases lose a lot of the stress testing power that is originally present in Csmith output. Of course we could save the non-reduced tests. This is probably worth doing but on the other hand it’s so easy to generate fresh tests that I do that in practice…
Yeah, that’s my point about the Whittaker/etc. bit. The reduced tests may have very good total coverage, but they are probably mostly useful as a shakedown for new compilers. I wouldn’t consider them much of a certification, but they are pre-simplified, etc. for use on a compiler that’s never had to man up to random testing before.
They were probably more useful for file systems than for compilers — the big win with them at JPL was that file systems tend to have some bugs that “reopen” over and over, and may take a while to detect. Most common scenario: change to something to optimize performance or to save flash writes results in rename being broken (the normal state of rename is “broken” until late in development). The “bottled death” regression would find lots of rename breakage before a commit, in 40-50 seconds, while fresh random tests might take an hour or two to find that bug. Do compilers have recurring issues like that that have a decent chance of being exposed? Are those probably already handled by the commercial validation suites or gcc regressions? Probably.
I think Csmith could have a hand in coming up with a candidates for a “suite of tests for C/C++”. There are organizational hurdles to getting a standardized set of tests for any language, but it seems that C/C++ need such a suite more urgently than others, since there are so many vagaries and special cases. A very large set of microprograms would help sure-up the potential misinterpretation of the specifications for these languages. Presumably each test could be justified through a formal chain of reasoning from the language spec. Doing so for randomly generated programs could be hard, though. Maybe combining Csmith with CompCert with some formal description of the spec?
I agree that any static set should consist mostly of un-reduced cases. OTOH, also including the reduced set would provide a
sort of “low hanging fruit first” opportunity.
Maybe part of the pushback against random tests is that regressions in a compiler are much more interesting/important than code that has always been wrongly handled? No compiler vendor wants to ship a new minor version only to find that some important customer’s major application that’s been in the field for years is miscompiled now, but if you’ve always miscompiled some corner case then you know not many people have run into it or you’d have got bug reports…
…on the other hand, a wrapper script that said “only report failures which did not fail on the most recent released version” would be pretty trivial, I guess. I think if I were trying to add Csmith to the testing regime for an existing compiler that’s where I’d start.
Hi pm215, you have read our minds, it looks like. Right now we’re working on a “bug disambiguator” that takes a pile of test cases that trigger failures and attempts to narrow it down to a minimal subset that triggers only previously unknown bugs. As you might expect, this is inexact and heuristical.
This is part of my long-term project to turn Csmith (or any other random tester) into a turnkey package that only tells you things you actually need to know about the system being tested.
Another possible twist on Csmith as a turnkey package is to examine some code-base, and then generate tests focused on the constructs seen in the codebase of interest. That way, a team working on some embedded project could get testing for their compiler tailored to the code they’re going to compile with it.
Phil, I think that approach has a lot of merit. Something a lot like what you suggest has been tried recently and successfully for Javascript:
http://www.st.cs.uni-saarland.de/~kim/2012/04/27/fuzzing-with-code-fragments-usenix-security-2012/
I believe their current work is extending this to work for C code, we will have to wait and see how the results look!
Regarding matching an existing code base; how hard would it be to, given a “accent vector”, make csmith wite code with the same “accent” as an existing code base, some sort of markov chain like thing? Maybe treat the X ancestors and preceding siblings in the AST as the context. I’d bet LLVM could be used to do the corpus processing.