Translation Validation of Bounded Exhaustive Test Cases


This piece is jointly authored by Nuno Lopes and John Regehr.

Compilers should be correct, but it is not straightforward to formally verify a production-quality compiler implementation. It is just too difficult to recover the high-level algorithms by looking at an enormous mess of arithmetic, loops, and memory side effects. One solution is to write a new compiler such as CompCert that is designed to be verified. Alternatively, we keep our large, low-level code base such as GCC or LLVM and settle for weaker forms of validation than formal verification. This piece is about a new way to do the second thing. Our focus is the middle-end optimizers, which seem to be the most difficult part of a compiler to get right. The target is LLVM.

End-to-end compiler testing, supported by a random source code generator like Csmith, is great — but it only gets us so far. The expressiveness of the program generator is one limitation, but a more serious problem is the normalization that happens in the compiler frontend. The issue is that there are a lot of valid code patterns that Clang will never emit and that are therefore impossible to test by driving Clang. As a Clang user you may not happen to care about this, but as LLVM people we want the middle-end optimizations to be free of logic errors and also the non-Clang-emittable code is important in practice since there are lots of frontends out there besides Clang.

The first step is to generate lots of LLVM IR. Rather than creating a relatively small number of large functions, as Csmith would do, this IR generator generates lots of tiny functions: it uses bounded exhaustive test generation to create every LLVM function up to a certain size. A fun thing about this kind of generator is its choose() operator. In random mode, choose() returns a random number; in exhaustive mode, it uses fork() to explore all alternatives. While this isn’t the most efficient way to do search, leveraging the OS keeps the generator very simple. The most vexing thing about this design is allowing it to use multiple cores while stopping it from being a fork bomb. The current version doesn’t contain code that tries to do this.

The next step is to run LLVM optimizations on the generated functions. One thing we want to try is the collection of passes that implements “-O2,” but it also makes sense to run some individual passes since it is possible for sequences of passes to miss bugs: early passes can destroy constructs that would trigger bugs in later ones, and the late passes can clean up problems introduced by earlier ones. In fact both of those things seem to happen quite often.

We end up with lots of pairs of unoptimized and optimized LLVM functions. The obvious thing to do is run them with the same inputs and make sure that the outputs are the same, but that only works when the executions encounter no undefined behaviors. Solutions to the UB problem include:

  1. Generating UB-free code, as Csmith would. At the level of these tiny functions that would be a major handicap on the generator’s expressiveness and we definitely do not wish to do it.
  2. Create an LLVM interpreter that detects UB instead of silently skipping over it. The rule is that the optimizer is allowed to preserve UB or remove it, but never to add it. In other words, the correctness criterion for any compiler transformation isn’t input/output equivalence but rather input/output refinement. Someone needs to write this interpreter, perhaps using lli as a starting point (though the last time we looked, the slow/simple interpreter mode of lli had suffered some bit rot).
  3. Formally verify the refinement relation using Alive. This is better than an interpreter because Alive verifies the optimization for all inputs to the function, but worse because Alive doesn’t support all of LLVM, but rather a loop-free subset.

It is option three that we chose. The Alive language isn’t LLVM but rather an LLVM-like DSL, but it is not too hard to automatically translate the supported subset of LLVM into Alive.

In the configuration that we tested (2- and 4-bit integers, three instructions per function, including select but not including real control flow, floating point, memory, or vectors) about 44.8 million functions are generated and binned into 1000 files. We identified seven configurations of the LLVM optimizers that we wanted to test: -O2, SCCP, GVN, NewGVN, Reassociate, InstSimplify, and InstCombine. Then, to make the testing happen we allocated 4000 CPU cores (in an Azure HPC cluster) to process in batch the 7000 combinations of file + optimization options. Each combination takes between one and two hours, depending on how many functions are transformed and how long Alive takes to verify the changes.

If we could generate all LLVM functions and verify optimization of them, then we’d have done formal verification under another name. Of course, practically speaking, there’s a massive combinatorial explosion and we can only scratch the surface. Nevertheless, we found bugs. They fall into two categories: those that we reported and that were fixed, and those that cannot be fixed at this time.

We found six fixable LLVM bugs. The most common problem was transformations wrongly preserving the nsw/nuw/exact attributes that enable undefined behaviors in some LLVM instructions. This occurred with InstCombine [1], GVN [1], and Reassociate [1,2]. InstSimplify generated code that produces the wrong output for some inputs [1]. Finally, we triggered a crash in llc [1].

The unfixable bugs stem from problems with LLVM’s undefined behavior model. One way to fix these bugs is to delete the offending optimizations, but some of them are considered important. You might be tempted to instead fix them by tweaking the LLVM semantics in such a way that all of the optimizations currently performed by LLVM are valid. We believe this to be impossible: that there does not exist a useful and consistent semantics that can justify all of the observed optimizations.

A common kind of unfixable bug is seen in the simplification logic that LLVM has for select: it transforms “select %X, undef, %Y” into “%Y”. This is incorrect (more details in the post linked above) and, worse, has been shown to trigger end-to-end miscompilations [1]. Another source of problems is the different semantics that different parts of LLVM assume for branches: these can also cause end-to-end miscompilations [1,2].

In summary, this is a kind of compiler testing that should be done; it’s relatively easy and the resulting failing test cases are always small and understandable. If someone builds an UB-aware LLVM interpreter then no tricky formal-methods-based tools are required. This method could be easily extended to cover other compilers.

There are some follow-on projects that would most likely provide a good return on investment. Our test cases will reveal many, many instances where an LLVM pass erases an UB flag that it could have preserved; these could be turned into patches. We can do differential testing of passes against their replacements (for example, NewGVN vs. GVN) to look for precision regressions. The set of instructions that we generate should be extended; for example, opt-fuzz already has some limited support for control flow.

The code to run these tests is here.


11 responses to “Translation Validation of Bounded Exhaustive Test Cases”

  1. How critical was the exhaustiveness? That is, you guys generated 44.8 million functions. Presumably, in fact, you don’t need all of those to completely test what you tested. But without some abstraction we can’t get, no way to know how much we needed. Fine.

    However, for the bugs you found, do you guys have a way of binning the functions exposing them by bug that is reliable? If so, one way to guess redundancy of the exhaustive tests is to see the smallest bin size. That is, if you have a bug where there is only one example in the 45M functions, it argues that either you really need them all, or at least that the abstraction to collapse redundancy is fairly complicated (since I assume any function you generate has some very, very similar functions in the set).

  2. On the other hand, if your smallest bin is about 500 functions, maybe something with bigger size and less exhaustiveness is worth figuring out.

  3. Hi Alex, my impulse here is to burn cycles testing stuff rather than burning brainpower figuring out where we can cut corners.

    But on the other hand, if we want to push past 4 or 5 instructions and/or wider integers, we’ll obviously need to give up on exhaustiveness.

    But you know (I think) my top-level view of bugs, which is that they don’t have models. So it’s not at all clear that what we learn from this crop of bugs would carry over to any other crop of them.

  4. Fair enough. Yeah, exhaustive when you can, definitely. I’m just worried about bugs whose shortest instance is 7 instructions. I am guessing this won’t cut it. Also, doesn’t ALIVE runtime go up after 6 or so, making you need to prune even more?

  5. I’m not sure I have a good handle on Alive runtimes, but it definitely goes through the roof if you have a wide divide. Just depends a lot on the details.

    My sense is that for 7+ instructions we just go to random sampling. So probably I should make another 50M functions that are quite a bit larger than these, and we check out their bugfinding power.

    But there’s a bit of a problem: take a look at the structure of opt-fuzz, it’s a chain of choose() operations. In random mode the outputs will be skewed heavily by the order in which I chained things. Obviously this doesn’t matter in exhaustive mode. To get a good random generator out of this code I’ll need to either restructure the code or somehow otherwise reshape the search tree.

    Also I’ve already cut corners in a few important cases such as never generating instructions where all inputs are constants, since those trivially fold to something shorter. So we’ll miss bugs in the folder, which I’m not really worried about.

  6. Would one way of fixing the ‘select’ simplification issue be to add an instruction like:

    = unpoison

    That yields the value it was passed, unless the value passed is poison, in which case it yields undef?

  7. Yeah, I know (from our past discussion) the skew problem once you sample. I was thinking of throwing out likely redundant prefixes or something. Also, is the generation and storage time the problem with scaling up a few instructions, or the Alive runtime? I don’t have a feel for how soon you need all the SSD space in the planet or something.

  8. kme where you have been??? it took us quite a bit of time to arrive at basically the solution you describe, see our PLDI 2017 paper

  9. Alex I’m currently running some experiments with a bit larger values of N. The one reported in the post hadn’t quite reached the point of no return but going to 4 instructions may walk us into the many-TB range unless compensating shortcuts are taken elsewhere. I have a sense that a pretty dominant proportion of optimizations require only 3-4 instructions to trigger.

  10. Hi John, Could you explain a bit more why looking at all the “-O2” passes together is interesting? My first intuition would be that running one pass over a generated function should result in another function that you also generated. Hence, it should be enough to look at all passes in isolation.

    Another question: how would you pick the function inputs when using an interpreter to detect UB? It seems you need to run the interpreter with all possible combinations of inputs. Thinking about this, I wonder whether you might be able to use /adapt a symbolic execution tool (e.g., KLEE) to do the UB detection?

  11. Hi Jeroen, our list of individual passes includes just a few that we guess will be most interested in these short functions but -O2 includes many more passes that may do something with them, so it seems good to look at it.

    Regarding function inputs for an interpreter, I’d probably go with a few random choices, although exhaustive testing will be possible for functions with few enough bits of input. It is lucky that LLVM supports (and optimizes) arbitrary width integers, other compilers will not be able to play this trick and there we’ll be stuck guessing a few 32-bit or 64-bit values.

    In any case (as long as we stay within its supported subset of LLVM) Alive is the right solution since it does precisely the refinement check that we want, I don’t see Klee as being particularly helpful.

    But I’ll tell you a way that Klee can help! When we generate functions to optimize, we must include lots of different choices of constants (ideally, all of them) because we don’t know a priori what the preconditions for optimizations are. This wastes a lot of time. Better would be to generate functions containing symbolic constants and then use Klee to find concrete constants that drive the different paths through the optimizer. I’d love to experiment with this, am lacking only time right now.