Alive2 Part 1: Introduction

[This piece is co-authored by Nuno P. Lopes and John Regehr.]

Compiler bugs threaten the correctness of almost any computer system that uses compiled code. Translation validation is a path towards reliably correct compilation that works by checking that an individual execution of the compiler did the right thing. We created a tool, Alive2, that takes an unoptimized function in LLVM intermediate representation (IR) and either proves that it is refined by the optimized version of the function, or else shows a set of inputs that illustrate a failure of refinement.

We work in terms of refinement, rather than equivalence, because compilers often change the meaning of the code they are optimizing — but in allowed ways. For example, adding two variables of “int” type in a C or C++ program means something like “add the two quantities, returning the mathematical result as long as it is within an implementation-specified range that is no smaller than -32767..32767.” When integer addition is lowered from C or C++ to LLVM IR, this very restrictive range for validity will typically be refined to be -2147483648..2147483647. Then, when LLVM IR is lowered to a target platform such as ARM or x86-64, the behavior of additions whose results fall out of bounds is refined from “undefined” to “two’s complement wraparound.” Both of these refinements are perfectly legal, but would be incorrect if the transformation went in the other direction. Non-trivial refinement is ubiquitous during compilation and we have to take it into account in a translation validation tool.

Let’s look at an example of refinement and also learn how to invoke Alive2 and interpret its results. Here’s a pair of functions in LLVM IR:

define i16 @src(i16 %x, i16 %y) {
    %r = sdiv i16 %x, %y
    ret i16 %r
}

define i16 @tgt(i16 %x, i16 %y) {
    %z = icmp eq i16 %y, 0
    br i1 %z, label %zero, label %nonzero
zero:
    ret i16 8888
nonzero:
    %q = sdiv i16 %x, %y
    ret i16 %q
}

The first function, src, simply returns the quotient of its arguments. The second function, tgt, returns the quotient of its arguments unless the second argument is zero, in which case it returns 8888. It turns out that src is refined by tgt. This works because division by zero in src has undefined behavior: LLVM makes no promise whatsoever about the behavior of a program that divides a number by zero. tgt, on the other hand, has a specific behavior when its second argument is zero, and behaves the same as src otherwise. This is the essence of a refinement relationship. If we run the refinement check in the other direction, Alive2 will tell us that src does not refine tgt.

To see these results for yourself, you can build Alive2 (making sure to enable translation validation at CMake time) and then invoke our alive-tv tool like this:

alive-tv foo.ll

Alternatively, you can simply visit our Compiler Explorer (CE) instance. Let’s talk about this user interface for a second. When you load the page you should see something like this:

The large pane on the left is where you type or paste LLVM IR. The large pane on the right shows alive-tv’s output in the case where it is able to parse the LLVM code successfully. The “share” pulldown at the top provides a short link to what you are currently seeing: this is an excellent way to show other people your results. If your LLVM code contains an error, the right pane will say “Compilation failed” and the text at the bottom of the right pane will change from “Output (0/0)” to something like “Output (0/3)”. Click on this text and a third pane will open showing the error output. You should then see something like this:

After fixing the error you can close this new pane.

Returning to our example, here’s Compiler Explorer with the example already loaded. We’ve also given alive-tv the --bidirectional command line option, which causes it to run the refinement check in both directions. The output therefore has two parts; this is the first one:

----------------------------------------
define i16 @src(i16 %x, i16 %y) {
%0:
  %r = sdiv i16 %x, %y
  ret i16 %r
}
=>
define i16 @tgt(i16 %x, i16 %y) {
%0:
  %z = icmp eq i16 %y, 0
  br i1 %z, label %zero, label %nonzero

%nonzero:
  %q = sdiv i16 %x, %y
  ret i16 %q

%zero:
  ret i16 8888
}
Transformation seems to be correct!

“Transformation seems to be correct!” indicates that Alive2 thinks that src is refined by tgt. The arrow (=>) points in the direction of the refinement relation. The second part of the output is:

----------------------------------------
define i16 @tgt(i16 %x, i16 %y) {
%0:
  %z = icmp eq i16 %y, 0
  br i1 %z, label %zero, label %nonzero

%nonzero:
  %q = sdiv i16 %x, %y
  ret i16 %q

%zero:
  ret i16 8888
}
=>
define i16 @src(i16 %x, i16 %y) {
%0:
  %r = sdiv i16 %x, %y
  ret i16 %r
}
Reverse transformation doesn't verify!
ERROR: Source is more defined than target

Example:
i16 %x = undef
i16 %y = #x0000 (0)

Source:
i1 %z = #x1 (1)
i16 %q = #xffff (65535, -1)	[based on undef value]

Target:
i16 %r = #xffff (65535, -1)

Notice that now the arrow points from tgt to src. This time Alive2 says:

Reverse transformation doesn't verify!
ERROR: Source is more defined than target

(Here in the reverse direction, “source” unfortunately refers to “tgt” and “target” refers to “src”.)

Interpreting a counterexample from Alive2 isn’t always easy. Part of the blame falls on LLVM’s tricky undefined behavior semantics, but also there are some UX details that we hope to improve over time. The main thing to remember is that a counterexample is an example: it represents a single execution which illustrates the failure of refinement. The most important part of a counterexample is the set of input values to the IR being checked:

i16 %x = undef
i16 %y = #x0000 (0)

These are, of course, the arguments to the functions. Given a set of input values, you should ideally be able to execute both the unoptimized and optimized functions in your head or on paper, in order to see the difference. Alas, this depends on the functions being fairly small and also on you having a good understanding of LLVM’s undefined behavior model.

The next part of the output is designed to help you think through the execution of the code; it provides intermediate values produced as the source and target execute the counterexample:

Source:
i1 %z = #x1 (1)
i16 %q = #xffff (65535, -1)	[based on undef value]

Target:
i16 %r = #xffff (65535, -1)

Now that we’re seen all of the parts of Alive2’s output, let’s work through this specific counterexample. The key is passing 0 as the argument to the %y parameter; the value of %x doesn’t matter at all. Then, tgt branches to %zero and returns 8888. src, on the other hand, divides by zero and becomes undefined. This breaks that rule that optimizations can make code more defined, but not less defined. We plan to augment Alive2’s counterexamples with some indication of where control flow went and at what point the execution became undefined. Also, we should suppress printing of values that occur in dead code. Alternately, and perhaps preferably, we could feed Alive2’s counterexample to a separate UB-aware LLVM interpreter (a tool that doesn’t yet exist, but should).

In general, when refinement fails, it means that there exists at least one valuation of the inputs to the function such that one of these conditions holds:

the source and target return different concrete answers
the source function returns a concrete answer but the target function is undefined
there is a failure of refinement in the memory locations touched by the function

We’ll explain memory refinement in a later post in this series.

Alive2’s predecessor, Alive, was also a refinement checker. It was based on the idea that a user-friendly domain-specific language could make it easier for compiler developers to try out new optimizations. This language supported omitting some details found in real LLVM IR, such as how wide, in bits, an operation being optimized is. Alive2 still supports this domain-specific language, but much of our recent work has focused on handling LLVM IR directly. This means that we are working at a slightly lower level of abstraction than before, but it allows us to run our tool on large amounts of code found in the wild.

One of our main targets for Alive2 has been LLVM’s unit test suite: a collection of about 22,000 files containing LLVM IR. Each file specifies that one or more LLVM optimization passes should be run, and then checks that the file gets optimized as expected. When we run Alive2 over the test suite, we ignore the expected result and instead simply check that the optimized code refines the original code. Perhaps surprisingly, this has caught quite a few bugs: 27 so far. Why would the test suite itself, when all tests are in a passing state, be capable of triggering bugs in LLVM? This can happen for several reasons, but the most common one is that LLVM’s test oracles — which decide whether the compiler behaved correctly or not — are generally coarse-grained. For example, a unit test about optimizing division by power-of-two into right shift might simply test for the presence of a shift instruction in the optimized code. Alive2, on the other hand, acts as a fine-grained oracle: if the optimized code does not preserve the meaning of the original code for even one choice of input values, this fact will be detected and reported. Here’s an example of a bug we found this way.

Alive2 handles a large subset of LLVM, but not all of it. Right now there are three main limitations. First, loops are only partially analyzed. Alive2 unrolls loops once, so any miscompilation that only manifests when the loop runs for more than one iteration cannot be detected. Second, interprocedural optimizations are not supported. Although Alive2 can handle function calls in the code being analyzed, it does not look across functions. So a function inlining transformation cannot be analyzed, nor can something like removing an unused global variable. Third, some kinds of code will reliably cause the underlying SMT solver, Z3, to take too long to return a result. Floating point operations, wide integer divisions, and complex memory operations are common culprits.

This post has introduced Alive2 and shown off some of its basic capabilities. Subsequent posts in this series will cover more detailed technical topics such as Alive2’s memory model, verifying vector and floating point optimizations, and finding bugs in LLVM backends and also binary lifters that turn object code into LLVM IR.

March 12, 2020

regehr

Compilers, Computer Science, Software Correctness

8 responses to “Alive2 Part 1: Introduction”

John Payson says:

March 13, 2020 at 10:58 am

One problem I’ve observed with clang and gcc is that different stages of optimization seem to have conflicting ideas about what constructs will be treated as defined when passed to downstream stages, and what constructs must be treated as defined when they come from upstream stages. For example, if a piece of code is conditionally executed when (uintptr_t)(p+i) and (uintptr_t)(q+5) happen to be equal, clang will seem to assume that lvalue p[i] may be replaced with lvalue q[5]–something that would be fine if no downstream stages try to make any assumptions about lvalue q[5] being unable to access some things that could be accessed by p[i].

Could you demonstrate how your tool could be used to determine what will happen with a particular piece of C code? For example, given https://godbolt.org/z/gg7x5Z (which demonstrates the above bug) could you demonstrate whether your tool would find the problem? Note that if x happens to be 5 elements, y happens to immediately follow it, and i is zero, evaluation of x+5 will yield a pointer which legitimately points just past the last value of x, but which cannot be legitimately dereferenced. While nothing in the Standard would mandate that the uintptr_t representation of that pointer compare equal to that of y, I think it’s pretty clear that the two possible correct behaviors of the function would be to leave y[0] holding 1 and return 1, or set y[0] to 2 and return 2. It would be possible for the comparison to report the two uintptr_t values equal, and if that happens, code would set y[0] to 2 but return 1, which is an incorrect behavior.
BCS says:

March 15, 2020 at 5:56 pm

How long does it take to run all 22k test cases through Alive2? Is there anyone with an intent to wire it in as a CI/presubmit check? (If that would take too long to do directly; How stable are the test cases input/outputs? Would a cache of cases tested, to avoid re-inspecting know good refinements, be a meaningful optimization?) If that is done; have any of the LLVM owners expressed interest in collecting test cases where Alive2 is the only applied oracle?

Really cool stuff. It’s really encouraging to see this sort of thing making the shift from academic theory to industrial application.
regehr says:

March 16, 2020 at 10:31 am

Hi John, this tool only handles LLVM IR and can’t help with your example, which is about C++ and assembly.

Regarding different parts of the compiler having different views of the semantics, this can definitely happen. I don’t know that much about the internals of GCC, but in LLVM it should be a bug any time that two parts of the compiler disagree about the meaning of LLVM IR. So in principle all of these things can be fixed.
regehr says:

March 16, 2020 at 10:37 am

Hi BCS, when the solver timeout is fairly short (I believe it is set to 1 second by default), all 22k test cases can be processed by alive2 in a matter of minutes, given a fast multicore. But it’s probably too resource-intensive as a pre-commit check. As you say, caching could definitely speed this up in the common case.

My feeling is it’s too early to wire this into the main LLVM workflow, but that in the medium term something like that might be really useful.

The problem with running Alive2 over the test suite is that this makes the quality of the unit tests into a critical bottleneck for finding bugs. We definitely do not want this bottleneck in practice. Rather, we need smart fuzzers coming up with new tests and then our tool can sit there and look for bugs. This turns everything back into an offline problem. I don’t think that’s a big deal as long as issues get reported within a few days, while things are fresh in developers’ minds.
Jordan Rose says:

March 17, 2020 at 11:11 am

UX suggestion: note which value is the return value in the failure output (if relevant).
regehr says:

March 18, 2020 at 10:58 am

Thanks Jordan!
Dan says:

March 29, 2020 at 5:11 am

Interesting, and pretty cool that it is automated to such an extent!
> Alive2, on the other hand, acts as a fine-grained oracle: if the optimized code does not preserve the meaning of the original code for even one choice of input values, this fact will be detected and reported.

So Alive2 is capable of detecting the absence of a refinement? Of course, in general this is undecidable, so I guess sometimes it has to give up on showing a refinement?
regehr says:

March 29, 2020 at 11:10 pm

Hi Dan, the problem being solved by Alive2 is decidable because it deals with loops in a pretty restricted fashion. However, even this decidable problem is often intractable in practice, so Alive2 definitely gives up. In practice this manifests as a solver timeout.