An important tool for reasoning about programs without running them is dataflow analysis, which can be used to compute facts such as:
- an integer-valued variable is non-negative at a particular program point
- a conditional branch always goes one way
- an indirect write cannot modify a particular array
Dataflow facts drive optimizations such as constant propagation and dead code elimination. Outside of compilers, dataflow analyses are important in tools that prove safety properties, such as absence of undefined behaviors in C or C++ programs. Dataflow analysis is backed by a rich academic literature and tool writers have decades of experience implementing it. This set of lecture notes is my favorite introduction to dataflow analysis.
Despite all of the nice theorems, dataflow analyses are hard to get right, in part because people are just not very good at reasoning about properties that hold over sets of program executions. As a simple exercise in this kind of thinking, try this: you have a variable A whose value is known to be in the range [0..1000] and a variable B that is known to be in the range [10000..10050]. The program computes the bitwise xor of A and B and assigns the result into C. Provide tight bounds on the values that C can take. If you find this to be easy, that’s great — but consider that you then have to make this work for all intervals and after that there are about 9,999 similarly fiddly things left to implement and any mistakes you make are likely to result in miscompilations.
So now let’s finally get to the point of this post. The point is that since dataflow is hard, if you implement a dataflow analysis, you should also implement a dynamic checker that looks for cases where the analysis has come to the wrong conclusion. To keep going with the running example, the program being analyzed contains this line of code:
// A in [0..1000] and B in [10000..10050] C = A ^ B;
If our interval analysis says that C has to be in the range [9225..12255], we would rewrite the code to add an assertion:
// A in [0..1000] and B in [10000..10050] C = A ^ B; assert(C >= 9225 && C <= 12255);
Now we can run the instrumented program and, if we’re very lucky, it will execute this code with values such as A = 830 and B = 10041, making C = 9223, triggering an assertion violation telling us that our analysis is unsound and giving us something to debug. The other way to debug this sort of problem is to backtrack from a miscompilation — an experience enjoyed by basically nobody.
Every dataflow analysis that I’ve implemented or supervised the implementation of has gotten a dynamic checker. This is a great testing technique that has numerous advantages. It finds bugs in practice. It can be automated, either using a program generator like Csmith or else using a regular test suite. The annotated programs serve as a useful way to look at the precision of an analysis: if not very many assertions show up, or if the bounds are very loose, then we probably need to do some tuning for precision. Finally, if we don’t find any bugs, then the dataflow analysis is probably (please excuse me, formal methods people) correct enough for practical purposes.
Of course, not everything that can be computed by a dataflow analysis can be expressed in code. For example, in C and C++ there’s no lightweight way to assert that a pointer refers to valid storage or that a variable is uninitialized. E-ACSL is a neat piece of related work in this general area.
Just recently I’ve been looking at the efficiency of integer overflow checking using LLVM and since then I wrote some code that uses integer range dataflow facts to remove overflow checks that can be proved to not fire. Here’s the under-review patch. The ranges are computed by an LLVM pass called LazyValueInfo (LVI), which better be correct if we’re going to rely on it, so I wrote a little LLVM pass that does the necessary checking. It processes each function in between the optimization and code generation phases by:
- Making a list of instructions that LVI knows something about.
- Creating a new basic block that executes a trap instruction.
- Conditionally jumping to the trap block if the actual value produced by any instruction is out of bounds with respect to the predicted value range.
(If you build this pass, you’ll need a slightly hacked LLVM, see the README.) Although it might seem tempting to run the optimizers again on the instrumented code in order to clean it up a bit, this would be a very bad idea: the very dataflow analyses that we’re trying to check up on would be used to drive optimizations, which could easily end up incorrectly deleting our checks.
So far, some testing using Csmith hasn’t turned up any bugs in LVI, which is great. Less great is the fact that it drops precision all over the place: a lot of tuning up is needed before it is the basis for a really strong redundant overflow check remover.
The technique I’m describing is not as widely known or as widely used as it should be, considering how easy and effective it is. The best answer is that C = [9216..10239].
UPDATE: The latest version of my LLVM dataflow checker also validates the results of computeKnownBits(), which tries to figure out which bits of a value must be either zero or one.
Here’s a bit of further reading about how these techniques (and others) were applied to the Frama-C static analyzer.
7 responses to “Checking Up on Dataflow Analyses”
How hard would it be to have your pass dump self-contained snippets of the code being compiled, along with the assertions, and then subjected them to the same type of checking that ALIVe does for instruction combiner transformations? Assuming you can apply that to check the soundness of the analysis, how hard would it be to reformulate the assertion to check them for tightness?
As a first pass, I’d think acyclic, unconditional computation graphs (of limited size) with only unconstrained inputs should be easily identifiable and may generally be tractable to process.
I think it would be more helpful if this post or the notes you point to had some citations to the large amount of work in this field.
For example, here is a C/C++ dataflow analysis tool we wrote that found hundreds of remotely exploitable bugs in Debian with a true positive rate of 50% (quite high); disclaimer: I am the maintainer of the project:
Elsa/Oink
http://danielwilkerson.com/oink/
https://github.com/dsw/oink-stack/
13 October 2014: After the Heartbleed bug came out, someone at a government lab that will not let me use their name wrote me (initially on 18 April 2014), saying:
Yes, you are interpreting me correctly. CQual++ found heartbleed while the commercial tools I tried did not. I also applied CQual++ to an important internal project and found it very effective (though a bit difficult to run and interpret) at identifying places where sanitization routines weren’t being called consistently.
and
I’ve been applying cqual to a fairly significant library that we’ve been developing that does a fair amount of both network and filesystem related stuff. I don’t think I’ve hit any false positives. There were a number of places where input was being validated but not marked as such which of course caused issue reports but that’s not really a false positive.
Daniel, that’s a great story about heartbleed! You’ll have to excuse me for somehow not knowing that you folks were working on Elsa/Oink again, this is good to hear.
I agree that a big literature review about dataflow would be excellent but this isn’t the place.
bcs, you have concisely described a project that I have been dreaming about doing, a translation validation tool for LLVM-level optimizations. There are a few parts of this that I don’t really know how to do, such as establishing a mapping between the pre-optimization and post-optimization code. Also, the LLVM passes aren’t setup to do just one transformation at a time, which I think is necessary to make this work. But maybe these problems aren’t too difficult. Anyway, the thing you describe would be extremely useful since it would avoid the (difficult, slow) step of making Alive patterns from C++ code in the LLVM tree.
Do you have any more information about how you handled bitwise operations in a numerical domain? To me it seems non-trivial since you cannot do the bit operations using only the min/max values (e.g., logical AND with [-2, 3] would use the value 0 to produce the minimal result and (perhaps?) -1 to produce the maximum result assuming twos compliment). All the paper’s I’ve run into use some additional abstract domain in a reduced product.
Thanks for the interesting post.
Hi markus, making good abstract transfer functions can be really hard. I computed the optimal example for this post by brute force. In the past I’ve worked on other ways of deriving these functions.
http://www.cs.cornell.edu/courses/cs711/2005fa/papers/rr-asplos04.pdf
https://www.cs.utah.edu/~regehr/papers/lctes06_2/fp019-regehr.pdf
We are not exactly still “working on” Elsa and Oink, however I keep them maintained in the sense that if people report bugs, such as flex changed and the code no longer builds without some little change, then I test those and if they build on my machine I incorporate them. That is, we are not really using Elsa and Oink for anything, but if someone is, I’m happy to help them. My experience suggests that our dataflow analysis is still better than that of Clang.
Scott’s design for Elsa where he uses a GLR parser generator, Elkhound, to parse C++ means that extending Elsa to parse the newer versions of C++ would be straightforward; similarly extending the Oink dataflow analysis for the new AST nodes would also be straightforward.