Dataflow analysis, or static analysis, is a way to compute properties that hold over all possible executions of a program. For example, if a compiler can conclude that an expression always evaluates to the value 5, this fact can be used to avoid computing the expression at runtime.
The math behind dataflow analysis is probably one of the ten most important results in theoretical computer science. It supports a strong separation of concerns where a generic dataflow framework takes care of the bookkeeping and a collection of transfer functions implement the specific analysis.
Once a framework exists, writing analyses is easy and entertaining. It’s great to form a little hypothesis like “we could get rid of many array bounds checks by keeping track of integer intervals,” write a few transfer functions, and then analyze some large programs in order to validate or disprove the hypothesis. On the other hand, writing a dataflow analysis without a framework is a hellish amount of work.
I recently read this paper which evaluates a new idea for analyzing integer ranges. The authors implemented their analysis over the LLVM IR, which is nice since LLVM serves as a partial framework, permitting them to analyze some of the SPEC benchmarks. On the other hand, significant limitations of their analysis (intraprocedural, limited analysis of globals and pointed-to values) would seem to indicate that LLVM does not yet serve as a great framework for this sort of research. The result is that even though the authors of this paper have done the right thing, their evaluation remains tantalizing at best. My former student Nathan Cooprider had similar problems implementing dataflow analyses using CIL, which saved us probably 18 months of work and made his project feasible. Even so, CIL had numerous shortcomings as a dataflow framework and a lot of effort was required to get things working.
A good dataflow framework should:
- Contain front-ends for one or more real programming languages.
- Be open source and well documented.
- Expose a clean intermediate representation.
- Make easy things easy. A user should have to write only a minimal amount of code to support a simple static analysis such as detecting null pointers or determining signedness of integers.
- Make it easy to write at least one kind of static analysis. For example, an integer range analysis is quite different from a live variables analysis, and both are very different from a pointer analysis.
- Provide easy access to source-level information such as lines, columns, variable names, etc.
- Heavily optimize the program prior to analysis. One of the easiest ways to increase the precision of an analysis is to run it after a comprehensive collection of optimization passes (such as those supported by LLVM) has removed dead code, useless indirection, and other junk that tends to pollute analysis results.
- Permit analyses to learn from the results of other analyses.
- Provide good support for turning analysis results into assertions; Nathan probably couldn’t have worked the bugs out of his analyzers without doing this.
- Provide efficient fixpoint iteration algorithms.
- Make it easy to write transformations that consume analysis results.
- Provide transparent support for interprocedural and whole program analysis.
It is possible that Soot and Rose meet all of these requirements. CIL does not. I don’t think LLVM is too far off. It seems likely that good analysis frameworks for functional languages exist, but I’ve paid no attention.
An important criterion that I forgot to list is the ability of a dataflow framework to provide knobs for tuning precision against resource usage. Examples of this include configurable widening thresholds and tunable degrees of path and context sensitivity.