Skip to content

Modern Dataflow Frameworks Wanted

Dataflow analysis, or static analysis, is a way to compute properties that hold over all possible executions of a program. For example, if a compiler can conclude that an expression always evaluates to the value 5, this fact can be used to avoid computing the expression at runtime.

The math behind dataflow analysis is probably one of the ten most important results in theoretical computer science. It supports a strong separation of concerns where a generic dataflow framework takes care of the bookkeeping and a collection of transfer functions implement the specific analysis.

Once a framework exists, writing analyses is easy and entertaining. It’s great to form a little hypothesis like “we could get rid of many array bounds checks by keeping track of integer intervals,” write a few transfer functions, and then analyze some large programs in order to validate or disprove the hypothesis. On the other hand, writing a dataflow analysis without a framework is a hellish amount of work.

I recently read this paper which evaluates a new idea for analyzing integer ranges. The authors implemented their analysis over the LLVM IR, which is nice since LLVM serves as a partial framework, permitting them to analyze some of the SPEC benchmarks. On the other hand, significant limitations of their analysis (intraprocedural, limited analysis of globals and pointed-to values) would seem to indicate that LLVM does not yet serve as a great framework for this sort of research. The result is that even though the authors of this paper have done the right thing, their evaluation remains tantalizing at best. My former student Nathan Cooprider had similar problems implementing dataflow analyses using CIL, which saved us probably 18 months of work and made his project feasible. Even so, CIL had numerous shortcomings as a dataflow framework and a lot of effort was required to get things working.

A good dataflow framework should:

  1. Contain front-ends for one or more real programming languages.
  2. Be open source and well documented.
  3. Expose a clean intermediate representation.
  4. Make easy things easy. A user should have to write only a minimal amount of code to support a simple static analysis such as detecting null pointers or determining signedness of integers.
  5. Make it easy to write at least one kind of static analysis. For example, an integer range analysis is quite different from a live variables analysis, and both are very different from a pointer analysis.
  6. Provide easy access to source-level information such as lines, columns, variable names, etc.
  7. Heavily optimize the program prior to analysis. One of the easiest ways to increase the precision of an analysis is to run it after a comprehensive collection of optimization passes (such as those supported by LLVM) has removed dead code, useless indirection, and other junk that tends to pollute analysis results.
  8. Permit analyses to learn from the results of other analyses.
  9. Provide good support for turning analysis results into assertions; Nathan probably couldn’t have worked the bugs out of his analyzers without doing this.
  10. Provide efficient fixpoint iteration algorithms.
  11. Make it easy to write transformations that consume analysis results.
  12. Provide transparent support for interprocedural and whole program analysis.

It is possible that Soot and Rose meet all of these requirements. CIL does not. I don’t think LLVM is too far off. It seems likely that good analysis frameworks for functional languages exist, but I’ve paid no attention.

One of the biggest holes that I see right now is a good unified framework for static analysis of languages such as JavaScript, Python, Perl, and PHP. The heavy reliance on strings, hashes, and dynamic types in these languages makes them a lot different (and in some ways a lot more difficult) than traditional compiled languages. I’m not totally sure that the differences between these languages can be abstracted away in a clean fashion, but it seems worth trying.

UPDATE:

Pascal Cuoq points out the Frama-C meets all of the criteria at least to some extent. Ed Schwartz points to CPAChecker which looks interesting though I know nothing about it.

An important criterion that I forgot to list is the ability of a dataflow framework to provide knobs for tuning precision against resource usage. Examples of this include configurable widening thresholds and tunable degrees of path and context sensitivity.

{ 18 } Comments

  1. Edward | September 30, 2012 at 4:22 pm | Permalink

    Hoopl is worth taking a look at, though it fails a lot of your requirements.

    Interpreted languages are a bit harder here, since they don’t usually have a nice IR, right? (They usually have bytecodes with a metric ton of primitive operations.)

  2. Robby | September 30, 2012 at 4:45 pm | Permalink

    CRI proposals are due in about a month.

  3. regehr | September 30, 2012 at 5:03 pm | Permalink

    Robby, I’ve had terrible luck at CRI in the past and don’t feel super incentivized to try again. I don’t mean to complain that my proposals weren’t funded (though they were not) but rather that the reviews didn’t make it sound like the proposals were evaluated in a serious fashion. I think it may simply be difficult to get software infrastructure proposals funded. I know there have been successes but this may have been by people better at the proposals game than I.

    Edward, I’ve quickly looked at Hoopl and it seems quite cool, though I didn’t read it carefully enough to distill out the key ideas. I’m guessing that the native IR for most scripting languages would be unsuitable for analysis and that a new one needs to be invented.

  4. Pascal Cuoq | September 30, 2012 at 5:24 pm | Permalink

    Hi John!

    You wrote:

    “Heavily optimize the program prior to analysis. One of the easiest ways to increase the precision of an analysis is to run it after a comprehensive collection of optimization passes (such as those supported by LLVM) has removed dead code, useless indirection, and other junk that tends to pollute analysis results.”

    Another thing that a “comprehensive collection of optimization passes” tends to do is erase bugs that were unambiguously present in the source code. It all depends what you want to do, but if you want to claim, say, that tool X is “is dynamically sound and complete with respect to some [...] undefined behaviors”, then either the optimization passes should not erase any of the targeted undefined behavior, or you will have to twist the definition of “sound” a bit.

    Never having used it, I am no expert of LLVM’s transformations, for instance, but the experts I met told me that its comprehensive collection of transformations was an issue for soundness of analyses based on it. One example was, for integer overflows, and if I remember correctly, a snippet like:

    unsigned int x =0x70000000;
    unsigned int y =0x70000000;
    unsigned int z = (unsigned int) ((int) x + (int) y);

    There was another example from another expert that involved variable uninitializedness and, in specific circumstances, the impossibility to detect such at the IR level.

    I am also curious about the issues you found in CIL, having written my own dataflow analysis with it. My issues were that it erased bugs, but you seem to imply that something else in it was annoying you.

    I would argue that the framework in which I worked, http://frama-c.com/ , based on a modified version of CIL, satisfies to some extent each of your listed items.

    1- C99 minus Variable Length Arrays, with more front-ends being developed that will roughly target the same AST.

    2- Open Source, yes. Well-documented, partial credit for effort.

    3- A fair amount of non-bug-erasing normalization is done. No bugs are not erased to the best of my knowledge.

    4- It takes a short time to implement, say, a verifier of constant-time execution for cryptographic functions: http://blog.frama-c.com/index.php?post/2011/12/31/Do-not-use-AES-in-a-context-where-timing-attacks-are-possible

    5- The framework makes it easy to write a large number of analyses, as demonstrated by the number of analyses available. Specific control-flow and data-flow analyses are given the spotlight in http://www.erts2012.org/Site/0P2RUC89/5C-3.pdf

    6- Lines and variable names are available. For C, reliably indicating columns requires writing one’s own pre-processor, which would be a valuable but large undertaking.

    7-8- There is a slicer in the framework if you want to use it. Apart from that, pointers can be resolved in advance of the analysis being implemented, which simplifies it in many cases.

    9- Any plug-in can insert assertions in the AST. Assertions that are in the executable subset of the annotation language can be converted to executable C assertions automatically from there. It is also possible to hack away something quick-and-dirty as in http://www.cs.utah.edu/~regehr/papers/nfm12.pdf , for “Printing internal invariants in executable C form”, but yes, one shouldn’t have to do that.

    10- This is probably a weak point. Let us do some comparisons when a comparable framework for C is available.

    11- The existing analyses already consume the results of one another. Brand-new program transformations such as http://www.rennes.supelec.fr/ren/rd/cidre/tools/sidan/pubs.html can be designed to take advantages of the same results.

    12- I didn’t think it was much work to write my whole-program analysis, but I have only written one, so again, comparisons should be made when a comparable framework is available.

  5. anonymouse | September 30, 2012 at 5:40 pm | Permalink

    YES! Please god yes.

    If there was a plus one, I would press that too :P

  6. Ed Schwartz | September 30, 2012 at 6:06 pm | Permalink

    Have you tried CPAChecker? I haven’t, but I like their paper. I think the implementation is based on CIL.

    http://cpachecker.sosy-lab.org/

  7. Carter Schonwald | September 30, 2012 at 6:16 pm | Permalink

    @edward i’m still picking up hoopl and that piece of the haskell ecosystem, what does it lack?

  8. regehr | September 30, 2012 at 9:03 pm | Permalink

    Hi Pascal, at the time that we used CIL I wasn’t yet thinking very hard about the undefined behavior issue. But CIL is very invasive in this respect. Like LLVM, you have to think about it as a compiler: it will pick a single interpretation for a program that may, in its original form, admit many interpretations (including being totally undefined). So any conclusions that you arrive at are predicated on CIL’s choices. If a different compiler picks a different interpretation, all bets are off. If I understand correctly, you have worked around most or all of these issues in Frama-C?

    Our main “problem” with CIL was simply that it failed to be a dataflow framework. Rather, it gives an AST upon which a CFG must be built and a lot of other work done in order to build an abstract interpreter. This was quite a few years ago, though, and it may have improved since then.

    I haven’t looked into extending Frama-C, but can you give me an idea how much code would be required to for example implement the bitwise abstract domain where every bit is independently 0/1/top/bottom? Can I just implement the transfer functions and go?

    Finally, Frama-C has an important benefit that meant to put in the list, but forgot: it provides nice knobs for tuning precision vs. resource usage.

  9. regehr | September 30, 2012 at 9:05 pm | Permalink

    Ed, thanks for the link, I had not heard of CPAChecker but will read up on it!

  10. Andreas Zwinkau | October 1, 2012 at 2:28 am | Permalink

    Note that there are tradeoffs. For example, 6 and 7 somewhat contradict each other. Optimizations might introduce (SSA) values, which cannot be tracked back to variables or even positions in the source file.

  11. Pascal Cuoq | October 1, 2012 at 10:12 am | Permalink

    > CIL is very invasive in this respect [...] If I understand correctly, you have worked around most or all of these issues in Frama-C?

    Yes, we have worked around all the issues that one may care about in Frama-C. The code is normalized but enough information is retained for the verification that the non-normalized code was actually defined. The penultimate normalization bug was found with Csmith and fixed recently:

    http://bts.frama-c.com/view.php?id=1059

    The last remaining bug is:

    http://bts.frama-c.com/view.php?id=69

    This one is rather obscure. While revisiting it, I noticed that the situation had changed in C11. Our normalization is still wrong (it erases bugs that are definitely in the source code). Here is a question that I had while re-investigating this:

    http://stackoverflow.com/q/12676411/139746

    ______

    On the subject of CIL, when I started using it, a function’s CFG was built automatically from the AST, and there were functors to build forward or backward dataflow analyses with ease. The functors’ interfaces were designed for intraprocedural analyses, but it only took ten lines of Caml or so to make an interprocedural context-sensitive analysis from them (with the let module OCaml construct).

  12. Alex Groce | October 1, 2012 at 10:19 am | Permalink

    In CIL’s case, bug erasing is a “feature” right? CIL can be used as a framework for this kind of thing, but it’s always seemed to me to be best as a source->source transformer that reduces ambiguity. The problem, though, is that if you don’t compile the CIL version of the code, you may have different behavior — at JPL we discussed (but didn’t do it, for various good reasons) using the CIL version of some code rather than original source. CIL’s great for transforming at source level for dynamic analysis, at least, because the canonical version doesn’t usually introduce bugs, unless the underlying program is exploiting undefined/impl. defined behavior and getting lucky. I have yet to use Frama-C, is writing analyses there comparable to writing in CIL? I.e., am I writing an ocaml program?

  13. Pascal Cuoq | October 1, 2012 at 11:31 am | Permalink

    @Alex Groce

    Writing a Frama-C plug-in is very much comparable to writing a CIL-based analyzer. A Frama-C plug-in can either be a static analysis or a source-to-source transformation. The same normalizations that CIL provides are applied to the AST, but the information is kept on the side that some of these transformations may have erased bugs.

    – A code transformation plug-in will usually work on the normalized AST and ignore the extraneous information.

    – An analysis plug-in can check that the transformation was indeed correct (e.g. x = ++(*p) + ++ (*q); is transformed into ++(*p); ++(*q); x = *p + *q; and a note is made on the side that p and q must not alias at this point. An analysis plug-in can try to check this condition). Alternately, the analysis plug-in can ignore the issue and blissfully work on the normalized code. The rationale may be that another plug-in can always be used to check for these bugs (the value analysis plug-in detects most of them).

    In short, Frama-C provides the best of both normalization and soundness with respect to the C standard. If all the normalization conditions are checked, the program can be compiled with any compliant C compiler and should behave in a defined manner (modulo small details of which http://bts.frama-c.com/view.php?id=69 is an example).

  14. Justin Bailey | October 1, 2012 at 1:52 pm | Permalink

    Edward already mentioned Hoopl. I wrote an MS thesis that used Hoopl extensively (“Using Dataflow Optimization Techniques with a
    Monadic Intermediate Language”). Chapter 3 gives an introduction to using the Hoopl API, where I describe a simple dead-code elimination pass over a C function. Feel free to check it out at http://mil.codeslower.com/thesis.pdf.

  15. Paul Biggar | October 1, 2012 at 3:14 pm | Permalink

    For my PhD, I wrote a static analyzer and compiler for PHP (http://phpcompiler.org). I wasted a lot of time trying to find “the one true dataflow framework”, which I think doesn’t and probably cant exist. I believe you can really only share a dataflow engine with another analysis if you have an agreed understanding of what and how to perform the analysis.

    For example, suppose you want to implement constant-propagation in an existing dataflow engine. The engine completely constrains how you implement it – you can’t do SCCP in a traditional dataflow-based engine, nor Pioli-Hind-style CCP+pointer-analysis in an intraprocedural engine.

    For my PHP compiler, the semantics of PHP forced my hand to implement my own analysis framework from scratch, which performed multiple analyses at the same time. As we performed interprocedural pointer analysis, we relied on results from type and constant propagation that were performed at the same time. This produced a nice framework where it was easy to include other analyses, but only if they used the symbolic-analysis style we used. A live-ness analysis (or anything needing to go “backwards”) would have been impossible.

    Similarly, you ask about a framework for analysis of dynamic languages like PHP, Perl, Javascript, etc. I don’t believe this is possible. The differences are too great – they all have very different object models, they’re fundamentally different in how they represent variables (Perl and PHP allow you to dynamically modify the symbol table, for example).

    Anyway, if you’re interested in the challenge for PHP and how we made our framework, check out my Thesis (http://paulbiggar.com/research/#phd-dissertation), esp chapters 2 and 6.

  16. regehr | October 1, 2012 at 9:43 pm | Permalink

    Paul, thanks for the link, this is great stuff. You’re right that there’s no one true framework and I agree that the design space for analyses is quite large. On the other hand, frameworks like Soot and SUIF seem to have been successful in significantly reducing the amount of work required to implement new analyses and that’s the real goal here. Modular analyses will never be as fast (and probably not as precise) as the kind of carefully designed combined analyses you are talking about, and that’s not a problem for prototyping purposes.

  17. Nuno Lopes | October 4, 2012 at 1:19 pm | Permalink

    CPAchecker is a software model checker. It uses CIL for parsing C.
    The basic algorithm at its core is the typical CEGAR loop (abstraction refinement) and linear interpolation and so on.
    It’s pretty heavy technology that more “classic” dataflow analysis doesn’t do (nor wants to). It has a dataflow engine, but probably it can only use the abstraction type it was designed for (predicate abstraction).

    (disclosure: I’m a developer of a similar tool)

  18. Manu Sridharan | October 4, 2012 at 1:49 pm | Permalink

    I’ll throw in a plug for WALA:

    http://wala.sf.net

    We’ve been using it to build production-quality program analyses at Watson for years. It has highly-tuned dataflow and constraint solvers, including a tabulation solver for interprocedural analyses. Issues are (1) you’ll have to code in some JVM language and (2) we don’t have a C frontend. But, we’ve been working on cleanly separating out analysis utilities that are language-independent (like the solvers) and putting them in their own project, com.ibm.wala.util. So, if someone has code for building IRs, CFGs, etc. for C code, in principle combining that with our solvers should be easy. I’m happy to try to help if someone wants to push on this.