Dataflow Analyses and Compiler Optimizations that Use Them, for Free

Compilers can be improved over time, but this is a slow process. “Proebsting’s Law” is an old joke which suggested that advances in compiler optimization will double the speed of a computation every 18 years — but if anything this is optimistic. Slow compiler evolution is never a good thing, but this is particularly problematic in today’s environment of rapid innovation in GPUs, TPUs, and other entertaining platforms.

One of my research group’s major goals is to create technologies that enable self-improving compilers. Taking humans out of the compiler-improvement loop will make this process orders of magnitude faster, and also the resulting compilers will tend to be correct by construction. One such technology is superoptimization, where we use an expensive search procedure to discover optimizations that are missing from a compiler. Another is generalization, which takes a specific optimization (perhaps, but not necessarily, discovered by a superoptimizer) and turns it into a broadly applicable form that is suitable for inclusion in a production compiler.

Together with a representative benchmark suite, superoptimization + generalization will result in a fully automated self-improvement loop for one part of an optimizing compiler: the peephole optimizer. In the rest of this piece I’ll sketch out an expanded version of this self-improvement loop that includes dataflow analyses.

The goal of a dataflow analysis is to compute useful facts that are true in every execution of the program being compiled. For example, if we can prove that x is always in the range [5..15], then we don’t need to emit an array bound check when x is used as an index into a 20-element array. This particular dataflow analysis is the integer range analysis and compilers such as GCC and LLVM perform it during every optimizing compile. Another analysis — one that LLVM leans on particularly heavily — is “known bits,” which tries to prove that individual bits of SSA values are zero or one in all executions.

Out in the literature we can find a huge number of dataflow analyses, some of which are useful to optimize some kinds of code — but it’s hard to know which ones to actually implement. We can try out different ones, but it’s a lot of work implementing even one new dataflow analysis in a production compiler. The effort can be divided into two major parts. First, implementing the analysis itself, which requires creating an abstract version of each instruction in the compiler’s IR: these are called dataflow transfer functions. For example, to implement the addition operation for integer ranges, we can use [lo1, hi1] + [lo2, hi2] = [lo1 + lo2, hi1 + hi2] as the transfer function. But even this particularly easy case will become tricker if we have to handle overflows, and then writing a correct and precise transfer function for bitwise operators is much less straightforward. Similarly, consider writing a correct and precise known bits transfer function for multiplication. This is not easy! Then, once we’ve finished this job, we’re left with the second piece of work which is to implement optimizations that take advantage of the new dataflow facts.

Can we automate both of these pieces of work? We can! There’s an initial bit of work in creating a representation for dataflow facts and formalizing their meaning that cannot be automated, but this is not difficult stuff. Then, to automatically create the dataflow transfer functions, we turn to this very nice paper which synthesizes them basically by squeezing the synthesized code between a hard soundness constraint and a soft precision constraint. Basically, every dataflow analysis ends up making approximations, but these approximations can only be in one direction, or else analysis results can’t be used to justify compiler optimizations. The paper leaves some work to be done in making this all practical in a production compiler, but it looks to me like this should mainly be a matter of engineering.

A property of dataflow transfer functions is that they lose precision across instruction boundaries. We can mitigate this by finding collections of instructions commonly found together (such as those implementing a minimum or maximum operation) and synthesizing a transfer function for the aggregate operation. We can also gain back precision by special-casing the situation where both arguments to an instruction come from the same source. We don’t tend to do these things when writing dataflow transfer functions by hand, but in an automated workflow they would be no problem at all. Another thing that we’d like to automate is creating efficient and precise product operators that allow dataflow analyses to exchange information with each other.

Given a collection of dataflow transfer functions, creating a dataflow analysis is a matter of plugging them into a generic dataflow framework that applies transfer functions until a fixpoint is reached. This is all old hat. The result of a dataflow analysis is a collection of dataflow facts attached to each instruction in a file that is being compiled.

To automatically make use of dataflow facts to drive optimizations, we can use a superoptimizer. For example, we taught Souper to use several of LLVM’s dataflow results. This is easy stuff compared to creating a superoptimizer in the first place: basically, we can reuse the same formalization of the dataflow analysis that we already created in order to synthesize transfer functions. Then, the generalization engine also needs to fully support dataflow analyses; our Hydra tool already does a great job at this, there are plenty of details in the paper.

Now that we’ve closed the loop, let’s ask whether there are interesting dataflow analyses missing from LLVM, that we should implement? Of course I don’t know for sure, but one such domain that I’ve long been interested in trying out is “congruences” where for a variable v, we try to prove that it always satisfies v = ax+b, for a pair of constants a and b. This sort of domain is useful for tracking values that point into an array of structs, where a is the struct size and b is the offset of one of its fields.

Our current generation of production compilers, at the implementation level, is somewhat divorced from the mathematical foundations of compilation. In the future we’ll instead derive parts of compiler implementations — such as dataflow analyses and peephole optimizations — directly from these foundations.

One response to “Dataflow Analyses and Compiler Optimizations that Use Them, for Free”

Benjamin Shropshire says:

April 21, 2024 at 1:42 pm

Is peephole optimization an example of the kind of search problem that suffers from getting stuck in local maximums?

If so, it seems like an autonomously expanding set of generated optimizations is likely to shift the complexity towards (and expand the complexity of) the “meta” problem of what order to attempt to apply things in, which I could easily see being a chaotic system (applying B before A, prevents Z from ever being needed and makes a never before seen pattern X a high leverage target). Just off hand, automating *that* seems like a field with a lot of different opportunities for research. (E.g. it seems like a problem space that could have a lot of overlap with problems worked on in AlphaGo and that could use some of the same classes of solutions.)

And then there is the meta-meta problem: can superoptimizers get trapped in local maxima with regards to the set of available optimizations?