Souper Results 2

The Souper superoptimizer has made some progress since my last post about it.

We wrote compiler drivers that usually reduce the problem of building a project with Souper to make CC=sclang CXX=sclang++. Souper now uses Redis to cache optimizations so that even if the initial build of a program using Souper is slow, subsequent builds will be pretty fast. We fixed several problems that were preventing Souper from building largish programs like LLVM and GCC. This works now and, as far as we know, Souper can be used to optimize arbitrary LLVM code.

Souper now understands the ctpop, ctlz, cttz, and bswap intrinsics. It no longer generates only i1 values, but rather synthesizes constant values of any width. Constant synthesis is not fast and it requires a solver with good support for quantifiers, currently only Z3 (synthesizing constants without quantifiers isn’t hard, we just haven’t implemented that yet). Here’s a list of constants synthesized while building LLVM with Souper. The left side of each line is the number of times the constant on the right side was synthesized. i1 constants dominate but it’s fun to see, for example, that Souper was able to synthesize the 64-bit value 90112 four times. Where did that come from?

Souper has two main use cases. First, application developers can use Souper directly to optimize code they are compiling. Second, LLVM developers can use Souper to learn about optimizations missed by the existing optimization passes. We’re trying to make it useful to both of these audiences.

To make Souper more useful for compiler developers, we implemented a C-Reduce-like reducer for Souper optimizations. This is necessary because Souper extracts and attempts to optimize pieces of LLVM that are as large as possible, meaning that its optimizations often contain extraneous material. A reduced optimization has the handy invariant that no path condition, UB qualifier (nsw, nuw, exact), or leaf instruction can be removed without breaking the optimization. We did some cross-checking between Souper and Alive, as a sanity check on both tools. Additionally, we convert each Souper optimization back into LLVM and run it through opt -O3 in order to weed out any optimizations that LLVM already knows how to do. For example, Souper loves to prove that icmp eq %0, %0 can be simplified to 1. This is not useful.

While building LLVM, ~16,000 Souper optimizations fire. Some of these optimizations are duplicates (presumably due to inlining and header inclusion); ~7000 of them are distinct. After reduction there are ~4000 distinct optimizations and LLVM does not know how to perform ~1500 of them. Even 1500 optimizations is lot of work to look through and of course not all of them matter. To help figure out which optimizations matter, we implemented two kinds of optimization profiling. The first is static profiling, which counts the number of times an optimization is applied at compile time. Implementing optimizations with a high static profile count would tend to reduce the size of the compiler’s generated code. Second, we implemented dynamic profiling, which counts the number of times each optimized piece of code is executed. This is accomplished by instrumenting the compiled program so that it dumps dynamic profile information to a Redis server using an atexit() handler. Implementing optimizations with a high dynamic profile count would tend to decrease the runtime of generated code. Of course, all standard caveats about profile-driven optimization apply here. Also keep in mind that Souper is extremely specific while compilers are less so: there is a many-to-one relationship between optimizations discovered by Souper and optimizations you would implement in LLVM. Therefore, it may well be the case that there are collections of low-ranking Souper optimizations that would rank highly if considered as a group, and that could all be implemented by a single LLVM transformation. We’ve experimented a bit with trying to automatically aggregate similar Souper optimizations, but so far I haven’t been too happy with the results.

If we take a Souper-optimized LLVM and use it to build SPEC CPU 2006, this is the optimization with the highest dynamic profile count; it is executed ~286 million times:

%0:i64 = var
%1:i64 = and 15:i64, %0
%2:i1 = eq 0:i64, %1
pc %2 1:i1
%3:i64 = and 7:i64, %0
%4:i1 = eq 0:i64, %3
cand %4 1:i1

The first four lines tell us that the arbitrary 64-bit value %0 is known to have zeros in its four bottom bits. The last three lines tell us that — of course — %0 has zeros in its three bottom bits. LLVM doesn’t understand this yet, leading to a lot of unnecessary conditional jumps.

Here’s the collection of Souper optimizations that are discovered while building LLVM/Clang/Compiler-RT r222538:

The clang binary from a “Release” build with Souper is about 800 KB smaller than the clang built without Souper. Please let us know about any bugs in the output above, including missed optimizations (but don’t tell us about missing vector, FP, or memory optimizations, we know that those are not supported yet). In the course of this work Raimondas ran across a Z3 bug; luckily he caught it by cross-checking Souper’s results using a different solver, instead of having to debug the resulting miscompilation.

The main thing that Souper does not do, that you would expect a superoptimizer to do, is to synthesize sequences of instructions. Much of our work over the last six months has been building infrastructure to support instruction synthesis, and almost all of that is now in place. Synthesis is our next major piece of work.

In the meantime, Peter has run Souper over libgo. I would like to build something a bit bigger such as Chromium. If you have a recipe for that, please drop me a line. I got as far as noticing that Chromium builds its own LLVM at which point my allergy to build systems kicked in. Integrating Souper into a build of the Rust compiler might also produce interesting results; it should be as easy as starting Redis and making sure our opt pass gets loaded in the right places.

Souper is by Peter Collingbourne at Google, by my postdoc Raimondas Sasnauskas, by Yang Chen at nVidia, by my student Jubi Taneja, by Jeroen Ketema at Imperial College London, and by me.

18 thoughts on “Souper Results 2”

  1. Hi,

    I’d be glad to help you get a chromium build going. If sclang understands clang flags, your best bet is probably:

    cd path/to/chromium/src
    GYP_DEFINES=’clang_use_chrome_plugins=0 make_clang_dir=/abs/path/to/llvm-build werror=’ build/gyp_chromium
    head out/Release/build.ninja # check that cc, cxx point at your compiler
    ninja -C out/Release chrome

    Please reach out via email or find me (thakis) on #chromium in irc.freenode.net.

  2. My first thought on ranking results for investigation is to see how many different code bases find them. Happy that doesn’t require big code bases, just lots of them.

  3. 90112 is 0x16000, which is a nicely round number. Presumably it’s the virtual address or the size of something.

  4. Great work! This is a dream come true ( see https://gcc.gnu.org/ml/gcc-help/2011-04/msg00254.html ).

    I love that you use redis to cache optimizations. Such a database could be built and maintained. It could even be partially parametric and then would apply to more (though search times and insert times would be longer). I’m really looking forward to this. Reminds me of Nathan Fain using such a database for complied embedded code so people could do decompilation by simply looking up a piece of compiled code in the database. People could send in compiled-original code pairs and improve the DB.

    I also like that STP was the system you used to find the bug in Z3 (full disclosure: I am a develper of STP). I would be even happier if you could put someone on STP to get it to the point where you could replace Z3 with it. Such work would allow the results to be used in open-source projects.

    Good luck and looking forward to your future results!

  5. Hi Mate, thanks for that link! STP has been great, it’s usually extremely fast. Adding generalization to the optimization database is one of my long-term goals. It will definitely be interesting.

  6. The top few smallest optimizations seem to say that “udiv %anything, 0” is 0. Can you explain this please? The LLVM LangRef says that division by zero leads to undefined behavior. Thanks.

  7. Hi Jay, “undefined behavior” is just a way to say “anything may happen if you do this”. So it is perfectly legal for Souper to give a zero result for these operations. There are, of course, stronger things that it could do, such as synthesizing the undef value or the unreachable instruction. It just doesn’t yet know how to do these things.

  8. I looked at this one which is close to the top of both the static and dynamic lists:

    %0:i1 = var
    %1:i64 = var
    %2:i1 = ult 2305843009213693951:i64, %1
    %3:i1 = or %0, %2
    %4:i64 = select %3, 2305843009213693951:i64, %1
    %5:i1 = ult 2305843009213693951:i64, %4
    cand %5 0:i1

    I turned it into this LLVM IR:

    define i1 @foo(i1 %r0, i64 %r1) {
    %r2 = icmp ult i64 2305843009213693951, %r1
    %r3 = or i1 %r0, %r2
    %r4 = select i1 %r3, i64 2305843009213693951, i64 %r1
    %r5 = icmp ult i64 2305843009213693951, %r4
    ret i1 %r5
    }

    Running opt -O1 happily reduces this to:

    define i1 @foo(i1 %r0, i64 %r1) {
    ret i1 false
    }

    In fact these passes are enough: -instcombine -gvn -instcombine

    So LLVM already knows this optimization. Why isn’t it getting it in your case? My guess is that some of the values in the original code have multiple uses, uses not shown in your example. For example, consider the following variant:

    declare void @bar(i64)
    define i1 @foo(i1 %r0, i64 %r1) {
    %r2 = icmp ult i64 2305843009213693951, %r1
    %r3 = or i1 %r0, %r2
    %r4 = select i1 %r3, i64 2305843009213693951, i64 %r1
    %r5 = icmp ult i64 2305843009213693951, %r4
    call void @bar(i64 %r4)
    ret i1 %r5
    }

    Then opt doesn’t simplify it at all, even at -O3. For the original IR, instcombine transforms

    %r4 = select i1 %r3, i64 2305843009213693951, i64 %r1
    %r5 = icmp ult i64 2305843009213693951, %r4

    into

    %r51 = icmp ugt i64 %r1, 2305843009213693951
    %not.r3 = xor i1 %r3, true
    %r5 = and i1 %r51, %not.r3

    which is what enables the other passes to turn the whole function into a constant. In the variant with an additional use of %r4 it doesn’t do this because it would have to preserve the select in order to pass it to @bar, and having both the select and the transformed code would result in the function having more instructions and doing more work.

    So I think it is essential to keep track of how many “external” uses each instruction has (i.e. uses the exist outside the code snippet you are analysing), and teach the super-optimizer that simplified code must continue to compute any values with external uses.

  9. Hi Duncan, here’s what we turn it into:

    @dummy1 = global i32 0, align 4
    @dummy2 = global i32 0, align 4
    @dummy3 = global i32 0, align 4

    define void @foo(i64 %x2, i1 %x1) {
    foo0:
    %0 = icmp ult i64 2305843009213693951, %x2
    %1 = or i1 %x1, %0
    %2 = select i1 %1, i64 2305843009213693951, i64 %x2
    %3 = icmp ult i64 2305843009213693951, %2
    %cand = icmp eq i1 0, %3
    br i1 %cand, label %return, label %dead
    return:
    %dummy1w = atomicrmw add i32* @dummy1, i32 1 monotonic
    ret void
    dead:
    %dummy2w = atomicrmw add i32* @dummy2, i32 1 monotonic
    ret void
    }

    Is this translation bogus somehow? The extra icmp comes from the fact that we’re not always computing an i1. Of course we could add a special case for i1, maybe that’s a good idea. I would not have expected it to matter, though.

    Then I get this:

    $ llvm-as foo.ll -o – | opt -O3 | llvm-dis
    ; ModuleID = ‘
    target datalayout = “e-m:e-i64:64-f80:128-n8:16:32:64-S128”

    @dummy1 = global i32 0, align 4
    @dummy2 = global i32 0, align 4
    @dummy3 = global i32 0, align 4

    ; Function Attrs: nounwind
    define void @foo(i64 %x2, i1 %x1) #0 {
    foo0:
    %0 = icmp ugt i64 %x2, 2305843009213693951
    %1 = or i1 %0, %x1
    %.not = icmp ult i64 %x2, 2305843009213693952
    %cand = or i1 %1, %.not
    br i1 %cand, label %return, label %dead

    return: ; preds = %foo0
    %dummy1w = atomicrmw add i32* @dummy1, i32 1 monotonic
    ret void

    dead: ; preds = %foo0
    %dummy2w = atomicrmw add i32* @dummy2, i32 1 monotonic
    ret void
    }

    attributes #0 = { nounwind }

  10. Hi Duncan, you said:

    “I think it is essential to keep track of how many “external” uses each instruction has (i.e. uses the exist outside the code snippet you are analysing), and teach the super-optimizer that simplified code must continue to compute any values with external uses.”

    I agree that we need to be careful about this sort of thing, but I don’t think Souper has any problems so far. Turning an LLVM instruction into a constant value should always be profitable, right? So I don’t think tracking the number of uses matters yet. Certainly it will matter once we start doing instruction synthesis.

    Optimization profitability is something that I’m trying to figure out right now. I’d love to get your feedback on this. Most of the profitability I’ve seen in InstCombine has been really simple.

    Regarding correctness, as in “teach the super-optimizer that simplified code must continue to compute any values with external uses” — I don’t think this is a problem for us. We simply replace an instruction with a constant value and let LLVM clean up the rest.

  11. Hi John and Duncan,

    I’m pretty sure I’ve fixed the most egregious cases of this going wrong with r222928. It is a little conservative because I wanted to ensure that we wouldn’t end up with a net increase of IR.

  12. Hi John, re comment 11, yes, it looks like the extra icmp made a difference. With that extra icmp, in your optimized IR you get these two comparisons:

    %0 = icmp ugt i64 %x2, 2305843009213693951
    %.not = icmp ult i64 %x2, 2305843009213693952

    These look temptingly similar but are not quite the same. Without the extra icmp you also get two comparisons after running instcombine, but they are *exactly* the same. GVN then merges them into one comparison, and the next run of instcombine reduces everything down to a constant.

  13. Thanks Duncan!

    I was just rereading your 2011 superoptimizer slides, there’s a lot of great information about profitability in there.

  14. Hi John, re comment 12. Something very similar to constant folding but a bit more general is what my super-optimizer does: look for reductions to sub-expressions that are already present, i.e. that don’t require synthesizing any instructions. For example X + (Y – X) -> Y.

    This discovered many optimizations that LLVM wasn’t doing. But for many of them it turned out that LLVM did know them, but wasn’t doing them due to all the extra uses floating around (causing it to think that the intermediate reductions needed for the overall optimization weren’t a win). As you correctly point out, if you are only reducing to a constant, or to an existing expression, you don’t have to worry about uses. So my remark about uses was silly given the current state of souper.

    The only other kind of simplification my super-optimizer could handle was detecting “unused arguments”. For example, consider X – (X + Y). This simplifies to -Y, but as -Y is a new expression my super-optimizer couldn’t handle that. However it could detect that the value of X – (X + Y) doesn’t depend on the value of X. You can therefore deduce that you can replace X with a constant (eg 0, but could be anything) without changing the value of the expression; this shows the expression is equal to 0 – (0 + Y), aka -Y. The main problem with this last kind of simplification is that it is not free, because you synthesize a new expression. However the X + (Y – X) -> Y kind of simplification is for free, like folding to a constant: always a win.

    Or is it? In fact this kind of simplification tends to increase register pressure by making expressions like Y live longer. This can cause extra stack spills and worse code, even though everything looks much better at the IR level. My benchmarking showed that this was sometimes a real issue (mostly it wasn’t though). Reducing to a constant isn’t always a win either (depends a lot on the architecture), as the backend guys will be happy to tell you.

    How can you tell from the IR if something is a win or not? It is essentially not possible: the problem is that the IR is far far away from the machine, and knowing that “doing this optimization causes you to run out of registers” is hard (instcombine has a few hacks to reduce register pressure but they are really nasty). This is why instcombine doesn’t have a sophisticated cost model: you are so far away from the machine that saying “fewer instructions is better” is about as good as it gets.

    That’s why some people have a different viewpoint about instcombine and other early LLVM passes: that they aren’t about producing faster code, they are about canonicalization: recognizing all kinds of exotic expressions are actually computing the same thing, and turning them all into that thing, making life easier for the rest of the compiler. From this viewpoint safe optimizations are those that
    1) fold to a constant (though even these are not always a win…), or
    2) can be undone by the target.
    I.e. optimizations need to be in some sense invertible.

    If you want sophisticated cost models and all that, you need to be working at the level of the backend, with the SelectionDAG and target specific nodes. This might be something to look into.

  15. Duncan, thanks for the thoughts. I think we are generally in agreement about all of this. The IR-level superoptimizer is necessary because it often enables DCE and other optimizations, but it should not try to be too clever.

    A target-specific superoptimizer would be really useful, I think, and also it sounds fun to work on!

Comments are closed.