A Few Synthesizing Superoptimizer Results

For this post, I crippled Souper by disabling its path conditions and limiting the depth of harvested expressions to two LLVM instructions. The first goal was to create a nice easy burn-in test for Souper’s instruction synthesizer, which uses a variant of this method; the second goal was to see if depth-limited, path-condition-free expressions would result in optimizations that are easier to understand and implement in LLVM.

We harvested LLVM expressions from SPEC CINT 2006, leaving out omnet, which caused a compiler crash. For each of the ~30,000 unique harvested expressions, we ran Souper’s synthesizer, which attempts to create a functionally equivalent but cheaper version of the expression. The results are ranked by the average of the optimization’s rank in terms of static and dynamic profile counts.

Here are the results

One thing you’ll notice is a lot of this sort of thing:

%0:i8 = var %1:i8 = and 1:i8, %0 %2:i1 = ne 0:i8, %1 infer %2 %3:i1 = trunc %0 result %3

The backends don’t usually care which of these they get, but even so this seems like a nice easy simplification to implement at the IR level.

My favorite optimizations from this batch are where Souper uses bit-level tricks to remove select instructions; here’s an example where the optimized version seems to lead to better code on the architectures that I’ve tried:

%0:i1 = var %1:i64 = var %2:i64 = select %0, %1, 1:i64 %3:i1 = eq 0:i64, %2 infer %3 %4:i64 = zext %0 %5:i1 = ult %1, %4 result %5

On the other hand, it is clear that there are many select-removal “optimizations” in this batch that are just not a good idea. We are not finding it easy to automatically choose which of two LLVM expressions are preferable. For now we just say that most instructions cost 1, select and div/rem cost 3, and constants and phis are free. Obviously this is just a first hack. We’d appreciate feedback on this aspect of the work (and others, of course). And obviously let us know if you find bugs in the results.

We might ask what can cause Souper to not find an optimization. First, Souper’s grab-bag of components for synthesis contains one of each of the following: shl, lshr, ashr, and, or, xor, add, sub, slt, ult, sle, ule, eq, ne, and constant. If an optimization requires, for example, two xors, we won’t find it. Of course we could give the synthesizer more components to work with but that slows it down. Sext/zext/trunc aren’t considered basic components by Souper, but are added as necessary to make bitwidths match up. This leads us to the second cause of missed optimizations: Souper won’t make up bitwidths not found in the original code. So, if there’s a trick where the arguments need to be extended from 32 bits to 64 and then truncated later, we won’t find that. Again this isn’t hard to fix, but it does appear to be hard to fix without slowing down synthesis a lot. Finally, if the solver (Z3 here) uses too much time or memory, or if some limits on loops inside the synthesizer are exceeded, we’ll miss out on whatever we might have found if we were willing to use more resources.

I’d like to distill the synthesis problems that come from optimizing LLVM code into some sort of synthesis benchmark that researchers can use to evaluate their work. We would say that your synthesis engine is better than ours (on this benchmark) if it can save more total static or dynamic profile weight in a given amount of time, or if it can produce equivalent results in less time. Of course we’ll have to agree on a cost model. I just attended the 2015 SMT Workshop the other day (where I talked about Souper!) and it sounds like there is plenty of interest in the solver community in providing better support for synthesis, which is great!

We have a medium-term goal of taking an optimization discovered by Souper and generalizing it into something that is very close to what an LLVM developer would implement. Since Souper has no capacity for abstraction, we’ll turn its optimizations into Alive patterns and then derive not only any preconditions that need to hold before the optimization can fire but also code for converting constant values on the left hand side into constants on the right hand side. For example let’s say that Souper finds this optimization:

%1 = lshr i32 %in, 3 %out = shl nuw i32 %1, 3 => %out = and i32 4294967288, %in

The general form that we’d like to derive is:
just

%1 = lshr %in, C %out = shl %1, C => %out = and -(1<<C), %in

Can we derive -(1<<C) from scratch? Time will tell. If this works, it’ll be useful for two reasons. First, it’ll effectively deduplicate Souper’s output: many specific optimizations will all boil down to the same general one. Second, the optimization will be easier to implement in LLVM if a human doesn’t have to do the generalization work. Some work by Nuno Lopes contains the basic ideas that we’ll use.

Something else I’ve been thinking about is the fact that all compilers have weaknesses of the kind that we point out in this post. The OPTGEN paper has some concrete data for GCC, LLVM, and Intel CC. I’ve heard that the Microsoft C/C++ compiler could stand to be improved in this respect too. CompCert needs a ton of work, the last time I checked. Wouldn’t it be cool if the work we’re doing for Souper and Alive could apply more broadly than just to LLVM? One observation is that if we deal only with integer values, the IR semantics in each of these compilers is probably broadly the same (leaving out some tricky issues like IR-level undefined behaviors). There’s probably no fundamental reason we can’t just reuse integer optimizations across multiple compilers.

Here are a few offers I’d like to extend:

If you are an LLVM frontend developer, I’d like to run Souper on some representative code from your frontend. It would be easiest if you harvested optimized LLVM code into Souper yourself (I’ll have to write up instructions) but if not, you can send me some bitcode files — in this case we won’t get dynamic profile information. A few GB would not be too much. The effect will be that we can help the LLVM middle-end optimizers do a better job on code that you care about.
If you develop a compiler other than LLVM, even a closed-source one, let’s figure out a way to translate an integer subset of your IR into Souper, so we can look for missed optimizations. My guess is that if you target a subset of Souper that doesn’t have phis or path conditions, this isn’t very difficult. You will basically end up doing the same kind of depth-bounded expression harvesting that I did.
If you are doing research on program synthesis, and if you might be interested in working towards some benchmarks about optimizing LLVM code, let me know and let’s figure out how to make that happen.

The next set of results that I post will be similar to these, but they’ll show off Souper’s ability to exploit the results of one or two of LLVM’s dataflow analyses. Finally I want to add that although I’ve been writing up these posts, Souper is a group effort.

July 21, 2015

regehr

Compilers, Computer Science

5 responses to “A Few Synthesizing Superoptimizer Results”

bcs says:

July 21, 2015 at 4:52 pm

Regarding getting a better cost function, how hard would it be to synthesize large scale random benchmarks? That is, generate random-but-“correct” segments of IR, compile them to arch-of-choice via just the backend and benchmarks it on real hardware. Given enough of those, training a cost function should be possible via standard turnkey methods. (And re-running, for new architectures or any time something changes, should become just a matter of compute time.)

Regarding self imposed resource limits, are those limits scaled from profiling data?
Joshua Cranmer says:

July 21, 2015 at 6:30 pm

One problem with using LLVM IR for timing data is that LLVM doesn’t implement some particularly common assembly operations as single IR instructions (most notably, there’s no rotate left/rotate right intrinsic, although operations that use the carry bit are also difficult (I’m not even sure the MachineInst lowering can select these)).
regehr says:

July 21, 2015 at 9:37 pm

Hi bcs, I like the idea of scaling resources with profile counts, hadn’t thought of that! So far all we’ve done is just stop running synthesis for stuff that isn’t running that much.

Regarding learning a cost function, I’ve thought about that a bit. I think it’s moderately painful. For example, does the model we train up on a Haswell work pretty OK for most x64s or do we need to find a bunch of other implementations of that architecture? Also, accurately benchmarking codes that contain just a few instructions isn’t easy due to pipeline effects. Also I’m not too excited about tracking down and taming a bunch of ARM implementations. But anyhow I agree that at some level it’s the right answer, I’m just not sure how to make it work with reasonable effort.
regehr says:

July 21, 2015 at 9:45 pm

Joshua, yeah, and (referring to comments #1 and #3 now) it might not be easy to train up a cost model that is reliably sensitive to things like rotate idioms.

Fundamentally, LLVM-level optimization is tough and I think Souper’s primary goal should be to get rid of instructions without screwing up the canonicalization. The secondary goal is to replace expensivish stuff with cheaper stuff. I’m psyched about some of the select removals since cmov isn’t as cheap as regular ALU instructions, though I understand that it has gotten much cheaper in recent years.
bcs says:

July 22, 2015 at 9:23 am

I’d think the way to tame the cost function problem would be to encapsulate it as best you can to make it each for other to run (if someone is interested in having a good const function for arch-x, they likely have the compiler and hardware setup needed for arch-x). The random-snippets parts should be rather agnostic. The running it on the CPU should be as well. The ML part can be done offline so it totally is. If the whole process can be reduced to “run program X on the arch-of-interest with the compiler as a flag and then run program Y on the file generated from X” then getting results for different arches becomes an exercise in crowd sourcing.

In one regard, this sort of automation is very necessary as the weight function will be a function not just of the CPU but of the arch specific optimizations in the back-end, implying that the function should be re-trained periodically, and then the middle-end optimizations should really be re-evaluated (which really argues for the middle-end being generated from something like Alive data).

As to accuracy, there is no particular reason the sequences need be short. And you can always fold them into a loop to make them longer. The more interesting generation problem would likely be how to generate random and interesting branchy code.

An interesting exercise, once the function training is turn-key, would be to cross check the models: e.g. compare the rÂ² for the Haswell training data fed into the Haswell model with the rÂ² for the Haswell training data fed into the Ivybridge model. This would provide an objective measure of similarity between different CPUs, which could answer your question about generality.