Skip to content

A Few Synthesizing Superoptimizer Results

For this post, I crippled Souper by disabling its path conditions and limiting the depth of harvested expressions to two LLVM instructions. The first goal was to create a nice easy burn-in test for Souper’s instruction synthesizer, which uses a variant of this method; the second goal was to see if depth-limited, path-condition-free expressions would result in optimizations that are easier to understand and implement in LLVM.

We harvested LLVM expressions from SPEC CINT 2006, leaving out omnet, which caused a compiler crash. For each of the ~30,000 unique harvested expressions, we ran Souper’s synthesizer, which attempts to create a functionally equivalent but cheaper version of the expression. The results are ranked by the average of the optimization’s rank in terms of static and dynamic profile counts.

Here are the results

One thing you’ll notice is a lot of this sort of thing:

%0:i8 = var
%1:i8 = and 1:i8, %0
%2:i1 = ne 0:i8, %1
infer %2
%3:i1 = trunc %0
result %3

The backends don’t usually care which of these they get, but even so this seems like a nice easy simplification to implement at the IR level.

My favorite optimizations from this batch are where Souper uses bit-level tricks to remove select instructions; here’s an example where the optimized version seems to lead to better code on the architectures that I’ve tried:

%0:i1 = var
%1:i64 = var
%2:i64 = select %0, %1, 1:i64
%3:i1 = eq 0:i64, %2
infer %3
%4:i64 = zext %0
%5:i1 = ult %1, %4
result %5

On the other hand, it is clear that there are many select-removal “optimizations” in this batch that are just not a good idea. We are not finding it easy to automatically choose which of two LLVM expressions are preferable. For now we just say that most instructions cost 1, select and div/rem cost 3, and constants and phis are free. Obviously this is just a first hack. We’d appreciate feedback on this aspect of the work (and others, of course). And obviously let us know if you find bugs in the results.

We might ask what can cause Souper to not find an optimization. First, Souper’s grab-bag of components for synthesis contains one of each of the following: shl, lshr, ashr, and, or, xor, add, sub, slt, ult, sle, ule, eq, ne, and constant. If an optimization requires, for example, two xors, we won’t find it. Of course we could give the synthesizer more components to work with but that slows it down. Sext/zext/trunc aren’t considered basic components by Souper, but are added as necessary to make bitwidths match up. This leads us to the second cause of missed optimizations: Souper won’t make up bitwidths not found in the original code. So, if there’s a trick where the arguments need to be extended from 32 bits to 64 and then truncated later, we won’t find that. Again this isn’t hard to fix, but it does appear to be hard to fix without slowing down synthesis a lot. Finally, if the solver (Z3 here) uses too much time or memory, or if some limits on loops inside the synthesizer are exceeded, we’ll miss out on whatever we might have found if we were willing to use more resources.

I’d like to distill the synthesis problems that come from optimizing LLVM code into some sort of synthesis benchmark that researchers can use to evaluate their work. We would say that your synthesis engine is better than ours (on this benchmark) if it can save more total static or dynamic profile weight in a given amount of time, or if it can produce equivalent results in less time. Of course we’ll have to agree on a cost model. I just attended the 2015 SMT Workshop the other day (where I talked about Souper!) and it sounds like there is plenty of interest in the solver community in providing better support for synthesis, which is great!

We have a medium-term goal of taking an optimization discovered by Souper and generalizing it into something that is very close to what an LLVM developer would implement. Since Souper has no capacity for abstraction, we’ll turn its optimizations into Alive patterns and then derive not only any preconditions that need to hold before the optimization can fire but also code for converting constant values on the left hand side into constants on the right hand side. For example let’s say that Souper finds this optimization:

%1 = lshr i32 %in, 3
%out = shl nuw i32 %1, 3
  =>
%out = and i32 4294967288, %in

The general form that we’d like to derive is:
just

%1 = lshr %in, C
%out = shl %1, C
  =>
%out = and -(1<<C), %in

Can we derive -(1<<C) from scratch? Time will tell. If this works, it’ll be useful for two reasons. First, it’ll effectively deduplicate Souper’s output: many specific optimizations will all boil down to the same general one. Second, the optimization will be easier to implement in LLVM if a human doesn’t have to do the generalization work. Some work by Nuno Lopes contains the basic ideas that we’ll use.

Something else I’ve been thinking about is the fact that all compilers have weaknesses of the kind that we point out in this post. The OPTGEN paper has some concrete data for GCC, LLVM, and Intel CC. I’ve heard that the Microsoft C/C++ compiler could stand to be improved in this respect too. CompCert needs a ton of work, the last time I checked. Wouldn’t it be cool if the work we’re doing for Souper and Alive could apply more broadly than just to LLVM? One observation is that if we deal only with integer values, the IR semantics in each of these compilers is probably broadly the same (leaving out some tricky issues like IR-level undefined behaviors). There’s probably no fundamental reason we can’t just reuse integer optimizations across multiple compilers.

Here are a few offers I’d like to extend:

  1. If you are an LLVM frontend developer, I’d like to run Souper on some representative code from your frontend. It would be easiest if you harvested optimized LLVM code into Souper yourself (I’ll have to write up instructions) but if not, you can send me some bitcode files — in this case we won’t get dynamic profile information. A few GB would not be too much. The effect will be that we can help the LLVM middle-end optimizers do a better job on code that you care about.
  2. If you develop a compiler other than LLVM, even a closed-source one, let’s figure out a way to translate an integer subset of your IR into Souper, so we can look for missed optimizations. My guess is that if you target a subset of Souper that doesn’t have phis or path conditions, this isn’t very difficult. You will basically end up doing the same kind of depth-bounded expression harvesting that I did.
  3. If you are doing research on program synthesis, and if you might be interested in working towards some benchmarks about optimizing LLVM code, let me know and let’s figure out how to make that happen.

The next set of results that I post will be similar to these, but they’ll show off Souper’s ability to exploit the results of one or two of LLVM’s dataflow analyses. Finally I want to add that although I’ve been writing up these posts, Souper is a group effort.

Nibble Sort Denouement

Back in January my nibble sort contest resulted in entries that dramatically exceeded my expectations. Since then I’ve been trying to write up a post explaining the various strategies that people used and since you don’t care about my excuses I won’t tell you them, but I never got it written. However! I want to point everyone to three great blog entries:

The thing that I liked most about this contest is that the fastest entries ended up being way faster than I predicted — mostly because I didn’t understand how excellent the Haswell vector unit is. I’ve sent my standard programming contest prize (Amazon gift certificate covering the cost of a paper copy of Hacker’s Delight) to Alexander and Jeff (Jerome’s non-SIMD entry, alas, arrived a bit after the contest expired). Thanks again everyone!

Final note: as a compiler researcher I think these contest entries could be very useful as a case study in how variance among algorithms and among implementations of algorithms can lead to extreme performance differences. In the long run we need compilers that understand how to automatically switch between different versions such as these. Regular compiler technology won’t cut it, nor will (at present) program synthesis. But it’s a good goal.

Defending Against Compiler-Based Backdoors

Scotty Bauer (a Utah grad student), Pascal Cuoq, and I have an article in the latest PoC||GTFO about introducing a backdoor into sudo using a compiler bug. In other words, the C code implementing sudo does not contain a backdoor, but a backdoor appears when sudo is built using a particular compiler version (clang 3.3, here). The advantages of this kind of backdoor include subtlety, deniability, and target-specificity.

Of course I have no idea who, if anyone, is trying to use compiler bugs to create backdoors. However if I worked for the offensive wing of a well-funded security service, this would be one of the tools in my toolbox. But that’s neither here nor there. What follows are some ideas about how various groups of people can help defend against these backdoors.

Compiler Developers

  • Fix known miscompilation bugs as rapidly as possible
  • Consider back-porting fixes to miscompilation bugs into maintenance releases
  • Go looking for trouble using fuzz tools

Maintainers of Open Source Packages

  • Be suspicious of baroque patch submissions
  • Consider rewriting patches

OS Packagers

  • Assess compilers for reliability before selecting a system compiler
  • Aggressively test compiled code for all platforms before deployment
  • Run a trusting trust test on the system compiler (our attack in the PoC||GTFO article isn’t a trusting trust attack– but this won’t hurt)

End User

  • Recompile the system from source, using a different compiler or compiler version
  • Run your own acceptance tests on precompiled applications

Researchers

  • Create practical proved-correct compilers — which won’t contain the kind of bug that we exploited
  • Create practical translation validation schemes — which would detect the miscompilation as it occurs
  • Create practical N-version programming systems — which would detect the backdoor as it executes

Overall, this kind of attack is not easy to defend against, and my guess is that most instances of it (if any exist) will never be detected.

Generating a Random Program vs. Generating All Programs

Generating all possible inputs — up to some maximum length — to a software system is one way of creating test cases, and this technique even has a name: bounded exhaustive testing. Back when we were doing Csmith, my then-student Yang Chen spent a while on a bounded exhaustive C program generator which was in fact pretty decent at finding compiler bugs and had the added advantage of not requiring a test case reducer: if you try test cases in order of increasing size, you’re guaranteed that the first input that triggers a bug is already minimal. In the end, however, Csmith worked better and we dropped the exhaustive generator.

Lately I’ve been working on a little exhaustive LLVM function generator. (NOTE: be careful about running this program, it not only writes a lot of files but is also a barely-controlled forkbomb. It would be better to just download a tarball of its output.) Here I don’t want to write about what we’ve learned using it, but rather about writing such a tool.

The first observation is obvious: the difference between a random generator and an exhaustive generator is simply the implementation of the choice operator: the random tool makes a random choice and the exhaustive generator makes every choice. Here you can see my choice implementation which uses fork() since it is simple and fairly efficient.

The rest of this piece is about the less obvious differences between writing a random generator and writing an exhaustive generator. First, most random program generators contain code like this:

restart:
  ch1 = choose(I);
  ch2 = choose(J);
  if (!compatible(ch1, ch2))
    goto restart;

In other words, when the generator gets into a bind, it simply restarts an entire operation. As long as this will eventually succeed, there’s no problem, and in practice a generator that can use this idiom will be moderately simpler than one that doesn’t use it. For exhaustive generation there are two problems. First, wasting choices increases the generator’s branching factor, potentially creating a huge increase in the number of states that must be explored. Second, the exhaustive generator might not even terminate, depending on what kind of “fuel” it uses. There’s no problem if the fuel is the number of choices, but if we want to use a different, user-meaningful kind of fuel (LLVM instructions, for my generator above) then this idiom leads to an infinite fork loop.

To write the same sequence of choices without a restart loop, we end up with this kind of code:

  ch1 = choose(I);
  ch2 = choose(J - incompatible_choices(ch1));
  ch2 = expand(ch1, ch2);

In other words, we need to always choose from a densely populated range of choices and then expand the choice out to the original, sparser choice space. Why does this sort of thing happen in practice? Let’s say that a function in the program we’re generating takes two arguments that should be distinct. It’s easy to generate two random arguments and then restart if they happen to be the same, but slightly harder to generate a random argument and then a second random argument that is guaranteed to be different from the first. The painfulness increases rapidly with the sophistication of the constraints that need to be respected. I’m not sure that it would have been possible to implement Csmith in the non-restart style.

The next issue is the balancing of probabilities that is required to create a good random program generator. This is really hard, we spent an enormous amount of time tuning Csmith so that it often generates something useful. Doing a bad job at this means that a fuzzer — even if it is in principle capable of generating interesting outputs — almost never does that. For example, the main value generation function in my LLVM generator is structured as a sequence of coin tosses. This is fine for exhaustive generation but creates an extremely skewed distribution of programs when used in fuzzer mode.

The final difference between random and exhaustive generation is that exhaustive generators end up being highly sensitive to incidental branching factors. For example, if my LLVM function generator is permitted to simply return one of its arguments (with the actual contents of the function being dead) the number of programs that it generates (for a given number of instructions) increases significantly, with no added testing benefit. Similarly, if we generate instructions that fold (for example, “add 1, 1″) we get another massive increase in the number of outputs (though these are not useless since they test the folder — something that we might or might not be interested in doing). A massive blowup comes from the necessity of choosing every possible value for each literal constant. This is a bit of a showstopper when generating C programs but is not so bad in LLVM since we can limit inputs to be, for example, three bits wide.

In summary, although in principle a fuzzer and an exhaustive generator are the same program with different implementations of the choice operator, in practice the concerns you face while writing one are significantly different from the other. Neither is strictly easier than the other to create.

What afl-fuzz Is Bad At

American fuzzy lop is a polished and effective fuzzing tool. It has found tons of bugs and there are any number of blog posts talking about that. Here we’re going to take a quick look at what it isn’t good at. For example, here’s a program that’s trivial to crash by hand, that afl-fuzz isn’t likely to crash in an amount of time you’re prepared to wait:

#include <stdlib.h>
#include <stdio.h>

int main(void) {
  char input[32];
  if (fgets(input, 32, stdin)) {
    long n = strtol(input, 0, 10);
    printf("%ld\n", 3 / (n + 1000000));
  }
  return 0;
}

The problem, of course, is that we’ve asked afl-fuzz to find a needle in a haystack and its built-in feedback mechanism does not help guide its search towards the needle. Actually there are two parts to the problem. First, something needs to recognize that divide-by-zero is a crash behavior that should be targeted. Second, the search must be guided towards inputs that result in a zero denominator. A finer-grained feedback mechanism, such as how far the denominator is from zero, would probably do the job here. Alternatively, we could switch to a different generation technology such as concolic testing where a solver is used to generate test inputs.

The real question is how to get the benefits of these techniques without making afl-fuzz into something that it isn’t — making it worse at things that it is already good at. To see why concolic testing might make things worse, consider that it spends a lot of time making solver calls instead of actually running tests: the base testing throughput is a small fraction of what you get from plain old fuzzing. A reasonable solution would be to divide up the available CPU time among strategies; for example, devote one core to concolic testing and seven to regular old afl-fuzz. Alternatively, we could wait for afl-fuzz to become stuck — not hitting any new coverage targets within the last hour, perhaps — and then switch to an alternate technique for a little while before returning to afl-fuzz. Obviously the various search strategies should share a pool of testcases so they can interact (afl-fuzz already has good support for communication among instances). Hopefully some concolic testing people will spend a bit of time bolting their tools onto afl-fuzz so we can play with these ideas.

A variant of the needle-in-a-haystack problem occurs when we have low-entropy input such as the C++ programs we would use to test a C++ compiler. Very few ascii strings are C++ programs and concolic testing is of very little help in generating interesting C++. The solution is to build more awareness of the structure of C++ into the testcase generator. Ideally, this would be done without requiring the somewhat elaborate effort we put into Csmith. Something like LangFuzz might be a good compromise. (Of course I am aware that afl-fuzz has been used against clang and has found plenty of ways to crash it! This is great but the bugs being found are not the kind of semantic bugs that Csmith was designed to find. Different testcase generators solve different problems.)

So we have this great fuzzing tool that’s easy to use, but it also commonly runs up against testing problems that it can’t solve. A reasonable way forward will be for afl-fuzz to provide the basic framework and then we can all build extensions that help it reach into domains that it couldn’t before. Perhaps Valgrind is a good analogy: it provides not only an awesome baseline tool but also a rich infrastructure for building new tools.

[Pascal Cuoq gave me some feedback on this post including a bunch of ideas about afl-fuzz-related things that he’ll hopefully blog about in the near future.]

UB Canaries

If you report an undefined behavior bug, a common reaction from software developers is “So what? Our code works just fine.” As a random example, here is a discussion I had with Rasmus Lerdorf about five years ago about some UBs in the PHP interpreter. One might point out that it wasn’t a very mature exchange but I wasn’t even 40 yet at the time. (Earlier I had an example here from OpenSSL but this one is more suitable.)

An idea I’ve kicked around for a long time, but only got around to implementing this week, is a collection of canaries for undefined behavior: little test programs that automate the process of determining whether a given compiler configuration is willing to exploit particular UBs. Here’s the UB-canary repo. The idea behind canaries is that since they’re easy to run, they provide a quick and easy way to figure out which UBs the users of a particular compiler should be especially worried about. The definition of “exploit” requires a bit of care: exploitation only makes sense in terms of an expectation held by a C/C++ programmer. For example, I might expect that when a negative number is shifted to the left, the result is the same as if the number had been cast to unsigned before being left-shifted. Of course we need to be completely clear with ourselves that any such expectation has nothing to do with the language standard and everything to do with what a particular compiler happens to do, either because the providers of that compiler are unwilling to exploit that UB or just because they have not gotten around to exploiting it yet. When no real guarantee from the compiler provider exists, we like to say that as-yet unexploited UBs are time bombs: they’re waiting to go off next month or next year when the compiler gets a bit more aggressive.

Below are some results from various versions of GCC and LLVM on Ubuntu Linux. Each compiler was tested at -O0, -O1, -O2, -Os, and -O3; the number in each part of the table indicates the number of optimization options at which the compiler was willing to exploit the UB. Most compilers were targeting x86 but for some of the older versions of GCC (4.0, 4.1, 4.2) I could no longer get that to work due to some sort of libc problem, so they were targeting x86-64. I doubt that affects the results much.

Here are the results for GCC:

4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 4.8 4.9
addr-null 5 5 5 5 5 5 0 0 0 0
array-oob 0 0 0 0 0 0 0 0 0 0
dangling-pointer 4 4 4 4 4 4 4 4 4 4
int-min-mod-minus-1 5 5 5 5 5 5 5 5 5 5
memcpy-overlap 5 5 5 5 5 5 5 5 5 5
modify-string-literal 5 5 5 5 5 5 5 5 5 5
pointer-casts 4 4 4 4 4 0 0 0 0 0
shift-by-bitwidth 2 2 2 3 3 3 3 3 4 4
signed-integer-overflow 2 5 3 3 3 3 3 3 3 3
signed-left-shift 0 0 0 0 0 0 0 0 0 0
strict-aliasing 3 3 3 0 2 0 3 3 3 3
uninitialized-variable 0 0 0 1 1 1 1 1 1 1

And for LLVM:

2.9 3.0 3.1 3.2 3.3 3.4 3.5 3.6
addr-null 0 0 0 0 0 0 0 0
array-oob 0 0 0 0 0 0 0 0
dangling-pointer 3 3 3 3 3 3 3 3
int-min-mod-minus-1 5 5 5 5 5 5 5 5
memcpy-overlap 5 5 5 5 5 5 5 5
modify-string-literal 5 5 5 5 5 5 5 5
pointer-casts 5 5 5 5 5 5 5 5
shift-by-bitwidth 3 3 3 3 3 3 3 3
signed-integer-overflow 4 4 4 4 4 4 4 4
signed-left-shift 0 0 0 0 0 0 0 0
strict-aliasing 3 3 3 3 3 3 3 3
uninitialized-variable 4 4 4 4 4 4 4 4

Click on the links to see the corresponding assumptions. It is interesting that GCC shows a lot more variance in UB exploitation than LLVM; I suspect that this is simply because the GCC versions shown here span a longer period of time than the LLVM versions (~10 years vs. ~4 years). It is also interesting that UB exploitation is not getting (as one might have feared) uniformly worse. Although overlapping memcpy() calls were always exploited by the compilers tested here, that is not the case on my Mac. I believe there have been compilers that exploited the signed-left-shift UBs but either my canaries aren’t good enough to find them or else those behaviors did not happen to be seen in the releases I choose here. Modifying a string literal never works on a modern Linux machine but it continues to work on embedded platforms that lack memory protection (I just saw some code doing this the other day). The INT_MIN % -1 case is interesting since it has a sensible answer and also my reading of the C99 standard is that this construct is not undefined; it is explicitly undefined in the newer standards.

Please let me know if:

  • You notice a bug in any UB canary or in the (admittedly crappy) driver script.
  • You know of a compiler that exploits an UB but where I don’t have a canary detecting the exploitation. A pull request would be ideal.

Canary code comes from a variety of sources and I have tried to give credit in the individual files. UB canaries are related to several ongoing UB-protection efforts but I want to in particular mention the Cerberus C Survey that is being run by Peter Sewell’s research group at Cambridge. Their goal is to create a mechanized semantics for C as it is really used, as opposed to the somewhat fictitious language described in the standards. If you are a C/C++ developer who cares about UB, please take a few minutes to fill out the survey.

Inexpensive CPU Monster

Rather than using the commercial cloud, my group tends to run day-to-day jobs on a tiny cluster of machines in my office and then to use Emulab when a serious amount of compute power is required. Recently I upgraded some nodes and thought I’d share the specs for the new machines on the off chance this will save others some time:

processor Intel Core i7-5820K $ 380
CPU cooler Cooler Master Hyper 212 EVO $ 35
mobo MSI X99S SLI Plus $ 180
RAM CORSAIR Vengeance LPX 16GB (4 x 4GB) DDR4 SDRAM 2400 $ 195
SSD SAMSUNG 850 Pro Series MZ-7KE256BW $ 140
video PNY Commercial Series VCG84DMS1D3SXPB-CG GeForce 8400 $ 49
case Corsair Carbide Series 200R $ 60
power supply Antec BP550 Plus 550W $ 60

These machines are well-balanced for the kind of work I do, obviously YMMV. The total cost is about $1100. These have been working very well with Ubuntu 14.04. They do a full build of LLVM in about 18 minutes, as opposed to 28 minutes for my previously-fastest machines that were based on the i7-4770. I’d be interested to hear where other research groups get their compute power — everything in the cloud? A mix of local and cloud resources? This is an area where I always feel like there’s plenty of room for improvement.

Booster Test

Ever since learning that the space shuttle booster motors were manufactured and tested at ATK in Promontory Utah — not too far from where I live — I wanted to see one of the tests. I didn’t manage to do that before the shuttle program was shut down, but today I got to see something better: a test of an SLS booster, which is about 25% more powerful than an STS booster and more than twice as powerful as one of the big F-1 engines from the Saturn V.

Here’s a close-up video. On the other hand, this one shows what the test was like from the viewing area, in particular the 8 seconds it took the noise to reach us. The sound was very impressive, with enough low-frequency power to make my clothing vibrate noticeably, but it was not anywhere close to painfully loud. The flame was, however, painfully bright to look at. The nozzle was being vectored around during the test (I hadn’t realized that the solid rockets participate in guidance) but that wasn’t easy to see from a distance.

NASA socials give some inside access to people like me (and you, if you live in the USA and want to sign up next time) who have no official connection to the space program. Yesterday we got to tour the plant where the boosters are made. It was great to learn about techniques for mixing, casting, and curing huge amounts of propellant without getting air bubbles or other imperfections into the mix and without endangering workers. The buildings in this part of ATK have escape slides from all levels and are surrounded by big earthworks to deflect potential explosions upwards. It was also really cool to see the hardware for hooking boosters to the main rocket, for vectoring nozzles, and things like that. Alas, we weren’t allowed to take pictures on the tour.

ATK’s rocket garden at sunrise:

And the main event:

Adandoned Mineshaft

Due to my 8-year-old’s obsession with Minecraft, the abandoned mineshafts found in the game are an everyday topic of discussion around the house. He is saving up to buy a pickaxe — no joke. Since we needed a day trip for the long weekend, I thought we’d visit some actual mines in the Silver Island Mountains near the Utah-Nevada border. We followed an old road most of the way up a minor mountain and found the remains of either a processing facility or living quarters, and several mine shafts higher up. One of these apparently goes all the way through the mountain but we didn’t do much more than poke our heads in, since it was difficult to gauge how dangerous it was. This area was mined from the 1870s through the 1930s. Since the pathway connecting the mine shafts with the highest vehicle access was too narrow and steep for vehicles, we were forced to assume that ore was transported downhill using some combination of human and animal power — the level of physical effort implied here seemed to be lost on the 8-year-old, who wants a pickaxe more than ever. We saw a pair of pronghorn antelope — but failed to get decent pictures.

Lamus Peak:

Built-up trail used to move ore downhill through rugged country:

Mine shaft:

Another one, with some nice copper ore right at the entrance:

And a lot of nothing in the distance:

Instruction Synthesis is Fun and Weird

Synthesis is sort of a hot topic in PL research lately. It basically means “implement a specification automatically.” Of course, at some level this isn’t very different from what compilers have been doing for ages, if we consider the source language program to be the specification, but when we’re doing synthesis the connotation is that the specification isn’t directly implementable using standard compiler techniques.

As of recently, we have an LLVM synthesizer for Souper. The basic method is from this excellent PLDI paper. In brief, you give the synthesis algorithm a specification and a library of components and it either gives you a way to wire together the components into a circuit that implements the specification or else it fails. Some other time I might explain the algorithm in detail, but here I wanted to give an example where we’re trying to figure out if a 64-bit word contains 16 repetitions of the same nibble value (this computation figured into many nibble sort submissions).

The specification in C is:

int nibs_same(unsigned long word) {
  unsigned long n00 = (word >> 60) & 0xf;
  unsigned long n01 = (word >> 56) & 0xf;
  unsigned long n02 = (word >> 52) & 0xf;
  unsigned long n03 = (word >> 48) & 0xf;
  unsigned long n04 = (word >> 44) & 0xf;
  unsigned long n05 = (word >> 40) & 0xf;
  unsigned long n06 = (word >> 36) & 0xf;
  unsigned long n07 = (word >> 32) & 0xf;
  unsigned long n08 = (word >> 28) & 0xf;
  unsigned long n09 = (word >> 24) & 0xf;
  unsigned long n10 = (word >> 20) & 0xf;
  unsigned long n11 = (word >> 16) & 0xf;
  unsigned long n12 = (word >> 12) & 0xf;
  unsigned long n13 = (word >>  8) & 0xf;
  unsigned long n14 = (word >>  4) & 0xf;
  unsigned long n15 = (word >>  0) & 0xf;
  return
    (n00 == n01) & (n00 == n02) & (n00 == n03) & (n00 == n04) &
    (n00 == n05) & (n00 == n06) & (n00 == n07) & (n00 == n08) &
    (n00 == n09) & (n00 == n10) & (n00 == n11) & (n00 == n12) &
    (n00 == n13) & (n00 == n14) & (n00 == n15);
}

I won’t bother to give the LLVM version of this but it’s not too pretty, and obviously there’s not much that standard compiler optimization techniques can do to make it prettier. On the other hand, there’s a much better way to implement this specification:

int nibs_same(uint64_t word) {
  return word == rotate_right(word, 4);
}

LLVM doesn’t have a rotate instruction so the best we’ll do there is something like this:

define i32 @nibs_same(i64 %word) #0 {
  %1 = shl i64 %word, 60
  %2 = lshr i64 %word, 4
  %3 = or i64 %1, %2
  %4 = icmp eq i64 %3, %word
  %5 = zext i1 %4 to i32
  ret i32 %5
}

The question is: Will Souper generate this code from scratch? Turns out it does not, but on the other hand it generates this:

define i32 @nibs_same(i64 %word) #0 {
  %1 = lshr i64 %word, 28
  %2 = shl i64 %1, 32
  %3 = or i64 %1, %2
  %4 = icmp eq i64 %word, %3
  %5 = zext %4 to i32
  ret i32 %5
}

Or in C:

(((x >> 28) << 32) | (x >> 28)) == x

If you take a minute to look at this code, you’ll see that it is not implementing a rotate, but rather something odd and not obviously correct. It took me a while to convince myself that this is correct, but I believe that it is. The correctness argument uses these observations: (1) every nibble is compared to some other nibble and (2) it is irrelevant that the bitwise-or is presented with a pair of overlapping nibbles. Here’s the picture I drew to help:

0000000000111111111122222222223333333333444444444455555555556666
0123456789012345678901234567890123456789012345678901234567890123
                            >> 28
                            000000000011111111112222222222333333
                            012345678901234567890123456789012345
                            << 32
 00000011111111112222222222333333
 45678901234567890123456789012345
                               |
                             000000000011111111112222222222333333
                             012345678901234567890123456789012345
                              ==?
 0000000000111111111122222222223333333333444444444455555555556666
 0123456789012345678901234567890123456789012345678901234567890123
 

Also Alive can help:

  %1x = shl i64 %word, 60
  %2x = lshr i64 %word, 4
  %3x = or i64 %1x, %2x
  %4x = icmp eq i64 %3x, %word
  %5 = zext i1 %4x to i32
=>
  %1 = lshr i64 %word, 28
  %2 = shl i64 %1, 32
  %3 = or i64 %1, %2
  %4 = icmp eq i64 %word, %3
  %5 = zext %4 to i32

Done: 1
Optimization is correct!

The drawback of Souper's code is that since it doesn't include a rotate operation that can be recognized by a backend, the resulting assembly won't be as good as it could have been. Of course if LLVM had a rotate instruction in the first place, Souper would have used it. Additionally, if we posit a superoptimizer-based backend for LLVM, it would be able to find the rotate operation that is well-hidden in Souper's synthesized code.

I find it kind of magical and wonderful that a SAT solver can solve synthesis problems like this. Souper's synthesizer is very alpha and I don't recommend using it yet, but the code is here if you want to take a look. Raimondas, my heroic postdoc, did the implementation work.

I'll finish up by noting that my first blog entry was posted five years ago today. Happy birthday, blog.

Update from Feb 10: A reader sent this solution which I hadn't seen before:

(x & 0x0FFFFFFFFFFFFFFF) == (x >> 4)

I think Souper missed it since we didn't include an "and" in its bag of components for this particular run.