Instruction Synthesis is Fun and Weird


Synthesis is sort of a hot topic in PL research lately. It basically means “implement a specification automatically.” Of course, at some level this isn’t very different from what compilers have been doing for ages, if we consider the source language program to be the specification, but when we’re doing synthesis the connotation is that the specification isn’t directly implementable using standard compiler techniques.

As of recently, we have an LLVM synthesizer for Souper. The basic method is from this excellent PLDI paper. In brief, you give the synthesis algorithm a specification and a library of components and it either gives you a way to wire together the components into a circuit that implements the specification or else it fails. Some other time I might explain the algorithm in detail, but here I wanted to give an example where we’re trying to figure out if a 64-bit word contains 16 repetitions of the same nibble value (this computation figured into many nibble sort submissions).

The specification in C is:

int nibs_same(unsigned long word) {
  unsigned long n00 = (word >> 60) & 0xf;
  unsigned long n01 = (word >> 56) & 0xf;
  unsigned long n02 = (word >> 52) & 0xf;
  unsigned long n03 = (word >> 48) & 0xf;
  unsigned long n04 = (word >> 44) & 0xf;
  unsigned long n05 = (word >> 40) & 0xf;
  unsigned long n06 = (word >> 36) & 0xf;
  unsigned long n07 = (word >> 32) & 0xf;
  unsigned long n08 = (word >> 28) & 0xf;
  unsigned long n09 = (word >> 24) & 0xf;
  unsigned long n10 = (word >> 20) & 0xf;
  unsigned long n11 = (word >> 16) & 0xf;
  unsigned long n12 = (word >> 12) & 0xf;
  unsigned long n13 = (word >>  8) & 0xf;
  unsigned long n14 = (word >>  4) & 0xf;
  unsigned long n15 = (word >>  0) & 0xf;
  return
    (n00 == n01) & (n00 == n02) & (n00 == n03) & (n00 == n04) &
    (n00 == n05) & (n00 == n06) & (n00 == n07) & (n00 == n08) &
    (n00 == n09) & (n00 == n10) & (n00 == n11) & (n00 == n12) &
    (n00 == n13) & (n00 == n14) & (n00 == n15);
}

I won’t bother to give the LLVM version of this but it’s not too pretty, and obviously there’s not much that standard compiler optimization techniques can do to make it prettier. On the other hand, there’s a much better way to implement this specification:

int nibs_same(uint64_t word) {
  return word == rotate_right(word, 4);
}

LLVM doesn’t have a rotate instruction so the best we’ll do there is something like this:

define i32 @nibs_same(i64 %word) #0 {
  %1 = shl i64 %word, 60
  %2 = lshr i64 %word, 4
  %3 = or i64 %1, %2
  %4 = icmp eq i64 %3, %word
  %5 = zext i1 %4 to i32
  ret i32 %5
}

The question is: Will Souper generate this code from scratch? Turns out it does not, but on the other hand it generates this:

define i32 @nibs_same(i64 %word) #0 {
  %1 = lshr i64 %word, 28
  %2 = shl i64 %1, 32
  %3 = or i64 %1, %2
  %4 = icmp eq i64 %word, %3
  %5 = zext %4 to i32
  ret i32 %5
}

Or in C:

(((x >> 28) << 32) | (x >> 28)) == x

If you take a minute to look at this code, you’ll see that it is not implementing a rotate, but rather something odd and not obviously correct. It took me a while to convince myself that this is correct, but I believe that it is. The correctness argument uses these observations: (1) every nibble is compared to some other nibble and (2) it is irrelevant that the bitwise-or is presented with a pair of overlapping nibbles. Here’s the picture I drew to help:

0000000000111111111122222222223333333333444444444455555555556666
0123456789012345678901234567890123456789012345678901234567890123
                            >> 28
                            000000000011111111112222222222333333
                            012345678901234567890123456789012345
                            << 32
00000011111111112222222222333333
45678901234567890123456789012345
                              |
                            000000000011111111112222222222333333
                            012345678901234567890123456789012345
                             ==?
0000000000111111111122222222223333333333444444444455555555556666
0123456789012345678901234567890123456789012345678901234567890123

Also Alive can help:

  %1x = shl i64 %word, 60
  %2x = lshr i64 %word, 4
  %3x = or i64 %1x, %2x
  %4x = icmp eq i64 %3x, %word
  %5 = zext i1 %4x to i32
=>
  %1 = lshr i64 %word, 28
  %2 = shl i64 %1, 32
  %3 = or i64 %1, %2
  %4 = icmp eq i64 %word, %3
  %5 = zext %4 to i32

Done: 1
Optimization is correct!

The drawback of Souper's code is that since it doesn't include a rotate operation that can be recognized by a backend, the resulting assembly won't be as good as it could have been. Of course if LLVM had a rotate instruction in the first place, Souper would have used it. Additionally, if we posit a superoptimizer-based backend for LLVM, it would be able to find the rotate operation that is well-hidden in Souper's synthesized code.

I find it kind of magical and wonderful that a SAT solver can solve synthesis problems like this. Souper's synthesizer is very alpha and I don't recommend using it yet, but the code is here if you want to take a look. Raimondas, my heroic postdoc, did the implementation work.

I'll finish up by noting that my first blog entry was posted five years ago today. Happy birthday, blog.

Update from Feb 10: A reader sent this solution which I hadn't seen before:

(x & 0x0FFFFFFFFFFFFFFF) == (x >> 4)

I think Souper missed it since we didn't include an "and" in its bag of components for this particular run.

,

20 responses to “Instruction Synthesis is Fun and Weird”

  1. The best definition of synthesis I’ve heard was given by David Bacon, quoting Eran Yahav: “a synthesizer is just a compiler that doesn’t work”. The meaning, as I understood it, is that once a tool is a total function on a certain kind of specification producing an implementation (for example, GCC), then we call it a compiler. This is clearly not true for tools like the one you demonstrate, so we call that a synthesizer.

    Maybe someone who knows more about synthesis than can be learned by going to PLDI talks has a better definition, though. 🙂

  2. Sam there is certainly some truth to that definition, though the actual synthesis people would stress that that style of specification is different: it might be some examples, some random math, or whatever.

  3. I agree (although Eran is an actual synthesis person). It feels (to an outsider) like there are two kinds of synthesis: one, where there’s a legitimate aspiration to compiler status (such as the one described in your post), and the other, where there are lots of wrong answers that satisfy the input you provided. Example-driven synthesis falls into the second category.

    On the other hand, it seems like the techniques used are pretty similar today across both kinds. But if we can’t speculate about other people’s research areas in blog comment sections, what are they for?

  4. I like to think of synthesizers as tools that transform a relation into a function (in the mathematical sense). Because every function is also a relation, it follows that all compilers are synthesizers, but the converse does not hold.

  5. Ha, and I thought my ((x & 15) * 0x01111111111111111) == x was clever. Rotate by 4! Of course, so much simpler, better, and faster.

    It makes me think of the fastest C code in that repo (and good C code in general, like the stuff from K&R). It’s fast, alright, but that’s secondary. What amazes me is how the code is straight to the point. It’s almost rude!

  6. Carlos, I suspect that the rotate method will be optimal for any compiler that turns it into a basic hardware instruction. Personally, I still have a soft spot for the formula I posted on the contest page
    ((uint64_t) (x * 0xfffffffffffffff1U)) < 16U
    (here written in a more portable format) which I also thought was pretty clever.

  7. I like the multiply one too. I’ll check to see if Souper can find it (should be possible if we remove shifts from its box of components).

  8. Here’s the Alive version of the multiply version:

    %1a = shl i64 %word, 60
    %2a = lshr i64 %word, 4
    %3a = or i64 %1a, %2a
    %4a = icmp eq i64 %3a, %word
    %5 = zext i1 %4a to i32
    =>
    %1 = mul i64 %word, 18446744073709551601
    %2 = icmp ult %1, 16
    %5 = zext %2 to i32

    Done: 1
    Optimization is correct!

  9. “The correctness argument uses these observations: (1) every nibble is compared to some other nibble and (2) it is irrelevant that the bitwise-or is presented with a pair of overlapping nibbles.”

    Nitpick/clarification/elaboration for anyone still confused: (a) in addition to (1), it’s also necessary that the graph of nibble-comparisons is connected. (b) The reason the overlap is irrelevant is that only 15 comparisons are necessary between the 16 nibbles. (The overlap nibble doesn’t compute anything useful.) The part to the left of the overlap guarantees that n0 == n1 == … n7 and the part to the right of the overlap guarantees that n1 == n8, n2 == n9, … n8 == n15. I drew it on paper to show myself that all the nibbles are included and that the graph is connected.

    Cool work!

  10. Vijay, in fact some of our test cases for Souper come from obfuscators — the solver can see through a lot of crap.

    On the other hand (as hinted at in this post) the synthesis engine appears to have promise as an obfuscator, although so far all of our obfuscation has been undesired and unintentional.

  11. How did you chose the contents of the “component library”? Is there some automatic selection or is it a manual/human selection process aiming to fit the expected output?

  12. I also had to write out Souper’s shifting solution. Focusing on nibbles (represented below by capital letters) eased things (IMO):
    word = ABCDEFGHIJKLMNOP
    word >> 28 = 0000000ABCDEFGHI
    word <> 28) | (word <> 28) | (word << 32) == word ABCDEFGHIJKLMNOP == IJKLMNO(A|P)BCDEFGHI
    A == I == P (== I) == B == J == C == K == D == L == E == M == F == N == G == O == H == (A | P) (== I)
    and this simply says that all the nibbles are equal. From a naive operation counting perspective this solution is no worse than the rotate one, the latter only really benefiting from the processor instruction. I’m impressed!

  13. Mark, we are still figuring that out. It is somewhat tricky since, for example, a 16-bit adder can’t be wired up to a 32-bit input. Also, the presence of trunc / sext / zext instructions in LLVM mean that we might want to be performing, for example, 64-bit math even if all inputs and outputs are 32 bits. The most obvious answer to your question “Give the synthesizer all components it might possibly need” is not a good answer since query size is quadratic in the number of components. It’s not clear there is a principled solution to this problem, we’ll probably just settle on a standard bag of components that fit the available bitwidths and then hope it suffices.

  14. Strange, my formatting got screwed up, but I can’t see why. Somehow I lost the line computing word << 32 and other statement got merged. At any rate, it's the definition of word and the string of equalities that matter.

  15. Joshua, thanks, fixed. BTW here’s how to get Alive to prove correctness (of the 15-F version, of course, though the constant is obfuscated here):

    %1a = shl i64 %word, 60
    %2a = lshr i64 %word, 4
    %3a = or i64 %1a, %2a
    %4a = icmp eq i64 %3a, %word
    %5 = zext i1 %4a to i32
    =>
    %1 = and i64 %word, 1152921504606846975
    %2 = lshr i64 %word, 4
    %3 = icmp eq %1, %2
    %5 = zext %3 to i32