[Although this post was written to stand by itself, it builds on the previous one. It is authored by Jubi Taneja, Zhengyang Liu, and John Regehr.]
When designing computer systems, it can be useful to avoid specifying behaviors too tightly. For example, we might specify that a math library function only needs to return a result that is within some distance of the true answer, in order to avoid paying the cost of a fully precise answer when that precision is not needed. Similarly, when a processor boots we might specify that some of its registers or memory are in unknown configurations, to avoid burdening the processor designers with bringing up all hardware in known states. In hardware design we often refer to don’t-care terms in a functional specification. Undefined behavior in C and C++ has a similar rationale: by avoiding specification of an implementation’s behaviors in numerous inconvenient situations, more efficient implementations become possible.
A fun thing that optimizing compilers can do is to automatically infer when the full power of some operation is not needed, in which case it may be that the operation can be replaced by a cheaper one. This happens in several different ways; the method we care about today is driven by LLVM’s demanded bits static analysis, whose purpose is to prove that certain bits of an SSA value are irrelevant. For example, if a 32-bit value is truncated to 8 bits, and if that value has no other uses, then the 24 high bits of the original value clearly do not matter.
Here is a demanded-bits-based optimization that looks for a call to the bswap intrinsic where only a single byte of the result is demanded, and then simply shifts that one byte into the necessary position. This isn’t an important optimization for targets that have a fast bswap instruction, but it would be useful for targets that require bswap to be open-coded. In this example, a full-power 64-bit bswap for 32-bit ARM (foo1) requires more than a dozen instructions but a bswap where only the low byte is demanded (foo2) requires just two instructions.
In this post we’ll present a few examples where the precision of LLVM’s demanded bits analysis compares unfavorably with a highly precise demanded bits that we created using an SMT solver. Some of these situations may be worth fixing. The solver-based algorithm is pretty straightforward. For every bit that is an input to a computation, it asks two questions:
- If this bit is forced to zero, does the resulting computation have the same behavior as the original, for all valuations of the remaining input bits?
- If this bit is forced to one, does the resulting computation have the same behavior as the original, for all valuations of the remaining input bits?
If both answers are “yes” then that input bit is not demanded. Otherwise it is demanded.
Side note: “demanded bits” is a slightly confusing name for this analysis. Since static analyses always approximate, we can say that they have a may direction and a must direction. If demanded bits says that a bit is demanded, this actually means “it may be demanded, or perhaps not: this analysis is incapable of saying for sure.” On the other hand, if it says that a bit is not-demanded, then this should be interpreted as “this bit’s value must not matter, and we’re confident that we could provide a formal proof of this.” The problem here is that it is more intuitive to name an analysis after its must direction, since this is the reliable information that we can directly use to drive optimizations. So, in other words, this analysis should have been called a don’t-care analysis, as is traditional. We are getting off of our soapbox now.
Ok, now on to the examples. The code will be in Souper syntax, but hopefully it is close enough to LLVM IR that nobody will have trouble seeing what’s going on. All of these program fragments are ones that we encounter while compiling SPEC CPU 2017. The LLVM that we’re comparing against is from January 20, 2020.
When a number is multiplied by a constant, some of its high bits may not affect the answer:
%0:i32 = var %1:i32 = mul 40:i32, %0 demanded-bits from Souper: for %0 : 00011111111111111111111111111111 demanded-bits from compiler: for %0 : 11111111111111111111111111111111
Let’s pause here for a second and assume that you don’t believe us. And why should you? This material can be counterintuitive and we make mistakes all the time, like anyone else. So here’s a simple program you can use to check this result using exhaustive testing instead of a solver. With a bit of effort you should be able to use it to double-check all of the results in this post.
It is also the case that some low bits of a number may not matter, when it is going to be divided by a constant:
%0:i32 = var %1:i32 = udiv %0, 1000:i32 demanded-bits from Souper: for %0 : 11111111111111111111111111111000 demanded-bits from compiler: for %0 : 11111111111111111111111111111111
This one is cute:
%0:i32 = var %1:i32 = srem %0, 2:i32 demanded-bits from Souper: for %0 : 10000000000000000000000000000001 demanded-bits from compiler: for %0 : 11111111111111111111111111111111
And so is this one:
%0:i32 = var %1:i32 = mul %0, %0 demanded-bits from Souper: for %0 : 01111111111111111111111111111111 demanded-bits from compiler: for %0 : 11111111111111111111111111111111
Another flavor of lost precision comes from comparison instructions:
%0:i8 = var %1:i1 = ult %0, 8:i8 demanded-bits from Souper: for %0 : 11111000 demanded-bits from compiler: for %0 : 11111111
This one is fun, only the sign bits matters:
%0:i8 = var %1:i1 = slt %0, 0:i8 demanded-bits from Souper: for %0 : 10000000 demanded-bits from compiler: for %0 : 11111111
The above represent some low-hanging fruit. We’re leaving out some UB-related imprecision since that stuff is fiddly to work with, and also a lot of imprecision that requires looking at multiple instructions together. (Unhappily, abstract transfer functions are non-compositional: even when a collection of maximally precise functions is applied to a collection of instructions, the result is often not maximally precise.) Here are two quick examples:
%0:i16 = var %1:i16 = add 64:i16, %0 %2:i16 = and 448:i16, %1 demanded-bits from Souper: for %0 : 0000000111000000 demanded-bits from compiler: for %0 : 0000000111111111
And second:
%0:i32 = var %1:i1 = eq 0:i32, %0 %2:i32 = select %1, 1:i32, %0 %3:i1 = ne 0:i32, %2 demanded-bits from Souper: for %0 : 00000000000000000000000000000000 demanded-bits from compiler: for %0 : 11111111111111111111111111111111
Here we can see that %2 can never be zero, since %2 takes the value of %0 when %0 is not zero, and takes the value 1 otherwise. This analysis result is particularly satisfying because the special case of no demanded bits means that some instructions become dead. In other words, as far as this little collection of instructions is concerned, there wasn’t any need to compute %0 in the first place, because every one of the 232 values that it could take results in %3 being true.
Which of these imprecisions are worth fixing? That is harder to answer. The potential benefit is obvious: it may allow additional optimizations to be performed. The costs of increasing precision include increased compile time, the overhead of maintaining the new code, and the risk of miscompilation if someone writes an unsound transfer function.
3 responses to “Precision Opportunities for Demanded Bits in LLVM”
This seems like a portion of the compiler that would very much benefit from a non-imperative implementation. If better rules could be implemented as pattern & substitutions and those considered in a composable way (trivially easy and compact to add and remove) then real world exploration of the value of these vs. the downsides you list might be much more practical.
– Turning them on/off to see the effect on compile time would be easy (at least at build time).
– The maintenance overhead would be minimized during the exploration phases.
– Taking each one by it self could also significantly reduce the cost of verification.
IIRC you were involved in some work in that direction for the instruction combiner pass? Did anything ever come of that?
Hi Benjamin, absolutely. Some sort of DSL would be ideal for this stuff. I think the problem is that there’s quite a bit of work needed, both initially and on an ongoing basis, to create and maintain such a thing, and that that LLVM community would be correct to be suspicious of it, unless there was someone who was clearly capable of maintaining this indefinitely.
Replacing InstCombine with code generated from a DSL is still something I’m very much interested in, and I believe the LLVM community is also interested, but there’s the same problem.
> Undefined behavior in C and C++ has a similar rationale: by avoiding specification of an implementation’s behaviors in numerous inconvenient situations, more efficient implementations become possible.
Have you read the published Rationale for the C Standard? The way it describes UB is rather different from what the above description would suggest. Because different implementations are used to accomplish different kinds of tasks, some of which require the ability to do things not provided for by the Standard, implementations are given the freedom to either process various actions in a predictable fashion which would facilitate such tasks, or to perform optimizations on the assumption that code will not perform such actions, depending upon which approach would allow their customers to most effectively accomplish what needs to be done.
While it may be interesting from an academic standpoint to examine what the Standard does and does not attempt to require in various situations, it is important to remember that the Standard does not attempt to mandate that implementations be suitable for any particular purpose. Indeed, the authors acknowledge when discussing translation limits that it doesn’t even mandate that implementations be suitable for any useful purpose. “While a deficient implementation could probably contrive a program that meets this requirement, yet still succeed in being useless, the C89 Committee felt that such ingenuity would probably require more work than making something useful.”
Consider, for example, a function like “unsigned mul_mod_65536(unsigned short x, unsigned short y) { return (x*y) & 0xFFFFu;}” The authors of the Rationale describe, starting on page 44 line 20, how that would be processed by “most current implementations” in situations where the Standard would pose no requirements. I think it entirely plausible that the authors of the Standard would not have wanted to require a compiler for a 36-bit ones’-complement Univac with 18-bit “unsigned short” to generate the code needed to ensure a correct mod-65536 result in cases where the mathematical value of the product would be in the range 0x800000000u to 0xFFFFFFFFFu, but that text on page 44 would make no sense if the authors of the Standard didn’t expect that commonplace implementations for 32-bit platforms would process the code usefully for all possible inputs.
With regard to the specific subject of this article, does LLVM maintain dependency information in cases where no bits of a value end up being used? By my understanding of the Standard, if some action is sequenced before a read of “x”, such an action would also be sequenced before the write performed by the assignment expression “y = x | intExpr;” even if a compiler determines that “intExpr” will always be -1. There may not be any need to actually read the value of x, but the dependency should still remain. Does LLVM maintain dependency information in such cases?