Efficient Integer Overflow Checking in LLVM


(Here’s some optional background reading material.)

We want fast integer overflow checking. Why? First, if the undefined behavior sanitizers go faster then testing goes faster. Second, when overhead drops below a certain point people will become willing to use UBSan to harden production code against integer overflows. This is already being done in parts of Android. It isn’t the kind of thing to do lightly: some of LLVM’s sanitizers, such as ASan, will increase an application’s attack surface. Even UBSan in trapping mode — which does not increase attack surface that I know of — could easily enable remote DoS attacks. But this is beside the point.

Let’s start with this function:

int foo(int x, int y) {
  return x + y; 
}

When compiled with trapping integer overflow checks (as opposed to checks that provide diagnostics and optionally continue executing) Clang 3.8 at -O2 gives:

foo:
        addl    %esi, %edi
        jo      .LBB0_1
        movl    %edi, %eax
        retq
.LBB0_1:
        ud2

This code is efficient; here Chandler Carruth explains why. On the other hand, signed integer overflow checking slows down SPEC CINT 2006 by 11.8% overall, with slowdown ranging from negligible (GCC, Perl, OMNeT++) to about 20% (Sjeng, H264Ref) to about 40% (HMMER).

Why do fast overflow checks add overhead? The increase in object code size due to overflow checking is less than 1% so there’s not going to be much trouble in the icache. Looking at HMMER, for example, we see that it spends >95% of its execution time in a function called P7Viterbi(). This function can be partially vectorized, but the version with integer overflow checks doesn’t get vectorized at all. In other words, most of the slowdown comes from integer overflow checks interfering with loop optimizations. In contrast, GCC and Perl probably don’t get much benefit from advanced loop optimizations in the first place, hence the lack of slowdown there.

Here I’ll take a second to mention that I had to hack SPEC slightly so that signed integer overflows wouldn’t derail my experiments. The changes appear to be performance-neutral. Only GCC, Perl, and H254Ref have signed overflows. Here’s the patch for SPEC CPU 2006 V1.2. All performance numbers in this post were taken on an i7-5820K (6-core Haswell-E at 3.3 GHz), running Ubuntu 14.04 in 64-bit mode, with frequency scaling disabled.

Now for the fun part: making code faster by removing overflow checks that provably don’t fire. At -O0 the SPEC INT 2006 benchmarks have 67,678 integer overflow checks, whereas at -O3 there are 38,527. So that’s nice: LLVM 3.8 can already get rid of 43% of naively inserted checks.

Let’s look at some details. First, the good news: all of the overflow checks in these functions are optimized away by LLVM:

int foo2(int x, int y) {
  return (short)x + (short)y;
}

int foo3(int x, int y) {
  return (long)x + (long)y;
}

int foo4(int x, int y) {
  return (x >> 1) + (y >> 1);
}

int32_t foo5(int32_t x, int32_t y) {
  const int32_t mask = ~(3U << 30);
  return (x & mask) + (y & mask);
}

int32_t foo6(int32_t x, int32_t y) {
  const int32_t mask = 3U << 30;
  return (x | mask) + (y | mask);
}

int32_t foo7(int32_t x, int32_t y) {
  const int32_t mask = 1U << 31;
  return (x | mask) + (y & ~mask);
}

See for yourself here. The common theme across these functions is that each overflow check can be seen to be unnecessary by looking at the high-order bits of its inputs. The code for these optimizations is here.

Now on the other hand, LLVM is unable to see that these functions don’t require overflow checks:

int foo8(int x) {
  return x < INT_MAX ? x + 1 : INT_MAX;
}

int foo9(int x, int y) {
  if (x < 10 && x > -10 && y < 10 && y > -10)
    return x + y;
  return 0;
}

Here’s another one where the check doesn’t get eliminated:

void foo10(char *a) {
  for (int i = 0; i < 10; i++)
    a[i] = 0;
}

There’s good news: Sanjoy Das has a couple of patches that, together, solve this problem. Their overall effect on SPEC is to reduce the overhead of signed integer overflow checking to 8.7%.

Finally, this function should get one overflow check rather than three:

unsigned foo11(unsigned *a, int i) {
  return a[i + 3] + a[i + 5] + a[i + 2];
}

This example (or something like it) is from Nadav Rotem on twitter; his redundant overflow check removal pass for Swift eliminates this sort of thing at the SIL level. I’ve done a bit of work on bringing these ideas to LLVM, and will hopefully have more to write about that later on.

In summary, signed integer overflow checks in LLVM are fast but they get in the way of the optimizers, which haven’t yet been systematically taught to see through them or eliminate them. There’s plenty of low-hanging fruit, and hopefully we can get the overhead down to the point where people can turn on overflow checking in production codes without thinking too hard about the tradeoffs.

Addenda:

I haven’t forgotten about Souper! We’ve taught it to eliminate integer overflow checks. I’ll write about that later too.

See Dan Luu’s article on this topic.


11 responses to “Efficient Integer Overflow Checking in LLVM”

  1. Mill Computing is designing a new CPU architecture that natively implements interger overflow checks. Since their high-end CPU is expected to perform better than Intel Xeon, I see a bright future for Mill Computing.
    On a Mill CPU on simply does
    addsx b0, b1

    A Mill can also do a “widening add” where the result is two times as wide as the two original operands and therefore never overflows. And a saturating add where the result is maxint if it overflows.

    I expect that other CPU manufacturers will implement new instructions as well.

  2. Can you expand on what you mean by “some of LLVM’s sanitizers, such as ASan, will increase an application’s attack surface”? Is this just referring to the fact that with an ASan binary any detected memory error (even if a ‘mostly harmless’ one – say a one-past-the-end read where the value doesn’t end up affecting application control flow) results in an process abort and is thus an easy DoS? But much of that time a memory corruption is exploitable given sufficient effort/ingenuity, and of the options for “incorrect results due to memory corruption” I’d typically prefer “ASan crash” to “exec rm -rf /” or “open remote shell” or the like. So I don’t see how ASan increases the attack surface; it just converts arbitrary code executions into crashes.

    I guess you might mean that libasan is additionally linked into the process, and perhaps the ASan runtime itself could become the target of a memory corruption attack… ?

  3. I think you buried the lede here…

    Technically speaking, what can SPEC say it honestly benchmarks if it invokes UB in the process of doing it?

  4. Mike, yeah, it’s a bit worrisome that SPEC has undefined behavior, this means that the benchmarks are effectively written in a dialect of C that is widely implemented but is not described in any C standard. But anyway, basically all nontrivial C/C++ code has undefined behavior so there’s not much they could have done about it.

  5. > basically all nontrivial C/C++ code has undefined behavior

    Understood, but OTOH, most C/C++ code out there is not written for the express purpose of proving out the quality of a compiler/CPU implementation. Code of that sort should be held to a higher standard and the SPEC maintainers should take your discovery very seriously.

  6. Mike, I agree that benchmark code has to be held to a high standard– if they execute UB (and still expect certain outputs) this isn’t fair to compiler developers. In fact, I believe it has happened multiple times that the GCC or LLVM devs have had to revert correct compiler changes that happened to break SPEC.

  7. Fun way to see an interger overload error :
    Open MS Word or Excel
    Open VBA
    Press ctrl-G to get “immediate” window
    note: “?” means output to window
    type “? ( 25000 + 25000 )”
    It generates an overflow error 🙂

  8. > This code is efficient; here Chandler Carruth explains why.

    One additional nit: The add and jo will definitely get fused into one uop, but that uop can only be executed by 2 functional units in Haswell (ports 0 and 6), as opposed to 4 for the unchecked add (0,1,5,6). So, even if all loop optimization blocks are fixed, there might be a bit of overhead left. I would be surprised if that’s larger than 1% though — contention for arithmetic units is a fairly rare bottleneck.

  9. I remember coding the function ‘checkRippleForAdd’ in that overflow check function. It was a TODO in the code base. Never realized the potential of such checks when i wrote the code because it was just checking and not modifying/creating any IR. It took some more reading to understand its significance.