How Should You Write a Fast Integer Overflow Check?


Detecting integer overflow in languages that have wraparound semantics (or, worse, undefined behavior on overflow, as in C/C++) is a pain. The issue is that programming languages do not provide access to the hardware overflow flag that is set as a side effect of most ALU instructions. With the condition codes hidden, overflow checks become not only convoluted but also error-prone. An obvious solution — writing overflow checks in assembly — is undesirable for portability reasons and also it is not as performant as we would hope: assembly has opaque semantics to the higher-level language and the overflow checks cannot be removed in cases where the overflow can be shown to not happen.

In this post we’ll look at various implementations for this addition function:

/*
 * perform two's complement addition on the first two arguments;
 * put the result in the location pointed to by the third argument;
 * return 1 if overflow occurred, 0 otherwise
 */
int checked_add(int_t, int_t, int_t *);

This operation closely mimics a hardware add instruction. Here’s an optimal (at least in the number of instructions) implementation of it for the 64-bit case for x86-64 using the GCC/Linux calling convention:


checked_add:
        xor     %eax, %eax
        addq    %rsi, %rdi
        movq    %rdi, (%rdx)
        seto    %al
        ret


The problem is that compilers aren’t very good at taking an implementation of checked_add() in C/C++ and turning it into this code.

This is perhaps the simplest and most intuitive way to check for overflow:

int checked_add_1(int_t a, int_t b, int_t *rp) {
  wideint_t lr = (wideint_t)a + (wideint_t)b;
  *rp = lr;
  return lr > MAX || lr < MIN;
}

The only problem here is that we require a wider integer type than the one being added. When targeting x86-64, both GCC and Clang support 128-bit integers:

typedef int64_t int_t;
typedef __int128 wideint_t;
#define MAX INT64_MAX
#define MIN INT64_MIN

On the other hand, neither compiler supports 128-bit integers when generating code for a 32-bit target.

If we can’t or don’t want to use the wider type, we can let a narrow addition wrap around and then compare the signs of the operands and result:

int checked_add_2(int_t a, int_t b, int_t *rp) {
  int_t r = (uint_t)a + (uint_t)b;
  *rp = r;
  return 
    (a < 0 && b < 0 && r >= 0) ||
    (a >= 0 && b >= 0 && r < 0);
}

This code is the equivalent bitwise version:

#define BITS (8*sizeof(int_t))

int checked_add_3(int_t a, int_t b, int_t *rp) {
  uint_t ur = (uint_t)a + (uint_t)b;
  uint_t sr = ur >> (BITS-1);
  uint_t sa = (uint_t)a >> (BITS-1);
  uint_t sb = (uint_t)b >> (BITS-1);
  *rp = ur;
  return 
    (sa && sb && !sr) ||
    (!sa && !sb && sr);
}

What if we don’t have a wider type and also we don’t want to use unsigned math? Then we call this function:

int checked_add_4(int_t a, int_t b, int_t *rp) {
  if (b > 0 && a > MAX - b) {
    *rp =  (a + MIN) + (b + MIN);
    return 1;
  }
  if (b < 0 && a < MIN - b) {
    *rp =  (a - MIN) + (b - MIN);
    return 1;
  }
  *rp = a + b;
  return 0;
}

It looks odd but I believe it to be correct. You can find all of these functions (plus a crappy test harness) in this file and x86-64 assembly code for all widths in this file. Let me know if you have a checked add idiom that’s different from the ones here.

As a crude measure of code quality, let’s look at the number of instructions that each compiler emits, taking the smallest number of instructions across -O0, -O1, -O2, -Os, and -O3. Here I’m targeting x86-64 on Linux and using the latest releases of the compilers: Clang 3.4 and GCC 4.9.0. The results from compilers built from SVN the other day are not noticeably different.

8 bits 16 bits 32 bits 64 bits
GCC checked_add_1() 9 9 11 18
GCC checked_add_2() 14 21 21 21
GCC checked_add_3() 18 18 18 18
GCC checked_add_4() 12 12 16 23
Clang checked_add_1() 5 5 5 14
Clang checked_add_2() 16 16 14 14
Clang checked_add_3() 20 23 14 14
Clang checked_add_4() 18 18 18 22
optimal 5 5 5 5

The only real success story here is Clang and checked_add_1() for 8, 16, and 32-bit adds. The optimization responsible for this win can be found in a function called ProcessUGT_ADDCST_ADD() at line 1901 of this file from the instruction combiner pass. Its job — as you might guess — is to pattern-match on code equivalent to (but not identical to — some other passes have already transformed the code somewhat) checked_add_1(), replacing these operations with an intrinsic function that does a narrower add and separately returns the overflow bit. This intrinsic maps straightforwardly onto a machine add and an access to the hardware overflow flag, permitting optimal code generation.

Across both compilers, checked_add_1() gives the best results. Since it is also perhaps the most difficult check to mess up, that’s the one I recommend using, and also it has fairly straightforward equivalents for subtract, multiply, and divide. The only issue is that a double-width operation cannot be used to perform safe addition of 64-bit quantities on a 32-bit platform, since neither Clang nor GCC supports 128-bit math on that platform.

Ideally, authors of safe math libraries would spend some time creating code that can be optimized effectively by both GCC and LLVM, and then we could all just use that library. So far there’s one such example: Xi Wang’s libo, which avoids leaning so hard on the optimizer by directly invoking the llvm.x.with.overflow intrinsics. Unfortunately there’s no equivalent mechanism in GCC (that I know of, at least) for guaranteeing fast overflow checks.


27 responses to “How Should You Write a Fast Integer Overflow Check?”

  1. checked_add_3 can be improved:

    int checked_add_5(int_t a, int_t b, int_t *rp) {
    int_t ur = (uint_t)a + (uint_t)b;
    *rp = ur;
    return (~(a ^ b) & (ur ^ a)) < 0;
    }

    This gives 8 instructions with my version of gcc and clang for 32 and 64 bit ints. This slightly improves the best case for GCC.

  2. Usually instruction count is not a performance metric. It would be nice to see some runtime comparisons.

  3. Since such an operation is likely to be inlined in practice, it would be slightly more realistic to do the comparison on inlined code. In particular, the “overflow” return value will typically only be used in a conditional branch, so reifying it into an actual integer register value (as opposed to a branching flag) is a waste on many architectures.

    Try something like

    if (checked_add_N(a, b, &c))
    crashandburn();

    where crashandburn is declared noreturn.

  4. Perhaps use a compiler builtin for the purpose?

    bool __builtin_sadd_overflow (int x, int y, int *sum)

    Hopefully, this is “optimal” for a given architecture and data type, and may be recognised specially by optimizers?

  5. And what about:

    int checked_add_2(int_t a, int_t b, int_t *rp) {
    *rp = a+b;
    return (a>0 && b>0 && *rp < 0) ? 1 : 0 ;
    }

    Isn't this valid?

  6. Ouch, I forgot the negative numbers in my prvious comment, let me rewrite:

    Something like if “sign of a and b is the same but the sign or rp is different, then overflow. (you could do that with a multiplications if “a * b * (*rp) overflow, but I believe this is faster (I am going to try it my self and I will add some results).

    int checked_add_2(int_t a, int_t b, int_t *rp) {
    *rp = a+b;
    return (a>>7 == b>>7 && (*rp)>>7 != b<<7) ? 1 : 0 ;
    }

    There might be a typo but I think the idea is clear (unless any error in my thinking process)…

  7. Meanwhile developing I just noticed that I shifted only 7 bits, it should be size(int_t-1)… I was just considering one byte integers…

  8. Finally I have came up with a different approach using xors, it seems the neater and the faster, here I leave the comment I have left in HN.

    ——-

    I have added a way of checking the overflow and I have tested it:
    #define BITS (8*sizeof(int))-1

    int checked_add(int a, int b, int *rp)
    {
    *rp = a+b;
    return ((a^b^(*rp)) < 0) ? 1 : 0 ;
    }
    It is around 30% faster than the fastest in the blog and I believe the code is way cleaner. What we are doing is checking if all of the values has the same sign (the "^" is just a "xor"), if so, no overflow in other case, overflow.
    This is the testing program (g++ compliant):
    #include
    #include
    #include

    #define BITS (8*sizeof(int))-1

    int checked_add2(int a, int b, int *rp) {
    uint ur = (uint)a + (uint)b;
    uint sr = ur >> (BITS-1);
    uint sa = (uint)a >> (BITS-1);
    uint sb = (uint)b >> (BITS-1);
    *rp = ur;
    return
    (sa && sb && !sr) ||
    (!sa && !sb && sr);
    }

    int checked_add(int a, int b, int *rp)
    {
    *rp = a+b;
    return ((a^b^(*rp)) < 0) ? 1 : 0 ;
    }

    int main(int argc, char* argv[])
    {
    int a, b, c;
    long clong;
    srand(time(NULL));

    for(unsigned int i = 0; i < 50000000; ++i)
    {
    bool overflow = false;
    a = rand();
    b = rand();
    clong = (long)a + (long)b;

    overflow = checked_add(a,b,&c);

    if(clong != (long)c && !overflow)
    printf("Overflow not detected %i + %i = %i \n", a, b, c);

    }

    return 0;
    }

    Time for "checked_add": 2.694s
    Time for "checked_add2": 3.200s

    ———–

  9. I came up with a slightly different variation:

    int checked_add_6(int_t a, int_t b, int_t* rp) {
    static const uint_t mask = uint_t(1) << (8 * sizeof(int_t) – 1);
    *rp = (uint_t)a + (uint_t)b;
    return ((uint_t)a & mask) == ((uint_t)a & mask)
    && ((uint_t)a & mask) != ((uint_t)*rp & mask);
    }

    This seems to be about the same speed as version 1, marginally faster than Peter De Wachter's version, and significantly faster than any of the others. Timings (nanoseconds per call, using the latest Clang with -O2 on a Xeon 3500 CPU; they vary by a ns or two from run to run but these are typical):

    Function 8 bit 16 bit 32 bit 64 bit
    checked_add_1 58 58 29 74
    checked_add_2 64 64 36 76
    checked_add_3 65 63 36 77
    checked_add_4 69 66 42 83
    checked_add_5 59 58 32 74
    checked_add_6 57 58 30 72

    Interesting that the 32-bit functions are always substantially faster than any of the others, even though they're identical at the C level and similar in instruction count. I suppose this means the CPU architecture is optimized for 32-bit arithmetic and needs to do a certain amount of bit twiddling at the microcode level to handle the other sizes.

    (Sorry about the layout, I don't know what sort of formatting your comment form accepts, if any.)

  10. …Oops, just realized I made a typo in my code,it checks a twice instead of a and b in the second to last line. The fixed code runs at about the same speed as versions 2 and 3 (not really surprising, I suspect all three are optimized to essentially the same code). It looks like Peter’s version is the fastest one that doesn’t rely on the existence of 128-bit integers.

  11. Nice article!

    I think such codes are perfect candidates for superoptimization (like by using superopt or better, superopt-va). Could we, at some point, have selective superoptimization with GCC and LLVM? I know that there was at least one GSoC project on superoptimization for LLVM.

    Best regards
    Nikolaos Kavvadias

  12. It’s been pointed on on Hacker News that in my version the expression (~(a ^ b) & (ur ^ a))
    can be written as ((ur ^ a) & (ur ^ b))
    which eliminates the binary negation. In theory that should benchmark slightly better.

  13. Peter– the code from your comment 12 isn’t correct if we want the function to return 0 or 1. a and b need to be cast to unsigned before xor’ing them to avoid smearing the 1 bits and returning -1.

    Of course anything except 0 counts as “true” in C but still I think it’s friendly to return 0 or 1 if possible from a function like this!

  14. Nikolaos, superoptimization is definitely something that I am thinking about here! The SoC superoptimization project is not able to find this one, but I know how to make it work, I think.

  15. Regarding performance vs. code size, my intent was simply to capture the idea of when the compiler did the right thing vs. generating lots of crap. Of course running the benchmarks is a good idea, perhaps I’ll do a followup post with some numbers.

    Thanks everyone!

  16. Hi kiBytes, you’re not allowed to execute this code at the start of the function:

    *rp = a+b;

    It’s undefined behavior in the overflow case.

  17. Here is a version for 32 and 64 bit types that doesn’t require two’s complement arithmetic, casting, or a larger width type:

    http://www.eliteraspberries.com/files/checked_add.c

    It is also checked by Frama-C:

    frama-c -wp -wp-proof alt-ergo checked_add.c

    It compiles to about 30 instructions, but I suspect it is one of the fastest implementations because the majority of integers are of opposite signs.

  18. I’ve been thinking about this for a while, and I’ve come to the conclusion that arguably the way major programming languages implement integer arithmetic is wrong. There are, as I see it, basically four types of integers:
    1. Arbitrary-precision integers
    2. Wrapped arithmetic
    3. Trapping on overflow / checked overflow
    4. Saturated arithmetic

    The first type is favored by a few dynamic programming languages, but is typically less well-favored in C-ish languages. The fact that arithmetic in C mostly follows #2 (except for the utter weirdness that is signed overflow) I think mostly follows from it being the easiest to implement in hardware. But it’s not clear to me that #2 is the most useful model for programmers, and the fact that many languages make it hard to implement #3 or #4 means that integer overflow vulnerabilities are arguably much more common than they should be.

  19. Joshua, I agree with this. In fact I have a partly-written post about the design space for integer math which basically presents your list in a lot more detail. Hopefully I’ll be able to finish this up soon.

    My view is that wrapping integers are a disaster for code correctness and that trapping would be preferable. I’m bummed that recent systems languages like Go and Rust have wrapping integers.

    But really, both wrapping and trapping are unpleasant. What we really want is to be able to reason about integers using lightweight and medium-weight formal methods in order to show that they do not overflow and also that integers used as array indices are in-bounds. This is harder of course…

  20. Wrapped arithmetic definitely has a steady place in the useful design space, for implementing functions like hashes and ciphers. Though I guess neither of those sensibly wraps *signed* integers, so perhaps those semantics should be restricted only to unsigned integer types?

  21. Phil, yeah, wrapped operations are definitely useful, but I’m not super sure about unsigned types…

    One option would be to eliminate unsigned entirely, as in Java. In this case there would be separate wrapping operators that look different from the default trapping operators.

    Another option would be to have unsigned types but do a better job segregating them from signed types than C does, perhaps requiring explicit casts and definitely trapping on any lossy conversion.

  22. @Joshua Cranmer, blame the x86 and ARM CPU designers: if they included a “trap on integer overflow” mode on each integer operation in the ISA (like the MIPS does), then I think that modern language designers would use it, and that would be a great improvement(*)!

    But no .. let’s optimize for C and only for C, eh?
    It’s quite annoying because it would be very cheap to implement, in fact the biggest issue is the needed bit in the ISA to distinguish trapping/wrapping operation..

    *: of course it’s only a beginning: checking array access is another issue, use after free, etc.

  23. I don’t understand how the line
    if (b < 0 && a < MIN – b) {
    in checked_add_4() is not undefined behaviour. Should that be MIN + b?