Counting to 4 Billion Really Fast

Before teaching today I wrote a silly program to make sure that the bitwise complement of any 32-bit value is equal to one less than the two’s complement negation of that value:

#include <stdio.h>
#include <limits.h>
#include <assert.h>

void foo (unsigned x)
  unsigned x1 = ~x;
  unsigned x2 = -x - 1;
  assert (x1 == x2);

int main (void)
  unsigned x = 0;
  long checked = 0;
  while (1) {
    foo (x);
    if (x==UINT_MAX) break;
  printf ("checked %ld values\n", checked);
  return 0;

When the program terminated without any perceptible delay, I figured there was a bug, but nope: the code is good. It turns out that both GCC and Clang optimize the program into effectively this:

int main (void)
  printf ("checked 4294967296 values\n");
  return 0;

The surprise (for me) was that at -O1 — which traditionally does not enable interprocedural optimizations or aggressive loop transformations — both compilers looked inside the function foo() closely enough to figure out that it is a nop, and also that both compilers were able to predict that a not-traditionally-structured loop executes 2^32 times. I do so many posts about compiler bugs here that I figured a bit of antidote would be nice.

This is gcc 4.2.1 and Apple LLVM version 4.2.


11 responses to “Counting to 4 Billion Really Fast”

  1. I’m guessing inlining+DCE is now common and “basic”, but the compile-time evaluation of ‘checked’ is cool! Is there a well-established dataflow algorithm for that?

    Did the code work properly once, I assume, you changed the last line of foo to something like ” return x1 == x2″, then asserted on that in the main loop?

    BTW, this is a much cleaner way of doing fast loops than I had seen before : calloc() an array such that it ends at a page boundary, then set permissions on the next page such that an access will cause an exception, then finish the loop in the handler!

  2. Hi Eric, I believe that both GCC and LLVM use “scalar evolution” passes to compute loop counts. But I don’t know who developed these techniques and when.

  3. Forgive my asking, but are you sure assert is even called? (It needs -Og or #define _DEBUG.) Maybe f00() was a NOP because without assert it has no outside effects.

    Also, calling printf with just a string isn’t a very good optimisation because printf compares every character to ‘%’! (I know you meant to say printf(“checked %ld values\n”, UINT_MAX);).

    Eric, that’s a nice trick! But I’d hardly call it clean ;-). Is it actually faster? I imagine it only really works if you have to loop over the same array many times.

  4. Hi Magnus, I did double check that assertion by making it fail. GCC and Clang (for versions and platforms that I use, at least) only omit assertion checks when NDEBUG is defined.

    I was expecting both compilers to turn the printf into a puts, but neither one did that.

  5. Rewriting standard library functions seems a dubious pastime to me beyond builtins and obvious arithmetical optimizations. The assumptions compilers can make about them are limited. I wonder if rewriting printf() to puts() would actually pay off — nominally puts() is simpler since it skips the formatting, but you’d expect the I/O to be the bottleneck anyway. And the compiler needs to check that the string contains no formatting specifiers and the return value is not used (since the return value of puts() cannot be used to emulate the return value of printf()). All that for a library that for all you know might implement puts() by calling printf()…

  6. Jeroen, I’m pretty sure that previous versions of GCC did the puts() optimization. Perhaps they backed off due to problems like the ones you mention. The real benefit of this replacement would be on embedded systems where (assuming the libraries are designed well or the linker is pretty smart) there would be some chance of not even linking printf() into the executable.

    How many codes have you seen that checked the return value of printf()? I’m not sure I’ve ever seen it done “for real”.