The Little C Function From Hell

The other day a student and I were trying to understand a subtle part of the C standard. Often, the easiest way to clarify this kind of issue is to recognize that compiler writers have already grappled with it — so just write some code and see what various compilers do with it. I wrote this function:

int foo (char x) {
  char y = x;
  return ++x > y;
}

Since the expression ++x evaluates to the incremented value of x, it is clear that this function should return “1” for most values of x. The question is: What does it compute for CHAR_MAX?  One possibility is that the function reliably returns “0” for that input, the other possibility is that ++x is undefined on platforms where char is a signed type. For completeness here’s the test harness that prints foo()’s output for all possible inputs:

int main (void) {
  int i;
  for (i=CHAR_MIN; i<=CHAR_MAX; i++) {
    printf ("%d ", foo(i));
    if ((i&31)==31) printf ("\n");
  }
  return 0;
}

This code only works if char is narrower than int, but I’ve never seen a platform where that is not the case.

The first sign that something odd was going on appeared when I compiled the code using Clang:

regehr@home:~$ clang -O foo.c -o foo
regehr@home:~$ ./foo
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

Not cool — the function is supposed to mostly return true. But then I realized that my default Clang was out of date (2.7) so I tried a very recent Clang snapshot (rev 126534):

regehr@home:~$ clang -O0 overflow.c -o overflow
regehr@home:~$ ./overflow
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
regehr@home:~$ clang -O1 overflow.c -o overflow
regehr@home:~$ ./overflow
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

The result (look at the last character) changed when we changed the optimization level — this is OK if incrementing CHAR_MAX is undefined, and is a compiler bug otherwise.

The Intel C compiler (12.0.2 for x86-64) has similar behavior:

[regehr@bethe ~]$ icc -O0 foo.c -o foo
[regehr@bethe ~]$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
[regehr@bethe ~]$ icc -O foo.c -o foo
[regehr@bethe ~]$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

A very recent GCC (rev 170512 for x86) gives a consistent output:

regehr@home:~$ current-gcc -O0 foo.c -o foo
regehr@home:~$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
regehr@home:~$ current-gcc -O2 foo.c -o foo
regehr@home:~$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

However, the story changes if we ask it not to perform function inlining:

regehr@home:~$ current-gcc -O2 -fno-inline foo.c -o foo
regehr@home:~$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

So far, Clang 2.7 appears to be wrong, but all other observed behavior is consistent with ++x being undefined when x is CHAR_MAX. Then I tried CompCert and things took a turn for the worse:

regehr@home:~$ ccomp foo.c -o foo
regehr@home:~$ ./foo
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1

This is very odd because CompCert contains a verified version of C’s tricky implicit casts — the exact thing that foo() was designed to test in the first place.

To make a long story short: when the “char” type is narrower than the “int” type and when x has type “signed char” and value CHAR_MAX, ++x is well-defined by both ANSI C and C99. We know that it is well-defined because:

  1. The standard says: “The expression ++E is equivalent to (E+=1).”
  2. The standard says: “A compound assignment of the form E1 op= E2 differs from the simple assignment
    expression E1 = E1 op (E2) only in that the lvalue E1 is evaluated only once.”
  3. The RHS of the simple assignment expression “E1 op E2” is subject to C’s “usual arithmetic conversions.”
  4. The usual arithmetic conversions ensure that two operands to a “+” operator of type signed char are both promoted to signed int before the addition is performed.
  5. When int is wider than char, there is no possibility of the resulting addition overflowing. Thus, the behavior is well-defined and every correct compiler must emit this output:

1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0

In other words, to evaluate ++x > y when both variables have type signed char and value CHAR_MAX:

  1. Signed char 127 is promoted to signed int 127.
  2. Signed int 127 is incremented to signed int 128.
  3. Signed int 128 is cast to signed char -128, which is the new value for x and also the value of the subexpression ++x. (Update: As Mans points out in comment #5, the result of this type cast is implementation defined. All common C implementations define the behavior to be truncation of a 2’s complement integer.)
  4. Signed char -128 is promoted to signed int -128.
  5. Signed int -128 > signed int 127 is evaluated to 0.

But what about the fact that none of the four compilers I commonly use reliably returns the correct result?

  • In GCC the bug is known and has existed for some time.
  • In LLVM/Clang the bug was not known but was fixed in less than 24 hours.
  • The Intel compiler is wrong too.
  • In CompCert there is no bug. However, there is an unfortunate interaction between its definition of “char” to be unsigned and the signed values for CHAR_MIN and CHAR_MAX found in my Linux machine’s limits.h file. Verifying a compiler is an extremely difficult problem and verifying its entire environment (header files, libraries, assemblers, I/O routines, etc.) is pretty much an open problem. This is why we test.

That’s a lot of trouble being caused by a two-line function. C may be a small language, but it’s not a simple one.

[Thanks to Chucky Ellison from UIUC and to Xavier Leroy for helping me puzzle though all this.]

How to Write a C/C++ Compiler That Respects Volatile

The volatile type qualifier in C/C++ means roughly that accesses to the qualified object happen on the actual machine as they do in the abstract machine. I’ve written about volatile pretty extensively, so won’t repeat myself.

An interesting problem with volatile is that in practice, compilers fail to respect it: they add, remove, and reorder accesses in ways that break the rules. This happens because the rules for translating accesses to volatile-qualified objects are very different from the rules for accessing regular C variables: the compiler has nearly complete freedom to add, remove, and reorder non-volatile variable accesses so long as this doesn’t change the program’s externally observable behavior. Thus, many optimization passes require special cases like this:

if (!var->is_volatile()) transform_code();

The problem is that every optimization developer must add these extra safety checks every time — any omission is likely to break the properties that volatile is intended to preserve.

A few years ago Eric Eide and I observed that the rules for accessing volatile objects are very similar to the rules for manipulating function calls. In other words, when the compiler lacks any special knowledge about a called function, it must not add, remove, or reorder function calls. This lead to the idea that if compiler writers would simply model volatile accesses as function calls, all of those special cases in the optimization passes would go away. We tested this idea by writing a source-to-source transformer for C code that turned accesses to volatiles into calls to automatically generated helper functions. In other words, if x is defined as “volatile int x;” then this code:

y=x;

becomes:

y=__volatile_read_int(&x);

Our idea mostly worked. What I mean is that many, but not all, problems in miscompilation of accesses to volatiles went away after transforming programs. Our hypothesis was that when wrapper functions didn’t work, it was always because the compiler was performing a regular miscompilation (i.e., generating the wrong code in a way that has nothing to do with volatile). But we couldn’t back this up since we lacked a correct compiler and we also didn’t have time to manually inspect thousands of failed test cases.

So far this is all old news, but there has been a very nice new development in volatile-land. As of recently, CompCert implements a proved volatile semantics like this:

  1. Accesses to volatile-qualified objects are turned into function calls in the frontend.
  2. The target-independent optimization passes totally ignore these function calls.
  3. In the backend, the function calls are turned into inline code that performs the accesses.

This is basically the strategy that Eric and I came up with, but with a nice improvement: much of the overhead associated with actual function calls is avoided due to the late inline substitution. The overhead of function calls — particularly in terms of code size — would be significant for small embedded applications that consist mainly of accesses to device registers. Some overhead will likely remain due to the calling convention and because CompCert must pessimistically assume that a helper function has updated any global variables that happen to be cached in registers. Hopefully the CompCert folks will quantify the overheads of various alternatives in a paper sometime.

We looked for volatile bugs in CompCert and found a few: they were in, for example, the unproved frontend code that expands C-level struct assignments into Clight. After Xavier fixed these bugs (I think there were two of this ilk) we can no longer find volatile bugs in CompCert. Now we finally come to the point of this post:

Out of the dozen-odd compilers we have tested, there is only one — CompCert — that reliably respects C’s volatile qualifier.

This is an impressive result.

Update from March 1 2011:

My description of CompCert is a bit out of date. Xavier Leroy explains:

One little correction, though: the handling of volatiles you describe is that of CompCert 1.7.

In the latest release CompCert 1.8, I improved the generated code by, in essence, inlining the _volatile_read and _volatile_write functions after optimizations are done, but before register allocation. (In reality, it’s done a bit differently, through a notion of “inlinable builtin functions” of which the volatile operations is an instance.)

This way, the generated code isn’t constrained by the calling conventions: the compiler knows that the caller-save registers are not destroyed by the _volatile_* functions, and that these “functions” can take their arguments in any register, not just those dictated by the calling conventions.

This sounds like exactly the right solution: not only does it give us correct code and optimization passes that are free of volatile-related clutter, but the performance and size of the generated code should be very good.

The Synergy Between Delta Debugging and Compiler Optimization

Before reporting a compiler bug, it’s best to reduce the size of the failure-inducing input. For example, this morning I reported an LLVM bug where the compiler enters an infinite loop when compiling this C code:

static int foo (int si1, int si2) {
  return si1 - si2;
}
void bar (void) {
  unsigned char l_197;
  for (;; l_197 = foo (l_197, 1))
    ;
}

My bug report was a good one because the test program looks pretty close to minimal; it was reduced manually and automatically from a code that was originally 25 KB. Testcase reduction is so important that the GCC folks have an entire document devoted to the topic, including instructions for how to use the Berkeley Delta implementation.

The Berkeley Delta is pretty awesome except when it is not. It often gets stuck at local minima that cannot be avoided except by making coordinated changes to multiple parts of a program. For example, it will never remove an argument from a function definition and also from all calls to the function. Therefore, a person works with delta like so:

  1. Run delta
  2. Do more reduction by hand
  3. Goto step 1

After spending more hours on this kind of activity than I’d care to admit, I realized that most of the stuff I was doing resembled compiler optimizations:

  • inlining a function
  • deleting a dead function argument
  • propagating a constant
  • reducing the dimension of an array
  • removing a level of indirection from a pointer

I think the similarity is a useful observation, but there are important differences between an optimizer and a testcase reducer:

  1. An optimizer must be semantics-preserving, whereas a testcase reducer doesn’t need to be
  2. An optimization is desirable when it speeds up the code, whereas a reduction step is desirable when it both simplifies the test case and also doesn’t make the bug disappear

Thus, to reuse an optimizer as a reducer, we would simply need to drive it in a typical Delta loop, being ready to back out any transformation which makes the bug go away or fails to make the test case simpler. I haven’t had time to do this; hopefully someone else will.

A Critical Look at the SCADE Compiler Verification Kit

While searching for related work on compiler testing and certification, I ran across the SCADE Compiler Verification Kit: a collection of SCADE-generated C code along with some test vectors. The idea is to spot compiler problems early and in a controlled fashion by testing the compiler using the kinds of C code that SCADE generates.

SCADE is a suite of tools for building safety critical embedded systems; it generates C code that is then cross-compiled for the target platform using a convenient compiler. Using C as a portable assembly language is a fairly common strategy due to the large variety of embedded platforms out there.

Given that there is a lot of variation in the quality of embedded C compilers (in other words, there are some really buggy compilers out there), something like the Compiler Verification Kit (CVK) probably had to be created. It’s a great idea and I’m sure it serves its purpose. Unfortunately, the developers over-state their claims. The CVK web page says:

Once this verification of the compiler is complete, no further low-level testing is needed on the SCADE-generated object code. [emphasis is theirs]

This is saying that since the CVK tests all language features and combinations of language features used by SCADE, the CVK will reveal all compiler bugs that matter. This claim is fundamentally wrong. Just as a random example, we’ve found compiler bugs that depend on the choice of a constant used in code. Is the CVK testing all possible choices of constants? Of course not.

The web page also states that that CVK has:

…the test vectors needed to achieve 100% MC/DC of structural code coverage.

No doubt this is true, but it is misleading. 100% coverage of the generated code is not what is needed here. 100% coverage of the compiler under test would be better, but even that would be insufficient to guarantee the absence of translation bugs. Creating safety-critical systems that work is a serious business, and it would be nice if the tool vendors in this sector took a little more trouble to provide accurate technical claims.

A Quick Update to Comparing Compiler Optimizations

Saturday’s post on Comparing Compiler Optimizations featured some work done by my student Yang Chen. Since then:

  • There has been some discussion of these problems on the GCC mailing list; some of the problems are already in the Bugzilla and a new PR was filed. A patch fixing the problem is already available!
  • On Sunday night Chris Lattner left a comment saying that two of the Clang performance problems have been fixed (the other two are known problems and will take more time to fix).

Have I ever said how great open source is? Of course, there’s nothing stopping a product group from fixing a bug just as quickly — but the results would not be visible to the outside world until the next release. When improvements to a piece of software depend on a tight feedback loop between developers and outsiders, it’s pretty difficult to beat the open source model.

Yang re-tested Clang and here are the results.

Example 4

Previously, Clang’s output required 18 cycles to execute. The new output is both faster (4 cycles) and smaller than GCC’s or Intel CC’s:

tf_bH:
        movq    %rsi, last_tf_arg_u(%rip)
        imull   $43691, %esi, %eax      # imm = 0xAAAB
        shrl    $17, %eax
        ret

Nice!

Example 7

Previously, Intel CC’s 8-cycle code was winning against Clang’s 24 and GCC’s 21 cycles. Clang now generates this code:

crud:
	cmpb	$33, %dil
	jb	.LBB0_4
	addb	$-34, %dil
	cmpb	$58, %dil
	ja	.LBB0_3
	movzbl	%dil, %eax
	movabsq	$288230376537592865, %rcx
	btq	%rax, %rcx
	jb	.LBB0_4
.LBB0_3:
	xorl	%eax, %eax
	ret
.LBB0_4:
	movl	$1, %eax
	ret

Although this is several instructions shorter than Intel CC’s output, it still requires 10 cycles to execute on our test machine, as opposed to ICC’s 8 cycles. Perhaps this is in the noise, but Yang spent a while trying to figure it out anyway. It seems that if you replace the btq instruction with a testq (and adjust the subsequent branch appropriately), Clang’s output also executes in 8 cycles. I won’t pretend to understand this.

Example 6 (Updated Dec 17)

This has been fixed in GCC, see the discussion here. The failure to optimize was pretty interesting: some hacking in GCC’s profile code caused some breakage that lead it to infer that the division instructions in this function are “cold” and not worth optimizing to multiplies. This is a great example of how the lack of good low-level metrics makes it difficult to detect subtle regressions. Of course this would have been noticed eventually by someone who reads assembly, but there are probably other similar problems that are below the noise floor when we look only at results from high-level benchmarks.

Comparing Compiler Optimizations

[Update from Dec 14: Some of these problems have already been fixed! Details here.]

[This is a guest post by my grad student Yang Chen, describing one of the projects he’s been working on lately. I elaborated on some of Yang’s points, edited, and did some formatting for WordPress.]

Our goal is to help C compiler developers find areas where their optimizer could stand to be improved. We do this by taking some open source applications and, for each application:

  1. Select a number of benchmark functions. For now, all benchmarks are leaves: they make no function calls. There are a few other restrictions too.
  2. Instrument the source code so that all inputs to benchmark functions (including those accessed through pointers) are logged.
  3. Run whatever test suite comes with the application, logging inputs to benchmark functions. Logging stops after a configurable maximum number of distinct inputs has been seen.
  4. Extract standalone versions of the benchmark functions.
  5. For each compiler under test, compile each extracted benchmark with a test harness. Run the resulting code on an otherwise quiescent machine in order to measure the performance of the benchmark function in isolation.
  6. Look for outlier functions where one compiler created significantly faster code than another compiler — these are potentially interesting and we inspect them manually. Some examples are below.

When we report the number of cycles required to execute a function, this is an average over all logged inputs.

The test machine is built around an Intel Core i7-920 processor, which has 4 cores running at 2.67GHz and 8MB of shared L3 cache. It has 6 GB of RAM. It runs the x86-64 version of Ubuntu 9.10. Performance is measured using the RDTSC instruction. The test process is pinned to a single processor and all CPU and OS power management functionality is turned off.

The compiler options we used are intended to:

  1. Optimize for maximum speed
  2. Give the tools a fighting chance at making equivalent assumptions about the hardware platform

Clang

Svn head built on Dec 4, 2010:

$ clang -v
clang version 2.9 (trunk 120838)
Target: x86_64-unknown-linux-gnu
Thread model: posix

Invocation:

clang -O3 -mssse3 -march=core2 -mtune=core2 -fomit-frame-pointer -fno-stack-protector -w

GCC

Svn head built on Dec 4, 2010:

$ current-gcc -v
Using built-in specs.
COLLECT_GCC=./compilers/bin/current-gcc
COLLECT_LTO_WRAPPER=/uusoc/exports/scratch/chenyang/performance_tests/compilers/libexec/gcc/x86_64-unknown-linux-gnu/4.6.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../configure –prefix=/uusoc/exports/scratch/chenyang/performance_tests/compilers –program-prefix=current- –enable-lto –enable-languages=c,c++,lto
Thread model: posix
gcc version 4.6.0 20101204 (experimental) (GCC)

Invocation:

current-gcc -O3 -mfpmath=sse -mssse3 -march=core2 -mtune=core2 -fno-stack-protector -fomit-frame-pointer -w

Intel

Version is:

$ icc -V
Intel(R) C Intel(R) 64 Compiler Professional for applications running on Intel(R) 64, Version 11.1 Build 20100414
Copyright (C) 1985-2010 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY

Invocation:

icc -O3 -static -not-prec-div -xHost -mssse3 -fno-stack-protector -fomit-frame-pointer -w

Example 1

(Here and in other examples, we have edited the assembly code for clarity. Also note that we are showing the source code as seen by the compilers being tested. It is not only preprocesssed, but also has been parsed by our benchmark extraction tool which is based on CIL. CIL does a few simple transformations that make C code a lot uglier sometimes.)

This function is setboundaries() from zic.c in postgresql-9.0.0.

 1 typedef long int64;
 2 typedef int64 zic_t;
 3 void setboundaries (void);
 4 zic_t max_time;
 5 zic_t min_time;
 6 void
 7 setboundaries (void)
 8 {
 9   int i;
10
11   min_time = -1L;
12   i = 0;
13   while (i < 63)
14     {
15       min_time *= 2L;
16       i++;
17     }
18   max_time = -(min_time + 1L);
19   return;
20 }

GCC and Intel CC generate code that executes in 124 cycles and 160 cycles, respectively. Clang’s output, on the other hand, requires only 4 cycles because it can statically evaluate the loop. GCC and ICC can statically evaluate many loops, but not this one.

GCC output:

setboundaries:
        movl    $63, %eax
        movq    $-1, %rdx
.L2:    addq    %rdx, %rdx
        subl    $1, %eax
        jne     .L2
        movq    %rdx, min_time(%rip)
        notq    %rdx
        movq    %rdx, max_time(%rip)
        ret

Clang output:

setboundaries:
        movabsq $-9223372036854775808, %rax # imm = 0x8000000000000000
        movq    %rax, min_time(%rip)
        movabsq $9223372036854775807, %rax # imm = 0x7FFFFFFFFFFFFFFF
        movq    %rax, max_time(%rip)
        ret

Example 2

This function is mad_synth_mute() from synth.c in mad-0.14.2b from the MiBench embedded benchmark suite.

 1 typedef int mad_fixed_t;
 2 struct mad_pcm
 3 {
 4   unsigned int samplerate;
 5   unsigned short channels;
 6   unsigned short length;
 7   mad_fixed_t samples[2][1152];
 8 };
 9 struct mad_synth
10 {
11   mad_fixed_t filter[2][2][2][16][8];
12   unsigned int phase;
13   struct mad_pcm pcm;
14 };
15 void mad_synth_mute (struct mad_synth *synth);
16 void
17 mad_synth_mute (struct mad_synth *synth)
18 {
19   unsigned int ch;
20   unsigned int s;
21   unsigned int v;
22
23   ch = 0U;
24   while (ch < 2U)
25     {
26       s = 0U;
27       while (s < 16U)
28 	{
29 	  v = 0U;
30 	  while (v < 8U)
31 	    {
32 	      synth->filter[ch][1][1][s][v] = 0;
33 	      synth->filter[ch][1][0][s][v] = 0;
34 	      synth->filter[ch][0][1][s][v] = 0;
35 	      synth->filter[ch][0][0][s][v] = 0;
36 	      v++;
37 	    }
38 	  s++;
39 	}
40       ch++;
41     }
42   return;
43 }

The questions here are: Can the compiler recognize that the loops are equivalent to a bzero()? And, how fast of a bzero() can be generated?

GCC’s code executes in 256 cycles because the loop is completely unrolled and SSE instructions are used. Intel CC’s code uses SSE but does less unrolling, and takes 308. Clang produces an obvious translation of the source code — locality is poor and 1032 cycles are required.

GCC’s code:

mad_synth_mute:
        pxor    %xmm0, %xmm0
        movdqu  %xmm0, (%rdi)
        movdqu  %xmm0, 16(%rdi)
        movdqu  %xmm0, 32(%rdi)
        movdqu  %xmm0, 48(%rdi)
        movdqu  %xmm0, 64(%rdi)
        ... (obvious sequence of movdqu instructions elided) ....
        movdqu  %xmm0, 3936(%rdi)
        movdqu  %xmm0, 3952(%rdi)
        movdqu  %xmm0, 3968(%rdi)
        movdqu  %xmm0, 3984(%rdi)
        movdqu  %xmm0, 4064(%rdi)
        movdqu  %xmm0, 4080(%rdi)
        ret

Example 3

This is an initialization function taken from pyexpat.c in Python 2.7.

 1 char template_buffer[257];
 2 void
 3 init_template_buffer (void)
 4 {
 5   int i;
 6
 7   i = 0;
 8   while (i < 256)
 9     {
10       template_buffer[i] = (char) i;
11       i++;
12     }
13   template_buffer[256] = (char) 0;
14   return;
15 }

Here Clang generates the obvious looping code requiring 532 cycles: 2 cycles per loop iteration plus some overhead. GCC generates very good SSE code requiring 86 cycles. Intel CC generates code that executes in 18 cycles:

init_template_buffer:
        movl      $269488144, %eax                              #6.3
        movd      %eax, %xmm0                                   #6.3
        pshufd    $0, %xmm0, %xmm1                              #6.3
        movdqa    _2il0floatpacket.0(%rip), %xmm0               #6.3
        xorl      %eax, %eax                                    #6.3
..B1.2:                         # Preds ..B1.2 ..B1.1
        movdqa    %xmm0, template_buffer(%rax)                  #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 16+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 32+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 48+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 64+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 80+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 96+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 112+template_buffer(%rax)              #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        addq      $128, %rax                                    #6.3
        cmpq      $256, %rax                                    #6.3
        jb        ..B1.2        # Prob 99%                      #6.3
        movb      $0, 256+template_buffer(%rip)                 #10.3
        ret                                                     #11.3

Here ICC has performed an impressive 16-way vectorization and 16-way unrolling; the result takes a bit of work to understand.

The 269488144 loaded into eax is 0x10101010. The pshufd instruction (shuffle packed doublewords) loads the value 0x10101010 10101010 10101010 10101010 into xmm1. The next instruction loads xmm0 with 0x0f0e0d0c 0b0a0908 07060504 03020100.

Now the purpose of the first movdqa in the loop is clear: it loads the values {0,1,2, …, 15} into the first 16 bytes of  template_buffer. The paddb (packed add, byte-wise) adds 0x10 to each byte of xmm0, leaving its contents 0x1f1e1d1c 1b1a1918 17161514 13121110. These are moved into the next 16 bytes of template_buffer, and so on. The loop body executes only twice before the entire array is initialized.

Although this translation seems very fragile, icc can generate equivalent object code for most versions of this code where N, X, and Y are constants:

for (i=0; i<N; i++) a[i] = iX+Y;


Example 4

This function is tf_bH() from _ctypes_test.c in Python 2.7.

 1 unsigned long long last_tf_arg_u;
 2 unsigned short
 3 tf_bH (signed char x, unsigned short c)
 4 {
 5   last_tf_arg_u = (unsigned long long) c;
 6   return ((unsigned short) ((int) c / 3));
 7 }

Here GCC and Intel CC both generate code that executes in 6 cycles, whereas Clang’s takes 18.

GCC’s output:

tf_bH:
	movzwl	%si, %eax
	movq	%rax, last_tf_arg_u(%rip)
	movzwl	%si, %eax
	imull	$43691, %eax, %eax
	shrl	$17, %eax
	ret

Clang’s:

tf_bH:                                  # @tf_bH
	movq	%rsi, last_tf_arg_u(%rip)
	movw	%si, %ax
	movw	$-21845, %cx            # imm = 0xFFFFFFFFFFFFAAAB
	mulw	%cx
	andl	$65534, %edx            # imm = 0xFFFE
	movl	%edx, %eax
	shrl	%eax
	ret

At first glance, both compilers have done the right thing: turned the divide-by-constant into a multiply. The difference in execution times is explained by the Intel Optimization Reference Manual, Assembly/Compiler Coding Rule 21:

Favor generating code using imm8 or imm32 values instead of imm16 values. If imm16 is needed, load equivalent imm32 into a register and use the word value in the register instead.

We verified that this is the problem by changing the imm16 value to imm32 by hand and re-running the performance test.

Example 5

This function is get_domain() from tree-call-cdce.c in GCC 4.5.0.

 1 struct input_domain
 2 {
 3   int lb;
 4   int ub;
 5   unsigned char has_lb;
 6   unsigned char has_ub;
 7   unsigned char is_lb_inclusive;
 8   unsigned char is_ub_inclusive;
 9 };
10 typedef struct input_domain inp_domain;
11 inp_domain
12 get_domain (int lb, unsigned char has_lb, unsigned char lb_inclusive,
13 	    int ub, unsigned char has_ub, unsigned char ub_inclusive)
14 {
15   inp_domain domain;
16
17   domain.lb = lb;
18   domain.has_lb = has_lb;
19   domain.is_lb_inclusive = lb_inclusive;
20   domain.ub = ub;
21   domain.has_ub = has_ub;
22   domain.is_ub_inclusive = ub_inclusive;
23   return (domain);
24 }

This function simply loads its arguments into a struct. GCC pushes the values through memory, taking 35 cycles. Intel CC generates code that uses memory 6 times (as opposed to 8 for GCC), requiring 24 cycles. Clang assembles the struct using registers, requiring only 12 cycles.

GCC’s code:

get_domain:
	movl	%edi, -24(%rsp)
	movl	%ecx, -20(%rsp)
	movb	%sil, -16(%rsp)
	movq	-24(%rsp), %rax
	movb	%r8b, -15(%rsp)
	movb	%dl, -14(%rsp)
	movb	%r9b, -13(%rsp)
	movl	-16(%rsp), %edx
	ret

Clang’s:

get_domain:
	movl	%edi, %eax
	shlq	$32, %rcx
	leaq	(%rcx,%rax), %rax
	shll	$16, %edx
	orl	%esi, %edx
	shll	$8, %r8d
	orl	%edx, %r8d
	shll	$24, %r9d
	movl	%r9d, %edx
	orl	%r8d, %edx
	ret

Example 6

This function is php_filter_parse_int() from logical_filters.c in PHP 5.3.3.

  1 int
  2 php_filter_parse_int (char const *str, unsigned int str_len, long *ret)
  3 {
  4   long ctx_value;
  5   int sign;
  6   int digit;
  7   char const *end;
  8   int tmp;
  9   char const *tmp___0;
 10   char const *tmp___1;
 11
 12   sign = 0;
 13   digit = 0;
 14   end = str + str_len;
 15   switch ((int) *str)
 16     {
 17     case 45:
 18       sign = 1;
 19     case 43:
 20       str++;
 21     default:;
 22       break;
 23     }
 24   if ((unsigned long) str < (unsigned long) end)
 25     {
 26       if ((int const) *str >= 49)
 27 	{
 28 	  if ((int const) *str <= 57)
 29 	    {
 30 	      if (sign)
 31 		{
 32 		  tmp = -1;
 33 		}
 34 	      else
 35 		{
 36 		  tmp = 1;
 37 		}
 38 	      tmp___0 = str;
 39 	      str++;
 40 	      ctx_value = (long) (tmp * (int) ((int const) *tmp___0 - 48));
 41 	    }
 42 	  else
 43 	    {
 44 	      return (-1);
 45 	    }
 46 	}
 47       else
 48 	{
 49 	  return (-1);
 50 	}
 51     }
 52   else
 53     {
 54       return (-1);
 55     }
 56   if (end - str > 19)
 57     {
 58       return (-1);
 59     }
 60   while ((unsigned long) str < (unsigned long) end)
 61     {
 62       if ((int const) *str >= 48)
 63 	{
 64 	  if ((int const) *str <= 57)
 65 	    {
 66 	      tmp___1 = str;
 67 	      str++;
 68 	      digit = (int) ((int const) *tmp___1 - 48);
 69 	      if (!sign)
 70 		{
 71 		  if (ctx_value <=
 72 		      (9223372036854775807L - (long) digit) / 10L)
 73 		    {
 74 		      ctx_value = ctx_value * 10L + (long) digit;
 75 		    }
 76 		  else
 77 		    {
 78 		      goto _L;
 79 		    }
 80 		}
 81 	      else
 82 		{
 83 		_L:
 84 		  if (sign)
 85 		    {
 86 		      if (ctx_value >=
 87 			  ((-0x7FFFFFFFFFFFFFFF - 1) + (long) digit) / 10L)
 88 			{
 89 			  ctx_value = ctx_value * 10L - (long) digit;
 90 			}
 91 		      else
 92 			{
 93 			  return (-1);
 94 			}
 95 		    }
 96 		  else
 97 		    {
 98 		      return (-1);
 99 		    }
100 		}
101 	    }
102 	  else
103 	    {
104 	      return (-1);
105 	    }
106 	}
107       else
108 	{
109 	  return (-1);
110 	}
111     }
112   *ret = ctx_value;
113   return (1);
114 }

The assembly is long and not that interesting. The upshot is that Clang and Intel CC are able to turn the divisions by 10 at lines 72 and 87 into integer multiplies, whereas GCC leaves division instructions in the generated code. It is not clear why this happens; as we saw above, other divisions-by-constant are optimized successfully by this compiler. The result is that Clang’s and ICC’s code executes in 41 cycles whereas GCC’s takes 133.

Example 7

This function is crud() from ident.c in git-1.7.3.2.

1 int crud (unsigned char c)
2 {
3   return (((((((((((int) c <= 32 || (int) c == 46) || (int) c == 44)
4 		 || (int) c == 58) || (int) c == 59) || (int) c == 60)
5 	      || (int) c == 62) || (int) c == 34) || (int) c == 92)
6 	   || (int) c == 39) != 0);
7 }

GCC and Clang generate the obvious cascaded comparisons and their code executes in 21 and 24 cycles, respectively. Intel CC is far more clever and its code requires only 8 cycles:

crud:
        movzbl    %dil, %edi
        cmpl      $32, %edi
        jle       ..B1.4
        addl      $-34, %edi
        cmpl      $64, %edi
        jae       ..B1.5
        movl      $1, %eax
        movl      %edi, %ecx
        shlq      %cl, %rax
        movq      $0x400000017001421, %rdx
        testq     %rdx, %rax
        je        ..B1.5
..B1.4:
        movl      $1, %eax
        ret
..B1.5:
        xorl      %eax, %eax
        ret

Since the range of characters being checked for is less than 64, ICC is able to pre-compute a bitmap with 1s in positions corresponding to the desired characters.

Conclusions

Our idea is that using a small collection of compilers and a large collection of realistic benchmark functions, we can get the compilers to reveal performance limitations in each other. Once the basic infrastructure is in place, results can be recompiled with almost no manual effort. Although there are obvious limitations of this approach — it cannot spot an optimization missed by all compilers, it does not take the larger execution context of a benchmark function into account, etc. — we hope that it will become a useful part of compiler developers’ arsenal against performance problems.

These are very early results, selected from only about 1400 benchmark functions. As we remove limitations of the tool, it should become possible to harvest and test hundreds of thousands of benchmark functions. At that point, it may become useful to incorporate profile data into our metric for interesting functions, in order to direct compiler developers towards missed optimizations that are causing the most real slowdown.

Wanted: Invariant-Based Synchronization

Although a significant fraction of the programming languages community works on detecting race conditions in multi-threaded software, I haven’t been able to get very excited about this. Certainly race-free programs have some nice properties, but race freedom is neither a necessary nor sufficient condition for concurrency correctness. This research area doesn’t feel to me like it is fundamental, but rather an evolutionary stage that has already worn out its welcome.

What I’d like to see — both as a developer and as a researcher — is a much more concrete link between concurrency control and program correctness. For example, in invariant-based synchronization, we specify the invariants that must hold over concurrent data structures and then use partially automated methods to derive locking solutions that guarantee that the invariants are maintained. Coarse-grained solutions are easy to find, and of course fine-grained solutions require more work. The scarcity of papers on this kind of idea has surprised me for a while now. I predict that ideas similar to invariant-based synchronization will become a popular area of research when people get over races and transactions, probably in the 2015-2020 time frame.