Does Portable Byte-Order Code Optimize?


When reading data whose byte-ordering may differ from the host computer’s, Rob Pike advocates writing portable code as opposed to using #ifdefs to optimize the case where the encoded data’s byte ordering matches the host. His arguments seem reasonable and it’s definitely a win if the compiler can transparently recognize the useless data extraction operations and optimize them away. Rob’s code is basically this:

int read_little_endian_int (unsigned char *data)
{
  int i = 
    (data[0]<<0) | (data[1]<<8) | 
    (data[2]<<16) | (data[3]<<24);
  return i;
}

On a little-endian host, this code can be simplified to just a 32-bit load. But do compilers actually perform this optimization? It turns out not. Clang (built today from trunk) gives this:

read_little_endian_int:
 movzbl (%rdi), %ecx
 movzbl 1(%rdi), %eax
 shll $8, %eax
 orl %ecx, %eax
 movzbl 2(%rdi), %ecx
 shll $16, %ecx
 orl %eax, %ecx
 movzbl 3(%rdi), %eax
 shll $24, %eax
 orl %ecx, %eax
 ret

GCC (built a few days ago) and Intel’s compiler (12.0.5) produce very similar code.

Does the sub-optimality of this code matter? Probably not if we’re reading from a slow device, but it does matter if we’re reading from RAM. My Core i7 920 (running in 64-bit mode) can copy a large block of data using 32-bit operations at about 7 GB/s, but reaches only about 2 GB/s when running the data through the code emitted by Clang. The results for GCC and Intel CC are similar.

But now the plot thickens slightly. If we wish to check that the function above is implemented and compiled correctly, we might write a program with this main function:

int main (void)
{
  int i = INT_MIN;
  while (1) {
    int i2 = read_little_endian_int ((unsigned char *)&i);
    if (i != i2) {
      printf ("oops %d != %d\n", i, i2);
    }
    if (i == INT_MAX) break;
    i++;
  }
  return 0;
}

GCC and Intel CC compile this in the obvious way, but Clang (at -O3) turns this main function into a nop:

 main:
 xorl %eax, %eax
 ret

Now why would Clang understand that read_little_endian_int is a nop when it is used in this context, but fail to emit the obvious code for the function itself? I’m not sure what to make of this. Maybe someone familiar with LLVM’s optimizers can help out here. The full code is here if anyone wants to play with it.

,

12 responses to “Does Portable Byte-Order Code Optimize?”

  1. In general, optimizing the original code to a single 32-bit load on little endian targets is not safe: the pointer may not be aligned. However, even after marking the pointer aligned, clang still misses the optimization.

    -Chris

  2. Also, to answer the question: clang can successfully forward loads and stores when the accessed object is known. For example, it turns this into a simple copy. This is why it gets your more complicated example:

    static int read_little_endian_int (unsigned char *data) {
    int i =
    (data[0]<< 0) | (data[1]<< 8) |
    (data[2]<<16) | (data[3]<<24);
    return i;
    }

    int test(int X) {
    return read_little_endian_int(&X);
    }

  3. Finally, Clang also optimizes the opposite: turning this code into a “bswapl” on x86:

    static int read_big_endian_int (unsigned char *data)
    {
    int i =
    (data[0]<< 24) | (data[1]<< 16) |
    (data[2]<< 8 ) | (data[3]<<0);
    return i;
    }

    int test(int X) {
    return read_big_endian_int(&X);
    }

    The biggest problem with relying on this sort of thing is that deep dependence on optimizer details makes your code less portable. This can still make sense if you a) don't care about portability, or b) don't care about performance.

    However, you can also just use ntohl, __builtin_bswap, and related functions depending on the domain.

    -Chris

  4. I think Rob’s argument for using the portable form is that more a argument for what semantics to think with rather than an argument for a given implementation.

    If that’s valid, might it be worth adding the functions as compiler intrinsics or as part of the standard lib? The people who write the compiler should know what is and isn’t safe and should be able to write a function that works fast and correctly under all cases.

  5. Chris:

    I thought I was taking crazy pills until I saw your post. I’m astonished that Rob Pike doesn’t know about ntohl/htonl etc. Those and related functions/macros were the very first thought to run through my head.

    -Ryan

  6. I think something more important is missing in Rob’s article: you should not leave any kind of raw low level loader code right in the middle of the interesting parts. So instead of copy pasting (probably introducing some bugs in the process) (a[0] << 24) | (a[1] << 16) | (a[2] << 8) | (a[3] << 0) everywhere — that is better than using #ifdef everywhere — you just write something like cpu_from_be32_unaligned(a); then there is no problem to use a single #ifdef in the implementation!

    But copy pasting (a[0] << 24) | (a[1] << 16) | (a[2] << 8) | (a[3] < there will not be less bugs than if bswap on a word was used (if allowed for any alignment).

    So what is really important is to only store in data structure being worked on (not mere storage) in native ordering, except maybe on very very special condition (and only if they are very well documented). In all cases, you have to correctly identify where byte ordering is native and where it is constrained; and this really is the *only* thing that matters: once you have that and nice accessing primitives there really is no reason not to correctly optimize them if necessary, maybe using bswap on some hosts and not on others.

    What really caused Adobe problems is that they insufficiently defined their file format in the first place and probably did not use any data serialization at all, then latter tried to reuse the same highly buggy code on an architecture with a different byte swaping that exposed the problem. THEN maybe they tried to solve the problem in a moronic way. There is no way one can be sure they put some #if BIG_ENDIAN everywhere; in the first place they probably had none, and that is probably the source root of the bugs.

    The remark in this article that Portable Byte-Order Code [does not] Optimize is also very important for e.g. stuff written to device registers. You might need to read/write 32 bits in only one access: in this case you will probably store an X-Endian word in memory then load it back byte by byte; depending on your processor that might stall for extended amounts of time (device register access will probably be slow in the first place, but why should add an additionnal BIG penalty even if byte swaping was not even needed in the first place?)

    Real world successful operating systems are not only portable (well, some of them…). They are also fast. And when possible they are made of non bloated binaries. |(a[] << x) constructs might be right in some case, but that does not mean bswap should be forbidden. It clearly has its place, and there is no silver bullet.

  7. I might be missing something, but if all you’re doing is copying data from A to B, why on earth would you care about its endianness and not just use memcpy()? Surely Pike’s point is that if you actually care about the endianness of your data, you’re probably doing something with it which is complicated enough to overwhelm the overhead of reading it a byte at a time?

  8. gwenhwyfaer, my post here is motivated by being a compiler nerd, not by any strong belief that this optimization has to happen.

  9. So far, every comment made here has completely missed the point.

    The solution to this is to never have to write such code to begun with. The endianness of a word is an intrinsic attribute of the word itself. Just as one should provide short vs long attributes where out matters, one should provide big vs little attributes as well. It really is that simple.

    So, knowing that this is a perennial problem, WTH hasn’t the compiler industry responded with a real solution?