Comparing Compiler Optimizations

[Update from Dec 14: Some of these problems have already been fixed! Details here.]

[This is a guest post by my grad student Yang Chen, describing one of the projects he’s been working on lately. I elaborated on some of Yang’s points, edited, and did some formatting for WordPress.]

Our goal is to help C compiler developers find areas where their optimizer could stand to be improved. We do this by taking some open source applications and, for each application:

Select a number of benchmark functions. For now, all benchmarks are leaves: they make no function calls. There are a few other restrictions too.
Instrument the source code so that all inputs to benchmark functions (including those accessed through pointers) are logged.
Run whatever test suite comes with the application, logging inputs to benchmark functions. Logging stops after a configurable maximum number of distinct inputs has been seen.
Extract standalone versions of the benchmark functions.
For each compiler under test, compile each extracted benchmark with a test harness. Run the resulting code on an otherwise quiescent machine in order to measure the performance of the benchmark function in isolation.
Look for outlier functions where one compiler created significantly faster code than another compiler — these are potentially interesting and we inspect them manually. Some examples are below.

When we report the number of cycles required to execute a function, this is an average over all logged inputs.

The test machine is built around an Intel Core i7-920 processor, which has 4 cores running at 2.67GHz and 8MB of shared L3 cache. It has 6 GB of RAM. It runs the x86-64 version of Ubuntu 9.10. Performance is measured using the RDTSC instruction. The test process is pinned to a single processor and all CPU and OS power management functionality is turned off.

The compiler options we used are intended to:

Optimize for maximum speed
Give the tools a fighting chance at making equivalent assumptions about the hardware platform

Clang

Svn head built on Dec 4, 2010:

$ clang -v
clang version 2.9 (trunk 120838)
Target: x86_64-unknown-linux-gnu
Thread model: posix

Invocation:

clang -O3 -mssse3 -march=core2 -mtune=core2 -fomit-frame-pointer -fno-stack-protector -w

GCC

Svn head built on Dec 4, 2010:

$ current-gcc -v
Using built-in specs.
COLLECT_GCC=./compilers/bin/current-gcc
COLLECT_LTO_WRAPPER=/uusoc/exports/scratch/chenyang/performance_tests/compilers/libexec/gcc/x86_64-unknown-linux-gnu/4.6.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../configure –prefix=/uusoc/exports/scratch/chenyang/performance_tests/compilers –program-prefix=current- –enable-lto –enable-languages=c,c++,lto
Thread model: posix
gcc version 4.6.0 20101204 (experimental) (GCC)

Invocation:

current-gcc -O3 -mfpmath=sse -mssse3 -march=core2 -mtune=core2 -fno-stack-protector -fomit-frame-pointer -w

Intel

Version is:

$ icc -V
Intel(R) C Intel(R) 64 Compiler Professional for applications running on Intel(R) 64, Version 11.1 Build 20100414
Copyright (C) 1985-2010 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY

Invocation:

icc -O3 -static -not-prec-div -xHost -mssse3 -fno-stack-protector -fomit-frame-pointer -w

Example 1

(Here and in other examples, we have edited the assembly code for clarity. Also note that we are showing the source code as seen by the compilers being tested. It is not only preprocesssed, but also has been parsed by our benchmark extraction tool which is based on CIL. CIL does a few simple transformations that make C code a lot uglier sometimes.)

This function is setboundaries() from zic.c in postgresql-9.0.0.

 1 typedef long int64;
 2 typedef int64 zic_t;
 3 void setboundaries (void);
 4 zic_t max_time;
 5 zic_t min_time;
 6 void
 7 setboundaries (void)
 8 {
 9   int i;
10
11   min_time = -1L;
12   i = 0;
13   while (i < 63)
14     {
15       min_time *= 2L;
16       i++;
17     }
18   max_time = -(min_time + 1L);
19   return;
20 }

GCC and Intel CC generate code that executes in 124 cycles and 160 cycles, respectively. Clang’s output, on the other hand, requires only 4 cycles because it can statically evaluate the loop. GCC and ICC can statically evaluate many loops, but not this one.

GCC output:

setboundaries:
        movl    $63, %eax
        movq    $-1, %rdx
.L2:    addq    %rdx, %rdx
        subl    $1, %eax
        jne     .L2
        movq    %rdx, min_time(%rip)
        notq    %rdx
        movq    %rdx, max_time(%rip)
        ret

Clang output:

setboundaries:
        movabsq $-9223372036854775808, %rax # imm = 0x8000000000000000
        movq    %rax, min_time(%rip)
        movabsq $9223372036854775807, %rax # imm = 0x7FFFFFFFFFFFFFFF
        movq    %rax, max_time(%rip)
        ret

Example 2

This function is mad_synth_mute() from synth.c in mad-0.14.2b from the MiBench embedded benchmark suite.

 1 typedef int mad_fixed_t;
 2 struct mad_pcm
 3 {
 4   unsigned int samplerate;
 5   unsigned short channels;
 6   unsigned short length;
 7   mad_fixed_t samples[2][1152];
 8 };
 9 struct mad_synth
10 {
11   mad_fixed_t filter[2][2][2][16][8];
12   unsigned int phase;
13   struct mad_pcm pcm;
14 };
15 void mad_synth_mute (struct mad_synth *synth);
16 void
17 mad_synth_mute (struct mad_synth *synth)
18 {
19   unsigned int ch;
20   unsigned int s;
21   unsigned int v;
22
23   ch = 0U;
24   while (ch < 2U)
25     {
26       s = 0U;
27       while (s < 16U)
28 	{
29 	  v = 0U;
30 	  while (v < 8U)
31 	    {
32 	      synth->filter[ch][1][1][s][v] = 0;
33 	      synth->filter[ch][1][0][s][v] = 0;
34 	      synth->filter[ch][0][1][s][v] = 0;
35 	      synth->filter[ch][0][0][s][v] = 0;
36 	      v++;
37 	    }
38 	  s++;
39 	}
40       ch++;
41     }
42   return;
43 }

The questions here are: Can the compiler recognize that the loops are equivalent to a bzero()? And, how fast of a bzero() can be generated?

GCC’s code executes in 256 cycles because the loop is completely unrolled and SSE instructions are used. Intel CC’s code uses SSE but does less unrolling, and takes 308. Clang produces an obvious translation of the source code — locality is poor and 1032 cycles are required.

GCC’s code:

mad_synth_mute:
        pxor    %xmm0, %xmm0
        movdqu  %xmm0, (%rdi)
        movdqu  %xmm0, 16(%rdi)
        movdqu  %xmm0, 32(%rdi)
        movdqu  %xmm0, 48(%rdi)
        movdqu  %xmm0, 64(%rdi)
        ... (obvious sequence of movdqu instructions elided) ....
        movdqu  %xmm0, 3936(%rdi)
        movdqu  %xmm0, 3952(%rdi)
        movdqu  %xmm0, 3968(%rdi)
        movdqu  %xmm0, 3984(%rdi)
        movdqu  %xmm0, 4064(%rdi)
        movdqu  %xmm0, 4080(%rdi)
        ret

Example 3

This is an initialization function taken from pyexpat.c in Python 2.7.

 1 char template_buffer[257];
 2 void
 3 init_template_buffer (void)
 4 {
 5   int i;
 6
 7   i = 0;
 8   while (i < 256)
 9     {
10       template_buffer[i] = (char) i;
11       i++;
12     }
13   template_buffer[256] = (char) 0;
14   return;
15 }

Here Clang generates the obvious looping code requiring 532 cycles: 2 cycles per loop iteration plus some overhead. GCC generates very good SSE code requiring 86 cycles. Intel CC generates code that executes in 18 cycles:

init_template_buffer:
        movl      $269488144, %eax                              #6.3
        movd      %eax, %xmm0                                   #6.3
        pshufd    $0, %xmm0, %xmm1                              #6.3
        movdqa    _2il0floatpacket.0(%rip), %xmm0               #6.3
        xorl      %eax, %eax                                    #6.3
..B1.2:                         # Preds ..B1.2 ..B1.1
        movdqa    %xmm0, template_buffer(%rax)                  #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 16+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 32+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 48+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 64+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 80+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 96+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 112+template_buffer(%rax)              #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        addq      $128, %rax                                    #6.3
        cmpq      $256, %rax                                    #6.3
        jb        ..B1.2        # Prob 99%                      #6.3
        movb      $0, 256+template_buffer(%rip)                 #10.3
        ret                                                     #11.3

Here ICC has performed an impressive 16-way vectorization and 16-way unrolling; the result takes a bit of work to understand.

The 269488144 loaded into eax is 0x10101010. The pshufd instruction (shuffle packed doublewords) loads the value 0x10101010 10101010 10101010 10101010 into xmm1. The next instruction loads xmm0 with 0x0f0e0d0c 0b0a0908 07060504 03020100.

Now the purpose of the first movdqa in the loop is clear: it loads the values {0,1,2, …, 15} into the first 16 bytes of template_buffer. The paddb (packed add, byte-wise) adds 0x10 to each byte of xmm0, leaving its contents 0x1f1e1d1c 1b1a1918 17161514 13121110. These are moved into the next 16 bytes of template_buffer, and so on. The loop body executes only twice before the entire array is initialized.

Although this translation seems very fragile, icc can generate equivalent object code for most versions of this code where N, X, and Y are constants:

for (i=0; i<N; i++) a[i] = iX+Y;

Example 4

This function is tf_bH() from _ctypes_test.c in Python 2.7.

 1 unsigned long long last_tf_arg_u;
 2 unsigned short
 3 tf_bH (signed char x, unsigned short c)
 4 {
 5   last_tf_arg_u = (unsigned long long) c;
 6   return ((unsigned short) ((int) c / 3));
 7 }

Here GCC and Intel CC both generate code that executes in 6 cycles, whereas Clang’s takes 18.

GCC’s output:

tf_bH:
	movzwl	%si, %eax
	movq	%rax, last_tf_arg_u(%rip)
	movzwl	%si, %eax
	imull	$43691, %eax, %eax
	shrl	$17, %eax
	ret

Clang’s:

tf_bH:                                  # @tf_bH
	movq	%rsi, last_tf_arg_u(%rip)
	movw	%si, %ax
	movw	$-21845, %cx            # imm = 0xFFFFFFFFFFFFAAAB
	mulw	%cx
	andl	$65534, %edx            # imm = 0xFFFE
	movl	%edx, %eax
	shrl	%eax
	ret

At first glance, both compilers have done the right thing: turned the divide-by-constant into a multiply. The difference in execution times is explained by the Intel Optimization Reference Manual, Assembly/Compiler Coding Rule 21:

Favor generating code using imm8 or imm32 values instead of imm16 values. If imm16 is needed, load equivalent imm32 into a register and use the word value in the register instead.

We verified that this is the problem by changing the imm16 value to imm32 by hand and re-running the performance test.

Example 5

This function is get_domain() from tree-call-cdce.c in GCC 4.5.0.

 1 struct input_domain
 2 {
 3   int lb;
 4   int ub;
 5   unsigned char has_lb;
 6   unsigned char has_ub;
 7   unsigned char is_lb_inclusive;
 8   unsigned char is_ub_inclusive;
 9 };
10 typedef struct input_domain inp_domain;
11 inp_domain
12 get_domain (int lb, unsigned char has_lb, unsigned char lb_inclusive,
13 	    int ub, unsigned char has_ub, unsigned char ub_inclusive)
14 {
15   inp_domain domain;
16
17   domain.lb = lb;
18   domain.has_lb = has_lb;
19   domain.is_lb_inclusive = lb_inclusive;
20   domain.ub = ub;
21   domain.has_ub = has_ub;
22   domain.is_ub_inclusive = ub_inclusive;
23   return (domain);
24 }

This function simply loads its arguments into a struct. GCC pushes the values through memory, taking 35 cycles. Intel CC generates code that uses memory 6 times (as opposed to 8 for GCC), requiring 24 cycles. Clang assembles the struct using registers, requiring only 12 cycles.

GCC’s code:

get_domain:
	movl	%edi, -24(%rsp)
	movl	%ecx, -20(%rsp)
	movb	%sil, -16(%rsp)
	movq	-24(%rsp), %rax
	movb	%r8b, -15(%rsp)
	movb	%dl, -14(%rsp)
	movb	%r9b, -13(%rsp)
	movl	-16(%rsp), %edx
	ret

Clang’s:

get_domain:
	movl	%edi, %eax
	shlq	$32, %rcx
	leaq	(%rcx,%rax), %rax
	shll	$16, %edx
	orl	%esi, %edx
	shll	$8, %r8d
	orl	%edx, %r8d
	shll	$24, %r9d
	movl	%r9d, %edx
	orl	%r8d, %edx
	ret

Example 6

This function is php_filter_parse_int() from logical_filters.c in PHP 5.3.3.

  1 int
  2 php_filter_parse_int (char const *str, unsigned int str_len, long *ret)
  3 {
  4   long ctx_value;
  5   int sign;
  6   int digit;
  7   char const *end;
  8   int tmp;
  9   char const *tmp___0;
 10   char const *tmp___1;
 11
 12   sign = 0;
 13   digit = 0;
 14   end = str + str_len;
 15   switch ((int) *str)
 16     {
 17     case 45:
 18       sign = 1;
 19     case 43:
 20       str++;
 21     default:;
 22       break;
 23     }
 24   if ((unsigned long) str < (unsigned long) end)
 25     {
 26       if ((int const) *str >= 49)
 27 	{
 28 	  if ((int const) *str <= 57)
 29 	    {
 30 	      if (sign)
 31 		{
 32 		  tmp = -1;
 33 		}
 34 	      else
 35 		{
 36 		  tmp = 1;
 37 		}
 38 	      tmp___0 = str;
 39 	      str++;
 40 	      ctx_value = (long) (tmp * (int) ((int const) *tmp___0 - 48));
 41 	    }
 42 	  else
 43 	    {
 44 	      return (-1);
 45 	    }
 46 	}
 47       else
 48 	{
 49 	  return (-1);
 50 	}
 51     }
 52   else
 53     {
 54       return (-1);
 55     }
 56   if (end - str > 19)
 57     {
 58       return (-1);
 59     }
 60   while ((unsigned long) str < (unsigned long) end)
 61     {
 62       if ((int const) *str >= 48)
 63 	{
 64 	  if ((int const) *str <= 57)
 65 	    {
 66 	      tmp___1 = str;
 67 	      str++;
 68 	      digit = (int) ((int const) *tmp___1 - 48);
 69 	      if (!sign)
 70 		{
 71 		  if (ctx_value <=
 72 		      (9223372036854775807L - (long) digit) / 10L)
 73 		    {
 74 		      ctx_value = ctx_value * 10L + (long) digit;
 75 		    }
 76 		  else
 77 		    {
 78 		      goto _L;
 79 		    }
 80 		}
 81 	      else
 82 		{
 83 		_L:
 84 		  if (sign)
 85 		    {
 86 		      if (ctx_value >=
 87 			  ((-0x7FFFFFFFFFFFFFFF - 1) + (long) digit) / 10L)
 88 			{
 89 			  ctx_value = ctx_value * 10L - (long) digit;
 90 			}
 91 		      else
 92 			{
 93 			  return (-1);
 94 			}
 95 		    }
 96 		  else
 97 		    {
 98 		      return (-1);
 99 		    }
100 		}
101 	    }
102 	  else
103 	    {
104 	      return (-1);
105 	    }
106 	}
107       else
108 	{
109 	  return (-1);
110 	}
111     }
112   *ret = ctx_value;
113   return (1);
114 }

The assembly is long and not that interesting. The upshot is that Clang and Intel CC are able to turn the divisions by 10 at lines 72 and 87 into integer multiplies, whereas GCC leaves division instructions in the generated code. It is not clear why this happens; as we saw above, other divisions-by-constant are optimized successfully by this compiler. The result is that Clang’s and ICC’s code executes in 41 cycles whereas GCC’s takes 133.

Example 7

This function is crud() from ident.c in git-1.7.3.2.

1 int crud (unsigned char c)
2 {
3   return (((((((((((int) c <= 32 || (int) c == 46) || (int) c == 44)
4 		 || (int) c == 58) || (int) c == 59) || (int) c == 60)
5 	      || (int) c == 62) || (int) c == 34) || (int) c == 92)
6 	   || (int) c == 39) != 0);
7 }

GCC and Clang generate the obvious cascaded comparisons and their code executes in 21 and 24 cycles, respectively. Intel CC is far more clever and its code requires only 8 cycles:

crud:
        movzbl    %dil, %edi
        cmpl      $32, %edi
        jle       ..B1.4
        addl      $-34, %edi
        cmpl      $64, %edi
        jae       ..B1.5
        movl      $1, %eax
        movl      %edi, %ecx
        shlq      %cl, %rax
        movq      $0x400000017001421, %rdx
        testq     %rdx, %rax
        je        ..B1.5
..B1.4:
        movl      $1, %eax
        ret
..B1.5:
        xorl      %eax, %eax
        ret

Since the range of characters being checked for is less than 64, ICC is able to pre-compute a bitmap with 1s in positions corresponding to the desired characters.

Conclusions

Our idea is that using a small collection of compilers and a large collection of realistic benchmark functions, we can get the compilers to reveal performance limitations in each other. Once the basic infrastructure is in place, results can be recompiled with almost no manual effort. Although there are obvious limitations of this approach — it cannot spot an optimization missed by all compilers, it does not take the larger execution context of a benchmark function into account, etc. — we hope that it will become a useful part of compiler developers’ arsenal against performance problems.

These are very early results, selected from only about 1400 benchmark functions. As we remove limitations of the tool, it should become possible to harvest and test hundreds of thousands of benchmark functions. At that point, it may become useful to incorporate profile data into our metric for interesting functions, in order to direct compiler developers towards missed optimizations that are causing the most real slowdown.

December 11, 2010

regehr

Compilers, Computer Science

17 responses to “Comparing Compiler Optimizations”

Embedded in Academia : Comparing Compiler Optimizations « Yahyasheikho786's Blog says:

December 11, 2010 at 4:14 pm

[…] Embedded in Academia : Comparing Compiler Optimizations. […]
kl says:

December 11, 2010 at 7:22 pm

It would be cool if one could assemble binary consisting of fastest versions of functions from all compilers.
Kevin Bowling says:

December 12, 2010 at 11:58 am

Thanks, it’s good to see how far Clang has progressed!
Fuse says:

December 12, 2010 at 12:43 pm

Clang has -O4.
regehr says:

December 12, 2010 at 1:14 pm

Hi Fuse- clang’s -O4 turns on link-time optimization and is irrelevant here.
damg says:

December 12, 2010 at 6:05 pm

I wonder if GCC’s -Ofast (new in 4.6) would make a difference with these tests.
regehr says:

December 12, 2010 at 9:25 pm

kl- I had the same thought. It doesn’t seem very hard, and could even be valuable within a single compiler since neither -O1 or -O2 uniformly wins, for example. It would also be possible to assemble the smallest binary, or the binary minimizing some weighted combination of size and execution time.

damg- I think -Ofast is just -O3 plus -ffast-math, so probably it would not have made any difference for these examples (I think fast-math only affects FP code and these are all integer codes).
Chris Lattner says:

December 13, 2010 at 1:50 am

FWIW, these two are now implemented in mainline LLVM:

#4: We now generate slightly better code than GCC (two instructions shorter):

_tf_bH:
movq _last_tf_arg_u(%rip), %rax
movq %rsi, (%rax)
imull $43691, %esi, %eax ## imm = 0xAAAB
shrl $17, %eax
ret

#7: We now generate apparently better code than ICC:

_crud:
cmpb $33, %dil
jb LBB0_4
addb $-34, %dil
cmpb $58, %dil
ja LBB0_3
movzbl %dil, %eax
movabsq $288230376537592865, %rcx
btq %rax, %rcx
jb LBB0_4
LBB0_3:
xorl %eax, %eax
ret
LBB0_4:
movl $1, %eax
ret

-Chris
regehr says:

December 13, 2010 at 3:16 pm

Chris- Fast work! We’ll post an update soon.
cesium62 says:

December 14, 2010 at 2:53 pm

for the first piece of code, it would be cool if the compiler noticed that it could use a NOT instruction to generate a constant from a value in an existing register instead of storing the 8-byte constant in the instruction stream.
cesium62 says:

December 14, 2010 at 3:22 pm

In the first two examples, the programmer used poor code that the compiler reverse-engineered into a simpler approach. It would be nice if the compilers could act as pair-programmers in these cases and suggest to the human the simpler approach.
regehr says:

December 14, 2010 at 3:27 pm

Hi cesium62- This would be definitely nice, but I think in practice it’s difficult to do this well since poor code often happens by accident due to inlining, macros, etc.

Also, once the compiler is strong enough to recognize an idiom, the motivation to report it is gone – changing the code would only benefit competing compilers that cannot yet recognize the purpose of the code.
Nuno Lopes says:

December 15, 2010 at 12:29 pm

Awsome post as usual, John!
Thank you for these delighting posts! Keep them coming 🙂
Ben L. Titzer says:

December 15, 2010 at 1:42 pm

This post pretty much convinced me to wait before writing my own backend for Virgil…I can’t spend the next dozen years catching up on all these crazy idioms and vectorization.
regehr says:

December 15, 2010 at 6:09 pm

Ben- I agree: in the future there will be only a handful of seriously good backends for each architecture. What about just compiling Virgil to LLVM? You could have fun with all the high-level optimizations and hopefully LLVM would take it from there.
Ben L. Titzer says:

December 15, 2010 at 8:56 pm

Yup, already working on it 😉
Nona Mills says:

December 22, 2010 at 10:25 am

Hi cesium62- This would be definitely nice, but I think in practice it’s difficult to do this well since poor code often happens by accident due to inlining, macros, etc. Also, once the compiler is strong enough to recognize an idiom, the motivation to report it is gone – changing the code would only benefit competing compilers that cannot yet recognize the purpose of the code.