Benchmarking Short Codes on Modern Processors

Around 15 years ago, as a newish graduate student, I got access to a Pentium-based Linux machine. One of the coolest things about this machine was the new RDTSC instruction that measured the number of clock ticks since the processor had been reset. This could be used to directly observe cache misses, branch mispredictions, and other low-level performance phenomena; it was a little like looking through a microscope for the first time. I ended up using the timestamp counter for a lot of things during the next few years, including in my Hourglass tool.

Lately my student Yang Chen has been using RDTSC to measure the execution times of function calls that take on the order of 10 to 1000 cycles to execute. Unfortunately, this activity is no longer nearly as straightforward as it once was. This post lists some things we learned. It’s somewhat specific to GCC and Linux.

Enabling Basic Processor-Level Timing Stability

In the BIOS, turn off hyperthreading and anything having to do with turbo mode and frequency scaling. Turn off all OS-level frequency scaling as well; this is of course specific to the OS. These instructions worked fine for us.

Avoiding Scheduling and Multicore Problems

The scheduler can mess up benchmarking in two main ways. First, it can simply take the processor away from the code being benchmarked. Second, it can migrate code to a different processor, slowing it down and also causing problems because the timestamp counters are typically not closely synchronized across CPUs. There seem to be three approaches to dealing with context switches messing up timing:

  1. Ignore them. This is reasonable when timing events that are very short because context switches aren’t that fast; they’ll show up as massive outliers that are easy to discard.
  2. Minimize their impact by raising the priority of running code using sched_setscheduler() and by pinning code to a single processor using the taskset command. Both of these are easy in Linux, although priority elevation requires root privileges.
  3. Lock out the scheduler by disabling interrupts. This can be done from user-mode using the iopl(3) call, or can it be done by running code in a kernel module. Either way, root privileges are required. Incautious disabling of interrupts is a good way to necessitate hard reboots of a machine.

Dealing with Out-of-Order Execution

The RDTSC instruction is not serializing: it can be reordered significantly with respect to other in-flight instructions, causing serious (apparent) timing anomalies. Recently this fantastic white paper addressing this issue was released by Intel. It would have saved Yang and me a ton of time if it had existed six months earlier. The short answer is that if you’re running a newish processor in 64-bit mode, just use these functions to start and stop the timer:

typedef unsigned long long ticks;

static __inline__ ticks start (void) {
  unsigned cycles_low, cycles_high;
  asm volatile ("CPUID\n\t"
		"RDTSC\n\t"
		"mov %%edx, %0\n\t"
		"mov %%eax, %1\n\t": "=r" (cycles_high), "=r" (cycles_low)::
		"%rax", "%rbx", "%rcx", "%rdx");
  return ((ticks)cycles_high << 32) | cycles_low;
}

static __inline__ ticks stop (void) {
  unsigned cycles_low, cycles_high;
  asm volatile("RDTSCP\n\t"
	       "mov %%edx, %0\n\t"
	       "mov %%eax, %1\n\t"
	       "CPUID\n\t": "=r" (cycles_high), "=r" (cycles_low):: "%rax",
	       "%rbx", "%rcx", "%rdx");
  return ((ticks)cycles_high << 32) | cycles_low;
}

Why use RDTSC to start the timer and RDTSCP to stop it? The white paper explains this and contains a lot of other good details.

Other Precautions

  • Code and data should be cache-line aligned when possible.
  • The machine under test should be quiescent. It could also be booted in single-user mode, but we don’t bother.
  • Turning off ASLR seems like a good idea, though we have not done the measurements to see how much it affects repeatability.
  • It used to be the case that paging would affect timing results, so we would use the mlockall() call to pin a process’s pages into RAM. Machines have so much memory these days that this hardly seems necessary.
  • It’s possible to boot a multicore with one or more processors disabled; this will reduce or eliminate TLB shootdowns, other cache invalidation traffic, and memory system contention. Again, we don’t bother.

Analyzing Data

The last important piece of the puzzle is to use robust statistical methods. Never take the average of a series of timing values: their distribution is almost always skewed by massive outliers. The minimum value may or may not be safe to use — there could be outliers on the low end due to migrations between unsynchronized processors. A good starting point would be to look at data points at the 25th, 50th, and 75th percentiles; if these are not very close together, something is probably amiss.

Any time you get timing data from a real processor, it must be scrutinized carefully to see if it makes sense. A good understanding of modern processor architectures really helps. We’ve seen odd things happen such as a machine using frequency scaling when we thought it had been permanently disabled. Either a kernel upgrade or similar disturbed our configuration or we simply messed up.

Summary

Modern computers create a fairly hostile environment for accurate benchmarking of short-duration events. To do it right, you want to use a dedicated machine, carefully navigate the issues I’ve outlined (and whatever others I’ve missed), and develop a benchmarking procedure that includes plenty of sanity checks.

Automatic Advising

A comment on my previous post Interviewing PhD Students suggested that advising students can and should be automated. Although at first I didn’t take it seriously, this idea started to grow on me once I realized that a person could probably be an above-average advisor by simply doing nothing that is actively harmful and also saying the following things at vaguely appropriate times:

  • Sure, I’d be happy to sign that piece of paper without looking at it closely.
  • Is there a thesis statement here?
  • How’s the writing going?
  • How are the experiments going?
  • I think you’d better cite a few more papers in chapter 7.
  • What research question is this addressing?
  • There’s no such thing as writers’ block, but there is such a thing as people who don’t want to write very badly.
  • You are not allowed to say “the stupid reviewer is wrong about X” but rather “I failed to explain X clearly enough for the reviewer.”
  • Do you really need to implement that, or do you just want to?
  • What’s the null hypothesis?
  • Why does this graph: contain no axis labels / connect unrelated points with lines / start the y-axis somewhere besides zero / use 3D / confuse me so much?
  • What have you learned that you didn’t already know?
  • What have you learned that I didn’t already know?
  • What have you learned that nobody knew?
  • Whose life have you made better so far?

Now if only paper-writing, proposal-writing, and teaching could be automated so easily.

A Quick Update to Comparing Compiler Optimizations

Saturday’s post on Comparing Compiler Optimizations featured some work done by my student Yang Chen. Since then:

  • There has been some discussion of these problems on the GCC mailing list; some of the problems are already in the Bugzilla and a new PR was filed. A patch fixing the problem is already available!
  • On Sunday night Chris Lattner left a comment saying that two of the Clang performance problems have been fixed (the other two are known problems and will take more time to fix).

Have I ever said how great open source is? Of course, there’s nothing stopping a product group from fixing a bug just as quickly — but the results would not be visible to the outside world until the next release. When improvements to a piece of software depend on a tight feedback loop between developers and outsiders, it’s pretty difficult to beat the open source model.

Yang re-tested Clang and here are the results.

Example 4

Previously, Clang’s output required 18 cycles to execute. The new output is both faster (4 cycles) and smaller than GCC’s or Intel CC’s:

tf_bH:
        movq    %rsi, last_tf_arg_u(%rip)
        imull   $43691, %esi, %eax      # imm = 0xAAAB
        shrl    $17, %eax
        ret

Nice!

Example 7

Previously, Intel CC’s 8-cycle code was winning against Clang’s 24 and GCC’s 21 cycles. Clang now generates this code:

crud:
	cmpb	$33, %dil
	jb	.LBB0_4
	addb	$-34, %dil
	cmpb	$58, %dil
	ja	.LBB0_3
	movzbl	%dil, %eax
	movabsq	$288230376537592865, %rcx
	btq	%rax, %rcx
	jb	.LBB0_4
.LBB0_3:
	xorl	%eax, %eax
	ret
.LBB0_4:
	movl	$1, %eax
	ret

Although this is several instructions shorter than Intel CC’s output, it still requires 10 cycles to execute on our test machine, as opposed to ICC’s 8 cycles. Perhaps this is in the noise, but Yang spent a while trying to figure it out anyway. It seems that if you replace the btq instruction with a testq (and adjust the subsequent branch appropriately), Clang’s output also executes in 8 cycles. I won’t pretend to understand this.

Example 6 (Updated Dec 17)

This has been fixed in GCC, see the discussion here. The failure to optimize was pretty interesting: some hacking in GCC’s profile code caused some breakage that lead it to infer that the division instructions in this function are “cold” and not worth optimizing to multiplies. This is a great example of how the lack of good low-level metrics makes it difficult to detect subtle regressions. Of course this would have been noticed eventually by someone who reads assembly, but there are probably other similar problems that are below the noise floor when we look only at results from high-level benchmarks.

Sometimes the Name Is the Contribution

Every now and then I read or re-read a famous, influential paper and realize (or at least suspect strongly) that it did not — at the time it was published — contain any new ideas. My guess is that a paper like this can become highly cited for one or more of the following reasons:

  • It popularized an existing idea that was not widely known.
  • It was well written, and packaged up existing knowledge in a convenient, pleasing way.
  • It provided a good name for a concept that previously lacked one.

This last issue, naming, is the one that fascinates me. Somehow, ideas are not quite real until they are given a label. (An unfortunate side effect is that many researchers have realized this, and assign names to trivial algorithms and silly results.) In the near future I’ll write a detailed post about a paper whose contribution is naming a known technique. In the meantime, I’d love to hear readers’ suggestions about papers like this.

Note that I am not mocking or belittling this kind of paper. On the contrary, the authors of these papers deserve every bit of recognition they get: the timely synthesis, packaging, and naming of existing ideas is a difficult art and most of us would love to be able to pull this off just once or twice in a career.

Comparing Compiler Optimizations

[Update from Dec 14: Some of these problems have already been fixed! Details here.]

[This is a guest post by my grad student Yang Chen, describing one of the projects he’s been working on lately. I elaborated on some of Yang’s points, edited, and did some formatting for WordPress.]

Our goal is to help C compiler developers find areas where their optimizer could stand to be improved. We do this by taking some open source applications and, for each application:

  1. Select a number of benchmark functions. For now, all benchmarks are leaves: they make no function calls. There are a few other restrictions too.
  2. Instrument the source code so that all inputs to benchmark functions (including those accessed through pointers) are logged.
  3. Run whatever test suite comes with the application, logging inputs to benchmark functions. Logging stops after a configurable maximum number of distinct inputs has been seen.
  4. Extract standalone versions of the benchmark functions.
  5. For each compiler under test, compile each extracted benchmark with a test harness. Run the resulting code on an otherwise quiescent machine in order to measure the performance of the benchmark function in isolation.
  6. Look for outlier functions where one compiler created significantly faster code than another compiler — these are potentially interesting and we inspect them manually. Some examples are below.

When we report the number of cycles required to execute a function, this is an average over all logged inputs.

The test machine is built around an Intel Core i7-920 processor, which has 4 cores running at 2.67GHz and 8MB of shared L3 cache. It has 6 GB of RAM. It runs the x86-64 version of Ubuntu 9.10. Performance is measured using the RDTSC instruction. The test process is pinned to a single processor and all CPU and OS power management functionality is turned off.

The compiler options we used are intended to:

  1. Optimize for maximum speed
  2. Give the tools a fighting chance at making equivalent assumptions about the hardware platform

Clang

Svn head built on Dec 4, 2010:

$ clang -v
clang version 2.9 (trunk 120838)
Target: x86_64-unknown-linux-gnu
Thread model: posix

Invocation:

clang -O3 -mssse3 -march=core2 -mtune=core2 -fomit-frame-pointer -fno-stack-protector -w

GCC

Svn head built on Dec 4, 2010:

$ current-gcc -v
Using built-in specs.
COLLECT_GCC=./compilers/bin/current-gcc
COLLECT_LTO_WRAPPER=/uusoc/exports/scratch/chenyang/performance_tests/compilers/libexec/gcc/x86_64-unknown-linux-gnu/4.6.0/lto-wrapper
Target: x86_64-unknown-linux-gnu
Configured with: ../configure –prefix=/uusoc/exports/scratch/chenyang/performance_tests/compilers –program-prefix=current- –enable-lto –enable-languages=c,c++,lto
Thread model: posix
gcc version 4.6.0 20101204 (experimental) (GCC)

Invocation:

current-gcc -O3 -mfpmath=sse -mssse3 -march=core2 -mtune=core2 -fno-stack-protector -fomit-frame-pointer -w

Intel

Version is:

$ icc -V
Intel(R) C Intel(R) 64 Compiler Professional for applications running on Intel(R) 64, Version 11.1 Build 20100414
Copyright (C) 1985-2010 Intel Corporation. All rights reserved.
FOR NON-COMMERCIAL USE ONLY

Invocation:

icc -O3 -static -not-prec-div -xHost -mssse3 -fno-stack-protector -fomit-frame-pointer -w

Example 1

(Here and in other examples, we have edited the assembly code for clarity. Also note that we are showing the source code as seen by the compilers being tested. It is not only preprocesssed, but also has been parsed by our benchmark extraction tool which is based on CIL. CIL does a few simple transformations that make C code a lot uglier sometimes.)

This function is setboundaries() from zic.c in postgresql-9.0.0.

 1 typedef long int64;
 2 typedef int64 zic_t;
 3 void setboundaries (void);
 4 zic_t max_time;
 5 zic_t min_time;
 6 void
 7 setboundaries (void)
 8 {
 9   int i;
10
11   min_time = -1L;
12   i = 0;
13   while (i < 63)
14     {
15       min_time *= 2L;
16       i++;
17     }
18   max_time = -(min_time + 1L);
19   return;
20 }

GCC and Intel CC generate code that executes in 124 cycles and 160 cycles, respectively. Clang’s output, on the other hand, requires only 4 cycles because it can statically evaluate the loop. GCC and ICC can statically evaluate many loops, but not this one.

GCC output:

setboundaries:
        movl    $63, %eax
        movq    $-1, %rdx
.L2:    addq    %rdx, %rdx
        subl    $1, %eax
        jne     .L2
        movq    %rdx, min_time(%rip)
        notq    %rdx
        movq    %rdx, max_time(%rip)
        ret

Clang output:

setboundaries:
        movabsq $-9223372036854775808, %rax # imm = 0x8000000000000000
        movq    %rax, min_time(%rip)
        movabsq $9223372036854775807, %rax # imm = 0x7FFFFFFFFFFFFFFF
        movq    %rax, max_time(%rip)
        ret

Example 2

This function is mad_synth_mute() from synth.c in mad-0.14.2b from the MiBench embedded benchmark suite.

 1 typedef int mad_fixed_t;
 2 struct mad_pcm
 3 {
 4   unsigned int samplerate;
 5   unsigned short channels;
 6   unsigned short length;
 7   mad_fixed_t samples[2][1152];
 8 };
 9 struct mad_synth
10 {
11   mad_fixed_t filter[2][2][2][16][8];
12   unsigned int phase;
13   struct mad_pcm pcm;
14 };
15 void mad_synth_mute (struct mad_synth *synth);
16 void
17 mad_synth_mute (struct mad_synth *synth)
18 {
19   unsigned int ch;
20   unsigned int s;
21   unsigned int v;
22
23   ch = 0U;
24   while (ch < 2U)
25     {
26       s = 0U;
27       while (s < 16U)
28 	{
29 	  v = 0U;
30 	  while (v < 8U)
31 	    {
32 	      synth->filter[ch][1][1][s][v] = 0;
33 	      synth->filter[ch][1][0][s][v] = 0;
34 	      synth->filter[ch][0][1][s][v] = 0;
35 	      synth->filter[ch][0][0][s][v] = 0;
36 	      v++;
37 	    }
38 	  s++;
39 	}
40       ch++;
41     }
42   return;
43 }

The questions here are: Can the compiler recognize that the loops are equivalent to a bzero()? And, how fast of a bzero() can be generated?

GCC’s code executes in 256 cycles because the loop is completely unrolled and SSE instructions are used. Intel CC’s code uses SSE but does less unrolling, and takes 308. Clang produces an obvious translation of the source code — locality is poor and 1032 cycles are required.

GCC’s code:

mad_synth_mute:
        pxor    %xmm0, %xmm0
        movdqu  %xmm0, (%rdi)
        movdqu  %xmm0, 16(%rdi)
        movdqu  %xmm0, 32(%rdi)
        movdqu  %xmm0, 48(%rdi)
        movdqu  %xmm0, 64(%rdi)
        ... (obvious sequence of movdqu instructions elided) ....
        movdqu  %xmm0, 3936(%rdi)
        movdqu  %xmm0, 3952(%rdi)
        movdqu  %xmm0, 3968(%rdi)
        movdqu  %xmm0, 3984(%rdi)
        movdqu  %xmm0, 4064(%rdi)
        movdqu  %xmm0, 4080(%rdi)
        ret

Example 3

This is an initialization function taken from pyexpat.c in Python 2.7.

 1 char template_buffer[257];
 2 void
 3 init_template_buffer (void)
 4 {
 5   int i;
 6
 7   i = 0;
 8   while (i < 256)
 9     {
10       template_buffer[i] = (char) i;
11       i++;
12     }
13   template_buffer[256] = (char) 0;
14   return;
15 }

Here Clang generates the obvious looping code requiring 532 cycles: 2 cycles per loop iteration plus some overhead. GCC generates very good SSE code requiring 86 cycles. Intel CC generates code that executes in 18 cycles:

init_template_buffer:
        movl      $269488144, %eax                              #6.3
        movd      %eax, %xmm0                                   #6.3
        pshufd    $0, %xmm0, %xmm1                              #6.3
        movdqa    _2il0floatpacket.0(%rip), %xmm0               #6.3
        xorl      %eax, %eax                                    #6.3
..B1.2:                         # Preds ..B1.2 ..B1.1
        movdqa    %xmm0, template_buffer(%rax)                  #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 16+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 32+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 48+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 64+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 80+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 96+template_buffer(%rax)               #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        movdqa    %xmm0, 112+template_buffer(%rax)              #7.5
        paddb     %xmm1, %xmm0                                  #7.5
        addq      $128, %rax                                    #6.3
        cmpq      $256, %rax                                    #6.3
        jb        ..B1.2        # Prob 99%                      #6.3
        movb      $0, 256+template_buffer(%rip)                 #10.3
        ret                                                     #11.3

Here ICC has performed an impressive 16-way vectorization and 16-way unrolling; the result takes a bit of work to understand.

The 269488144 loaded into eax is 0x10101010. The pshufd instruction (shuffle packed doublewords) loads the value 0x10101010 10101010 10101010 10101010 into xmm1. The next instruction loads xmm0 with 0x0f0e0d0c 0b0a0908 07060504 03020100.

Now the purpose of the first movdqa in the loop is clear: it loads the values {0,1,2, …, 15} into the first 16 bytes of  template_buffer. The paddb (packed add, byte-wise) adds 0x10 to each byte of xmm0, leaving its contents 0x1f1e1d1c 1b1a1918 17161514 13121110. These are moved into the next 16 bytes of template_buffer, and so on. The loop body executes only twice before the entire array is initialized.

Although this translation seems very fragile, icc can generate equivalent object code for most versions of this code where N, X, and Y are constants:

for (i=0; i<N; i++) a[i] = iX+Y;


Example 4

This function is tf_bH() from _ctypes_test.c in Python 2.7.

 1 unsigned long long last_tf_arg_u;
 2 unsigned short
 3 tf_bH (signed char x, unsigned short c)
 4 {
 5   last_tf_arg_u = (unsigned long long) c;
 6   return ((unsigned short) ((int) c / 3));
 7 }

Here GCC and Intel CC both generate code that executes in 6 cycles, whereas Clang’s takes 18.

GCC’s output:

tf_bH:
	movzwl	%si, %eax
	movq	%rax, last_tf_arg_u(%rip)
	movzwl	%si, %eax
	imull	$43691, %eax, %eax
	shrl	$17, %eax
	ret

Clang’s:

tf_bH:                                  # @tf_bH
	movq	%rsi, last_tf_arg_u(%rip)
	movw	%si, %ax
	movw	$-21845, %cx            # imm = 0xFFFFFFFFFFFFAAAB
	mulw	%cx
	andl	$65534, %edx            # imm = 0xFFFE
	movl	%edx, %eax
	shrl	%eax
	ret

At first glance, both compilers have done the right thing: turned the divide-by-constant into a multiply. The difference in execution times is explained by the Intel Optimization Reference Manual, Assembly/Compiler Coding Rule 21:

Favor generating code using imm8 or imm32 values instead of imm16 values. If imm16 is needed, load equivalent imm32 into a register and use the word value in the register instead.

We verified that this is the problem by changing the imm16 value to imm32 by hand and re-running the performance test.

Example 5

This function is get_domain() from tree-call-cdce.c in GCC 4.5.0.

 1 struct input_domain
 2 {
 3   int lb;
 4   int ub;
 5   unsigned char has_lb;
 6   unsigned char has_ub;
 7   unsigned char is_lb_inclusive;
 8   unsigned char is_ub_inclusive;
 9 };
10 typedef struct input_domain inp_domain;
11 inp_domain
12 get_domain (int lb, unsigned char has_lb, unsigned char lb_inclusive,
13 	    int ub, unsigned char has_ub, unsigned char ub_inclusive)
14 {
15   inp_domain domain;
16
17   domain.lb = lb;
18   domain.has_lb = has_lb;
19   domain.is_lb_inclusive = lb_inclusive;
20   domain.ub = ub;
21   domain.has_ub = has_ub;
22   domain.is_ub_inclusive = ub_inclusive;
23   return (domain);
24 }

This function simply loads its arguments into a struct. GCC pushes the values through memory, taking 35 cycles. Intel CC generates code that uses memory 6 times (as opposed to 8 for GCC), requiring 24 cycles. Clang assembles the struct using registers, requiring only 12 cycles.

GCC’s code:

get_domain:
	movl	%edi, -24(%rsp)
	movl	%ecx, -20(%rsp)
	movb	%sil, -16(%rsp)
	movq	-24(%rsp), %rax
	movb	%r8b, -15(%rsp)
	movb	%dl, -14(%rsp)
	movb	%r9b, -13(%rsp)
	movl	-16(%rsp), %edx
	ret

Clang’s:

get_domain:
	movl	%edi, %eax
	shlq	$32, %rcx
	leaq	(%rcx,%rax), %rax
	shll	$16, %edx
	orl	%esi, %edx
	shll	$8, %r8d
	orl	%edx, %r8d
	shll	$24, %r9d
	movl	%r9d, %edx
	orl	%r8d, %edx
	ret

Example 6

This function is php_filter_parse_int() from logical_filters.c in PHP 5.3.3.

  1 int
  2 php_filter_parse_int (char const *str, unsigned int str_len, long *ret)
  3 {
  4   long ctx_value;
  5   int sign;
  6   int digit;
  7   char const *end;
  8   int tmp;
  9   char const *tmp___0;
 10   char const *tmp___1;
 11
 12   sign = 0;
 13   digit = 0;
 14   end = str + str_len;
 15   switch ((int) *str)
 16     {
 17     case 45:
 18       sign = 1;
 19     case 43:
 20       str++;
 21     default:;
 22       break;
 23     }
 24   if ((unsigned long) str < (unsigned long) end)
 25     {
 26       if ((int const) *str >= 49)
 27 	{
 28 	  if ((int const) *str <= 57)
 29 	    {
 30 	      if (sign)
 31 		{
 32 		  tmp = -1;
 33 		}
 34 	      else
 35 		{
 36 		  tmp = 1;
 37 		}
 38 	      tmp___0 = str;
 39 	      str++;
 40 	      ctx_value = (long) (tmp * (int) ((int const) *tmp___0 - 48));
 41 	    }
 42 	  else
 43 	    {
 44 	      return (-1);
 45 	    }
 46 	}
 47       else
 48 	{
 49 	  return (-1);
 50 	}
 51     }
 52   else
 53     {
 54       return (-1);
 55     }
 56   if (end - str > 19)
 57     {
 58       return (-1);
 59     }
 60   while ((unsigned long) str < (unsigned long) end)
 61     {
 62       if ((int const) *str >= 48)
 63 	{
 64 	  if ((int const) *str <= 57)
 65 	    {
 66 	      tmp___1 = str;
 67 	      str++;
 68 	      digit = (int) ((int const) *tmp___1 - 48);
 69 	      if (!sign)
 70 		{
 71 		  if (ctx_value <=
 72 		      (9223372036854775807L - (long) digit) / 10L)
 73 		    {
 74 		      ctx_value = ctx_value * 10L + (long) digit;
 75 		    }
 76 		  else
 77 		    {
 78 		      goto _L;
 79 		    }
 80 		}
 81 	      else
 82 		{
 83 		_L:
 84 		  if (sign)
 85 		    {
 86 		      if (ctx_value >=
 87 			  ((-0x7FFFFFFFFFFFFFFF - 1) + (long) digit) / 10L)
 88 			{
 89 			  ctx_value = ctx_value * 10L - (long) digit;
 90 			}
 91 		      else
 92 			{
 93 			  return (-1);
 94 			}
 95 		    }
 96 		  else
 97 		    {
 98 		      return (-1);
 99 		    }
100 		}
101 	    }
102 	  else
103 	    {
104 	      return (-1);
105 	    }
106 	}
107       else
108 	{
109 	  return (-1);
110 	}
111     }
112   *ret = ctx_value;
113   return (1);
114 }

The assembly is long and not that interesting. The upshot is that Clang and Intel CC are able to turn the divisions by 10 at lines 72 and 87 into integer multiplies, whereas GCC leaves division instructions in the generated code. It is not clear why this happens; as we saw above, other divisions-by-constant are optimized successfully by this compiler. The result is that Clang’s and ICC’s code executes in 41 cycles whereas GCC’s takes 133.

Example 7

This function is crud() from ident.c in git-1.7.3.2.

1 int crud (unsigned char c)
2 {
3   return (((((((((((int) c <= 32 || (int) c == 46) || (int) c == 44)
4 		 || (int) c == 58) || (int) c == 59) || (int) c == 60)
5 	      || (int) c == 62) || (int) c == 34) || (int) c == 92)
6 	   || (int) c == 39) != 0);
7 }

GCC and Clang generate the obvious cascaded comparisons and their code executes in 21 and 24 cycles, respectively. Intel CC is far more clever and its code requires only 8 cycles:

crud:
        movzbl    %dil, %edi
        cmpl      $32, %edi
        jle       ..B1.4
        addl      $-34, %edi
        cmpl      $64, %edi
        jae       ..B1.5
        movl      $1, %eax
        movl      %edi, %ecx
        shlq      %cl, %rax
        movq      $0x400000017001421, %rdx
        testq     %rdx, %rax
        je        ..B1.5
..B1.4:
        movl      $1, %eax
        ret
..B1.5:
        xorl      %eax, %eax
        ret

Since the range of characters being checked for is less than 64, ICC is able to pre-compute a bitmap with 1s in positions corresponding to the desired characters.

Conclusions

Our idea is that using a small collection of compilers and a large collection of realistic benchmark functions, we can get the compilers to reveal performance limitations in each other. Once the basic infrastructure is in place, results can be recompiled with almost no manual effort. Although there are obvious limitations of this approach — it cannot spot an optimization missed by all compilers, it does not take the larger execution context of a benchmark function into account, etc. — we hope that it will become a useful part of compiler developers’ arsenal against performance problems.

These are very early results, selected from only about 1400 benchmark functions. As we remove limitations of the tool, it should become possible to harvest and test hundreds of thousands of benchmark functions. At that point, it may become useful to incorporate profile data into our metric for interesting functions, in order to direct compiler developers towards missed optimizations that are causing the most real slowdown.

Taking Grants for Granted

Proposal-writing talent is unevenly distributed among researchers, and so is grant money. Furthermore, there are strong positive feedback loops where grant money leads to more publications, bigger group size, and increased reputation, all of which make it easier to get subsequent awards.

It struck me, then, while skimming the list of Google Research Award winners from 2010, that just about everyone on this list (certainly the people I know on the list) is already highly effective at getting grant money. In general (I claim) the research output of good grant-writers is not limited by money, but rather by the fact that there are no more hours in a day and by other personal and institutional factors. $6M of Google’s money spread over 112 awards is about $50,000 each: a relatively small amount that will not make a qualitative difference in the kind of research performed by a well-established, highly-funded scholar. It seems likely that Google could find a way to give $6M to the research community that would lead to a much greater impact — for example, giving $50K to each of 112 postdocs or first-year assistant professors. For these people, the money may well create a qualitative difference in their work, providing part of a year of funding for the postdoc or a year of graduate research assistant support for the prof.

Related: Michael Nielsen, in a great blog post, observes that that best results might be obtained by giving out grant money randomly.

The Official Utah State Firearm

From yesterday’s Salt Lake Tribune I learned that Utah may get a state gun: the Browning M1911 semiautomatic pistol, designed by Utahn John Browning. Utah would be the first state to get its own gun. While this is an interesting idea, I’m not sure that .45 is the right caliber:

  • The Utah state animal is the Rocky Mountain elk. Shouldn’t the state gun be a good choice for hunting the state animal? Personally I would not take a .45 on an elk hunt. On the other hand, plinking at seagulls (the state bird) with a .45 would work nicely.
  • Sure, the .45 has ample stopping power, but we might hope that the state gun would be suitable for carrying by any adult resident of the state. The .45’s recoil might be on the heavy side for a hundred-pound recent high school graduate. A 9mm or .38 special may offer a better compromise between effectiveness and usability.
  • According to the company web site, Browning’s handgun offerings are all in .22, 9mm, and .40 — they don’t even make a .45 caliber pistol anymore. Let’s support local industry and choose a state firearm that is still manufactured here (if any are).

I also learned from yesterday’s Trib article that the Utah State Cooking Pot is the dutch oven.