A Quick Update to Comparing Compiler Optimizations

Saturday’s post on Comparing Compiler Optimizations featured some work done by my student Yang Chen. Since then:

There has been some discussion of these problems on the GCC mailing list; some of the problems are already in the Bugzilla and a new PR was filed. A patch fixing the problem is already available!
On Sunday night Chris Lattner left a comment saying that two of the Clang performance problems have been fixed (the other two are known problems and will take more time to fix).

Have I ever said how great open source is? Of course, there’s nothing stopping a product group from fixing a bug just as quickly — but the results would not be visible to the outside world until the next release. When improvements to a piece of software depend on a tight feedback loop between developers and outsiders, it’s pretty difficult to beat the open source model.

Yang re-tested Clang and here are the results.

Example 4

Previously, Clang’s output required 18 cycles to execute. The new output is both faster (4 cycles) and smaller than GCC’s or Intel CC’s:

tf_bH:
        movq    %rsi, last_tf_arg_u(%rip)
        imull   $43691, %esi, %eax      # imm = 0xAAAB
        shrl    $17, %eax
        ret

Nice!

Example 7

Previously, Intel CC’s 8-cycle code was winning against Clang’s 24 and GCC’s 21 cycles. Clang now generates this code:

crud:
	cmpb	$33, %dil
	jb	.LBB0_4
	addb	$-34, %dil
	cmpb	$58, %dil
	ja	.LBB0_3
	movzbl	%dil, %eax
	movabsq	$288230376537592865, %rcx
	btq	%rax, %rcx
	jb	.LBB0_4
.LBB0_3:
	xorl	%eax, %eax
	ret
.LBB0_4:
	movl	$1, %eax
	ret

Although this is several instructions shorter than Intel CC’s output, it still requires 10 cycles to execute on our test machine, as opposed to ICC’s 8 cycles. Perhaps this is in the noise, but Yang spent a while trying to figure it out anyway. It seems that if you replace the btq instruction with a testq (and adjust the subsequent branch appropriately), Clang’s output also executes in 8 cycles. I won’t pretend to understand this.

Example 6 (Updated Dec 17)

This has been fixed in GCC, see the discussion here. The failure to optimize was pretty interesting: some hacking in GCC’s profile code caused some breakage that lead it to infer that the division instructions in this function are “cold” and not worth optimizing to multiplies. This is a great example of how the lack of good low-level metrics makes it difficult to detect subtle regressions. Of course this would have been noticed eventually by someone who reads assembly, but there are probably other similar problems that are below the noise floor when we look only at results from high-level benchmarks.

December 14, 2010

regehr

Compilers, Computer Science

8 responses to “A Quick Update to Comparing Compiler Optimizations”

C. BergstrÃ¶m (CTO – PathScale) says:

December 15, 2010 at 3:57 am

Would you be willing to run the tests for path64..

git clone git://github.com/path64/compiler.git

I’ll run it independently myself, but I’d like to see our results in the races and how fast we can fix any issues that pop up.

Thanks

./C
regehr says:

December 15, 2010 at 8:32 am

Sure, we can try path64. Our loose requirements are a compiler we can build and a vaguely gcc-compatible command line system.
Ben L. Titzer says:

December 17, 2010 at 12:19 am

BTW, I’m curious how you arrived at the cycle counts. Are you surrounding the code snippets with a counted loop and averaging out the cycles for a large number of iterations?
regehr says:

December 17, 2010 at 10:08 am

Hi Ben- Yeah, that’s basically the way we do it. I’ll ask Yang to give more details, it took a surprising amount of work to get stable, useful data out of RDTSC. We’re happy to share code if you’re interested.
radu says:

December 29, 2010 at 8:56 am

Hi there! Interesting stuff! Would you care to post the code for cycle counting? Cheers!
regehr says:

December 29, 2010 at 10:52 am

Hi Radu- See this post:

http://blog.regehr.org/archives/330

If you still have questions, please leave another comment.
radu says:

December 30, 2010 at 3:04 am

Thanks!
Chris Lattner says:

January 8, 2011 at 2:25 pm

As of r123089, LLVM now compiles example 2 into:

_mad_synth_mute: ## @mad_synth_mute
pushq %rax
movl $4096, %esi ## imm = 0x1000
callq ___bzero
popq %rax
ret

Which seems better than what GCC or ICC produce. The only example Clang doesn’t handle now is “Example 3”, due to lack of an autovectorizor.