A Quick Update to Comparing Compiler Optimizations


Saturday’s post on Comparing Compiler Optimizations featured some work done by my student Yang Chen. Since then:

  • There has been some discussion of these problems on the GCC mailing list; some of the problems are already in the Bugzilla and a new PR was filed. A patch fixing the problem is already available!
  • On Sunday night Chris Lattner left a comment saying that two of the Clang performance problems have been fixed (the other two are known problems and will take more time to fix).

Have I ever said how great open source is? Of course, there’s nothing stopping a product group from fixing a bug just as quickly — but the results would not be visible to the outside world until the next release. When improvements to a piece of software depend on a tight feedback loop between developers and outsiders, it’s pretty difficult to beat the open source model.

Yang re-tested Clang and here are the results.

Example 4

Previously, Clang’s output required 18 cycles to execute. The new output is both faster (4 cycles) and smaller than GCC’s or Intel CC’s:

tf_bH:
        movq    %rsi, last_tf_arg_u(%rip)
        imull   $43691, %esi, %eax      # imm = 0xAAAB
        shrl    $17, %eax
        ret

Nice!

Example 7

Previously, Intel CC’s 8-cycle code was winning against Clang’s 24 and GCC’s 21 cycles. Clang now generates this code:

crud:
	cmpb	$33, %dil
	jb	.LBB0_4
	addb	$-34, %dil
	cmpb	$58, %dil
	ja	.LBB0_3
	movzbl	%dil, %eax
	movabsq	$288230376537592865, %rcx
	btq	%rax, %rcx
	jb	.LBB0_4
.LBB0_3:
	xorl	%eax, %eax
	ret
.LBB0_4:
	movl	$1, %eax
	ret

Although this is several instructions shorter than Intel CC’s output, it still requires 10 cycles to execute on our test machine, as opposed to ICC’s 8 cycles. Perhaps this is in the noise, but Yang spent a while trying to figure it out anyway. It seems that if you replace the btq instruction with a testq (and adjust the subsequent branch appropriately), Clang’s output also executes in 8 cycles. I won’t pretend to understand this.

Example 6 (Updated Dec 17)

This has been fixed in GCC, see the discussion here. The failure to optimize was pretty interesting: some hacking in GCC’s profile code caused some breakage that lead it to infer that the division instructions in this function are “cold” and not worth optimizing to multiplies. This is a great example of how the lack of good low-level metrics makes it difficult to detect subtle regressions. Of course this would have been noticed eventually by someone who reads assembly, but there are probably other similar problems that are below the noise floor when we look only at results from high-level benchmarks.

,

8 responses to “A Quick Update to Comparing Compiler Optimizations”

  1. BTW, I’m curious how you arrived at the cycle counts. Are you surrounding the code snippets with a counted loop and averaging out the cycles for a large number of iterations?

  2. Hi Ben- Yeah, that’s basically the way we do it. I’ll ask Yang to give more details, it took a surprising amount of work to get stable, useful data out of RDTSC. We’re happy to share code if you’re interested.

  3. As of r123089, LLVM now compiles example 2 into:

    _mad_synth_mute: ## @mad_synth_mute
    pushq %rax
    movl $4096, %esi ## imm = 0x1000
    callq ___bzero
    popq %rax
    ret

    Which seems better than what GCC or ICC produce. The only example Clang doesn’t handle now is “Example 3”, due to lack of an autovectorizor.