Saturday’s post on Comparing Compiler Optimizations featured some work done by my student Yang Chen. Since then:
- There has been some discussion of these problems on the GCC mailing list; some of the problems are already in the Bugzilla and a new PR was filed. A patch fixing the problem is already available!
- On Sunday night Chris Lattner left a comment saying that two of the Clang performance problems have been fixed (the other two are known problems and will take more time to fix).
Have I ever said how great open source is? Of course, there’s nothing stopping a product group from fixing a bug just as quickly — but the results would not be visible to the outside world until the next release. When improvements to a piece of software depend on a tight feedback loop between developers and outsiders, it’s pretty difficult to beat the open source model.
Yang re-tested Clang and here are the results.
Example 4
Previously, Clang’s output required 18 cycles to execute. The new output is both faster (4 cycles) and smaller than GCC’s or Intel CC’s:
tf_bH: movq %rsi, last_tf_arg_u(%rip) imull $43691, %esi, %eax # imm = 0xAAAB shrl $17, %eax ret
Nice!
Example 7
Previously, Intel CC’s 8-cycle code was winning against Clang’s 24 and GCC’s 21 cycles. Clang now generates this code:
crud: cmpb $33, %dil jb .LBB0_4 addb $-34, %dil cmpb $58, %dil ja .LBB0_3 movzbl %dil, %eax movabsq $288230376537592865, %rcx btq %rax, %rcx jb .LBB0_4 .LBB0_3: xorl %eax, %eax ret .LBB0_4: movl $1, %eax ret
Although this is several instructions shorter than Intel CC’s output, it still requires 10 cycles to execute on our test machine, as opposed to ICC’s 8 cycles. Perhaps this is in the noise, but Yang spent a while trying to figure it out anyway. It seems that if you replace the btq instruction with a testq (and adjust the subsequent branch appropriately), Clang’s output also executes in 8 cycles. I won’t pretend to understand this.
Example 6 (Updated Dec 17)
This has been fixed in GCC, see the discussion here. The failure to optimize was pretty interesting: some hacking in GCC’s profile code caused some breakage that lead it to infer that the division instructions in this function are “cold” and not worth optimizing to multiplies. This is a great example of how the lack of good low-level metrics makes it difficult to detect subtle regressions. Of course this would have been noticed eventually by someone who reads assembly, but there are probably other similar problems that are below the noise floor when we look only at results from high-level benchmarks.
8 responses to “A Quick Update to Comparing Compiler Optimizations”
Would you be willing to run the tests for path64..
git clone git://github.com/path64/compiler.git
I’ll run it independently myself, but I’d like to see our results in the races and how fast we can fix any issues that pop up.
Thanks
./C
Sure, we can try path64. Our loose requirements are a compiler we can build and a vaguely gcc-compatible command line system.
BTW, I’m curious how you arrived at the cycle counts. Are you surrounding the code snippets with a counted loop and averaging out the cycles for a large number of iterations?
Hi Ben- Yeah, that’s basically the way we do it. I’ll ask Yang to give more details, it took a surprising amount of work to get stable, useful data out of RDTSC. We’re happy to share code if you’re interested.
Hi there! Interesting stuff! Would you care to post the code for cycle counting? Cheers!
Hi Radu- See this post:
http://blog.regehr.org/archives/330
If you still have questions, please leave another comment.
Thanks!
As of r123089, LLVM now compiles example 2 into:
_mad_synth_mute: ## @mad_synth_mute
pushq %rax
movl $4096, %esi ## imm = 0x1000
callq ___bzero
popq %rax
ret
Which seems better than what GCC or ICC produce. The only example Clang doesn’t handle now is “Example 3”, due to lack of an autovectorizor.