Compiler Optimizations are Awesome

This piece, which I hadn’t gotten around to writing until now since I thought it was all pretty obvious, explains why Daniel J. Bernstein’s talk, The death of optimizing compilers (audio) is wrong, and in fact compiler optimizations are extremely wonderful and aren’t going anywhere.

First, the thesis of the talk is that almost all code is either hot, and therefore worth optimizing by hand, or else cold, and therefore not worth optimizing at all (even with -O). Daniel Berlin, a compiler person at Google, has looked at the data and disagrees. We can also refute Bernstein’s argument from first principles: the kind of people who can effectively hand-optimize code are expensive and not incredibly plentiful. If an optimizing compiler can speed up code by, for example, 50%, then suddenly we need to optimize a lot less code by hand. Furthermore, hand-optimized code has higher ongoing maintenance costs than does portable source code; we’d like to avoid it when there’s a better way to meet our performance goals.

Second, size matters. Most of the computers in the world are embedded and many of these are storage-constrained. Compiler optimization reduces code size and this phenomenon is completely independent of the hot/cold issue. Without optimization we’d have to buy more expensive deeply-embedded processors that have more on-chip flash memory, and we’d also have to throw away many of those 16 GB phones that are cheap and plentiful and fairly useful today.

Third, most future software isn’t written in C and C++ but rather in higher-level languages, which more or less by definition rely on the optimizer to destroy abstraction layers, do compile-time memory management, etc.

Finally, I claim that the economics of compiler optimization are excellent. A lot of dollars are spent each year making code run faster, either by buying hardware resources or by paying programmers to write faster code. In contrast, there are probably a few thousand people actively doing compiler optimization work, and just about everyone benefits from this. If we can centralize on fewer compiler infrastructures, like GCC and LLVM and V8, then the economics get even better.

In summary, of course there’s plenty of hot code that wants to be optimized by hand, and of course there’s plenty of cold code that sees little benefit due to optimizing compilers. But neither of these facts forms an argument against optimizing compilers, which are amazingly useful and will continue to be for the indefinite future.

33 Replies to “Compiler Optimizations are Awesome”

  1. Another reason for optimizing compilers: It allows us to write simple and obvious code expecting that the compiler will do the optimization. This both reduces the initial defect count and makes the code more maintainable in the long-run. Note that these are also often the simplest optimizations (e.g. a calculation inside a loop where none of the inputs to the calculation change during loop execution) that reduce the need to create intermediate variables that can obscure the ultimate function of the code.

  2. A maybe even better argument: portability.

    If I write code to some reasonable trade off between measured performance with optimizations on $RANDOM_MACHINE and readability, I can with reasonable certainty expect it to be reasonably performant on almost all machines. If I hand tune for the best possible performance (with or without optimization on) on that same single machine, I am probably working counter to my self for all the systems I’m not testing on.

  3. As I said in a comment after the talk (too bad it wasn’t recorded), one of his primary examples of code that was so complicated it needed to be optimized by had was the LuaJIT interpreter loop, ie the core of _an optimizing compiler_.

    Also, his suggestion for what to do about this was to build tools that allowed higher-level specification of the tight assembly loops he wanted to write, ie an optimizing compiler.

  4. I must agree. Our resources are so constrained it will not run without maximum optimization. Sure we can write code that runs but it is very hard to read and finding people good enough to maintain it is hard. Even if the supply of great people was unlimited it is much cheaper to let the compiler do the work. The quality of optimizers has increased to the point hand optimizing is rarely needed and next year when we move to a new processor we will mostly have configuration code and low level drivers to change.

  5. I read the slides and reading this made me think that you didn’t understood the author.

    The point of the author is that as computers become faster, the programs increasingly spend shorter time in the anything else but the core functionality of the program. It will get so bad that there are no gains at all in optimizing the slowest parts of the program.

    But then the optimizing compiler cannot optimize the hottest loop alone, because the hand written code may still provide a 10x boost.

    Writing assembly is not practical because there are too many targets to support, but the traditional whole program kind of compilers won’t do it either.

    I see this problem already on my 7 years old computer. The small hot loop does all the work in many of my programs. It must be even worse on a modern superscalar.

    To understand this further, you should note to yourself that writing the code such that it can be compiled is already an optimization in its own. Writing code in C++ is a premature optimization.

  6. Sam, that’s pretty awesome. Too bad to get confirmation that this talk wasn’t recorded, I’m sure it was a lot more entertaining the reading those slides.

  7. Disagreement because you don’t happen to like the insights is not very productive.

    You can look into the CPU profiler output of various programs.

    If you see a profile where the hot routine of the program takes 30% of the time, and you don’t manage to increase the load, you have found a program that still benefits from the modern compilers.

    But if you find a program where 90% of the time is spent in 1% of the program, despite optimizations and smaller inputs. Then you have the case explained by the presenter and it’s likely that program won’t benefit from optimizing compilers as much.

    Now you should know that you can put any routine into a loop to make a program that requires optimization, but even then you may find out cases where the amount if data handled by the inner loop simply overtakes the show.

    And for clarity, there’s a good rule of thumb in predicting how much an optimization is worth: “If that portion of code didn’t run at all” is the upper bound of how much you can improve performance by focusing on that portion of code alone.

    But if you really don’t like these insights, there’s always the AVR to target. They don’t seem to go anywhere anytime soon, and there every function surely takes too much time to run.

  8. Henri, as I said very clearly in the post, there’s nothing wrong with optimizing hot loops by hand, and of course we should do this when machine-optimized code does not meet our performance goals. But this doesn’t make optimizing compilers a bad idea in general.

  9. Having read the slides and the discussion at Hacker News, both arguments strike me as weak.

    Optimizing compilers are awesome. It just isn’t practical to write every single thing in hand-tuned assembly (hand-tuned for which architecture?) for hot code, and for cold code a lot of other concerns apply. In fact, the dichotomy between hot and cold code is frequently false – a secret to optimizing large systems that few people know is at some point the system evolves to the extent that the only way to make progress is to optimize 1% at a time, 100 times in a row.

    Having said that, the oft taken position that Daniel seems to present that compilers are better than humans at optimizing code is false. It’s not the case that we don’t need optimizing compilers, but a number of crucial optimizations in hot code can only be done by humans, and looking at the optimization quality over the last 10 years, I don’t anticipate this changing. (source: I’ve spent the last 10 years working on performance critical portions of video game engines, among other things)

    To get efficient code, however, you have to have a human who understands how the compiler generates code and what kind of code can be efficient, and the compiler that can generate efficient code. There are cases where a human can beat the compiler, and there are cases where a compiler can beat the human, but efficiently writing performant code is only possible with both parties.

  10. The difference between optimizing and non-optimizing compilers can be restated in very basic terms: either you can get a relatively complex C++11 project to fit and run sensibly on an AVR or a legacy 8-bit target, or you don’t. On small targets like that, all code is hot. If you don’t optimize, it won’t fit. And that’s that.

    And there’s something to be said for being able to fit a self-contained program within the L1 cache of modern desktop CPUs. It’s an instant small integer factor jump in performance, and enables novel approaches e.g. to simulation of small embedded devices by binary-translating their optimized compiler-generated code to a desktop host, and then running extensive fuzz tests etc.

  11. The quoted Knuth bit in Bernstein’s talk continues, btw, the following: »premature optimization is the root of all evil. Yet we should not pass up our opportunities in that critical 3 %.«

  12. I think most people who spend a lot of time hand coding tight loops in assembly would be better off learning more about optimizing compilers and their limitations – and working *with* the compiler until it produces near-optimal code.

    It’s amazing what a few well-placed ‘restrict’ keywords and some copying into temporaries will do for clearing up that alias-analysis that is needed to get optimal code.

  13. Tim, as I said, I was at the talk, John’s summary is accurate, and the same problems he identifies were clear to the audience at the time.

  14. Thanks John/Sam. I wasn’t at the talk and neither am I a compiler expert so I don’t have anything more to add. I have been aware of DJB’s work and musings for 20+ years at this point (yes since he was a student at NYU) and usually he has good reasons for his arguments (even if they may be controversial or obtuse). I personally didn’t find the ycombinator thread conclusive in findings.

  15. Morten,

    I don’t think anybody disagrees with that. My personal principal (primarily working on embedded systems) is to write the simplest high level code possible and carefully add compiler hints as needed.

    I’ll just refer to the final comments in the ycombinator thread. It seems that we are debating [generally] about the interpretation of the talk than the spirit of the talk, as far as I can tell.

    lamacase 774 days ago [-]

    The second-to-last slide ends with:
    “ideally, a language should be designed so that an optimizing compiler can describe its optimizations in the source language. Of course!”
    and
    “The programmer using such a system will write his beautifully-structured, but possibly inefficient, program P; then he will interactively specify transformations that make it efficient. Such a system will be much more powerful and reliable than a completely automatic one.”
    Which I find to be a much more agreeable conclusion than
    “everybody should write in-line assembly for inner-loops”

    DannyBee 774 days ago [-]

    On the first, so he wants something like stratego (or any of the other failed systems that do this and were determined not useful) 🙂
    On the second, i’m going to just disagree.

  16. Sam,

    > Also, his suggestion for what to do about this was to build tools that allowed higher-level specification of the tight assembly loops he wanted to write, ie an optimizing compiler.

    Optimising compilers used to be able to do that. Today, gcc and llvm are much smarter. They’re so smart that, in many cases, it’s now easier to write correct, fast assembly code than it is to write C or C++ code that elicits fast and correct results.

  17. > Daniel Berlin, a compiler person at Google, has looked at the data and disagrees.
    Completely agree with DannyBee (for context, my team’s mandate at Google is to collect and analyze fleetwide profiling data). We even published a representative sample in a paper a few years back, showing just how flat our callgraphs are [1] (fig. 1, 2, 3).

    Also, there are plenty of folks tracking gains from compiler improvements alone (on the same HW and workloads). CPU DB has a nice example [2] (fig. 15).

    [1] https://research.google.com/pubs/archive/44271.pdf
    [2] http://dl.acm.org/ft_gateway.cfm?id=2181798&ftid=1198533&dwn=1

  18. tgm, I think on this blog we can blame that on the C and C++ language specifiers.

  19. Re the economics of compiler optimization, there is this old paper by Arch Robinson from Intel: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.85.1534&rep=rep1&type=pdf which tries to reason that optimizations are but a small part of the overall compiler product and that it’s not clear that a commercial compiler vendor would or should treat them as importantly as other things.

    Is it still relevant? This is from the point of view of Intel making ICC, and predates the current world where Google, Apple et al are effectively funding LLVM/Clang/GCC (and distorting the true economics perhaps by skewing development towards what they need for their programs).

  20. RE: “Third, most future software isn’t written in C and C++ but rather in higher-level languages, . . . ”
    And all of the projects that I’ve worked on in C/C++ were written in C/C++ because time and space were critical.

    I also agree with several commentators above: the logic should be written clearly for the benefit of future maintainers. The optimizer should make the logic efficient. Granted, good developers know how the compilers works and take advantage of those features, but if I have to do the optimizations by hand, most of my code won’t be readable to those that inherit it.

  21. Svilen, thanks for those links, they’re great.

    exbadger, I used to love that paper, but haven’t gone back to it for a number of years. I think it still applies but as you say there are some distorting effects.

  22. @Svilen Kanev In the figure 3: Even with those flat callbacks, where your 400 functions take 80% of the time, you have about 1000 functions that combined take less than 5% time to run.

    Otherwise that data is impressive. It shows out that you can unlikely make a 0.5% improvement by changing only an one function. Though the three first figures doesn’t tell whether you could change the whole structure of the program to optimize it. But maybe that’s not possible in the first place.

    But this is not all of Google that you present. You have tons of more software than this deployed. Your warehouse is the hot loop, yet, it still isn’t completely flat.

  23. I’m convinced by the third argument: economics.

    But, the flat profile of Google’s code seems irrelevant to the discussion here. Wouldn’t you expect any mature codebase to have a flat profile? If it had hotspots, then they were noticed and fixed.

    In other words, Bernstein may talk about how profiles look like when you first churn out code, while Google’s flat profiles are for mature code. No contradiction.

  24. Sam, no C standard requires compilers to abuse undefined behaviour—or to introduce it when compiling a C program where none exists.

  25. Real compiler optimizations that do not change the behaviour of programs are useful, but not a replacement for source-level optimization by programmers. E.g., looking at the example from “Writing Efficient Programs” (1982), the 1982-vintage source-level optimizations provide a speedup by a factor of 2.7 over gcc-5.2.0 -O3 (and more for clang -O3). I then looked at optimizing this further, but had to switch to assembly language (or, equivalently, architecture-specific intrinsics) in order to get any further, and, with the help of others, got another factor of 2 out of this example that gcc and clang were unable to find.

    By contrast, “optimizations” that assume that programs do not exercise undefined behaviour usually provide hardly any speedup even for code which they do not break. E.g., in the example above, these “optimizations” produced between a factor of 1.05 *slowdown* and a factor 1.04 speedup on the various programs (there are several, corresponding to different levels of source-level optimization).

    That would not be so bad, but the advocates of these “optimizations” tell us that the our options are to either disable optimization (which also disables real optimization, and may not even help), or to “sanitize” the programs, which requires a lot of effort (and not just on the hot code), and permanently increases the maintenance cost, throwing that benefit of optimization out of the window. And worse, when “sanitizing” code, it can become slower; I expect that a Forth system written in “sanitized” C would be at least 5 times slower than Gforth compiled with gcc-2.95.

    Concerning the link to Daniel Berlin’s posting, I don’t see any data there, only claims without data to back it up. His claim about interpreters and Unladen Swallow is interesting with respect to the “sanitizing” topic above, because one thing that Unladen Swallow (like many other high-performance interpreters) did, was to use threaded code, which cannot be expressed in standard C, and is therefore undefined behaviour.

    Data on the example from “Writing efficient programs” can be found in:
    http://www.complang.tuwien.ac.at/kps2015/proceedings/KPS_2015_submission_29.pdf
    and the further optimizations in news: ff.

  26. Anton, complaining about optimizations isn’t the way forward.

    The way forward is to figure out the semantics for C that you want, write them down, and then try to build a consensus around them.

  27. I doubt that is the way forward (although I plan to write a paper on it), but anyway, that was not the point of my comment.

    My main point is that compiler optimizations are not a replacement for source-level optimizations by the programmer. Looking at http://www.complang.tuwien.ac.at/anton/bentley.pdf they actually seem to be mostly orthogonal: Cases where a source-level optimization did something that was automatically done by the compilers show up as flat sections for the -O3 lines, but not as flat for the -O0 lines. That is the case for the optimization between tsp4 and tsp5 (inlining a function), but not for the other source-level optimizations in this example.

  28. Hi Anton, I think everyone is in agreement that compiler optimizations are not a replacement for source-level optimizations.

Comments are closed.