A misunderstanding that I sometimes run into when teaching programming is that the compiler can and should guess what the programmer means. This isn’t usually quite what people say, but it’s what they’re thinking. A great example appeared in a message sent to the avr-gcc mailing list. The poster had upgraded his version of GCC, causing a delay loop that previously worked to be completely optimized away. (Here’s the original message in the thread and the message I’m quoting from.) The poster said:
… the delay loop is just an example, of how simple, intuitive code can throw the compiler into a tizzy. I’ve used SDCC(for mcs51) where the compiler ‘recognises’ code patterns, and says “Oh, I know what this is – it’s a delay loop! – Let it pass.”(for example).
Another example comes from the pre-ANSI history of C, before the volatile qualifier existed. The compiler would attempt to guess whether the programmer was accessing a hardware register, in order to avoid optimizing away the critical accesses. I’m sure there are more examples out there (if you know of a good one, please post a comment or mail me).
The problem with this kind of thinking is that the correctness of code now depends on the compiler correctly guessing what the programmer means. Since the heuristics used to guess aren’t specified or documented, they are free to change across compilers and compiler versions. In the short term, these hacks are attractive, but in the long run they’re little disasters waiting to happen.
One of the great innovations of the past 50 years of programming language research was to separate the semantics of a language from its implementation. This permits the correctness of an implementation to be judged solely by how closely it conforms to the standard, and it also permits programs to be reasoned about as mathematical objects. C is not the most suited to mathematical reasoning, but there are some excellent research projects that do exactly this. For example Michael Norrish’s PhD thesis formalized C in the HOL theorem prover, and Xavier Leroy’s CompCert compiler provably preserves the meaning of a C program as it is translated into PPC or ARM assembly.
Of course, the intent of the programmer does matter sometimes. First, a well-designed programming language takes programmer intent into account and makes programs mean what they look like they mean. Second, intent is important for readability and maintainability. In other words, there are usually many ways to accomplish a given task, and good programmers choose one that permits subsequent readers of the code to easily grasp what is being done, and why. But the compiler does not, and should not, care about intent.
9 responses to “The Compiler Doesn’t Care About Your Intent”
One of the cases where a compiler absolutely *does* need to be concerned with the intent of the programmer is when presenting diagnostic information related to programming errors.
You write: “Xavier Leroyâ€™s CompCert compiler provably preserves the meaning of a C program as it is translated into PPC or ARM assembly.”
That’s not really true. If the CompCert compiler could actually be proved correct, then it would (by definition) be bug-free… and it’s clearly not bug-free yet, as seen from the number of bugfixes mentioned in its changelogs.
This isn’t Xavier Leroy’s fault; it’s mostly the fault of the C committee for not specifying C’s syntax or semantics very well (from a theoretician’s perspective; there are huge gaping holes in C that are generally handwaved away as “oh, the compiler vendors all know how to deal with that”). But still, you shouldn’t say that CompCert *does* preserve correctness; merely that it *tries to* preserve correctness.
Hi Anonymoose- I don’t quite agree with your characterization. First, looking through that changelog, I didn’t see any fixes for wrong-code bugs. It looked liked feature additions and fixing crash bugs (if CompCert crashes, no proof is emitted — it’s not wrong). Did I miss something? Second, as I understand the situation, laxity in the C standard is not a problem. The possibilities for errors in CompCert are: the PPC semantics could be wrong, the CLight semantics could be wrong, or the C->CLight translator can contain bugs. Since CIL is used to translate C into CLight, and CIL historically has contained plenty of bugs, CIL is almost certainly the weakest part of CompCert. I believe Xavier will have eliminated most dependency on CIL for the next release.
Good article, but I want to clarify on a little detail so that people do not do mistakes: volatile isn’t a portable way to do register access in C, because of various reasons.
(we could debate on whether or not C90 or C99 define volatiles so they are suitable for register access, but in the real world of now massively out of order CPUs even in embedded systems, multiprocessing, lots of caches and buses, it just doesn’t work anymore, and the amount of work to put in compilers AND systems to make it work again is ridiculously disproportionate with what it would give us compared to the solution that do work today: accessors written in in-line assembly language, featuring the correct amount of memory barriers black magic to enforce the effectiveness of transactions on the correct bus in an in-order way)
Hi Xilun- When talking about high-performance processors, you are perfectly correct.
However, there are a lot of microcontrollers that are in-order and lack interesting memory subsystems. These are commonly programmed in C and volatile is useful, by itself, in that domain.
Another example of where programmers would like to have the C compiler generate code based on guesses and not on the specified behavior is the use of structs in network protocol processing. Most protocol stacks that are written in C use structs to represent protocol headers by overlaying the struct over a byte array in memory and then using the struct to access individual header fields. This is very convenient, because the programmer does not have to think about how to extract the correct fields and also provides a nice syntax to the header field accesses. It works well on many platforms and architectures, but when the code is ported to a platform with weird word sizes or arcane memory alignment requirements, this code typically breaks. Specifically, DSP processors that have 16-bit or 32-bit chars breaks a lot of these structs.
The problem is that the C compiler does not have to abide by any specific rules when it chooses how to layout the struct in memory. Thus the C compiler may choose a layout that does not match the specified protocol header. At this point the programmer usually wishes that the compiler could guess the programmer’s intent and just provide the layout that matches the protocol header…
TinyOS and its C variant nesC makes a step away from depending on this quirk in the C language, by the way, by providing a language construct for specifying protocol headers.
Hi Adam- Yeah, it’s ironic that the prevalent systems programming language makes it hard to specify portable data layouts.
Do you know about this work? Is it a piece of the solution?
I mostly like nesC’s network types, but I dislike the way that they make it a bit tricky to reason about overheads.
Good point. Thanks