Proposal for a Friendly Dialect of C


[This post is jointly authored by Pascal Cuoq, Matthew Flatt, and John Regehr.]

In this post, we will assume that you are comfortable with the material in all three parts of John’s undefined behavior writeup and also with all three parts of Chris Lattner’s writeup about undefined behavior. Additionally, this paper is excellent background reading.

C compilers generate faster and smaller code by assuming that the compiled program will never execute an undefined behavior (UB). Each time a compiler exploits a new kind of UB or increases the optimizer’s reach in other ways, some code that previously worked will break. Thus, compiler upgrades cause consistent low-grade headaches for developers and maintainers of large bodies of C code. It’s not hard to find reasonable objections to this kind of thing as well as irate bug reports. The code that gets broken by the UB-aware compiler is incorrect according to the standard, but following all of the rules in the C standard in a large code base is brutally difficult and, practically speaking, few programmers are capable of it. For example, a sufficiently advanced compiler can break six of the nine well-worn C programs in SPEC CINT 2006 by only exploiting integer undefined behaviors. The problem is that the ostensible user base for C — people implementing low-level systems code — is not necessarily well served by creeping undefined-behavior exploitation. In short, modern C is not a friendly programming language.

When developers are not 100% certain that their code is free of undefined behaviors, one thing they do is add compiler-specific flags that disable certain UB-based optimizations. For example, PostgreSQL uses the -fwrapv option, which tells GCC to implement two’s complement wrapping behavior for signed integer overflows. For analogous reasons, the Linux kernel uses -fno-strict-aliasing and -fno-delete-null-pointer-checks. The problem with these sorts of flags is that they are compiler-specific, the flags don’t necessarily mean the same thing across compiler versions, the flags individually don’t provide much protection against UB exploitation, and developers must watch out for new kinds of breakage and new flags to add to configuration scripts.

As Chris Lattner says at the end of his third post on this topic, using various -f flags amounts to selecting a different dialect of C. Instead of having programmers learn, mix, match, and track various -f flags, we propose defining a friendly dialect of C that trades some optimization capability for ease of reasoning. This friendly dialect might be supported through a -std=friendly-c flag (if you’ll indulge the idea that the friendly dialect could be a standard) that merely implies a group of -f flags for a given version of GCC or LLVM. The flag would be otherwise orthogonal to code generation options, such as -O2. Our goal is to combine

  • minimal additional effort for compiler developers by — as much as possible — simply requiring that they provide behaviors that are already present or are at least easily available; with

  • minimal slowdown when compared to maximum UB-aware compiler optimizations by (1) not requiring any UBSan-like dynamic checks to be added and (2) disabling only optimizations that provide a bad value proposition in terms of performance vs. friendliness.

As a starting point, we imagine that friendly C is like the current C standard, but replacing many occurrences of “X has undefined behavior” with “X results in an unspecified value”. That adjustment alone can produce a much friendlier language. In other cases, we may be forced to refer to machine-specific details that are not features of the C abstract machine, and we are OK with that.

Here are some features we propose for friendly C:

  1. The value of a pointer to an object whose lifetime has ended remains the same as it was when the object was alive.
  2. Signed integer overflow results in two’s complement wrapping behavior at the bitwidth of the promoted type.
  3. Shift by negative or shift-past-bitwidth produces an unspecified result.
  4. Reading from an invalid pointer either traps or produces an unspecified value. In particular, all but the most arcane hardware platforms can produce a trap when dereferencing a null pointer, and the compiler should preserve this behavior.
  5. Division-related overflows either produce an unspecified result or else a machine-specific trap occurs.
  6. If possible, we want math- and memory-related traps to be treated as externally visible side-effects that must not be reordered with respect to other externally visible side-effects (much less be assumed to be impossible), but we recognize this may result in significant runtime overhead in some cases.
  7. The result of any signed left-shift is the same as if the left-hand shift argument was cast to unsigned, the shift performed, and the result cast back to the signed type.
  8. A read from uninitialized storage returns an unspecified value.
  9. It is permissible to compute out-of-bounds pointer values including performing pointer arithmetic on the null pointer. This works as if the pointers had been cast to uintptr_t. However, the translation from pointer math to integer math is not completely straightforward since incrementing a pointer by one is equivalent to incrementing the integer-typed variable by the size of the pointed-to type.
  10. The strict aliasing rules simply do not exist: the representations of integers, floating-point values and pointers can be accessed with different types.
  11. A data race results in unspecified behavior. Informally, we expect that the result of a data race is the same as in C99: threads are compiled independently and then data races have a result that is dictated by the details of the underlying scheduler and memory system. Sequentially consistent behavior may not be assumed when data races occur.
  12. memcpy() is implemented by memmove(). Additionally, both functions are no-ops when asked to copy zero bytes, regardless of the validity of their pointer arguments.
  13. The compiler is granted no additional optimization power when it is able to infer that a pointer is invalid. In other words, the compiler is obligated to assume that any pointer might be valid at any time, and to generate code accordingly. The compiler retains the ability to optimize away pointer dereferences that it can prove are redundant or otherwise useless.
  14. When a non-void function returns without returning a value, an unspecified result is returned to the caller.

In the interest of creating a readable blog post, we have kept this discussion informal and have not made any attempt to be comprehensive. If this proposal gains traction, we will work towards an implementable specification that addresses all 203 items listed in Annex J of the C11 standard. A friendly C++ could also be defined but that is a bigger job.

We are not trying to fix the deficiencies of the C language nor making an argument for or against C. Rather, we are trying rescue the predictable little language that we all know is hiding within the C standard. This language generates tight code and doesn’t make you feel like the compiler is your enemy. We want to decrease the rate of bit rot in existing C code and also to reduce the auditing overhead for safety-critical and security-critical C code. The intended audience for -std=friendly-c is people writing low-level systems such as operating systems, embedded systems, and programming language runtimes. These people typically have a good guess about what instructions the compiler will emit for each line of C code they write, and they simply do not want the compiler silently throwing out code. If they need code to be faster, they’ll change how it is written.

Related reading:

We appreciate feedback.


47 responses to “Proposal for a Friendly Dialect of C”

  1. Why isn’t #14 a compile time error?

    What implications exist around linking friendly-c code with machiavellian-c? I.e. put the performance critical kernels in there own module and work realy hard to make it correct.

  2. This makes perfect sense to me, and it can/should even be driven at the C standard body level. If you’re willing to concede that the “friendly” dialect only needs to support 2’s complement integer machines (something that C has apparently been unwilling to narrow its scope to), then this is a very straight-forward technical exercise.

    The challenge is the bike shedding and exact specification of the details, which is something a committee should be pretty good at.

    -Chris

  3. Fantastic! If you’re committed to replacing “undefined behavior” with “unspecified values,” I suspect it’s achievable within WG14.

    I worked on something similar. Unfortunately, I was mired by a few bike shed issues: e.g. Unicode identifiers and eliminating syntax ambiguities. I’d love to participate.

  4. Could #14 (non-void function returns without a value) please be an error? I’ve never heard any good reason why it shouldn’t be an error and have seen plenty of grief because of it.

  5. @bcs A function that does not terminate is allowed to return type int and not to contain any return statement. It is only reaching the final } of a non void-returning function (other than main) that is undefined according to the C standard. Hence, dynamic behavior.

  6. As an embedded system implementor, I strongly endorse this idea. I would also like to see the behavior of padding bytes specified: specifically, that the padding bytes in compound literals and objects initialized with a designated initializer are all zero, and that assignment to the elements of an object does not disturb the padding bytes (modulo padding bytes in union members that overlap other representations). This would inhibit the compiler from using a larger-sized store instruction to store a smaller-sized value by taking advantage of the undefined behavior of padding bytes.

    In concert with this specification of padding behavior, I’d also like to see empty compound literals/designated initializers supported for the purpose of easily initializing an object with automatic storage duration to all zeros (including any padding bytes).

  7. > replacing many occurrences of “X has undefined behavior” with “X results in an unspecified value”

    How costly would it be to remove *all* UB? (Is there any case where it’s not _possible_ to replace it with something milder like unspecified / implementation-defined results?)

    (This idea has been bouncing around my head as well. I suspect a significant part of C’s success is due to the fact that compilers historically didn’t take advantage of all this undefined behavior.)

    Also: how much harder would a friendly C++ be? My superficial impresssion is that C++ doesn’t add its own UB but only adopts it from C, but maybe this is mistaken.

  8. Another “unfriendly” (in that it doesn’t guarantee naive-compiler-like behavior) feature of C is “sequence points”. One could argue that this is a much harder mistake to make than the ones you mention; most code that gets broken is very unnatural. Even so, this seems like a good candidate to change if you want the language to be truly “friendly.”

  9. I believe (14) can lead to surprising slow-downs. When you write through *any* pointer you could validly reach any address. That means that a write clobbers *all* cached reads. You could write anything through any pointer and force the compiler to reload *all* other values it might have cached in registers from memory. Now, everything aliases everything else. Except, if provably not in simple cases.

  10. UB is the only thing that has kept C compilers ahead of other languages in terms of performance, and by extension, the only thing that has kept C alive for so long. If you remove all of the small competitive advantages that keep C “fast”, then it may as well not exist as a language, since it has *very* little else going for it, except for providing an system ABI for others to build on.

    I hate to sound like a broken record, but the solution isn’t to wrap dangerous tools in 10 layers of protective foam — it’s to keep them locked away, where only responsible adults (who understand the full ramifications of it’s use) can reach. The kind of code that is difficult to get right in C (copious passing and sharing of heap allocated structures), should just be written in a different language entirely. If this means repeatedly telling GitHub hipsters and Hacker News kids “don’t use C, you’ll put your eye out you blithering idiot”, then that’s still a better solution than this.

  11. Chris, thanks for the feedback. I am very willing to assume 2’s complement (despite the continued existence of at least one ones’ complement architecture that has a C compiler).

    The bikeshedding issue is a concern for sure…

  12. glaebhoerl, C++ is an odd beast that seems to be slowly but surely diverging from C in important respects. You’re right that many of the C++ UBs are rooted in C but surely there are many more that would have to be carefully mined out of the standard. Unlike C, C++ has no handy Appendix J (last time I looked, at least).

  13. Hi tobi, the pointer-related parts of friendly C are the trickiest. The new dialect will only be acceptable if it does not break standard (not type-based) alias analyses.

  14. Jack, the motivation here is the huge amount of existing C code that keeps getting broken by compiler upgrades.

    Keeping people who shouldn’t be using C from using C is a separate issue, and it seems clear to me that usage of C in new software projects has cratered in recent years, so I think this process is taking care of itself just fine.

  15. I write embedded C code professionally, and I like your proposal.

    Comments:
    * Would an uninitialized bool be required to be either true or false, or can it have any int value?
    * For (4), you may want to allow side effect of reading from invalid pointers. It may point to memory that is mapped to a hardware register where reading changes its state.

  16. Jack, signed integer overflow is *not* the undefined behavior that allows compiler optimizations to keep up. It makes only a slight difference, especially since the managed languages that are the primary performance competitors – C# and Java – have the wraparound semantics that this proposal specifies.

    These ideas leave plenty of undefined behavior. Accessing freed memory or going past the end of an array would be as undefined as ever.

    I disagree with some things on this list:

    memcpy and memmove ought to remain distinct, but I think that an overlapping memcpy ought to result in unspecified garbage being written rather than undefined behavior.

    I think the aliasing rules need some refinement, but we systems programmers need to come to a truce with compiler optimizer designers so as to not destroy C’s performance, because unlike signed integer overflow, the aliasing rules actually do matter for C’s performance.

    I think your changes need to account for pointers to memory-mapped I/O registers. Perhaps the standard should required that all such pointers be labeled “volatile”? At least, the registers with side effects from reading, or whose value may change upon reading, and for writing registers of any type except write-combining and write-through?

    If we’re going to ditch non-two’s-complement systems, we might as well define signed shift right as sign-extending.

    Over-shifting should probably be either unspecified, or perhaps specified as being one of two possibilities chosen by the implementation: 1. Over-shifting results in shifting by modulo the bit size of the promoted type. 2. Over-shifting results in 0 when shifting left, 0 when unsigned shifting right, and (input < 0) ? -1 : 0 when signed shifting right.

    By the way, here is something I use to show why signed integer overflow is a real pain in the ass to deal with, even when you're aware of it and try to use unsigned integer types to avoid it: http://pastebin.com/nLzRGHr0

  17. Since the proposal attempts to make C code easier to reason about, you should call it “Reasonable-C”.

  18. What does #11 mean (“A data race results in unspecified behavior.”)? How is “unspecified behavior” any different from “undefined behavior”? If the behavior is unspecified, it could still be “nasal demons” or “order a burrito”, as far as I can see.

    Did you intend to require something more like “A data race results in the storage unit on which the race occurs containing an indeterminate value”, perhaps?

  19. Richard, yes, what you said is better, but I suspect a lot more refinement is required for that one and for the pointer behaviors.

  20. C# and Java define over-shifting to be equivalent to (x >> (amount & 31)). If the compiler can prove that ((amount & 31) == amount) there won’t even be a performance difference. For constant shift amounts (common) there is no difference.

    Even C# has unspecified behavior. (int.MinValue / -1) is implementation defined (with certain choices to pick).

  21. Oh my but this would make a lot of embedded developers happy. I consult in the embedded industry and find that a large fraction of embedded developers think that this is already how C works. Type-based aliasing rules are probably the biggest source of bugs there; embedded programmers love to cast values willy-nilly and get surprised when it bites them.

    Note that it isn’t even necessary for compilers to start exploiting new undefined parts of the standard to break code. I saw someone turn on intermodule optimizations, and the compiler started optimizing away on hundreds of aliasing bugs that were previously invisible to the optimizer. They ended up disabling the strict aliasing rules after I explained what they were (nobody at that shop had ever even heard of them).

  22. Myria, I would say something like “an invalid pointer can be assumed to never alias with anything”. Though I can understand it’s not ideal, since you can have a function

    void foo(int * a, float * b)

    and ideally you’d want the compiler to assume a and b don’t alias because seriously why would they ever, but assuming a and b can point wherever, then they can alias.

    Then again, we have the restrict keyword and I would think that assuming a and b alias won’t lead to overly slow code.

    This is basically the problem with cache invalidation.

  23. tobi: Even “unspecified behavior” is better than “undefined behavior”. “Unspecified” results means that the value must be *something*, or in the case of INT_MIN / -1, could throw a trap. These are defined behaviors, even if they vary among implementations, or could vary depending upon the conditions in which they’re encountered in the code. The point is that they have effects that can be known and accounted for by the systems engineer, whereas current compilers are essentially unpredictable with their undefined behavior.

    For example, by defining division by zero or signed division overflow (INT_MIN / -1) as unspecified, we know that either we’ll simply get a garbage answer, or the program will trigger SIGFPE / STATUS_INTEGER_DIVIDE_BY_ZERO / STATUS_INTEGER_OVERFLOW (POSIX/NT case 1/NT case 2). We wouldn’t have to worry about the compiler pruning off a chunk of code that it figured it could optimize out because it could prove that it would only ever execute if the divisor were zero. If we have a SIGFPE signal handler in place to recover from divide by zero and resume, the compiler would have just broken our code. This kind of thing happens in low-level code.

  24. This is very cool. A friendly C compiler is something that I would be very interested in using. Additionally, I hope that the existence of a friendly C compiler would help to increase the dialogue amongst developers concerning undefined behavior.

  25. “Macro assembler C” seems more appropriate as a name than “friendly C”, I understand embedded devices people may want exactly that.

    Majority of proposals are harmless, but e.g. 10 is going to hurt big times at any attempt to meaningfully formally analyse/verify program behavior. C would probably be better off to go the opposite direction: move towards forbidding aliasing unless explicitly declared.

  26. What about specifying the function argument evaluation order as left to right, with sequence points between each pair? Do the same for operators (but let the associativity double as evaluation order, so right-to-left for assignment).

    The current rules are mainly a hold-over from the days of stack-based calling conventions, architectures favouring push/pop, and simplistic compilers. With modern compilers and ABIs, I doubt we would see much if any difference in performance. In particular, the as-if rule would still hold.

  27. Mattias, here we wanted to only deal with undefined behavior, but of course any friendly (or even sane) dialect of C would force a specific function evaluation order.

    Helge, it is absolutely not the case that friendly C will be harder to analyze/verify than regular C.

  28. As a TA for an undergraduate systems course, some of these, especially with regards to overflow and division, are some of the hardest concepts for students to grasp. Once they know about the wrapping behavior of 2’s complement they expect the same overflow to happen in C. Similar issues occur with shifting and division once they understand some of the hardware beneath. For those arguing against the creation of this standard, I argue that if not for industry then certainly for teaching this is necessary. With a friendlier standard, we can focus on systems concepts before delving into the intricacies of C. When students are ready to fall into C’s spike traps I often refer them to this blog’s posts on undefined behaviors and their dangers. But there should be some time of learning and enjoyment before that happens.

  29. In a past life I worked on implementing C for some oddball processors, for which the traditional rule of ‘you get what the hardware gives you’ was in effect. I like this proposal, with a few nits.

    For signed integer overflow, I suggest mandating two’s-complement wrapping only for intN_t types, which are already guaranteed to have a two’s-complement representation if they exist (§7.20.1.1). For ‘familiar’ environments there is no practical difference. For other signed integers, overflow should produce an implementation-defined value (stronger than unspecified; I have in mind DSPs with saturating arithmetic) or generate an implementation-defined trap. Similarly for signed shifts.

    I agree with Myria on mempcy(). Maximally fast copy is too important to disappear, so making memcpy() equivalent to memmove() will simply cause it to reappear under a new name.

  30. The “arbitrary types may alias” bit definitely hurts. Not in a theoretical sense, but in a very practical sense.

    To reason about a program you usually reason about invariants of “types” and why they hold, but note that these “semantic types” need not have a simple correspondence to “syntactic types” in the program. What adhering to strict aliasing rules gains is that syntactic and semantic types are considerably better aligned, while giving up on strict aliasing allows to entirely severe the connection between the two.

    In a theoretical sense the difference does not matter, in a practical sense the code produced by developers under the different regimes differ a lot. Don’t forget the social aspect of what (dis)incentive language rules provide to developers, and this has long-term impact on analyzability.

  31. I also agree with Myria on memcpy():
    > memcpy and memmove ought to remain distinct, but I think that an overlapping memcpy ought to result in unspecified garbage being written rather than undefined behavior.

    My experience is with Linux kernel code, where memcpy and memmove have very different runtime behaviors (falling within Myria’s specification).

    I have questions on how you’d actually specify this:

    > 11. A data race results in unspecified behavior

    At PLDI’08, Hans Boehm and Sarita Adve explained (in “Foundations of the C++ Concurrency Memory Model”, http://dl.acm.org/citation.cfm?id=1375591) that the optimizer behavior you describe can cause wild branches (if you have a data race on a value which is used to index a jump table, and bounds checking looks safe to drop). See their Sec. 5.
    They say “It seems difficult to explain either behavior in a language standard in any way other than to leave it completely undefined.”

  32. John, the lack of sequence points between evaluation of arguments and operands causes undefined behaviour if an object is written more than once or both read and written.

  33. For the assignment operators, if the assigned value is further used, C11 allows the compiler to either read back the value from the left hand side or use the value written to the LHS.

    That is, for

    a = b = c;

    the compiler can either do:

    b = c;
    a = b;

    or

    tmp = c;
    b = tmp;
    a = tmp;

    The former version (reading back) causes major issues with volatile objects and if the compiler chooses to read back the value (which, for example gcc does), then it can lead to situations where the generated code is not just counterintuitive and unpredictable, but actually not compliant with the standard itself (and apparently gcc is happy with that).

    That self-contradictory behaviour can be resolved by mandating that the value of an assignment expression is the value that gets assigned to the LHS, cast to the type of the LHS. That is, the compiler can not read back a volatile object to obtain the value (but it can read back a non-volatile object, if it likes).

  34. With respect to a friendly C++, instead of scouring the standard up front for any instances of behavior left undefined (a difficult task, to say the least), could this perhaps be done in an evolutionary manner in collaboration with compiler writers?

    I.e., we may not know all instances of UB that exist in the C++ standard, but individual compilers might know what UB they currently take advantage of: so, in the first instance, define friendly C++ by specifying less-undefined behaviors for these. Then, as the compiler discovers / takes advantage of more UB in later versions over time, it would also at the same time update its friendly C++ dialect accordingly, with better-defined behaviors for those as well.

    Or is this not how it works, and compilers can be taking advantage of UB without even knowing about it?

  35. Hi John,

    It would be so wonderful if Friendly C supported comparing signed and unsigned integer types and gave unsurprising results!

    This is actually quite easy to do. E.g.

    signed == unsigned

    can be desugared to:

    signed >= 0 && signed == unsigned

    signed < unsigned

    can be desugared to

    signed < 0 || signed < unsigned

    and so on…

    I ran into this problem when introducing unsigned types in Virgil and this is how I solved it.

  36. I would love it if this proposal gains traction. Every so often I get so disgusted with C in trying to do my cross-platform embedded work with a modicum of correctness that I’m compelled to search the internet for an alternative. During one such search a year ago I came across Dr. Regehr’s most excellent blog. Most of the objections that I read above are addressed therein and I suggest people read and understand it before blurting.

  37. @Mattias Engdegård, C11 and C++11 already eliminated sequence points in a way that fixes that issue. f(++i, ++i) uses unspecified order, but does not produce undefined behavior.

  38. @Seth, Really? I think the standard means something serious happens when using the word unspecified. In this particular example, I think there is a data race.

  39. glaebhoerl: It’s not really feasible to remove all undefined behaviour without requiring rigorous runtime checking (which John specifically rules out in the goals), because if you do things like write through wild pointers you could potentially overwrite control structures or the program text itself.

  40. Ben L. Titzer: I believe an unstated goal of this is that the “friendly C” language be a strict superset of existing standard C – that is, any existing conforming C program would continue to be conforming and have the same semantics under “friendly C”, although the reverse might not be true. Your suggested change to the comparison operators would break this.

  41. @kme, I totally agree. I think this is very important, and the article talks about UB and optimization. This is a dialect of C, not a new language — it should be more or less the “classic C”. It should remain lightweight and it can’t be type safe. It must change semantics of conforming programs — if it does any of these, then it may not be called C at all.

  42. Seth, thank you – I based my understanding on C99. Does it apply to operators as well, such as a++ * a++? (I still believe it would be beneficial to get rid of the order unspecifity, although the “friendly C” manifest here is all about UB.)