The Problem with Friendly C


I’ll assume you’re familiar with the Proposal for Friendly C and perhaps also Dan Bernstein’s recent call for a Boring C compiler. Both proposals are reactions to creeping exploitation of undefined behaviors as C/C++ compilers get better optimizers. In contrast, we want old code to just keep working, with latent bugs remaining latent.

After publishing the Friendly C Proposal, I spent some time discussing its design with people, and eventually I came to the depressing conclusion that there’s no way to get a group of C experts — even if they are knowledgable, intelligent, and otherwise reasonable — to agree on the Friendly C dialect. There are just too many variations, each with its own set of performance tradeoffs, for consensus to be possible. To get a taste of this, notice that in the comments for the Friendly C post, several people disagree with what I would consider an extremely non-controversial design choice for Friendly C: memcpy() should have memmove() semantics. Another example is what should be done when a 32-bit integer is shifted by 32 places (this is undefined behavior in C and C++). Stephen Canon pointed out on twitter that there are many programs typically compiled for ARM that would fail if this produced something besides 0, and there are also many programs typically compiled for x86 that would fail when this evaluates to something other than the original value. So what is the friendly thing to do here? I’m not even sure– perhaps the baseline should be “do what gcc 3 would do on that target platform.” The situation gets worse when we start talking about a friendly semantics for races and OOB array accesses.

I don’t even want to entertain the idea of a big family of Friendly C dialects, each making some niche audience happy– that is not really an improvement over our current situation.

Luckily there’s an easy away forward, which is to skip the step where we try to get consensus. Rather, an influential group such as the Android team could create a friendly C dialect and use it to build the C code (or at least the security-sensitive C code) in their project. My guess is that if they did a good job choosing the dialect, others would start to use it, and at some point it becomes important enough that the broader compiler community can start to help figure out how to better optimize Friendly C without breaking its guarantees, and maybe eventually the thing even gets standardized. There’s precedent for organizations providing friendly semantics; Microsoft, for example, provides stronger-than-specified semantics for volatile variables by default on platforms other than ARM.

Since we published the Friendly C proposal, people have been asking me how it’s going. This post is a long-winded way of saying that I lost faith in my ability to push the work forward. However, I still think it’s a great idea and that there are people besides me who can make it happen.

, ,

14 responses to “The Problem with Friendly C”

  1. Especially in security, people learn from reverse engineering, then learn how to write C to produce their assembly. It’s common for any programmer to learn by example instead of by reading a spec. The fallacy is that even if you know C and you know assembly perfectly well, you still don’t know how one will translate into the other. I think it’s that gap which is the behind the complaints about undefined behaviour in specific instead of logic bugs in general.

    I don’t think the problems you’ve encountered are about your personal ability. Which behaviours are defined and undefined have been considered thoroughly by the language committee and well informed by users and implementors and hardware vendors. It’s unlikely there’s better to be done.

    If what you really want is less undefined behaviour, there’s always Java.

  2. At this point I think we can conclude that the language committee is the problem. Is there really any need for a simple cast from float to int to be able to corrupt memory? Would anyone who really cares about writing correct code — or about creating a language that people can use to write correct code — want make memory corruption an acceptable outcome of a float to int conversion? (Or in the 190 or whatever other undefined behavior cases?)

  3. Nick, this isn’t about my preferences, this is about the huge mountains of C code that we can’t (yet) stop using.

    Sometime during the 2000-2010 period the programming world seemed to largely stop using C for projects where it was a bad idea, that was really a great thing to see.

  4. What I would want is for the compiler to just be able to emit a warning for all possible undefined or compiler-defined behaviour it encounters. This can be a lot (if you shift left by n and the compiler cannot prove that n is less than 32, it will have to warn, etc. etc. etc.), but the user should then be able to turn off the warning or have it be treated as an error, as they wants. Alternatively, only let it warn when it uses undefined behaviour to optimize something (this can still be a lot though).

  5. Gerben: the compiler can’t possibly detect all undefined behaviours. For example, ‘a << b' is defined or not depending on the runtime value of b.

    On the other hand, it's already possible to instrument your code in order to detect at least some of these undefined behaviours using Clang's -fsamitize=undefined (which has also been ported to GCC I think), but this is not a complete solution.

  6. “perhaps the baseline should be “do what gcc 3 would do on that target platform.”

    Using GCC as the baseline for anything should be criminal offense, enough to put you behind bars for the next 30 years, pondering what you did wrong.

  7. @Manuel

    I think that the sentence “if you shift left by n and the compiler cannot prove that n is less than 32, it will have to warn, etc.” in Gerben’s proposal makes it clear that you are not adding information.

    @Gerben

    This exists, it is called “a sound static analyzer for undefined behavior”. It’s in a separate static analyzer instead of a compiler because in order to produce useful results (i.e. other than listing every dangerous operation in the program, which is not only long but, like Manuel’s comment, doesn’t add information) it needs to have very different constraints. In particular, whether a function invokes undefined behavior depends on the state other functions have left the variables in, so it’s impossible to have “separate analysis” like you have “separate compilation”: the analyzer needs to see the entire program in order to have a chance to produce useful results. Even so, lots of warning tend to be emitted for programs that are correct for subtle reasons (for instance, accesses that could be out of bounds in zlib do not invoke UB only because an incremental Huffman tree is a correct Huffman tree at each step of the decompression. No general-purpose static analyzer can capture this sort of invariant, and most programmers are barely able to justify such reasoning rigorously either, even in their own code, hence the bugs in too-sophisticated-for-its-own-good code)

  8. Not sure how close or important the connection is, but I was reminded of the Markdown mess:

    http://blog.codinghorror.com/standard-markdown-is-now-common-markdown/

    It seems that the resolution of that situation was basically the 2nd to last paragraph of the post. Some important stakeholders (github, stackoverflow, etc (I think)) made a decision and went with it. Others are welcome to join or not.

    What’s kind of interesting is that Markdown is a dramatically less complicated language ecosystem than C.

  9. Hi John – I thought your interest in C was purely pragmatic re: the legacy code base you’ve mentioned now and again. Doesn’t a proposed dialect of C implies new code bases written in the dialect? Why would we want more C?

    If you’re still talking about fixing C in, oh, 2018, I’m going to be so bummed. At this point, I think the US government should be funding massive projects to develop new programming languages and operating systems as fulfillment of its national defense responsibilities. We’re being cleaned out by APTs.

    Relatedly, I’m very confused by why Microsoft is suddenly giving a detailed post-mortem on how they built a secure OS and safe C#, only to apparently abandon it. I’m surprised that they’re not afraid of Apple or someone else running away with their work: http://joeduffyblog.com/2015/12/19/safe-native-code/

  10. There seems to be some confusion here between “implementation defined” and “undefined” behavior.

    The original reason there were implementation defined sections in the “standard” was because people couldn’t/didn’t agree on a standard way to handle those points. So no surprise there.

    If there are places like “shift by 32” where the behavior is undefined, they should, where possible, be changed to implementation defined.

    You’ve discussed undefined behavriour before. For me, undefined behaviour, “if you use that construct the behaviour of our compiler is undefined” is much worse than implementation defined behavior.

  11. I think a “Bastard C” compiler would be more educational than a “Friendly C” compiler — one that unceremoniously core-dumped on every encountered UB rather than blithely continuing on.

    Granted, performance would be lacking in spots but it would also be very educational. My experience suggests most devs wouldn’t know UB code if it jumped up and bit them on the nose. I suspect there is very little code out there that is _designed_ to invoke UB — much more common that it happens by accident.

  12. Mike M, LLVM’s ubsan is basically the mean C compiler. This is the right answer when we are willing to go fix all the bugs. Friendly C is the right answer for legacy code that we don’t want to touch at all, if possible.

    David G, you are right that we want to turn as much UB as possible into implementation-defined behavior.