Under what circumstances is it acceptable for a programming language to admit undefined behaviors? Here I mean undefined behavior in the C/C++ sense where, for example, “anything can happen” when you use an uninitialized variable. In my opinion, five conditions need to be fulfilled.
First, the undefined behavior must provide a significant, robust performance win. By “significant” I mean well into double-digit percent speedup and by “robust” I mean that a lot of programs should receive the benefit. The undefinedness of out-of-bounds array accesses in C/C++ clearly satisfies this criterion. The undefinedness of signed overflow does not (I think) because you’ll need to look pretty hard before you can find an important application that gets double-digit slowdown if you compile it using GCC or Clang’s -fwrapv flag. People I trust have told me that these programs do exist, but I doubt that most of us use them on a daily basis.
Second, behavior should only be undefined as a last resort—when it cannot be reasonably phrased as an unspecified or implementation-defined behavior. In C/C++, an unspecified behavior is one where the implementation chooses from a set of well-defined behaviors, but there is no requirement that the choice is documented or consistent. An implementation-defined behavior is one that must be documented and consistent. As a random example, in C99 a program’s behavior is undefined if “the specification of a function type includes any type qualifiers.” Making this implementation-defined would not seem to lose anything. Many of C99’s 191 undefined behaviors have a similar character.
Third, undefined behavior is only acceptable when a sound, complete, efficient, and widely available dynamic checker exists (of course static checkers are good too). Preferably these checkers should be accessible using commonly-available compiler flags. For example, gcc -Wall serves as a reliable static checker for many undefined behaviors in C99 (but not for the any of the worst ones, of course). To meet this criterion, C/C++ would have to be modified to make memory safety easier to enforce, particularly in the presence of separate compilation. IOC is dynamically sound and complete with respect to some of C/C++’s integer undefined behaviors. More such tools are needed.
Fourth, the undefined behavior needs to be comprehensible and local. By “local” I mean that we need to be able to blame a very small amount of code for any bug relating to undefined behavior. Most of C/C++’s undefined behaviors (even the ones related to integer overflow and array accesses) are perfectly understandable and local. On the other hand, C99’s strict aliasing rules are an abysmal failure in this regard—as far as I can tell, not even compiler implementors fully understand how these interact with rules for accessing union fields. The undefinedness of non-terminating loops in C++11 is another nasty failure to meet this criterion because the undefined behavior is a property of all code that is reachable from the loop.
Finally, the undefined behavior should only punish programs that engage in risky behavior. For example, making data races undefined is OK if we consider multithreading to be risky (as I do). On the other hand, undefined integer overflows are not OK because few people would consider i++ to be a risky behavior. The best way to satisfy this criterion is by making the behavior defined by default, but then providing a greppable loophole. For example, the reinterpret_cast in C++ seems like a nice move in this direction. A special no-overflow qualifier for induction variables would probably give all of the performance benefits that we currently get from undefined signed overflows without any of the global punishment.
One might interpret these guidelines as saying “undefined behavior is never OK.” However, that is not what I intend. Niche languages for low-level, performance-critical applications—embedded systems and OS kernels, mainly—will have undefined behaviors for the foreseeable future and I think that is fine. On the other hand, the current situation in C and C++ is not fine and we are paying for it every day.
19 responses to “When is Undefined Behavior OK?”
Wouldn’t -ftrapv be a much more crucial test than -fwrapv ? I would expect the latter to be nearly free on twos complement architectures (though quite costly on non-twos complement architecture, but those appear to be extremely rare nowadays). I would be quite surprised if -ftrapv DIDN’T have a major cost in most programs.
I’d take issue with the 3rd point. I agree that it is HIGHLY desirable, but I don’t think it should be required. Strike “efficient” and you might talk me into it but it would still be an up-hill battle.
To pick a concrete example; Is it possible to create such a checker for data races? Can you take any arbitrary bit of threaded code, that runs to termination in a fixed interval and is fed fixed inputs and prove that it will never encounter a data race?
(I can’t find the exact post but this guy http://bartoszmilewski.com/ is (or was) working on some stuff that’s supposed to help find race bugs but, IIRC, doesn’t claim to be a strong detector.)
Hi bcs, I don’t really believe in static race detection, I’ll settle for good dynamic tools. For dynamic race detection, I’d hope to get soundness and completeness at about a 2x slowdown. Same for dynamic memory safety.
Hi Matthias, I used -fwrapv as the example since it should be cheap and it eliminates that one kind of undefined behavior.
Regarding the expense of -ftrapv, not too long ago I was talking to Nickolai Zeldovich (one of the authors of the MIT undefined behavior paper mentioned in my previous post) and he mentioned that they have an implementation of integer NaNs that is very efficient. This isn’t quite the same thing as -ftrapv but it seems to mean that if you don’t mind deferring the traps for a while, you get get similar semantics cheaply.
@bcs, @regehr –
Regarding static analysis & data races – have either of you used Gimpel’s Flexelint (or PC-Lint) v.9? It has a feature that helps identify potential data races for shared variables in multi-threaded code.
See the 2nd item under http://www.gimpel.com/html/lintdev.htm
All the standard disclaimers apply. Proper configuration / setup is crucial (e.g. identifying the “head” of a thread, identifying mutex lock/unlock routines, etc). There are some false positives that the tool can’t possibly rule out (IMO better than the opposite, hiding real problems).
I could write a lot more about it, especially from the perspective of an embedded systems developer who deals with things like data shared between interrupts & threads, nested interrupts, mutexes vs. interrupt disabling, etc. But the point is that the tool really does help identify unsafe read & write accesses to shared data, and I’m all for anything that helps automate the identification of such problems.
(BTW, I have no affiliation with the company or the product, just a long-time satisfied user.)
While this might have been true once, it certainly isn’t true any more. The cost of accessing memory — even on access patterns designed to maximize the cost — is so high that bounds-checking is lost in the noise.
Here’s a little program, designed to maximize the cost of bounds-checking.
http://pastebin.com/HhMVcx5L. It allocates an array of 400 million entries, and then walks sequentially over the array, once with bounds checking and once without. (It also does a pass before each run, to smooth out the cache a bit.) I compiled it using the command line gcc -O0 –std=c99 blah.c.
Note that:
1. This is *without* any optimizations, so this even before you start doing dataflow optimizations to eliminate unneeded checks.
2. Streaming access to an array is the most cache-favorable pattern, so this program should *maximize* the relative costs of bounds checks versus memory accesses.
The measured difference? 1.03 seconds (with bounds-checking) versus 1.02 seconds (without), for a program designed to maximize the cost. That’s less than 1%.
Hi Dan, I used PC-Lint many years ago and loved it, though it did not yet have race detection then.
I agree that these tools are great. What I meant to say in the post is that for large, pointer-heavy, thread-heavy applications (Firefox, for example) I don’t see static analysis as fundamentally making it easier to create correct concurrent code. Sure, it’ll find bugs, but it’s not going to find the hardest ones—and these are what suck up our time as developers.
Hi Neel.
What bounds checking tool are you using here?
Although I’d be happy to be proven wrong, I don’t believe for a second that your 1% result will scale to real applications.
A bit more on this topic here:
http://blog.regehr.org/archives/715
I didn’t use a specific tool. I just wrote a bit of C code that walks over an array in a for-loop twice. Just click on the link, and you can see the code I used. One pass does a bounds-check in the loop, and the other doesn’t. Basically, I wrote a microbenchmark designed to maximize the cost of performing bounds check computations.
As an aside, this measurement does scale to whole languages. The Ocaml compiler has a flag for turning off bounds checks, and the performance differences you see for full applications are basically in line with the microbenchmark I posted.
The code in the post you link to isn’t slow because of bounds checking. It’s slow because retrofitting dynamic bounds checks onto C, specifically, requires adding huge amounts of runtime metadata. You can’t do bounds checking on array accesses in C without fat pointers, because pointers and arrays are identified. Tripling the size of every single pointer in your program (so that you always have a base+bound available in addition to the address) is not cheap — in fact, I was actually surprised at how low the measured costs in your blog post were!
If you’re designing a new language — even a new systems language — you really can regard bounds checking as free.
(As a meta-comment, I liked your response. I basically always assume the hypothetical in my conclusion — “if you’re designing a new language” — since I’m a PL guy who likes writing compilers. It was nice to be reminded that not everyone makes the same automatic assumptions!)
Hi Neel, it sounds like we are in agreement here after all :).
Indeed, retrofitting memory safety into C/C++ is a depressing engineering problem. I’m only interested in it because it may be the most economical way forward, from the spot we are currently in (where we have these big C/C++ codes that nobody wants to rewrite).
Did you ever look at Deputy? This is my all-time favorite bounds checking tool for C because it leverages programmer knowledge in a clever way in order to avoid the fat pointer problem. But it was a few releases short of being really usable at the time that its author graduated and moved on to other things.
I don’t think bounds checking is expensive. It certainly is not in Java and modern JVMs. Most programs are not array intensive enough that bounds checking accounts for a significant fraction of time. For those that are, even the most basic of bounds check elimination techniques within loops gets rid of most of the overhead. E.g. I did pretty extensive benchmarking on the Virgil compiler, which implements bounds-check elimination only for trivially counted loops; I have trouble finding or creating a benchmark where more than 2% of execution time can be attributed to the (remaining) bounds checks (measured by execution time difference among unoptimized, optimized bound checks in loops, and bounds checks disabled configurations). Only the most array intensive, super-tight loops have measurable overheads. Even then, more advanced loop optimization techniques can further eliminate these overheads, including loop versioning, fission, peeling, and rotation.
Of course there are a few numerical programs like this, but in my experience the vast majority of programs these days spend their time navigating tangles of object pointers and calling through hierarchies of forwarder methods, copying data from one format to another, and no longer in hot array loops.
Bounds check elimination is basically a solved problem for safe languages. It just requires compiler smarts that the newest languages haven’t gotten around to employing. If you have some heavy numerical floating point problem that needs mega optimization, then write its kernel in FORTRAN, or CUDA, and be done with it.
And if we are talking about bounds checking pointers, I say forget it; pointers should be confined to a small unsafe runtime system and the rest of the application code should in a safe language (or safe subset) after the fashion of C# and Modula-3 before it. Even Ocaml and Haskell have facilities for these purposes.
Hi Ben, as you imply the actual language design space here is huge. Of course bounds checking is just a small part of the overall safety problem. I’m not yet convinced that D or Virgil or Go or whatever is a reasonable replacement for C/C++ for hefty system-level implementation tasks.
Argument evaluation order used to be undefined (technically unspecified, but bear with me) for good performance reasons but those have largely disappeared and are probably barely detectable today. It may be a low-hanging fruit for those who want to give C more predictable semantics.
Research proposition: alter GCC or Clang to force left-to-right or right-to-left evaluation and measure. As always, the compiler is free to do things in any order as long as it does not make a difference to the outcome.
Hi, Neel – one comment, though: Your case is *not* the worst case for bounds checking. With one array, the base+bound information is going to be in L1 every time you want it. A much worse case would be a set of 1 million arrays each a bit larger than 64 bytes in size (a cache line), which you access at random. Bounds checking in this case would require two cache misses per access instead of one.
Hi Dave, you’re right!
I wonder why this doesn’t show up in measurements of Ocaml code, though. Offhand, I would guess that arrays in Ocaml are mostly used for numerics, so there aren’t many arrays of arrays of arrays of so on.
This suggests something like HarfBuzz might be a good candidate for measuring the true worst-case impact, since IIRC it does text shaping with lots of nested lookup tables.
The tool Bartosz Milewski is involved in is a dynamic tool. It snap shots the system state and then via some VM based tooling runs the a threaded program through different interleaveings and (IIRC) attempts to detect differing end results.
Neel: The reason you saw almost no performance hit is because the code is *far* from being “worst case”. Let’s look at your “checked” version:
for (int i = 0; i < SIZE; i++) {
if (i < SIZE) {
Any good compiler can see the if is totally redundant and will be true if the for loop is still executing, and eliminate the code. You want it to be "worst case", make a global static array, and pass i to a function that is *not* inlined, and access the array through the passed value, so you have to actually do the testing.
@Valdis: I turned off optimizations precisely to ensure that this check was not being eliminated.
In fact, most decent compilers won’t just eliminate the bounds check — they’ll delete all the code in my little benchmark, since none of the loops make writes to variables live outside the loop!
@Neel Krishnaswami: also each time you use a different array, you have to load its bound and so I agree with Dave Andersen, the issue is when you use several arrays, especially on register starved architecture like x86.
To answer the original question: unitialised variables should have performance advantages especially for big structures.. Now 10% improvement or more? Probably only on ‘toy’ benchmarks.