Skip to content

Nibble Sort Programming Contest

The problem is to sort the 4-bit pieces of a 64-bit word with (unsigned) smaller values towards the small end of the word. The nibble sort of 0xbadbeef is 0xfeedbba000000000. The function you implement will perform this sorting operation on a buffer of 1024 64-bit integers:

void nibble_sort(unsigned long *buf); 

I’ll give a small prize to the submitter of the entry that performs this operation the fastest on my Core i7-4770. If the winner makes non-trivial use of SIMD, I’ll give a prize to that entry and also to the fastest non-SIMD entry. I’ll use GCC 4.9.2 to compile your C99 code, which may use SIMD intrinsics and also inline assembly if you choose. There’s no requirement for portability. The machine runs Ubuntu 14.04 in 64-bit mode. If you require any command line options other than “gcc -std=gnu99 -O3″ you need to tell me that. You may assume that the buffer starts at the start of a cache line. The buffer will be loaded with uniformly chosen random values. Submit code by emailing me. Submit entries before the end of January 2015.

Here’s a slow (but hopefully correct) reference implementation takes about 180 380 microseconds to do the job:

int read_nibble(unsigned long w, int i) {
  assert(i >= 0 && i < 16);
  unsigned long res = w >> (i * 4);
  return res & 0xf;

void write_nibble(unsigned long *w, int i, int v) {
  assert(i >= 0 && i < 16);
  unsigned long mask = 0xf;
  mask <<= (i * 4);
  *w &= ~mask;
  unsigned long prom = v;
  prom <<= (i * 4);
  *w |= prom;

unsigned long nibble_sort_word(unsigned long arg) {
  for (int i = 0; i < 16; ++i) {
    int min = i;
    for (int j = i+1; j < 16; ++j) {
      if (read_nibble(arg, j) < read_nibble(arg, min))
        min = j;
    if (min != i) {
      int tmp = read_nibble(arg, i);
      write_nibble(&arg, i, read_nibble(arg, min));
      write_nibble(&arg, min, tmp);
  return arg;

void nibble_sort(unsigned long *buf) {
  for (int i=0; i<1024; i++)
    buf[i] = nibble_sort_word(buf[i]);

Buying Into Open Source Security

If you were given the opportunity to spend USD 100 million over five years to maximally improve the security of open source software, what would you do? Let’s just assume that the money comes with adequate administrative staff to manage awards and contracts so you can focus on technical issues. A few ideas:

  • Bug bounties, skewed towards remotely exploitable bugs and towards ubiquitous infrastructure such as OpenSSL and the Linux kernel. To get rid of USD 100 M in five years, we’ll probably need to make the bounties very large by current standards, or else give out a lot of them.
  • Contracts for compatible rewrites of crufty-but-important software in safe languages.
  • Contracts for aggressive cleanup and refactoring of things like OpenSSL.
  • Contracts for convincing demonstrations of the security of existing codes, in cases where it seems clear that rewrites are undesirable or impractical. These demonstrations might include formal verification, high-coverage test suites, and thorough code inspections.
  • Research grants and seed funding towards technologies such as unikernels, static analyzers, fuzzers, homomorphic encryption, Qubes/Bromium-kinda things, etc. (just listing some of my favorites here).
  • Contracts and grants for high-performance, open-source hardware platforms.

This post is motivated by the fact that there seems to be some under-investment in security-critical open source components like Bash and OpenSSL. I’ll admit that it is possible (and worrying) that USD 100 M isn’t enough to make much of a dent in our current and upcoming problems.

Testing with Pictures

Testing code is fun and hard and looking at the problem in different ways is always good. Here’s a picture representing the behavior of a saturating subtraction operation, where the horizontal axes represent the inputs and the output is vertical:

And here are some of the functions handed in by my students in the fall:

The last one represents one of the most common failure modes: failing to account for the asymmetry where INT_MIN has no additive inverse. The thing that I like about these images is how glaringly obvious the bugs are.

Hey, who knew Gnuplot could do animated gifs now? I guess old dogs can learn new tricks.

(Note: I posted some of these a few years ago, but I like the pictures so much I wanted to do it again.)

Inversions in Computing

Some computer things change very slowly; for example, my newish desktop at home has a PS/2 port. Other things change rapidly: my 2010 iPad is kind of a stone-age relic now. This kind of differential progress creates some funny inversions. A couple of historical examples:

  • Apparently at one point in the 80s or 90s (this isn’t a firsthand story– I’d appreciate recollections or citations) the processor available in an Apple printer was so fast that people would offload numerical computations to their printers.
  • I spent the summer of 1997 working for Myricom. Using the then-current Pentium Pro machines, you could move data between two computers faster than you could do a local memcpy(). I’m pretty sure there was something wrong with the chipset for these processors, causing especially poor memcpy() performance, but I’ve lost the details.

What are the modern examples? A few come to mind:

Anyhow, I enjoy computing inversions since they challenge our assumptions.

Souper Results 2

The Souper superoptimizer has made some progress since my last post about it.

We wrote compiler drivers that usually reduce the problem of building a project with Souper to make CC=sclang CXX=sclang++. Souper now uses Redis to cache optimizations so that even if the initial build of a program using Souper is slow, subsequent builds will be pretty fast. We fixed several problems that were preventing Souper from building largish programs like LLVM and GCC. This works now and, as far as we know, Souper can be used to optimize arbitrary LLVM code.

Souper now understands the ctpop, ctlz, cttz, and bswap intrinsics. It no longer generates only i1 values, but rather synthesizes constant values of any width. Constant synthesis is not fast and it requires a solver with good support for quantifiers, currently only Z3 (synthesizing constants without quantifiers isn’t hard, we just haven’t implemented that yet). Here’s a list of constants synthesized while building LLVM with Souper. The left side of each line is the number of times the constant on the right side was synthesized. i1 constants dominate but it’s fun to see, for example, that Souper was able to synthesize the 64-bit value 90112 four times. Where did that come from?

Souper has two main use cases. First, application developers can use Souper directly to optimize code they are compiling. Second, LLVM developers can use Souper to learn about optimizations missed by the existing optimization passes. We’re trying to make it useful to both of these audiences.

To make Souper more useful for compiler developers, we implemented a C-Reduce-like reducer for Souper optimizations. This is necessary because Souper extracts and attempts to optimize pieces of LLVM that are as large as possible, meaning that its optimizations often contain extraneous material. A reduced optimization has the handy invariant that no path condition, UB qualifier (nsw, nuw, exact), or leaf instruction can be removed without breaking the optimization. We did some cross-checking between Souper and Alive, as a sanity check on both tools. Additionally, we convert each Souper optimization back into LLVM and run it through opt -O3 in order to weed out any optimizations that LLVM already knows how to do. For example, Souper loves to prove that icmp eq %0, %0 can be simplified to 1. This is not useful.

While building LLVM, ~16,000 Souper optimizations fire. Some of these optimizations are duplicates (presumably due to inlining and header inclusion); ~7000 of them are distinct. After reduction there are ~4000 distinct optimizations and LLVM does not know how to perform ~1500 of them. Even 1500 optimizations is lot of work to look through and of course not all of them matter. To help figure out which optimizations matter, we implemented two kinds of optimization profiling. The first is static profiling, which counts the number of times an optimization is applied at compile time. Implementing optimizations with a high static profile count would tend to reduce the size of the compiler’s generated code. Second, we implemented dynamic profiling, which counts the number of times each optimized piece of code is executed. This is accomplished by instrumenting the compiled program so that it dumps dynamic profile information to a Redis server using an atexit() handler. Implementing optimizations with a high dynamic profile count would tend to decrease the runtime of generated code. Of course, all standard caveats about profile-driven optimization apply here. Also keep in mind that Souper is extremely specific while compilers are less so: there is a many-to-one relationship between optimizations discovered by Souper and optimizations you would implement in LLVM. Therefore, it may well be the case that there are collections of low-ranking Souper optimizations that would rank highly if considered as a group, and that could all be implemented by a single LLVM transformation. We’ve experimented a bit with trying to automatically aggregate similar Souper optimizations, but so far I haven’t been too happy with the results.

If we take a Souper-optimized LLVM and use it to build SPEC CPU 2006, this is the optimization with the highest dynamic profile count; it is executed ~286 million times:

%0:i64 = var
%1:i64 = and 15:i64, %0
%2:i1 = eq 0:i64, %1
pc %2 1:i1
%3:i64 = and 7:i64, %0
%4:i1 = eq 0:i64, %3
cand %4 1:i1

The first four lines tell us that the arbitrary 64-bit value %0 is known to have zeros in its four bottom bits. The last three lines tell us that — of course — %0 has zeros in its three bottom bits. LLVM doesn’t understand this yet, leading to a lot of unnecessary conditional jumps.

Here’s the collection of Souper optimizations that are discovered while building LLVM/Clang/Compiler-RT r222538:

The clang binary from a “Release” build with Souper is about 800 KB smaller than the clang built without Souper. Please let us know about any bugs in the output above, including missed optimizations (but don’t tell us about missing vector, FP, or memory optimizations, we know that those are not supported yet). In the course of this work Raimondas ran across a Z3 bug; luckily he caught it by cross-checking Souper’s results using a different solver, instead of having to debug the resulting miscompilation.

The main thing that Souper does not do, that you would expect a superoptimizer to do, is to synthesize sequences of instructions. Much of our work over the last six months has been building infrastructure to support instruction synthesis, and almost all of that is now in place. Synthesis is our next major piece of work.

In the meantime, Peter has run Souper over libgo. I would like to build something a bit bigger such as Chromium. If you have a recipe for that, please drop me a line. I got as far as noticing that Chromium builds its own LLVM at which point my allergy to build systems kicked in. Integrating Souper into a build of the Rust compiler might also produce interesting results; it should be as easy as starting Redis and making sure our opt pass gets loaded in the right places.

Souper is by Peter Collingbourne at Google, by my postdoc Raimondas Sasnauskas, by Yang Chen at nVidia, by my student Jubi Taneja, by Jeroen Ketema at Imperial College London, and by me.

Partial Evaluation and Immutable Servers

Although I haven’t figured out exactly what immutability means for a server (I’m probably just being picky) the general idea of rebuilding a system from spec rather than evolving it with one-off hacks is very appealing. Lately I’ve been thinking about what could be accomplished if the system compiler were able to take advantage of certain kinds of immutability. One kind of technique that would be enabled is partial evaluation. Let’s look at a simple example starting with an integer power function I found on the web:

long powi(long x, long n) {
  assert(n >= 0);
  long  p = x, r = 1;
  while (n > 0) {
    if (n % 2 == 1)
      r *= p;
    p *= p;
    n /= 2;
  return r;

This function compiles to 20+ instructions. On the other hand, the compiler is able to do considerably better for this special case:

long cubei(long x) {
  return powi(x, 3);

GCC’s output:

   movq   %rdi, %rax
   imulq  %rdi, %rax
   imulq  %rdi, %rax

Here the C compiler has partially evaluated powi() with respect to the constant second argument. The assert() is gone, the loop is gone, etc. This is a very simple example. At the other extreme, people like to say that if you partially evaluate an interpreter with respect to a particular input, you get a compiler. Think, for a minute, about what kind of partial evaluator we would need to have in order to specialize a C interpreter with respect to the powi() code in such a way that we could honestly say that we’ve compiled it. The tool that would support this job is not so easy to create.

Ok, back to immutable servers. What we are looking for is programs in our server image that process immutable or constrained inputs. For example we want to try to show that:

  • A daemon, say Redis, is always started using the same configuration file
  • For a pair of programs that communicate through a pipe, only a small subset of the full set of commands is ever sent
  • Only a subset of the OS kernel’s system calls are invoked
  • A program (bash, hopefully) is never invoked at all

Next, we partially evaluate the system with respect to these constant or bounded inputs. If we do this properly, we would expect quite a bit of code handling general cases would fall away, leaving only the specific code needed for our server. This is basically just a big global tree-shaking operation.

Why would we do this? There are two reasons to cut away code and data that we don’t need. First, it reduces unnecessary attack surfaces. Second, it makes the resulting images smaller and faster. We can ship them around more easily and they use less RAM while running.

Partial evaluation is a very old idea, and the idea of applying it to systems software is not new either. Here’s a good piece of work, and here’s another one that I haven’t read carefully, but that seems reasonable at first glance. Why have these approaches not taken the world by storm? My guess is that it’s just difficult to get good results. In many cases we’re going to be dealing with strings and pointers, and it is very common to run into insurmountable problems when trying to reason about the behavior of programs in the presence of strings and pointers. Consider, for example, a Python script that makes a string using stuff it found in a file, stuff it got over the network, and a few regular expressions. What does the string do when we exec() it?

On the other hand, in the last decade or so SAT/SMT/string solvers have become very powerful, as have symbolic execution techniques. The cloud has created use cases for partial evaluation that did not exist earlier. Security is a worse problem than ever. Compilers are much better. Perhaps it’s time to try again. It is clear that we can’t just point the partial evaluator at our Docker image and expect great things. We’ll need to help it understand what parts of the system are immutable and we’ll also need to incrementally refactor parts of the system to make them cooperate with the partial evaluator. Anyway, research isn’t supposed to be easy.

I’ll finish up by mentioning that there’s a different way to get the same benefits, which is to assemble a system out of a collection of components in such a way that you don’t need a brilliant compiler to eliminate code that you didn’t mean to include. Rather, you avoid including that code in the first place. This is more or less Mirage’s design point. Both approaches seem worth pursuing.

Inward vs. Outward Facing Research

One of the things I like to think about while watching research talks is whether the work faces inward or outward. Inward facing research is mostly concerned with itself. A paper that uses most of its length to prove a theorem would be an example, as would a paper about a new operating system that is mainly about the optimizations that permit the system to perform well. Outward facing research is less self-aware, it is more about how the piece of work fits into the world. For example, our mathematical paper could be made to face outwards by putting the proof into an appendix and instead discussing uses of the new result, or how it relates to previous work. The OS paper could demonstrate how users and applications will benefit from the new abstractions. Computer science tends to produce a mix of outward and inward facing research.

Next let’s turn to the question of whether a given paper or presentation should be inward or outward facing. This is subjective and contextual so we’ll do it using examples. First, the mathematical paper. If the proof is the central result and it gives us new insights into the problem, then of course all is as it should be. Similarly, if the operating system’s use case is obvious but the optimizations are not, and if performance is the critical concern, then again no problem. On the other hand, researchers have a tendency to face inward even when this is not justified. This is natural: we know more about our research’s internal workings than anyone else, we find it fascinating (or else we wouldn’t be doing it), we invent some new terminology and notation that we like and want to show off, etc. — in short, we get caught up in the internal issues that we spend most of our time thinking about. It becomes easy to lose track of which of these issues other people need to know about and which ones should have stayed in our research notebooks. Let’s say that we’re working on a new kind of higher-order term conflict analysis (just making this up, no offense to that community if it exists). One way to structure a paper about it would be to discuss the twists and turns we took while doing the work, to perform a detailed comparison of the five variants of the conflict analysis algorithm that we created, and to provide a proof that the analysis is sound. Alternatively, if the running time of the analysis isn’t actually that important, we could instead use some space demonstrating that a first-order analysis is wholly unsuitable for solving modern problems stemming from the big data revolution. Or, it might so happen that the analysis’s soundness is not the main concern, in which case we can use that space a better way.

I hope it is becoming clear that while some work is naturally inward facing and some outward facing, as researchers we can make choices about which direction our work faces. The point of this piece is that we should always at least consider making our work more outward facing. The cost would be that some of our inner research monologue never sees the light of day. The benefit is that perhaps we learn more about the world outside of our own work, helping others to understand its importance and helping ourselves choose more interesting and important problems to work on.

Fall in City Creek Canyon

I’ve lived in Utah for a while now, in three different houses, but always a short walk from City Creek Canyon. This drainage starts right at the edge of downtown SLC and goes 14 miles up into the Wasatch Range. A service road provides easy walking access all year, although the upper parts are not plowed in winter. In summer, bikes are permitted on odd days; on even days there is light car traffic. Bikes are allowed and cars forbidden every day in fall, winter, and spring (though sometimes there are vehicles going to and from the water treatment plant a few miles up the canyon). The lower part of the canyon is heavily walked on nice days, for example by worker bees from downtown on their lunch break. The upper canyon receives light usage and there are many miles of trails and off-trail routes in upper City Creek where you are much more likely to see an elk or a moose than a person. Several of my favorite local mountains, Dude Peak, Burro Peak, Grandview Peak, and Little Black Mountain all overlook the upper canyon. Here are a few pictures from a bike ride the other morning.

Fun with Shellshock

[I don’t seem to be getting blog entries written lately. The semester has turned out to be surprisingly busy and, also, I’m working on a few longer pieces that have ended up being harder to write than I’d hoped. Anyhow, the piece below isn’t the sort of thing I usually post, you can think of it as sort of a guest post. The context is the recent Bash bug which — unlike Heartbleed — completely failed to stir up a pile of “here’s how to find it using static analysis” posts, for reasons that Pascal explains very nicely.]

A3 Mitigation of Shellshock Vulnerability
Aaron Paulos, Brett Benyo, Partha Pal, Shane Clark, Rick Schantz (Raytheon BBN Technologies)
Eric Eide, Mike Hibler, David Johnson, John Regehr (University of Utah)

[Distribution Statement “A” (Approved for Public Release, Distribution Unlimited)]


The shellshock/Bash bug has been in the news a lot recently and it seemed like a great opportunity for us to test our A3 fully automated repair technology against a real zero-day attack. We found that the mandatory mediation policy enforced by A3 blocked the effect of the injected command attack. The policy violation triggered A3 to automatically explore and repair the underlying security hole. A3 took around 2 minutes to automatically find a repair using virtual machine introspection to insert a system call block, preventing a sys_clone call made by Bash, and an additional 1.5 minutes to find a source code repair in the Bash code. The A3 shellshock experiment is an example that illustrates the recent progress made by the survivability and resiliency research community to automate post-incident response management and to reduce the time to patch.


We have been developing the A3 (Advanced Adaptive Applications) Environment for the past four years as part of the DARPA Clean-slate design of Resilient, Adaptive, Secure Hosts (CRASH) program. A3 aims to make network facing services and applications resilient against zero-day attacks through the use of containerization, mandatory I/O mediation, execution introspection, and defensive adaptation. Recently, our focus has been on automatically reasoning about attack manifestations and dynamically producing new network filters, system call policies, and even source patches to mitigate the underlying vulnerability. A3’s adaptive experimentation utilizes record and replay, machine learning algorithms, and execution tracing across the OS and application boundaries.

For the shellshock experiment, we applied A3 to a simple app store web application built on a standard LAMP stack with a vulnerable Bash version. It took us a few hours to get the source, build environment, and regression tests for Bash 4.2 into the “laboratory” area of the A3 environment (i.e., a set up for in-situ and online testing of new security adaptations of the protected application). This was only necessary to generate a source code level repair; generating the system call block repair did not require any code, build environment, or regression tests.

Constructing an attack to exploit the vulnerability was trivial. We simply inserted an exploit that attempted to cat a “passwd” file into a GET request:

GET /appstore/index.php HTTP/1.1
User-Agent: () { :;}; /bin/cat /home/mitll/passwd > /tmp/hello.txt
Accept: */* 

This style of attack was chosen because it mimicked what hackers attempted during a capture the flag experiment. We launched the attack, and watched A3 work.

First, A3’s mandatory mediation blocked the attack because the attack was trying to access a directory that is not allowed by the mediation policy of the protected application. It is not guaranteed that all attacks will be stopped there, of course — mediation policies are not guaranteed to be perfect, and the attack may involve operations that are permitted but cause an undesired effect at a later stage. However, the unauthorized access attempt triggered A3’s automated repair process, much like a later-stage undesired condition would. A3 took ~2 minutes to find a repair using virtual machine introspection to block a sys_clone call made by Bash. This was accomplished by replaying the attack within A3 (i.e., in the “laboratory” area), running a full system call analysis, and testing system call block policies for any unique calls or call parameters found. A3 took an additional ~1.5 minutes to find a source code repair in the Bash code by analyzing the call stack when the sys_clone call was attempted. Below are the call stack and a slightly more readable figure for our particular attack payload:

#0  0x00007f17a8f5f936 in __libc_fork () at ../nptl/sysdeps/unix/sysv/linux/x86_64/../fork.c:131
#1  0x0000000000448ebe in make_child (command=0xc7eb08 "/bin/cat /home/mitll/passwd > /tmp/hello.txt", async_p=0) at jobs
#2  0x000000000043a271 in execute_disk_command (words=0xc7a688, redirects=0xc7e688, command_line=0xc7ea48 "/bin/cat /home
/mitll/passwd > /tmp/hello.txt", pipe_in=-1, pipe_out=-1, async=0, fds_to_close=0xc7a4c8, cmdflags=0) at execute_cmd.c:46
#3  0x0000000000438fd0 in execute_simple_command (simple_command=0xc7e648, pipe_in=-1, pipe_out=-1, async=0, fds_to_close
=0xc7a4c8) at execute_cmd.c:3977
#4  0x0000000000433179 in execute_command_internal (command=0xc7e608, asynchronous=0, pipe_in=-1, pipe_out=-1, fds_to_clo
se=0xc7a4c8) at execute_cmd.c:735
#5  0x0000000000435d26 in execute_connection (command=0xc7e708, asynchronous=0, pipe_in=-1, pipe_out=-1, fds_to_close=0xc
7a4c8) at execute_cmd.c:2319
#6  0x00000000004334d4 in execute_command_internal (command=0xc7e708, asynchronous=0, pipe_in=-1, pipe_out=-1, fds_to_clo
se=0xc7a4c8) at execute_cmd.c:891
#7  0x0000000000487ee3 in parse_and_execute (string=0xc7dc08 "HTTP_USER_AGENT () { :;}; /bin/cat /home/mitll/passwd > /tm
p/hello.txt", from_file=0x7fff289f6c4e "HTTP_USER_AGENT", flags=5) at evalstring.c:319
#8  0x000000000043af8c in initialize_shell_variables (env=0x7fff289f50e0, privmode=0) at variables.c:350
#9  0x000000000041de8f in shell_initialize () at shell.c:1709

Looking for a place to stop the manifestation, A3 developed the following patch at line 3979 in execute_cmd.c, which just unconditionally skips the function call leading directly to our observed attack. This repair does not fix the Bash parser, but instead disables functionality that is unnecessary for processing legitimate requests by the protected application (app store running on the LAMP stack).

if (0) {  
	result = execute_disk_command ( words, simple_command->redirects, 
					  command_line, pipe_in, pipe_out, 
					  async, fds_to_close,

For this experiment we started with a single malicious request sent to the application and A3 used benign traffic and a subset of the tests shipped with Bash to reason about and develop its patch. We are not claiming that the A3-derived code repair is the right fix (although it is fairly close to the location of the proposed fix). With a little more time and tweaking (e.g., additional attack attempts trying to cause different manifestations, regression tests), we can refine it further.

What we are claiming is that A3 was able to automatically localize and find a patch that makes the protected application (our LAMP exemplar) resilient in seconds. If the adversary tries another exploit and causes an undesired condition in the protected application, A3 will find a refinement. A3’s explanation also provides a wealth of localization and causal relation information along with the patch by outputting the malicious message and the full call stack. This can be extremely helpful for a human developer trying to address the problem.

Vulnerabilities and attacks relying on arcane parsing bugs or obscure protocol features seem to get all the attention these days. However, progress is being made in faster, more efficient and more effective ways to deal with these thorny issues as well. The ability to block attack manifestations, and deliver useful debugging/forensic information along with repair candidates in the form of code patches has great potential to mitigate some of the major issues faced with network-facing software today, including the large average lifespan of zero-day vulnerabilities, difficulty in pinpointing vulnerable code, patch validation, and time and level of expertise needed to keep ubiquitous services and infrastructure like OpenSSL and Bash safe.

Further Information:

If you are interested in learning more about the A3 project, we have a list of published papers available at the project page. For more details on the repair technology, we are working on a paper that includes more technical details and experiments run with other bugs. For information about the CRASH program, contact the DARPA Public Affairs office at

Proposal for a Friendly Dialect of C

[This post is jointly authored by Pascal Cuoq, Matthew Flatt, and John Regehr.]

In this post, we will assume that you are comfortable with the material in all three parts of John’s undefined behavior writeup and also with all three parts of Chris Lattner’s writeup about undefined behavior. Additionally, this paper is excellent background reading.

C compilers generate faster and smaller code by assuming that the compiled program will never execute an undefined behavior (UB). Each time a compiler exploits a new kind of UB or increases the optimizer’s reach in other ways, some code that previously worked will break. Thus, compiler upgrades cause consistent low-grade headaches for developers and maintainers of large bodies of C code. It’s not hard to find reasonable objections to this kind of thing as well as irate bug reports. The code that gets broken by the UB-aware compiler is incorrect according to the standard, but following all of the rules in the C standard in a large code base is brutally difficult and, practically speaking, few programmers are capable of it. For example, a sufficiently advanced compiler can break six of the nine well-worn C programs in SPEC CINT 2006 by only exploiting integer undefined behaviors. The problem is that the ostensible user base for C — people implementing low-level systems code — is not necessarily well served by creeping undefined-behavior exploitation. In short, modern C is not a friendly programming language.

When developers are not 100% certain that their code is free of undefined behaviors, one thing they do is add compiler-specific flags that disable certain UB-based optimizations. For example, PostgreSQL uses the -fwrapv option, which tells GCC to implement two’s complement wrapping behavior for signed integer overflows. For analogous reasons, the Linux kernel uses -fno-strict-aliasing and -fno-delete-null-pointer-checks. The problem with these sorts of flags is that they are compiler-specific, the flags don’t necessarily mean the same thing across compiler versions, the flags individually don’t provide much protection against UB exploitation, and developers must watch out for new kinds of breakage and new flags to add to configuration scripts.

As Chris Lattner says at the end of his third post on this topic, using various -f flags amounts to selecting a different dialect of C. Instead of having programmers learn, mix, match, and track various -f flags, we propose defining a friendly dialect of C that trades some optimization capability for ease of reasoning. This friendly dialect might be supported through a -std=friendly-c flag (if you’ll indulge the idea that the friendly dialect could be a standard) that merely implies a group of -f flags for a given version of GCC or LLVM. The flag would be otherwise orthogonal to code generation options, such as -O2. Our goal is to combine

  • minimal additional effort for compiler developers by — as much as possible — simply requiring that they provide behaviors that are already present or are at least easily available; with

  • minimal slowdown when compared to maximum UB-aware compiler optimizations by (1) not requiring any UBSan-like dynamic checks to be added and (2) disabling only optimizations that provide a bad value proposition in terms of performance vs. friendliness.

As a starting point, we imagine that friendly C is like the current C standard, but replacing many occurrences of “X has undefined behavior” with “X results in an unspecified value”. That adjustment alone can produce a much friendlier language. In other cases, we may be forced to refer to machine-specific details that are not features of the C abstract machine, and we are OK with that.

Here are some features we propose for friendly C:

  1. The value of a pointer to an object whose lifetime has ended remains the same as it was when the object was alive.
  2. Signed integer overflow results in two’s complement wrapping behavior at the bitwidth of the promoted type.
  3. Shift by negative or shift-past-bitwidth produces an unspecified result.
  4. Reading from an invalid pointer either traps or produces an unspecified value. In particular, all but the most arcane hardware platforms can produce a trap when dereferencing a null pointer, and the compiler should preserve this behavior.
  5. Division-related overflows either produce an unspecified result or else a machine-specific trap occurs.
  6. If possible, we want math- and memory-related traps to be treated as externally visible side-effects that must not be reordered with respect to other externally visible side-effects (much less be assumed to be impossible), but we recognize this may result in significant runtime overhead in some cases.
  7. The result of any signed left-shift is the same as if the left-hand shift argument was cast to unsigned, the shift performed, and the result cast back to the signed type.
  8. A read from uninitialized storage returns an unspecified value.
  9. It is permissible to compute out-of-bounds pointer values including performing pointer arithmetic on the null pointer. This works as if the pointers had been cast to uintptr_t. However, the translation from pointer math to integer math is not completely straightforward since incrementing a pointer by one is equivalent to incrementing the integer-typed variable by the size of the pointed-to type.
  10. The strict aliasing rules simply do not exist: the representations of integers, floating-point values and pointers can be accessed with different types.
  11. A data race results in unspecified behavior. Informally, we expect that the result of a data race is the same as in C99: threads are compiled independently and then data races have a result that is dictated by the details of the underlying scheduler and memory system. Sequentially consistent behavior may not be assumed when data races occur.
  12. memcpy() is implemented by memmove(). Additionally, both functions are no-ops when asked to copy zero bytes, regardless of the validity of their pointer arguments.
  13. The compiler is granted no additional optimization power when it is able to infer that a pointer is invalid. In other words, the compiler is obligated to assume that any pointer might be valid at any time, and to generate code accordingly. The compiler retains the ability to optimize away pointer dereferences that it can prove are redundant or otherwise useless.
  14. When a non-void function returns without returning a value, an unspecified result is returned to the caller.

In the interest of creating a readable blog post, we have kept this discussion informal and have not made any attempt to be comprehensive. If this proposal gains traction, we will work towards an implementable specification that addresses all 203 items listed in Annex J of the C11 standard. A friendly C++ could also be defined but that is a bigger job.

We are not trying to fix the deficiencies of the C language nor making an argument for or against C. Rather, we are trying rescue the predictable little language that we all know is hiding within the C standard. This language generates tight code and doesn’t make you feel like the compiler is your enemy. We want to decrease the rate of bit rot in existing C code and also to reduce the auditing overhead for safety-critical and security-critical C code. The intended audience for -std=friendly-c is people writing low-level systems such as operating systems, embedded systems, and programming language runtimes. These people typically have a good guess about what instructions the compiler will emit for each line of C code they write, and they simply do not want the compiler silently throwing out code. If they need code to be faster, they’ll change how it is written.

Related reading:

We appreciate feedback.