Design and Evolution of C-Reduce (Part 1)

[This piece is posted in parallel on the IEEE Software blog. Karim Ali helped copyedit it.]

Since 2008, my colleagues and I have developed and maintained C-Reduce, a tool for programmatically reducing the size of C and C++ files that trigger compiler bugs. C-Reduce also usually does a credible job reducing test cases in languages other than C and C++; we’ll return to that later.

Why Reduce Test Cases?

Here’s a typical C-Reduce output, as found in the LLVM bug database:

int a[1];
int b;
void c() {
  void *d = b = 0;
  for (;; b++)
    a[b] = (int) d++;
}

Compiling this code at the -O2 optimization level causes LLVM to crash. The bug report doesn’t contain the original, unreduced test case, but most likely it was larger.

A reduced test case is preferable because:

  • it usually gets the compiler to misbehave quickly, reducing the number of function calls, memory allocations, etc. that the compiler developer has to step through and reason about while debugging
  • the reduced file contains little code not directly related to triggering the bug, reducing the likelihood that compiler developers will be distracted by extraneous features of the test case
  • reduced test cases for the same bug often look similar to each other, whereas this is not normally true for unreduced files that trigger the same bug
  • there is often little or no discernible similarity between an unreduced source file and its reduced version, making it easier for compiler bugs triggered by proprietary code to be reported externally

The minimum-sized input triggering any particular compiler bug can be found using a trivial algorithm: exhaustive search of text strings of increasing size. This is, of course, almost always intractable. In practice, test case reduction proceeds in the other direction: starting with a large, failure-inducing test case, incrementally making it smaller until a local minimum is reached.

A Bit of Background

The history of automated test case reduction does not seem to be well documented, but several examples can be found in software testing papers from the 1990s, such as
Differential Testing for Software and Massive Stochastic Testing of SQL. Test-case reduction was first studied in its own right in 2000 when Hildebrandt and Zeller introduced Delta Debugging: a general-purpose technique for test case reduction. Their algorithm uses a greedy search where a series of “variants” (our term, not theirs, for partially-reduced test case candidates) is produced by removing chunks of the test case. As reduction progresses, the chunk size is reduced, until it reaches some minimum-sized unit, such as a single line, token, or character. When no minimum-sized chunk can be removed from the test case without breaking the property that it triggers the bug, the Delta Debugger terminates. Almost all subsequent test-case reduction work, including ours, builds upon this work.

Towards C-Reduce

I became interested in test case reduction when my colleagues and I began to find a lot of bugs in C compilers using random testing. We found so many bugs that reporting them became bottlenecked on reducing the bug-triggering programs. Since I was the one reporting the bugs we found, I was the one who felt the pain of manual test-case reduction, and it quickly got old. I eventually reported around 500 compiler bugs and I could not have done this without first creating C-Reduce.

At the time, the best open-source implementation of Delta Debugging, from UC Berkeley, was line-based and contained a significant innovation over the original algorithm: it could reorganize a file in such a way that all nested curly braces deeper than a configurable level would appear on a single line. Thus, at level zero, entire functions would be placed on the same line, enabling the line-based reducer to remove an entire function at once. At higher nesting levels, functions would be split across lines, enabling finer-grained reduction. This worked well but the tool ended up being inadequate for my needs: it got stuck at local minima that were often orders of magnitude larger than what could be achieved when reducing by hand.

The limiting factor in this existing tool (“Delta” from now on) was obvious: it was not able to exploit enough of the structure of the file being reduced. For example, it could usually not do much to simplify arithmetic expressions. These sorts of simplifications tend to have a cascading effect: eliminating the last use of a variable allows its definition to be eliminated, etc. The obvious path forward was to write a new tool that solved a reduction problem that Delta could not solve, and then to alternate running this tool and Delta until a global fixpoint was reached. I did this, adding more and more reduction techniques over time. At some point I implemented a line-elimination pass in my new reducer, at which point Delta was subsumed and could be dropped.

We ended up keeping two elements of Delta’s design. First, the configurable hierarchical reformatting of a test case based on curly brace nesting. This technique, followed by removing contiguous chunks of code, is still one of C-Reduce’s most useful first lines of attack on a test case. Second, Delta’s mechanism for determining whether a given variant is “interesting.” An interesting variant is used as the basis for further reduction steps; an uninteresting variant is a dead end, and is discarded. Delta determined interestingness by invoking a user-supplied program — typically a shell script — whose process exit code determines the interestingness of the current variant. The flexibility afforded by this small element of user extensibility ends up being extremely useful. For example, the interestingness test can discard test cases that trigger certain compiler warnings, it can attempt to disambiguate different crash bugs, etc.

It is more challenging to reduce test cases that cause the compiler to emit incorrect object code than it is to reduce test cases that merely cause the compiler to crash. C-Reduce itself is agnostic about the character of the bug of interest: we push all of the difficulties in reducing miscompilation triggers into the interestingness test, which should try to answer questions such as:

  • is the variant well-defined by the C or C++ standard?
  • does the variant avoid depending on behaviors that are unspecified by the C or C++ standard?
  • does the buggy compiler turn the variant into an executable?
  • does this executable terminate within a specified time?
  • does the reference compiler (assumed to not contain the bug of interest) turn the variant into an executable?
  • does this executable also terminate within a specified time?
  • does the behavior of the two executables differ in a way that indicates that a miscompilation occurred?

The variant is interesting if the answer to all of these questions is “yes.”

The hardest part of reducing programs that trigger miscompilation bugs is ensuring that variants avoid undefined behaviors (such as invalid pointer operations) and do not rely on unspecified behaviors (such as the order of evaluation of function arguments). A test case doing one of these things is ill-formed and can accomplish nothing beyond annoying compiler developers. Empirically, if undefined behavior is not actively avoided during test-case reduction, C-Reduce will almost certainly introduce it. The practical solution is to employ suitable static and dynamic analysis tools, in order to avoid these ill-formed variants. Since no single tool detecting all undefined behaviors in C and C++ programs exists, a hybrid approach involving multiple tools is typically used in practice. This is not completely satisfying, but it works well enough that C-Reduce can reliably produce useful reduced test cases for miscompilation bugs in C and C++ compilers.

Writing good interestingness tests for miscompilations takes a bit of practice. More than one user has described C-Reduce as being something like the sorcerer’s apprentice: it does an excellent job reducing according to the criteria it is given, but if these criteria contain any kind of loophole, C-Reduce is likely to find it. For example it is easy to accidentally write a test which believes that the empty file is interesting. Additionally, a good interestingness test is written in such a way that its expected-case runtime is minimized by asking the quickest and most-likely-to-fail questions first.

From the start, C-Reduce was designed to produce a very small final reduced test case, even when this would make C-Reduce run for longer than we liked. This is based on the premise that we should burn cycles instead of human time, and that reporting a compiler bug is rarely on the critical path; we can often afford to wait for a better result. The consequences of this decision can be seen in Tables 1 and 2 of this paper that evaluates several test-case reduction methods: C-Reduce produces the smallest final output, but takes more time to do so.

A Modular, Domain-Independent Reducer Core

Although C-Reduce started out as a pet project solving a specific problem, it evolved into a research project involving a number of my colleagues, whose top-level goal was to produce an effective and usable reducer for C and C++ code as found in the wild. The first research contribution to come out of this effort was a way to achieve a clean mechanism/policy separation in a test case reducer. Previous reduction techniques had all baked specific transformations into the overall search strategy. This impedes extensibility, which we believe to be crucial. The structure that we ended up with is a small core that invokes a collection of pluggable transformation passes until a global fixpoint is reached.

The API for C-Reduce passes is simple but — like many simple things — required a lot of iterations before it felt finished. It is based on the ideas that transformation passes should be stateless and that every pass should implement a linear sequence of transformations, each of which results in a variant that may or may not be interesting. The interface is as follows:

state new(filename, option) : Return a fresh state object. Each pass uses this state to keep track of where it is in the sequence of transformations that it is capable of performing. These states may contain arbitrary data items; the C-Reduce core treats them as opaque. A typical pass stores some kind of cursor — a byte offset, token offset, line number, position in a tree traversal, etc. — in the state object.

The file referred to by filename is logically part of the state object even though it resides in the filesystem instead of memory. Of course it would not be difficult to pass the contents of the file around as a memory object but this would be slow when these objects are large (C-Reduce is frequently invoked on multi-megabyte preprocessed C++ files).

The “option” is used to select among different behaviors implemented by a composite pass.

state advance(filename, option, state) : Return a new state object referring to the next transformation opportunity following the one referenced by the state object passed as a parameter.

result transform(filename, option, state) : Modify the file in-place, selecting the transformation instance referred to by the state object. The result takes one of three values:

  • OK : the transformation succeeded
  • STOP : no more transformation instances remain for this pass
  • ERROR : something went wrong; for example, an external tool crashed, a working file or directory could not be created, etc.

(The API contains one additional method, which checks whether a pass’s external dependencies are satisfied, that doesn’t matter here.)

Our experience has been that every transformation pass that we wanted has been easy to implement behind this API.

The C-Reduce core implements this algorithm:

current = original_test_case
do
  size_at_start = size(current)
  foreach (p, option) in pass_list
    state = p::new(current, option)
    do
      variant = current         // this is a file copy operation
      result = p::transform(variant, option, state)
      if result == ERROR
	report_problem_in_pass(p, option)
      if result == OK
        if is_interesting(variant)
          current = variant     // also a file copy
	else
          state = p::advance(current, option, state)
    while result == OK
while size(current) < size_at_start

The termination argument for C-Reduce is:

  1. Since the outermost loop requires the size of the test case to decrease monotonically, it can only execute as many times as the size (in bytes) of the unreduced test case. In practice it executes many fewer times than this.
  2. The loop over passes terminates because the pass list is immutable once C-Reduce is initialized.
  3. Each iteration of the innermost loop either advances the state object or else (by selecting an interesting variant) removes one transformation opportunity. Either way, the number of remaining transformations decreases by one.
  4. The interestingness test is, at worst, terminated (using OS support for killing all processes in a process group) after a configurable timeout.

In practice the weak link in this argument is case 3, which is vulnerable to bugs in passes. C-Reduce terminates robustly by abandoning passes when they appear to be behaving unreasonably.

The C-Reduce core does not insist that transformations make the test case smaller, and in fact quite a few of its passes potentially increase the size of the test case, with the goal of eliminating sources of coupling within the test case, unblocking progress in other passes.

The sequence of transformation passes is carefully orchestrated such that passes that are likely to give the biggest wins -- such as those that remove entire functions -- run first; otherwise the tool would end up spending days or weeks doing silly things such as trying to shrink numeric constants in a huge source file. Shrinking numbers is useful, and it should be done, but only after many other reduction mechanisms have run to completion.

C-Reduce's collection of cooperating passes, with heavy phase-ordering constraints, is highly reminiscent of how a modern optimizing compiler works. However, only a small proportion of the transformation passes is intended to be semantics-preserving in the sense that a compiler's optimization passes must be. In this domain, we only want to preserve enough semantics that we can probabilistically avoid breaking whatever property makes a test case interesting.

A consequence of writing a modular reducer is that once we came up with the right API for writing passes, we were free to write a lot of passes. My colleagues and I spent several years doing this and we ended up with:

  • 35 passes, implemented in Perl, that include heuristics such as removing lines, removing various kinds of matched delimiters (and perhaps also the text between them), and shrinking integer values
  • 6 passes that invoke external utilities such as unifdef, a partial evaluator for the C preprocessor language, a lexer for C and C++ that supports various token-level reduction transformations, and pretty-printing utilities that make the reduced test case more pleasant to look at
  • 69 passes, implemented in C++, that use LLVM's Clang front end as a library for source-to-source transformation of C and C++ code; these include function inlining, partial template instantiation, scalar replacement of aggregates, copy propagation, and eliminating levels of a class hierarchy.

The actual number of dynamic passes is larger than the total of these numbers since some passes can be invoked in different modes using the "option" parameter mentioned above.

In this piece, we looked at why we had to create C-Reduce and at the modular structure that was key to making it solve the problems that we wanted to solve. In Part 2, I'll describe how C-Reduce improves reduction times using multiple cores and why C-Reduce usually does a good job reducing test cases in languages other than C and C++; finally, I'll discuss a few open research problems in test case reduction.

It’s Time for a Modern Synthesis Kernel

Alexia Massalin’s 1992 PhD thesis has long been one of my favorites. It promotes the view that operating systems can be much more efficient than then-current operating systems via runtime code generation, lock-free synchronization, and fine-grained scheduling. In this piece we’ll only look at runtime code generation, which can be cleanly separated from the other aspects of this work.

Runtime Code Generation in Ring 0

The promise of kernel-mode runtime code generation is that we can have very fast, feature-rich operating systems by, for example, not including code implementing generic read() and write() system calls, but rather synthesizing code for these operations each time a file is opened. The idea is that at file open time, the OS has a lot of information that it can use to generate highly specialized code, eliding code paths that are provably not going to execute.

Runtime code generation was a well-known idea in 1992, but it wasn’t used nearly as widely as it is today. In 2019, of course, just-in-time compilers are ubiquitous. However, operating system kernels still do not use runtime code generation very much, with a few exceptions such as:

  • several OS kernels, including Linux, have a simple JIT compiler in their BPF implementation
  • VMware used to use dynamic code generation to performantly virtualize OS kernels on x86 chips that lacked hardware virtualization extensions; I doubt that this is commonly used any longer
  • pre-NT Windows kernels would dynamically generate bitblit code. I learned this in a talk by a VMware employee; this code generation was apparently a debugging issue for VMware since it would fight with their own runtime code generator. Some details can be found in this post. The old paper about the origins of this technique in the Xerox Alto is a classic.
  • TempleOS, as explained in this nice writeup, made heavy use of dynamic code generation

Anyway, back to Synthesis. The OS and code generators were all written, from scratch, in 68020 assembly language. How do we translate Massalin’s ideas to 2019? Most likely by reusing an existing code generator and OS. For most of this piece I’ll assume that that’s what we want to do, but we’ll also briefly touch on customized alternatives.

Code Generator Requirements

The particular technology that we use for runtime code generation isn’t that important, but for now let’s imagine using LLVM. This means that the pieces of the kernel that we wish to specialize will need to be shipped as bitcode, and then we’ll ask LLVM to turn it into object code as needed. LLVM has lots of great optimization passes, from which we could pick a useful subset, and it is not hard to use in JIT mode. On the other hand, LLVM isn’t as fast as we’d like and also it has a large footprint. In production we’d need to think carefully whether we wanted to include a big chunk of non-hardened code in the kernel.

What optimizations are we expecting the code generator to perform? Mostly just the basic ones: function inlining, constant propagation, and dead code elimination, followed by high-quality instruction selection and register allocation. The hard part, as we’re going to see, is convincing LLVM that it is OK to perform these optimizations as aggressively as we want. This is an issue that Massalin did not need to confront: her kernel was designed in such a way that she knew exactly what could be specialized and when. Linux, on the other hand, was obviously not created with staged compilation in mind, and we’re going to have to improvise somewhat if we want this to work well.

My guess is that while LLVM would be great for prototyping purposes, for deployment we’d probably end up either reusing a lighter-weight code generator or else creating a new one that is smaller, faster, and more suitable for inclusion in the OS. Performance of runtime code generation isn’t just a throughput issue, there’ll also be latency problems if we’re not careful. We need to think about the impact on security, too.

Example: Specializing write() in Linux

Let’s assume that we’ve created a version of Linux that is capable of generating a specialized version of the write() system call for a pipe. This OS needs — but we won’t discuss — a system call dispatch mechanism to rapidly call the specialized code when it is available. In Synthesis this was done by giving each process its own trap vector.

Before we dive into the code, let’s be clear about what we’re doing here: we are pretending to be the code generator that is invoked to create a specialized write() method. Probably this is done lazily at the time the system call is first invoked using the new file descriptor. The specialized code can be viewed as a cached computation, and as a bonus this cache is self-invalidating: it should be valid as long as the file descriptor itself is valid. (But later we’ll see that we can do a better job specializing the kernel if we support explicit invalidation of runtime-generated code.)

If you want to follow along at home, I’m running Linux 5.1.14 under QEMU, using these instructions to single-step through kernel code, and driving the pipe logic using this silly program.

Skipping over the trap handler and such, ksys_write() is where things start to happen for real:

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos = file_pos_read(f.file);
		ret = vfs_write(f.file, buf, count, &pos);
		if (ret >= 0)
			file_pos_write(f.file, pos);
		fdput_pos(f);
	}

	return ret;
}

At this point the “fd” parameter can be treated as a compile-time constant, but of course “buf” and “count” cannot. If we turn “fd” into a constant, will LLVM be able to propagate it through the remaining code? It will, as long as:

  1. We inline all function calls.
  2. Nobody takes the address of “fd”.

It’s not that calls and pointers will always block the optimizer, but they complicate things by bringing interprocedural analysis and pointer analysis into the picture.

Our goal is going to be to see whether the code generator can infer the contents of the struct returned from fdget_pos(). (You might wonder why performance-sensitive code is returning a “struct fd” by value. Turns out this struct only has two members: a pointer and an integer.)

The call to fdget_pos() goes to this code:

static inline struct fd fdget_pos(int fd)
{
	return __to_fd(__fdget_pos(fd));
}

and then here:

unsigned long __fdget_pos(unsigned int fd)
{
	unsigned long v = __fdget(fd);
	struct file *file = (struct file *)(v & ~3);

	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
		if (file_count(file) > 1) {
			v |= FDPUT_POS_UNLOCK;
			mutex_lock(&file->f_pos_lock);
		}
	}
	return v;
}

and then (via a trivial helper that I’m not showing) here:

static unsigned long __fget_light(unsigned int fd, fmode_t mask)
{
	struct files_struct *files = current->files;
	struct file *file;

	if (atomic_read(&files->count) == 1) {
		file = __fcheck_files(files, fd);
		if (!file || unlikely(file->f_mode & mask))
			return 0;
		return (unsigned long)file;
	} else {
		file = __fget(fd, mask, 1);
		if (!file)
			return 0;
		return FDPUT_FPUT | (unsigned long)file;
	}
}

Keep in mind that up to here, we haven’t seen any optimization blockers. In __fdget_light(), we run into our first interesting challenge: “current” is a macro that returns a pointer to the running process’s PCB (in Linux the PCB, or process control block, is a “task_struct” but I’ll continue using the generic term). The current macro ends up being a tiny bit magical, but its end result can be treated as a constant within the context of a given process. There is no way a code generator like LLVM will be able to reach this conclusion, so we’ll need to give it some help, perhaps by annotating certain functions, macros, and struct fields as returning values that are constant over a given scope. This is displeasing but it isn’t clear there’s any easier or better way to achieve our goal here. The best we can hope for is that the annotation burden is close to proportional to the number of data types in the kernel; if it ends up being proportional to the total amount of code then our engineering effort goes way up.

Now, assuming that we can treat “current” as a compile-time constant, we’re immediately faced with a similar question: is the “files” field of the PCB constant? It is (once the process is initialized) but again there’s not going to be any easy way for our code generator to figure this out; we’ll need to rely on another annotation.

Continuing, the “count” field of files is definitely not a constant: this is a reference count on the process’s file descriptor table. A single-threaded Linux process will never see count > 1, but a multi-threaded process will. (Here we need to make the distinction between open file instances, which are shared following a fork, and the file descriptor table, which is not.) The fast path here is exploiting the insight that if our process is single-threaded we don’t need to worry about locking the file descriptor table, and moreover the process is not going to stop being single-threaded during the period where we rely on that invariant, because we trust the currently running code to not do the wrong thing.

Here our specializing compiler has a fun policy choice to make: should it specialize for the single threaded case? This will streamline the code a bit, but it requires the generated code to be invalidated later on if the process does end up becoming multithreaded — we’d need some collection of invalidation hooks to make that happen.

Anyhow, let’s continue into __fcheck_files():

static inline struct file *__fcheck_files(struct files_struct *files, unsigned int fd)
{
	struct fdtable *fdt = rcu_dereference_raw(files->fdt);

	if (fd < fdt->max_fds) {
		fd = array_index_nospec(fd, fdt->max_fds);
		return rcu_dereference_raw(fdt->fd[fd]);
	}
	return NULL;
}

At this point we’re in deep “I know what I’m doing” RCU territory and I’m going to just assume we can figure out a way for the code generator to do what we want, which is to infer that this function returns a compile-time-constant value. I think this’ll work out in practice, since even if the open file instance is shared across processes, the file cannot be truly closed until its reference count goes to zero. Anyway, let’s move forward.

Next, we’re back in __fget_light() and then __fdget_pos(): our code generator should be able to easily fold away the remaining branches in these functions. Finally, we return to line 4 of ksys_write() and we know what the struct fd contains, making it possible to continue specializing aggressively. I don’t think making this example any longer will be helpful; hopefully the character of the problems we’re trying to solve are now apparent.

In summary, we saw four kinds of variables in this exercise:

  1. Those such as the “fd” parameter to write() that the code generator can see are constant at code generation time.
  2. Those such as the “current” pointer that are constant, but where the code generator cannot see this fact for one reason or another. To specialize these, we’ll have to give the compiler extra information, for example using annotations.
  3. Those such as the “count” field of the “files_struct” that are not actually constant, but that seem likely enough to remain constant that we may want to create a specialized version treating them as constants, and then be ready to invalidate this code if the situation changes.
  4. Those that are almost certainly not worth trying to specialize. For example, the “count” parameter to write() is not likely to remain constant over a number of calls.

Writing one byte to a pipe from a single-threaded process executes about 3900 instructions on Linux 5.1.14 (this is just in ksys_write(), I didn’t measure the trapping and untrapping code). The Synthesis thesis promises an order of magnitude performance improvement. Can specialization reduce the fast path on this system call to 390 instructions? It would be fun to find out.

I’ll finish up this example by observing that even though I chose to present code from the filesystem, I think it’s network stack code that will benefit from specialization the most.

Discussion

I have some experience with OS kernels other than Linux, and my belief is that attempting to dynamically specialize any mainstream, production-grade OS other than Linux would run into the same issues we just saw above. At the level the code generator cares about, there just isn’t much effective difference between these OSes: they’re all big giant blobs of C with plentiful indirection and domain-specific hacks.

If our goal is only to create a research-grade prototype, it would be better to start with something smaller than Linux/Windows/Darwin so that we can refactor specialization-unfriendly parts of the OS in a reasonable amount of time. xv6 is at the other extreme: it is super easy to hack on, but it is so incredibly over-simplified that it could not be used to test the hypothesis “a realistic OS can be made much faster using specialization.” Hilariously, an xv6+LLVM system would be about 0.15% OS code and 99.85% compiler code. Perhaps there’s a middle ground that would be a better choice, Minix or OpenBSD or whatever.

Given two developers, one who knows LLVM’s JIT interfaces and one who’s a good Linux kernel hacker, how long would it take to bring up a minimally ambitious dynamically specializing version of Linux? I would guess this could be done in a week or two, there’s not really anything too difficult about it (it’s easy to say this while blogging, of course). The problem is that this would not give good results: only the very easiest specialization opportunities will get spotted by the runtime code generator. But perhaps this would generate enough interest that people would keep building on it.

Do we want to do specialization work on C code? No, not really, it’s just that every one of our production-grade kernels is already written in it. A fun but engineering-intensive alternative would be to create a new, specialization-friendly kernel in whatever programming language looks most suitable. Functional languages should offer real advantages here, but of course there are issues in using these languages to create a performant OS kernel. Perhaps Mirage is a good starting point here, it is already all about specialization — but at system build time, not at runtime.

An ideal programming environment for a modern Synthesis kernel would provide tool and/or language support for engineering specialization-friendly kernel code. For example, we would identify a potential specialization point and then the tools would use all of our old friends — static analysis, dynamic analysis, symbolic execution, etc. — to show us what data items fall into each of the four categories listed in the last section, and provide us with help in refactoring the system so that specialization can work better. A tricky thing here is taking into account the different kinds of concurrency and synchronization that happen in a sophisticated OS.

Some useful questions to ask (and of course we’re always asking these same things when doing OS and compiler research) are: How are we supposed to think about a dynamically specializing OS kernel? What are the new abstractions, if any? Specialization could really benefit from some sort of first-class “code region over which these values are effectively constant” and then also “but the constant-ness is invalidated by this set of events.”

Why Now?

The literature on dynamic specialization of OS code is interesting: it looks like there was a flurry of interest inspired by Synthesis in the mid/late 90s. Many of these papers had Calton Pu, Massalin’s thesis supervisor, on the author list. Not a whole lot has happened in this area since then, as far as I know. The only paper I can think of about optimistic OS specialization is this one; it’s a nice paper, I recommend it. Static OS specialization, on the other hand, is what unikernels are all about, so there’s been quite a bit of work done on this.

It seems like time to revive interest in dynamic OS specialization because:

  • Most of the processor speed wins lately are application specific; the cores that execute OS code are not getting noticeably faster each year, nor do they seem likely to. In fact, way back in 1989 John Ousterhout argued that increases in processor speed weren’t benefiting OS code as much as other kinds of code.
  • OSes have slowed down recently to mitigate side channel attacks. Maybe we can get some of that speed back using dynamic specialization.
  • OSes are way bloatier than they were in the 90s, increasing the potential benefits due to specialization.
  • Compiler technology is far ahead of where it was in the 90s, with off-the-shelf toolkits like LLVM providing high-quality solutions to many of the problems we’d run into while prototyping this work.

I’d like to thank Perry Metzger who suggested this piece and also provided feedback on a draft of it. Perry worked with Alexia back in the day and hopefully he’ll also write about this topic.

Finally, I don’t want to give the impression that I’m summarizing a research proposal or an in-progress project. This is the kind of thing I love to think about, is all.

Floating the Dirty Devil River

Packrafts are tough, light, individual-sized inflatable boats that people use to put together really amazing wilderness trips by combining rafting, hiking, and sometimes even biking. I’ve had sort of a low-grade obsession with packrafting since 2009 when I ran into some people in Alaska who were on their way to hike up and over the Brooks Range and paddle down to the Arctic Ocean. (I’m 99.9% certain it was these folks.) But anyway, I never managed to actually do any packrafting until spring break this year when — along with my friends Brian and Tim — I floated a section of the Dirty Devil River in southern Utah.

The Dirty Devil is a little tricky: it’s a seasonal desert river that, on average, only has enough flow for floating during February and March — and February can be really cold. Luckily the university’s spring break is in March, usually a superb time to be in the desert: the days are getting longer and warmer, the nights are still cold, and there’s not a single bug to be found. Usually you get snowed on for a couple hours, but the snow doesn’t stick around.

None of us had so much as touched a packraft, but all of us had canoed or kayaked before, so we figured a class 1-2 river should be doable. The Dirty Devil flows through a canyon 1500 feet (450 meters) deep and, once we started, there was no realistic possibility of escape except by finishing the trip 36 miles later.

We started out on a spectacular morning:

To set up a shuttle we drove down the rough but super fun Poison Springs Road; here Brian’s cramming a few last things into my vehicle with the DD running narrow and fast in the background:

There was quite a bit more water in the channel than we’d expected; it turned out that Happy Canyon, about eight miles upstream, was in the middle of a minor flash flood.

Next we returned to the paved road, drove a few miles, then down to the top of the western end of the Angel Trail (one of a number of trails with that name). Nobody else was there, and we didn’t see anyone until the last day of our trip. At this point the weather was deteriorating:

And soon we here hiking in the snow:

The snow didn’t last long, but it remained windy and not far above freezing; we were happy to have wetsuits and puffy jackets. We put in across the river from Robbers Roost Canyon, where Butch Cassidy and his gang used to hide out:

We paddled to Angel Cove, which has good camping and a nice spring. With the braided river and the wind it took forever to go just a couple of miles. We were pretty happy to get there, get into dry clothes, make some dinner. Also, the rest of our trip lacked reliable water sources (water from the DD is nasty) so we purified enough water to last the rest of the trip.

The next morning it was still cold, but nice and sunny:

We saw both sides of a ridge called the Saw Tooth, which the river meanders around:


We also passed No Man’s Canyon and Larry Canyon, both of which would have been great to spend a few days hiking, but we hadn’t been able to budget time for that.

The opaque water was hard to read and we did an awful lot of pulling our boats around the river trying to find the channel, which moved from bank to bank every few hundred yards:

This was exhausting and there was plenty of quicksand too. Time flew by as we moved down the river at less than a walking pace and it was around 3pm before we noticed that we should stop for lunch.

The scenery continued to develop:

We pushed late into the day:

And finally stopped at a nice camp site on the river bank where we watched the last of the sun on the canyon walls:

There was copious driftwood so we had a fire:

Morning was good:


Since we hadn’t made good time the previous day I got us up early, but it was just too cold to get into the water and anyway our wetsuits were frozen stiff. We had another cup of coffee and waited for the sun:

It ended up being warmer than the previous day — just about perfect weather — and for whatever reason we made much better time than we had earlier. At this innocuous-looking rest stop I hit some ugly quicksand and told the others to land somewhere else:

Brian landed a little ways to one side and just walked up the bank but Tim got mired much worse than I had, it probably took him 15 minutes to get out. We couldn’t even help.

Over the course of the day we passed Twin Corral Box Canyon and Sam’s Mesa Box Canyon, both huge and seldom-visited places I’d have been happy to spend a couple days exploring, but again we wanted to keep on schedule. Travel was generally easy and also the river gained some fun riffles as it started to drop more rapidly through the sandstone layers. It wasn’t possible to avoid hitting rocks since we couldn’t see through the water, but packrafts seem to be very tough. They’re also quite stable, I only worried about flipping once or twice, and never did.

By late afternoon we reached Happy Canyon — named, apparently, by old-time cowboys who were happy to find a way out of it — which was still running from its flash three days earlier:

No shortage of quicksand here either, ugh!

We didn’t take much time to explore, but rather found a place to camp nearby. By this point we were only eight miles from our takeout and I finally stopped worrying about getting trapped in the canyon if one of the boats popped or got lost. For whatever reason this night was extra cold, my water bottle was more than half frozen in the morning, meaning it had probably gotten below 25 degrees.

Nice light on the walls:

Happy Canyon has some of the finest walk-through narrows anywhere on the Colorado Plateau:

After this excursion it was time to pack up camp and float some more. We immediately hit a series of rock gardens:

Progress was overall slower than the previous day, but we moved along and the scenery kept being good.

And suddenly we reached the takeout, where we had adult beverages and loaded a lot of extremely sandy gear into Brian’s truck:

And then back to pick up my truck, which we reached just in time for sunset:

Here in the middle-right you can see the sawtooth ridge that we had floated past a few days earlier:

And finally, as tradition dictates, we stopped for burgers at Ray’s in Green River. This was a super good trip, but fairly difficult physically and perhaps a bit much to bite off for a group of 40-something first-time packrafters.

Verifying Popcount

Popcount is the function that returns the number of set bits in its argument. Showing that a popcount implementation does what it claims to do has become one of my favorite examples to use when I need to quickly show students how we can reason about programs mathematically. Something like a selection sort is probably the more standard example here, but sorting gets into some subtleties that I’d rather reserve for a second example. Popcount is so easy that this whole example can be done in a half hour.

The programming language and the particular algorithm don’t matter too much, here we’ll use what is perhaps the simplest popcount implementation in C or C++:

int popcount(uint64_t x) {
  int c = 0;
  for (int i = 0; i < 64; i++) {
    c += x & 1;
    x >>= 1;
  }
  return c;
}

This works, of course, because it scans the entire argument, incrementing the counter each time it finds a set bit. To show that it works in a more formal fashion, we can use methods more or less out of A Discipline of Programming, which allow us to combine obvious effects of individual statements into (hopefully) the desired high-level property, which we might as well annotate the program with right now:

int popcount(uint64_t x) {
  int c = 0;
  for (int i = 0; i < 64; i++) {
    c += x & 1;
    x >>= 1;
  }
  // c = POPCOUNT(orig_x)
  return c;
}

Here we’ve used a couple of tricks that need to be fully explained when giving this example to students who aren’t used to this kind of reasoning. First, since x is modified as the program executes, we’ve used orig_x as a shorthand for the original value of x when it was passed as an argument. (This syntax is a trick I learned from ACSL.) Second, here we’ve used the POPCOUNT function, which is the mathematical specification of popcount, which is a completely different thing than the C code we are trying to verify. Consistently applying the correct rules to mathematical and programmatic elements isn’t as straightforward as we would hope and in fact quite a few results from the bad old days of non-mechanized program verification are wrong in subtle ways.

To make our proof go, we’ll need to find the right loop invariant: a statement that is true before the loop executes and also true after every loop iteration. The trick when explaining this task to students is helping them realize that (for a simple example like this) if they understand how the algorithm works, they effectively already know the loop invariant. They just need some help in teasing it out into a usable form. One of the stumbling blocks is that every loop has many invariants, but most of them (for example c ≥ 0) aren’t helpful. We can avoid a lot of silly loop invariants by observing that the right invariant will be strong enough to imply the correctness criterion for the program while also being weak enough to already be true before the first loop iteration, but I’ve found that negative constraints like these are not very helpful for students. The trick, of course, is to capture the abstract fact that is being held in balance as the loop modifies pieces of program state, and to make a lecture like this work, we need to avoid the appearance of pulling this fact out of thin air. It defeats the entire purpose of this discussion if students walk away feeling like there was a magical step in the middle.

So anyway, by hook or by crook we need to get the audience to realize that the key to this algorithm is the loop invariant c + POPCOUNT(x) == POPCOUNT(orig_x). At this point we’re over the hump and we should mark up the code some more:

int popcount(uint64_t x) {
  int c = 0;
  // c + POPCOUNT(x) == POPCOUNT(orig_x) ???
  for (int i = 0; i < 64; i++) {
    // c + POPCOUNT(x) == POPCOUNT(orig_x)
    c += x & 1;
    x >>= 1;
    // c + POPCOUNT(x) == POPCOUNT(orig_x) ???
  }
  // c + POPCOUNT(x) == POPCOUNT(orig_x) => c = POPCOUNT(orig_x) ???
  return c;
}

Here I’ve used question marks to indicate a proposition that is still in doubt. Now we’ve broken up the problem into manageable pieces and can visit them in turn:

  • The loop invariant is true before the loop executes because c = 0 and x = orig_x.
  • To see that the loop invariant is preserved by the loop body, we analyze two cases: x & 1 is either zero or one. If it is zero, then c is unchanged and also, clearly, shifting x right by one place does not change POPCOUNT(x). If it is 1 then c is incremented by one and shifting x right by one place has decreased its popcount by one. So we can see that either way, the invariant is maintained.
  • To prove that the loop invariant implies the overall correctness condition, we must observe that x = 0 after being shifted right 64 times. This is the sort of situation where we better be clear about the rules for shifting signed vs. unsigned values in C or C++.

Of course we’ve skipped over a lot of smaller steps, but this argument captures the essence of why the algorithm works. When presenting this particular algorithm, we can safely skip the termination proof since i is not modified in the loop body.

An entertaining variant is Kernighan’s popcount:

int popcount(uint64_t x) {
  int c = 0;
  while (x) {
    x &= x - 1;
    c++;
  }
  return c;
}

I didn’t use it in this post because I feel like it’s useful to show the analysis by cases in the simpler algorithm. Wikipedia has some faster and much harsher algorithms, which are the ones used in practice on processors that do not have a native popcount instruction.

Anyway, I presented this material in a software engineering class yesterday, and realized that I’d used this example a couple of times before, so it seemed worth writing up here.

Explaining Code using ASCII Art

People tend to be visual: we use pictures to understand problems. Mainstream programming languages, on the other hand, operate in an almost completely different kind of abstract space, leaving a big gap between programs and pictures. This piece is about pictures drawn using a text character set and then embedded in source code. I love these! The other day I asked around on Twitter for more examples and the responses far exceeded expectations (thanks everyone!). There are a ton of great examples in the thread; here I’ve categorized a few of them. Click on images go to the repositories.

Data Structures

One of the most common kinds of ASCII art in code is illustrating the shape of a data structure.

The example I started with comes from LLVM:

The layout of a data structure in the Jikes RVM:

A tree rotate in Musl:

Double-ended queue from the Rust library:

Swift compiler internals:

Malloc header layout:

State Machines

JavaScript profiling:

RPCs in Cloud Spanner:

I/O stream states:

Logical Structure in the Problem Domain

Control flow in an NWScript program being decompiled:

ECC internals:

Formatting numbers:

A quantum circuit:

Balancing memory management objectives in an OS kernel:

Subtyping relations (this is a very cool special case where the ASCII art is also code):

The format of a DBF file:

A lookup table for image processing:

Shape of a color function:

Structure of a URI:

A very quick tutorial on undo systems from emacs:

Geometry

Attitude control in the Apollo Guidance Computer (!!!):

Image tiling:

Boomerang trajectories in Nethack:

Rendering CSS borders:

Quadtrees:

Speed control in a milling machine:

Scrolling web pages:

I hope you enjoyed these as much as I did!

Walking or Biking to NSF

Since the National Science Foundation funds a large fraction of academic computer science research in the USA, we often end up traveling to Washington to visit the NSF. This post is just to say that if you are traveling light, if you need some exercise, if you have a bit of free time, and if the weather is good, it is feasible and pleasant to walk or bike from Reagan Washington National Airport (DCA) to the new NSF location in Alexandria.

The short version is:

  • Don’t follow Google Maps in pedestrian mode, which finds a not-great route
  • Do follow the Mount Vernon Trail, which connects DCA and Old Town Alexandria in a pretty direct fashion

I’ve walked this route in both directions; it’s roughly five miles and takes me 80 or 90 minutes (I’m a pretty fast walker so ymmv). I have not biked it, but Capital Bikeshare has several stations near DCA and quite a few stations scattered around Alexandria — so this seems like a great option, assuming that your luggage is in a backpack or something else hands-free.

Here’s an overview of the whole route:


To help orient yourself, the beltway runs across the bottom of the image, the Potomac River runs north to south, you can see DCA’s runways towards the top, and Old Town Alexandria is in the triangular area in between the Potomac, the beltway, and the red walking route.

The only potentially tricky bit is locating the Mt Vernon Trail. Starting anywhere in the airport, walk to Terminal A and go outside to the semicircle-shaped pickup/dropoff area, and then find a sidewalk going roughly southwest. The road you want to walk along is South Smith Boulevard. Don’t cross under the Metro tracks and also don’t go down a level to Thomas Avenue, which gives access to some of the buildings right next to the airport terminal. This may be helpful:

In less than half a mile, as you reach the end of the airport-related buildings, the Mount Vernon Trail, paved with a dotted yellow center line, is easy to see and you should move over to it here (open kmz files in Google Earth). This is all pretty obvious on the ground, and there are some helpful signs. Whatever you do, don’t cross the George Washington Parkway — the busy divided highway that runs nearby.

Follow the Mount Vernon Trail past Daingerfield Island to Alexandria, at which point the trail forks and the Mount Vernon Trail heads over towards the Alexandria waterfront. This is a nice walk if you have time, but it’s out of the way and you can get to NSF more quickly by moving to East Abingdon Drive here and then winding your way through residential Alexandria towards King Street and then NSF.

As long as you have some sort of map this whole route is very easy to follow, as is the reverse route back to the airport.

Synthesizing Constants

(See this blog post for a short introduction to synthesis, or this paper for a long one.)

In this piece I want to discuss an aspect of program synthesis that sounds like it should be easy, but isn’t: synthesizing constant values. For example, consider trying to synthesize an optimized x86-64 implementation of this code:

long foo(long x) {
    return x / 52223;
}

The desired output is something like:

foo(long):
        movabs  rdx, -6872094784941870951
        mov     rax, rdi
        imul    rdx
        lea     rax, [rdx+rdi]
        sar     rdi, 63
        sar     rax, 15
        sub     rax, rdi
        ret

Assume for a moment that we know that this sequence of instructions will work, and we only lack the specific constants:

foo(long):
        movabs  rdx, C1
        mov     rax, rdi
        imul    rdx
        lea     rax, [rdx+rdi]
        sar     rdi, C2
        sar     rax, C3
        sub     rax, rdi
        ret

We need to find a 64-bit constant and two 6-bit constants such that the assembly code returns the same result as the C code for every value of the function parameter. Of course there’s a well-known algorithm that GCC and LLVM use to compute these constants, but the synthesizer doesn’t have it: it must operate from first principles.

An even more difficult example is a 64-bit Hamming weight computation, where (assuming we don’t want to use vector instructions) we might hope to synthesize something like this:

popcount64(unsigned long):
        movabs  rdx, 6148914691236517205
        mov     rax, rdi
        shr     rax
        and     rax, rdx
        movabs  rdx, 3689348814741910323
        sub     rdi, rax
        mov     rax, rdi
        shr     rdi, 2
        and     rax, rdx
        and     rdi, rdx
        add     rdi, rax
        mov     rax, rdi
        shr     rax, 4
        add     rax, rdi
        movabs  rdi, 1085102592571150095
        and     rax, rdi
        movabs  rdi, 72340172838076673
        imul    rax, rdi
        shr     rax, 56
        ret

Again assuming that we know that this sequence of instructions will work, we still need to come up with 280 bits worth of constant values. One can imagine even more challenging problems where we want to synthesize an entire lookup table.

The running example I’ll use in this post is much easier:

uint32_t bar1(uint32_t x) {
  return ((x << 8) >> 16) << 8;
}

The code we want to synthesize is:

uint32_t bar2(uint32_t x) {
  return x & 0xffff00;
}

That is, we want to find C in this skeleton:

uint32_t bar3(uint32_t x) {
  return x & C;
}

Basic Constant Synthesis

We need to solve an “exists-forall” problem: does there exist a C such that the LHS (left-hand side, the original code) is equivalent to the RHS (right-hand side, the optimized code) for all values of x? In other words:

(Here we’ll play fast and loose with the fact that in the real world we check refinement instead of equivalence — this doesn’t matter for purposes of this post.)

The specific formula that we want an answer for is:

A SAT solver can natively solve either an exists query (this is the definition of SAT) or else a forall query (by seeing if there exists a solution to the negated proposition). But, by itself, a SAT solver cannot solve an exists-forall query in one go. An SMT solver, on the other hand, can natively attack an exists-forall query, but in practice we tend to get better results by doing our own quantifier elimination based only on SAT calls. First, we ask the solver if there exists a C and x that make the LHS and RHS equivalent. If not, synthesis fails. If so, we issue a second query to see if C works for all values of x. If so, synthesis succeeds. If not, we add a constraint that this value of C doesn’t work, and we start the process again. The problem is that each pair of queries only rules out a single choice for C, making this process equivalent, in the worst case, to exhaustive guessing, which we cannot do when C is wide. We need to do better.

Reducing the Number of Counterexamples using Specialization

Souper uses a technique that appears to be common knowledge among people who do this kind of work; unfortunately I don’t know where it was first published (Section 4.2 of this paper is one place it has appeared). The trick is simple: we choose a few values of x and use it to specialize both the LHS and RHS of the equation; you can think of this as borrowing some constraints from the forall phase and moving them into the exists phase (which becomes an “exists-forsome” query, if you will). In many cases this adds enough extra constraints that the solver can arrive at a workable constant within a few iterations. If we specialize x with 0 and -1 then in the general case we get:

After constant folding, our running example becomes:

The first choice, 0, turns out to be an unlucky one: after further simplification it comes out to “0 = 0”, giving the solver no extra information. The second choice, on the other hand, is extremely lucky: it can be rewritten as “0x00FFFF00 = C” which simply hands us the answer. In most cases things won’t work out so nicely, but specializing the LHS and RHS with fixed inputs helps enormously in practice.

Some open questions remain: How many specific values should we try? How should these values be chosen? An obvious constraint is that the values chosen should be as different from each other as possible, though it isn’t clear what distance metric should be used. A related but less obvious criterion (this is Nuno Lopes’s idea, and we haven’t tried it out yet in Souper) is that the values should exercise as many different behaviors of each instruction as possible. For example, an addition instruction should have example inputs that overflow and also inputs that don’t. This requirement can be satisfied syntactically for instructions that are close to the inputs, but solver calls (or other symbolic methods) will be required to reach instructions that consume the outputs of other instructions. (In Souper’s case, solver calls are required anyway if there are any path conditions, which place arbitrary restrictions on the input space.) It also seems possible that we could tailor the input values to maximize how much of the RHS folds away (the LHS always folds completely down to a constant, of course).

Synthesis at Reduced Bitwidth

Instead of solving a difficult synthesis problem, we can instead solve a problem that is structurally identical, but uses narrower bitvectors (even two or three bits are often enough to capture the essence of a computation). This results in far better solver performance and, typically, a narrow synthesis result that does not contain constants will also work at the original, wider bitwidth (though obviously this needs to be verified by the solver). When the narrow result has constants, there’s a problem: we lack a principled method for deriving the wider constants from the narrow ones. One approach is to use heuristics. This paper suggests the following methods:

Here you should read BV(x,y) as “a bitvector of length y holding value x.” So, for example, rule 1 would extend an 8-bit variable containing 8 to a 16-bit variable containing 16. Rule 4 would extend an 8-bit variable containing 8 to a 16-bit variable containing 8. These seem reasonable, but notice that none of them would help with our running example.

In Souper we have the additional problem that the codes we’re trying to optimize usually contain constants, presenting us with a constant-narrowing problem that is analogous to the constant-widening problem that we just discussed, but worse because now any dodgy heuristics that we come up with are located in front of the expensive synthesis process instead of in back of it. It isn’t clear to us that there’s any satisfactory solution to this problem.

Getting More out of Counterexamples

In standard constant synthesis, each failed guess only adds the constraint that that particular constant doesn’t work — this fails to make much of a dent in an exponential search space. This paper and this one propose extracting additional information from each counterexample, cutting out a larger part of the search space. This also seems like an excellent direction.

Symbolic Constants

An alternate approach that avoids both narrowing and widening problems would be to treat constants symbolically. This, however, creates two new problems: deriving preconditions for optimizations and deriving functions to compute RHS constants in terms of elements of the LHS (constants, widths, etc.). It seems clear that in some cases this will be very difficult, for example imagine trying to derive, from first principles, the algorithm for computing the output constants for an arbitrary value of the LHS constant C:

long foo(long x) {
    return x / C;
}

Alive-Infer helps derive preconditions but it does not yet try to come up with the functions for computing constants in the synthesized output.

Since in some use cases (including my “destroy InstSimplify and InstCombine” dream) we want the generalized optimization anyway, this seems like a great direction for future work.

Conclusion

All of the techniques described in this post could be implemented inside the solver or outside it. Solvers like CVC4 already have sophisticated internal support for synthesis. The advantage of pushing more work into the solver is that it can exploit internal levers to guide the search, reuse data structures across synthesis iterations, and do other clever things that aren’t exposed by the solver’s API. The disadvantage is that we might have high-level information about the problem domain that is difficult to communicate effectively to the solver, that would help it do its job better.

In summary, constant synthesis is a hard problem that hasn’t received a lot of attention yet. My group is actively working on this and we have some ideas that aren’t mature enough to share yet. Please drop me a line if you know of any good techniques that I’ve missed here. Also, I’d be interested to hear from anyone working on benchmarks for constant synthesis.

Finally, tools such as Rosette and Sketch know how to synthesize constants.

Learning When Values are Changed by Implicit Integer Casts

C and C++ perform implicit casts when, for example, you pass an integer-typed variable to a function that expects a different type. When the target type is wider, there’s no problem, but when the target type is narrower or when it is the same size and the other signedness, integer values may silently change when the type changes. For example, this program:

#include <iostream>

int main() {
  int x = -3;
  unsigned y = x;
  std::cout << y << "\n";
}

prints 4294967293. Like unsigned integer wraparound (and unlike signed integer overflow) these changes of value are not undefined behavior, but they may be unintentional and may also be bugs. As of recently, Clang contains support for dynamically detecting value changes and either providing a diagnostic or else terminating the program. Some fine-grained flags are available for controlling these diagnostics, but you can enable all of them (plus some others, such as signed overflow and unsigned wraparound checks) using -fsanitize=integer.

Now we get:

$ clang++ implicit2.cpp -fsanitize=integer
$ ./a.out 
implicit2.cpp:5:16: runtime error: implicit conversion from type 'int' of value -3 (32-bit, signed) to type 'unsigned int' changed the value to 4294967293 (32-bit, unsigned)
4294967293
$ 

Here’s a truncation example:

int main() {
  int x = 300;
  unsigned char c = x;
}

It gives:

$ clang++ implicit3.cpp -fsanitize=integer
$ ./a.out 
implicit3.cpp:3:21: runtime error: implicit conversion from type 'int' of value 300 (32-bit, signed) to type 'unsigned char' changed the value to 44 (8-bit, unsigned)
$

To suppress the diagnostic, we can make the conversion explicit using a cast:

int main() {
  int x = 300;
  unsigned char c = (unsigned char)x;
}

Different parts of this functionality landed in Clang before and after the 7.0.0 release — to get everything, you will want to build a Clang+LLVM from later than 1 Nov 2018 (or else wait for the 8.0.0 release in a few months).

How would you use these checks? Generally, they should be part of a testing campaign, perhaps in support of a code audit. If you enable them on a non-trivial code base, you will run into diagnostics that do not correspond to bugs, just because real C and C++ programs mix integer types so freely. I suggest starting with -fsanitize=implicit-integer-truncation. Let’s look at doing this to Clang+LLVM itself and then compiling a hello world program: here’s the output.

I’d be happy to hear about any interesting bugs located using these checks, if anyone wants to share.

Finally, see this tweet by Roman Lebedev (the author of these checks) showing that the runtime impact of -fsanitize-trap=implicit-conversion is very low.

What’s the difference between an integer and a pointer?

(This piece is an alternate introduction and advertisement for a soon-to-be-published research paper.)

In an assembly language we typically don’t have to worry very much about the distinction between pointers and integers. Some instructions happen to generate addresses whereas others behave arithmetically, but underneath there’s a single data type: bitvectors. At the opposite end of the PL spectrum, a high-level language won’t offer opportunities for pointer/integer confusion because the abstractions are completely firewalled off from each other. Also, of course, a high-level language may choose not to expose anything that resembles a pointer.

The problematic case, then, is the performance-oriented systems programming language: C, C++, Rust, and a few others. The issue is that:

  1. When possible, they pretend to be high-level, in order to justify optimizations that would be unsound at the assembly level. For example, they assume that a pointer derived from one heap allocation cannot alias a pointer derived from a different heap allocation.
  2. When necessary, they allow the programmer to manipulate pointers in low-level ways, such as inspecting the representation of a pointer, creating a new pointer at some offset from an existing one, storing unrelated data in unused parts of a pointer, and even fabricating a pointer out of whole cloth.

These goals conflict: the better a language is for justifying high-level optimizations, the worse it seems to be for implementing an OS kernel, and vice versa. There are a few ways we might deal with this:

  • We could require the high-level language to call out to low-level code written in a different language, as Java does with its JNI.
  • We could create a special low-level mode for a safe language where it can break the normal rules, as Rust does with its unsafe code.
  • We could go ahead and freely mix high-level and low-level code, hoping that the compiler can sort it all out, as C and C++ do.

Let’s look at an example:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  int *x = malloc(4);
  *x = 3;
  int *y = malloc(4);
  *y = 5;
  uintptr_t xi = (uintptr_t)x;
  uintptr_t yi = (uintptr_t)y;
  uintptr_t diff = xi - yi;
  printf("diff = %ld\n", (long)diff);
  int *p = (int *)(yi + diff);
  *p = 7;
  printf("x = %p, *x = %d\n", x, *x);
  printf("y = %p, *y = %d\n", y, *y);
  printf("p = %p, *p = %d\n", p, *p);
}

When run (on my Mac) it gives:

$ clang-6.0 -O3 mem1d.c ; ./a.out
diff = -96
x = 0x7fcb0ec00300, *x = 7
y = 0x7fcb0ec00360, *y = 5
p = 0x7fcb0ec00300, *p = 7
$ gcc-8 -O3 mem1d.c ; ./a.out
diff = -96
x = 0x7f93b6400300, *x = 7
y = 0x7f93b6400360, *y = 5
p = 0x7f93b6400300, *p = 7

What happened here? First, we grabbed a couple of heap cells and initialized their contents. Second, we used integer math to fabricate a pointer p that has the same value as x, and stored through that pointer. Third, we asked the compiler what contents it thinks are in the locations pointed to by x, y, and p. Both GCC and Clang/LLVM are of the opinion that the store through p changed the value referred to by x. This is probably the result that we had hoped for, but let’s look into it a bit more deeply.

First, does this C program execute undefined behavior? It does not! Computing out-of-bounds pointers is UB but we are free to do whatever we like with integer-typed values that were derived from pointers. Second, does the C standard mandate the behavior that we saw here? It does not! In fact it gives only very sparse guidance about the behavior of integers derived from pointers and vice versa. You can read about that here.

Now let’s modify the program a tiny bit:

#include <stdio.h>
#include <stdlib.h>

int main(void) {
  int *x = malloc(4);
  *x = 3;
  int *y = malloc(4);
  *y = 5;
  uintptr_t xi = (uintptr_t)x;
  uintptr_t yi = (uintptr_t)y;
  uintptr_t diff = xi - yi;
  printf("diff = %ld\n", (long)diff);
  int *p = (int *)(yi - 96); // <--- CHANGED!
  *p = 7;
  printf("x = %p, *x = %d\n", x, *x);
  printf("y = %p, *y = %d\n", y, *y);
  printf("p = %p, *p = %d\n", p, *p);
}

The only difference is at line 13: instead of adding diff to yi to compute p, we added the observed value of the difference: -96. Now we’ll run it:

$ clang-6.0 -O3 mem1d2.c ; ./a.out
diff = -96
x = 0x7ff21ac00300, *x = 7
y = 0x7ff21ac00360, *y = 5
p = 0x7ff21ac00300, *p = 7
$ gcc-8 -O3 mem1d2.c ; ./a.out
diff = -96
x = 0x7fce5e400300, *x = 3
y = 0x7fce5e400360, *y = 5
p = 0x7fce5e400300, *p = 7

Clang behaves the same as before, but GCC no longer believes that storing through p results in x being overwritten, despite the fact that x and p have the same value. What is going on is that as far as GCC is concerned, since we computed the address of x through guesswork instead of through an actual observation, our guess is effectively illegitimate. (If this guessing game is too simple for you, there are entertaining variations that avoid the need for information from a previous run of the program, such as guessing addresses using an exhaustive loop or making huge allocations so that the resulting addresses are constrained by the pigeonhole principle.) Here again we are operating well outside of the C standard and are rather in a zone where compiler writers need to make decisions that balance optimization power against developers’ expectations. I like this example because it is weird and perhaps surprising.

What should we take away from this? First, it is a big mistake to try to understand pointers in a programming language as if they follow the same rules as pointer-sized integer values. Even people who have come to terms with UB and its consequences are often surprised by this. Second, these issues are not specific to C and C++, but rather are going to create problems in any language that wants highly optimizing compilers but also permits low-level pointer manipulation.

So what are the actual rules that Clang and GCC are following when compiling these examples? The short answer is that it’s complicated and not well documented. Some useful discussion can be found in this piece about pointer provenance. For a longer answer that is specific to LLVM, see this paper that will get presented at OOPSLA in November. Also, one of my coauthors wrote a blog post about this material.

Future Directions for Optimizing Compilers

I wanted to write a manifesto-ish sort of piece about what compilers are supposed to look like in the future. Nuno was the obvious coauthor since I’ve talked to him about this topic so much that I’m overall not really sure which parts started out as his ideas and which were mine. The article didn’t end up feeling like a good fit for a series of blog posts, so we’ve placed it on arXiv: Future Directions for Optimizing Compilers. The first paragraph of the paper gives the main idea:

As software becomes larger, programming languages become higher-level, and processors continue to fail to be clocked faster, we’ll increasingly require compilers to reduce code bloat, eliminate abstraction penalties, and exploit interesting instruction sets. At the same time, compiler execution time must not increase too much and also compilers should never produce the wrong output. This paper examines the problem of making optimizing compilers faster, less buggy, and more capable of generating high-quality output.

We plan to keep revising the paper and would be happy to receive feedback on it.