It’s Time for a Modern Synthesis Kernel

Alexia Massalin’s 1992 PhD thesis has long been one of my favorites. It promotes the view that operating systems can be much more efficient than then-current operating systems via runtime code generation, lock-free synchronization, and fine-grained scheduling. In this piece we’ll only look at runtime code generation, which can be cleanly separated from the other aspects of this work.

Runtime Code Generation in Ring 0

The promise of kernel-mode runtime code generation is that we can have very fast, feature-rich operating systems by, for example, not including code implementing generic read() and write() system calls, but rather synthesizing code for these operations each time a file is opened. The idea is that at file open time, the OS has a lot of information that it can use to generate highly specialized code, eliding code paths that are provably not going to execute.

Runtime code generation was a well-known idea in 1992, but it wasn’t used nearly as widely as it is today. In 2019, of course, just-in-time compilers are ubiquitous. However, operating system kernels still do not use runtime code generation very much, with a few exceptions such as:

  • several OS kernels, including Linux, have a simple JIT compiler in their BPF implementation
  • VMware used to use dynamic code generation to performantly virtualize OS kernels on x86 chips that lacked hardware virtualization extensions; I doubt that this is commonly used any longer
  • pre-NT Windows kernels would dynamically generate bitblit code. I learned this in a talk by a VMware employee; this code generation was apparently a debugging issue for VMware since it would fight with their own runtime code generator. Some details can be found in this post. The old paper about the origins of this technique in the Xerox Alto is a classic.
  • TempleOS, as explained in this nice writeup, made heavy use of dynamic code generation

Anyway, back to Synthesis. The OS and code generators were all written, from scratch, in 68020 assembly language. How do we translate Massalin’s ideas to 2019? Most likely by reusing an existing code generator and OS. For most of this piece I’ll assume that that’s what we want to do, but we’ll also briefly touch on customized alternatives.

Code Generator Requirements

The particular technology that we use for runtime code generation isn’t that important, but for now let’s imagine using LLVM. This means that the pieces of the kernel that we wish to specialize will need to be shipped as bitcode, and then we’ll ask LLVM to turn it into object code as needed. LLVM has lots of great optimization passes, from which we could pick a useful subset, and it is not hard to use in JIT mode. On the other hand, LLVM isn’t as fast as we’d like and also it has a large footprint. In production we’d need to think carefully whether we wanted to include a big chunk of non-hardened code in the kernel.

What optimizations are we expecting the code generator to perform? Mostly just the basic ones: function inlining, constant propagation, and dead code elimination, followed by high-quality instruction selection and register allocation. The hard part, as we’re going to see, is convincing LLVM that it is OK to perform these optimizations as aggressively as we want. This is an issue that Massalin did not need to confront: her kernel was designed in such a way that she knew exactly what could be specialized and when. Linux, on the other hand, was obviously not created with staged compilation in mind, and we’re going to have to improvise somewhat if we want this to work well.

My guess is that while LLVM would be great for prototyping purposes, for deployment we’d probably end up either reusing a lighter-weight code generator or else creating a new one that is smaller, faster, and more suitable for inclusion in the OS. Performance of runtime code generation isn’t just a throughput issue, there’ll also be latency problems if we’re not careful. We need to think about the impact on security, too.

Example: Specializing write() in Linux

Let’s assume that we’ve created a version of Linux that is capable of generating a specialized version of the write() system call for a pipe. This OS needs — but we won’t discuss — a system call dispatch mechanism to rapidly call the specialized code when it is available. In Synthesis this was done by giving each process its own trap vector.

Before we dive into the code, let’s be clear about what we’re doing here: we are pretending to be the code generator that is invoked to create a specialized write() method. Probably this is done lazily at the time the system call is first invoked using the new file descriptor. The specialized code can be viewed as a cached computation, and as a bonus this cache is self-invalidating: it should be valid as long as the file descriptor itself is valid. (But later we’ll see that we can do a better job specializing the kernel if we support explicit invalidation of runtime-generated code.)

If you want to follow along at home, I’m running Linux 5.1.14 under QEMU, using these instructions to single-step through kernel code, and driving the pipe logic using this silly program.

Skipping over the trap handler and such, ksys_write() is where things start to happen for real:

ssize_t ksys_write(unsigned int fd, const char __user *buf, size_t count)
{
	struct fd f = fdget_pos(fd);
	ssize_t ret = -EBADF;

	if (f.file) {
		loff_t pos = file_pos_read(f.file);
		ret = vfs_write(f.file, buf, count, &pos);
		if (ret >= 0)
			file_pos_write(f.file, pos);
		fdput_pos(f);
	}

	return ret;
}

At this point the “fd” parameter can be treated as a compile-time constant, but of course “buf” and “count” cannot. If we turn “fd” into a constant, will LLVM be able to propagate it through the remaining code? It will, as long as:

  1. We inline all function calls.
  2. Nobody takes the address of “fd”.

It’s not that calls and pointers will always block the optimizer, but they complicate things by bringing interprocedural analysis and pointer analysis into the picture.

Our goal is going to be to see whether the code generator can infer the contents of the struct returned from fdget_pos(). (You might wonder why performance-sensitive code is returning a “struct fd” by value. Turns out this struct only has two members: a pointer and an integer.)

The call to fdget_pos() goes to this code:

static inline struct fd fdget_pos(int fd)
{
	return __to_fd(__fdget_pos(fd));
}

and then here:

unsigned long __fdget_pos(unsigned int fd)
{
	unsigned long v = __fdget(fd);
	struct file *file = (struct file *)(v & ~3);

	if (file && (file->f_mode & FMODE_ATOMIC_POS)) {
		if (file_count(file) > 1) {
			v |= FDPUT_POS_UNLOCK;
			mutex_lock(&file->f_pos_lock);
		}
	}
	return v;
}

and then (via a trivial helper that I’m not showing) here:

static unsigned long __fget_light(unsigned int fd, fmode_t mask)
{
	struct files_struct *files = current->files;
	struct file *file;

	if (atomic_read(&files->count) == 1) {
		file = __fcheck_files(files, fd);
		if (!file || unlikely(file->f_mode & mask))
			return 0;
		return (unsigned long)file;
	} else {
		file = __fget(fd, mask, 1);
		if (!file)
			return 0;
		return FDPUT_FPUT | (unsigned long)file;
	}
}

Keep in mind that up to here, we haven’t seen any optimization blockers. In __fdget_light(), we run into our first interesting challenge: “current” is a macro that returns a pointer to the running process’s PCB (in Linux the PCB, or process control block, is a “task_struct” but I’ll continue using the generic term). The current macro ends up being a tiny bit magical, but its end result can be treated as a constant within the context of a given process. There is no way a code generator like LLVM will be able to reach this conclusion, so we’ll need to give it some help, perhaps by annotating certain functions, macros, and struct fields as returning values that are constant over a given scope. This is displeasing but it isn’t clear there’s any easier or better way to achieve our goal here. The best we can hope for is that the annotation burden is close to proportional to the number of data types in the kernel; if it ends up being proportional to the total amount of code then our engineering effort goes way up.

Now, assuming that we can treat “current” as a compile-time constant, we’re immediately faced with a similar question: is the “files” field of the PCB constant? It is (once the process is initialized) but again there’s not going to be any easy way for our code generator to figure this out; we’ll need to rely on another annotation.

Continuing, the “count” field of files is definitely not a constant: this is a reference count on the process’s file descriptor table. A single-threaded Linux process will never see count > 1, but a multi-threaded process will. (Here we need to make the distinction between open file instances, which are shared following a fork, and the file descriptor table, which is not.) The fast path here is exploiting the insight that if our process is single-threaded we don’t need to worry about locking the file descriptor table, and moreover the process is not going to stop being single-threaded during the period where we rely on that invariant, because we trust the currently running code to not do the wrong thing.

Here our specializing compiler has a fun policy choice to make: should it specialize for the single threaded case? This will streamline the code a bit, but it requires the generated code to be invalidated later on if the process does end up becoming multithreaded — we’d need some collection of invalidation hooks to make that happen.

Anyhow, let’s continue into __fcheck_files():

static inline struct file *__fcheck_files(struct files_struct *files, unsigned int fd)
{
	struct fdtable *fdt = rcu_dereference_raw(files->fdt);

	if (fd < fdt->max_fds) {
		fd = array_index_nospec(fd, fdt->max_fds);
		return rcu_dereference_raw(fdt->fd[fd]);
	}
	return NULL;
}

At this point we’re in deep “I know what I’m doing” RCU territory and I’m going to just assume we can figure out a way for the code generator to do what we want, which is to infer that this function returns a compile-time-constant value. I think this’ll work out in practice, since even if the open file instance is shared across processes, the file cannot be truly closed until its reference count goes to zero. Anyway, let’s move forward.

Next, we’re back in __fget_light() and then __fdget_pos(): our code generator should be able to easily fold away the remaining branches in these functions. Finally, we return to line 4 of ksys_write() and we know what the struct fd contains, making it possible to continue specializing aggressively. I don’t think making this example any longer will be helpful; hopefully the character of the problems we’re trying to solve are now apparent.

In summary, we saw four kinds of variables in this exercise:

  1. Those such as the “fd” parameter to write() that the code generator can see are constant at code generation time.
  2. Those such as the “current” pointer that are constant, but where the code generator cannot see this fact for one reason or another. To specialize these, we’ll have to give the compiler extra information, for example using annotations.
  3. Those such as the “count” field of the “files_struct” that are not actually constant, but that seem likely enough to remain constant that we may want to create a specialized version treating them as constants, and then be ready to invalidate this code if the situation changes.
  4. Those that are almost certainly not worth trying to specialize. For example, the “count” parameter to write() is not likely to remain constant over a number of calls.

Writing one byte to a pipe from a single-threaded process executes about 3900 instructions on Linux 5.1.14 (this is just in ksys_write(), I didn’t measure the trapping and untrapping code). The Synthesis thesis promises an order of magnitude performance improvement. Can specialization reduce the fast path on this system call to 390 instructions? It would be fun to find out.

I’ll finish up this example by observing that even though I chose to present code from the filesystem, I think it’s network stack code that will benefit from specialization the most.

Discussion

I have some experience with OS kernels other than Linux, and my belief is that attempting to dynamically specialize any mainstream, production-grade OS other than Linux would run into the same issues we just saw above. At the level the code generator cares about, there just isn’t much effective difference between these OSes: they’re all big giant blobs of C with plentiful indirection and domain-specific hacks.

If our goal is only to create a research-grade prototype, it would be better to start with something smaller than Linux/Windows/Darwin so that we can refactor specialization-unfriendly parts of the OS in a reasonable amount of time. xv6 is at the other extreme: it is super easy to hack on, but it is so incredibly over-simplified that it could not be used to test the hypothesis “a realistic OS can be made much faster using specialization.” Hilariously, an xv6+LLVM system would be about 0.15% OS code and 99.85% compiler code. Perhaps there’s a middle ground that would be a better choice, Minix or OpenBSD or whatever.

Given two developers, one who knows LLVM’s JIT interfaces and one who’s a good Linux kernel hacker, how long would it take to bring up a minimally ambitious dynamically specializing version of Linux? I would guess this could be done in a week or two, there’s not really anything too difficult about it (it’s easy to say this while blogging, of course). The problem is that this would not give good results: only the very easiest specialization opportunities will get spotted by the runtime code generator. But perhaps this would generate enough interest that people would keep building on it.

Do we want to do specialization work on C code? No, not really, it’s just that every one of our production-grade kernels is already written in it. A fun but engineering-intensive alternative would be to create a new, specialization-friendly kernel in whatever programming language looks most suitable. Functional languages should offer real advantages here, but of course there are issues in using these languages to create a performant OS kernel. Perhaps Mirage is a good starting point here, it is already all about specialization — but at system build time, not at runtime.

An ideal programming environment for a modern Synthesis kernel would provide tool and/or language support for engineering specialization-friendly kernel code. For example, we would identify a potential specialization point and then the tools would use all of our old friends — static analysis, dynamic analysis, symbolic execution, etc. — to show us what data items fall into each of the four categories listed in the last section, and provide us with help in refactoring the system so that specialization can work better. A tricky thing here is taking into account the different kinds of concurrency and synchronization that happen in a sophisticated OS.

Some useful questions to ask (and of course we’re always asking these same things when doing OS and compiler research) are: How are we supposed to think about a dynamically specializing OS kernel? What are the new abstractions, if any? Specialization could really benefit from some sort of first-class “code region over which these values are effectively constant” and then also “but the constant-ness is invalidated by this set of events.”

Why Now?

The literature on dynamic specialization of OS code is interesting: it looks like there was a flurry of interest inspired by Synthesis in the mid/late 90s. Many of these papers had Calton Pu, Massalin’s thesis supervisor, on the author list. Not a whole lot has happened in this area since then, as far as I know. The only paper I can think of about optimistic OS specialization is this one; it’s a nice paper, I recommend it. Static OS specialization, on the other hand, is what unikernels are all about, so there’s been quite a bit of work done on this.

It seems like time to revive interest in dynamic OS specialization because:

  • Most of the processor speed wins lately are application specific; the cores that execute OS code are not getting noticeably faster each year, nor do they seem likely to. In fact, way back in 1989 John Ousterhout argued that increases in processor speed weren’t benefiting OS code as much as other kinds of code.
  • OSes have slowed down recently to mitigate side channel attacks. Maybe we can get some of that speed back using dynamic specialization.
  • OSes are way bloatier than they were in the 90s, increasing the potential benefits due to specialization.
  • Compiler technology is far ahead of where it was in the 90s, with off-the-shelf toolkits like LLVM providing high-quality solutions to many of the problems we’d run into while prototyping this work.

I’d like to thank Perry Metzger who suggested this piece and also provided feedback on a draft of it. Perry worked with Alexia back in the day and hopefully he’ll also write about this topic.

Finally, I don’t want to give the impression that I’m summarizing a research proposal or an in-progress project. This is the kind of thing I love to think about, is all.

Join the Conversation

17 Comments

  1. This kind of makes me wish things were written in C++. As far as experimenting goes C++ templates would offer an interesting opportunity to test the effect of specializing over values in small domains. Say take a function like this:

    int SomeFunction(bool is_single_threaded, /* other args */) { … }

    And replace it with something like this:

    template int SomeFunction(/* other args */) {… }

    Then simply referencing both variations would give compile time specialization almost for free. There are a lot of problems that doesn’t solve (picking and invalidating the correct variation, etc.), but it would avoid any complexity imposed by trying to run a compiler in kernel mode (which frankly seems like a rather laborious yet boring portion of the research).

  2. Could someone manually generate this optimized ksys_write (or probably starting with the trap handler), add it is a new syscall, modify the test app to use the new syscall, run the test process, stop it immediately after it starts, manually patch all the process constants using a kernel debugger or whatever, and then unstop it to let it run the timing loop so we can find out whether the optimized version is really x10 faster? No need for JIT, just time it once. How hard can it be? No, I am not volunteering, I know nothing about kernel development.

  3. I actually implemented something very similar to this with Linux a few years ago.

    It used Andi Kleen’s kernel-LTO patch set as a starting point, and then used perf profiling to determine dynamically-hot codepaths, which were then recompiled with very aggressive optimizations and some auxiliary hint information (scripted out of perf) to enable speculative devirtualization of indirect calls (even cross-module ones), resulting in a equivalent new codepath with pretty much everything inlined together (from syscall entry points all the way down to device drivers), specialized for that workload on that system. That code was then built into a module and spliced into the running system via the livepatch mechanism.

    At the time the results weren’t quite dramatic enough to justify pursuing it further (or maybe I wasn’t testing it on the right workloads), but with the advent of Spectre and such increasing the cost of indirect calls non-trivially I wonder if it might look better now…

  4. This is a wonderful idea, and I hope many people start working on it right away.
    Although Massalin has never published her code, according to my memory of her thesis, Synthesis’s runtime code generation was mostly extremely simple, more like linking than what we think of as “code generation” — it copied a template method into the appropriate slot in the newly-generated quaject, then overwrote specific bytes in the generated code with pointers to the relevant callout (or, in some cases, the initial value of an instance variable for that quaject). Parts of the code that did not benefit from being specialized in this way were factored into ordinary functions the quaject method would just call.
    This meant that only a small amount of code was generated for each quaject, and the runtime code generation was very nearly as fast as memcpy(), which meant that it was reasonable to use it on every quaject instantiation.
    Massalin also talked about applying some optimizations to the generated code, such as the constant-folding and dead-code removal John mentions, but I have the intuition that only a minority of quaject instantiations involved such more aggressive optimizations. Since she never published Synthesis, it’s impossible to know for sure. (I’m not questioning her integrity or claiming that the impressive benchmarks reported in her dissertation are faked; I’m saying that we unfortunately can’t see the exact mixture of interesting things you need to do to get those kickass benchmarks; so, like an insecure Intel CPU, I’m reduced to speculation.)
    Later implementations inspired by Massalin’s approach included Engler’s VCODE (which, to my knowledge, has also never been published; Engler’s PLDI paper cites Massalin in the second sentence of the abstract), which was used to implement Engler’s `C, and GNU Lightning (inspired by Engler’s published papers about VCODE), used in a number of modern JIT compilers.
    I suspect that, by contrast, John’s idea of using LLVM is inevitably going to have much higher overhead — if only from the CPU cache devastation brought about by any LLVM invocation — so will only be a win for much-longer-lived objects, where the large instantiation overhead can be amortized over a much larger number of invocations. An intermediate approach like Engler’s `C might be more broadly applicable.
    John suggests this early on in his “for deployment” comment, but I think that it’s probably necessary for prototyping too, since the objective of the whole exercise would be to get an order-of-magnitude speedup, and the objective of the prototype would be to find out if that’s a plausible result. A prototype that makes all your programs run slower due to LLVM wouldn’t provide any useful evidence about that.
    I asked Perry what he thought about the above, and he replied with this gem:

    So you’re correct that the code generation was mostly “template instantiation”. I think that was key to having calls like open() function in reasonable time. I also suspect LLVM is a blunt instrument for this work. That said, it would have been difficult for anyone but Massalin to work with the approach in Synthesis. It was very much the product of a person who was both intensely brilliant and completely comfortable with writing “weird code” in the instruction set they were working in.
    So there’s then the question of how one can take the ideas from Synthesis and make them a practical thing that ordinary programmers could build and contribute to. And that is almost certainly going to involve compiler tooling. As a prototype, making this work by using LLVM is probably a good approach. Ultimately, I think that one is going to have to do the magic at kernel build time and have something fairly lightweight happen at runtime. But to figure out what that is, one needs to play. And the easiest tools right now for playing involve LLVM. If, for example, you can use LLVM successfully to specialize a write call’s instruction path down an order of magnitude or more, or to do similar things in the networking code, one can then ask how to do this better.
    There are, of course, a number of interesting paths towards playing with this. I suspect that none of them end anywhere near where they start. But the only way to see what might be possible with much better tooling is to start, and you have to start somewhere.
    BTW, I think the time is right, or even over-right, for this. Single processor core performance is stalled out, and while in 1992 one could just say “well, we’ll have another factor of ten performance improvement in a few years, who needs the trouble”, that’s no longer the case. Note that this argument also applies, to a considerable extent, to other parts of the modern software ecosystem. When you can’t just say “we could spend a couple of months optimizing this, but the next generation of processors will be out by then”, things change.
    Anyway, not sure if this answers your call for comments, but if you are interested in specific areas around this, I’ve no shortage of opinions. Many would say I have far too many opinions…
    You can quote any subset or the entire thing. So long as you don’t distort my intent I have no problem with people using my words that way.

  5. GraalVM seems like an interesting possibility for this if one of the old Java OS experiments could be resurrected.
    As a bonus Graalvm can interpret LLVM bitcode so porting device drivers should be technically possible.
    But then how do you know the JIT won’t leak sensitive info?

  6. Thanks for the link, Greg!

    Hi Benjamin, I agree there’s plenty of pain in putting a compiler in the kernel, but the hypothesis is that’s where the magic happens and compile-time specialization just won’t do it.

  7. ZT, yeah, an experiment like that is a good idea, but there are some tricky parts in the details. For example let’s say we patch in constant 1 and constant 2, and then the code checks them for equality. We need a real compiler, running late, to take advantage of that information, if we do the ahead of time pretend job we’ll definitely miss opportunities.

    Zev, cool! Are more details available anywhere?

  8. I love this idea! I think if somebody wanted to start it, it would be good to look into the experience that people had trying equivalent approaches in the language virtual machine and database space. Eg this paper:

    Carl-Philip Hänsch, Thomas Kissinger, Dirk Habich, Wolfgang Lehner:
    Plan Operator Specialization using Reflective Compiler Techniques

    They are doing something extremely similar to what you propose (using llvm to specialize existing C++ code bases) but for a database.

    I personally am a bit less confident that it’s possible to get fantastic results using the existing C code bases of operating systems. This is informed by the fact that doing the same in the virtual machine space largely failed. Eg unladen swallow tried to optimize Python by taking CPython and trying to inline together enough bytecode implementations using llvm to try and get improved performance. This never worked particularly well. However, I think it might be possible to get better results if you are prepared to, as you describe, pragmatically add annotations that guide the specialization process to the source code.

  9. Put me in the “skeptical that this is going to have measurable benefits” cohort. I would like some examples in which the CPU cost in the kernel is the primary contributor to the end-to-end cost.

    Take this syscall as an example, reducing the number of instructions from 3900 -> 390 would be very impressive and would reduce the OS execution cost (1 usec -> 100 ns) by a large amount. But, the real cost of this syscall is not these instructions, but rather the time to move bytes to the storage device — 5 millisec on an HDD and maybe 50 usec on an SSD. So, in the best case, less than 2% improvement.

    Similarly, low-latency research has shown that the bottleneck is the standard TCP/IP protocol, not just the large software stack that implements it. I have no doubt there are many opportunities to specialize that code, which has the generality of a protocol intended to move bytes cross country on phone lines. But, in data centers, by using simpler protocols, it is possible to achieve near-wire transmission costs on today’s systems (e.g. “R2P2: Making RPCs first-class datacenter citizens” in ATC ’19).

    [BTW Runtime code generation experiments would be a lot easier to experiment with by using a user-level network device like DPDK, rather than mucking around in a kernel.]

  10. GraalVM also points to a way this might be done in the C world: specify a serious of specialization points statically, and delimit the places where the code will be dynamically unfolded and compiled. With a very lightweight compiler and some advance work this might look a bit like filling in an IR-level code template and then generating machine code from the IR.

  11. Hi Jim! I agree there are plenty of paths through the kernel where the cost isn’t dominated by CPU, and also plenty of opportunities to use streamlined protocols instead of compiler tricks. An example where I think specialization may help is in filesystem-intensive workloads where caches are already working as intended. These workloads can involve long sequences of silly little system calls, so fusing multiple calls may be needed also. And the benefit won’t just be in terms of CPU cycles, avoiding the need to load and store predictable data should take some heat off the dcache.

    Thanks for the pointer Jan-Willem! I agree we’ll need techniques to keep the compiler part from working too hard here, or else there’ll be too many cases where we can’t recoup the compilation cost.

  12. Cool ideas! I’ve been working on a project to do something very similar (dynamically specialize an existing C codebase) but in a different domain (Python performance). I’ve built a JIT-creation engine that can ingest a C codebase (or really, any LLVM bitcode), take some lightweight annotations (such as immutability guarantees that the compiler cannot prove but the developer knows), and dynamically specialize the code.
    https://github.com/nitrousjit/nitrous/

    This is born out of my work on Python performance, but my theory is that a low-effort pluggable JIT engine would be useful in other contexts. I didn’t even think about OSes, this is a really intriguing possibility.

    Right now the major blocker is LLVM compilation performance — optimizing and compiling a sequence that’s several thousand instructions long can easily take 100ms with LLVM. So far the best idea I have to avoid this is to do aggressive caching.

    If anyone is interested in actually tackling this OS-specialization problem, let me know since I’d be interested in collaborating. I do share some other commenters’ worry, though, that there may not be sufficient CPU overheads to make these kinds of approaches worth it, but I would love to be convinced otherwise.

  13. So far the improvement is about 50% speedup on the official Python microbenchmarks, compared to PyPy’s 4.4x. I’m optimistic about the situation — there’s a lot more that can be done, since I started with a fully-working system and am making it faster over time, rather than starting with a fast system and making it more compatible over time. Plus I’m optimistic that my approach will scale better to larger and more realistic programs: PyPy has an issue where it is not anywhere near 4.4x faster on real programs, and is often slower than CPython.

Leave a comment

Your email address will not be published. Required fields are marked *