Sensor Network Technology in Vinge’s A Deepness in the Sky

An important function of science fiction is to help us understand sociological, technological, and other aspects of our future. A really good SF novel — like some of those produced by Asimov, Clarke, Heinlein, Le Guin, Niven, and Vinge — is so full of ideas and possibilities that the reader’s mind is expanded a little.

Although there have been some astonishing predictive successes, such as H. G. Wells’ descriptions of the tank and the atomic bomb, only a small fraction of predictions in SF literature have come true. Even so, those of us working in the technology research sector — which I view basically as applied science fiction — would do well to scour the SF literature for ideas and use cases that can help drive and make sense of the low-level technical work that ends up consuming most of our working lives. Neuromancer is a good example of a book that was often wrong at the technical level, but its overall effect was so amazing that — even 25 years later — a technical person like me would be foolish not to have read it.

It’s puzzling that not many people working on sensor networks seem to have read Vernor Vinge’s novel A Deepness in the Sky. So far, out of the people I’ve talked to about it, only Kris Pister has read this book. Of course Kris was one of the originators and major early drivers of the smart dust vision. Perhaps most researchers don’t have time or inclination to read SF? In any case, in 2000, when Vinge’s book came out, the “smart dust” vision was just getting off the ground with the help of DARPA’s NEST (Networked Embedded Systems Technology) program. The research group at UVA that I was a member of at the time was funded under NEST but, alas, I already had a PhD thesis topic and didn’t have the sense to switch over to this new and exciting area. It was only in 2001-2002, while working as a postdoc at Utah, that I got involved in sensor network research.

But back to the book. One of its central ideas is that following the Dawn Age — where we are now — humanity entered the Age of Failed Dreams where it became clear that many of humanity’s more ambitious goals — nanotechnology, artificial intelligence, anti-gravity, and sustained civilization — were unattainable. A Deepness in the Sky is set 8000 years later: humans are crawling around a limited part of the galaxy at well below the speed of light, using technologies not unfamiliar to us now. One of these technologies is the sensor network: a vast collection of tiny networked embedded systems, each of which contains a weak computer, a small power source, a packet radio, and some sensors and actuators.

The purpose of this piece is to compare and contrast Vinge’s vision of sensor networks with what we have right now, and to use this analysis to identify important open problems in sensor networking. I’ll do this by looking at some of the properties and capabilities that Vinge ascribes to his sensornet nodes, which are called Qeng Ho localizers (QHL from now on).

Physical Properties

The QHL is black and about 1mm long by 0.4mm wide. The device’s thickness is not specified, but since it is difficult to detect on human skin and floats around in normal indoor air currents, it would probably have to be pretty thin. For purposes of discussion, let’s say it’s 0.1mm thick.

At 2mm by 2.5 mm, the Berkeley Spec, produced around 2002, is the smallest sensor network node that I’ve heard of. Unfortunately, the smallest actual effective node based on a Spec was quite a bit larger than the Spec itself since it needed at (at least) external power and antenna. This node type was never mass-produced. Today’s sensor network nodes, particularly when packaged and combined with a realistic battery, are quite bulky (I like to tell people: “It’s not smart dust if it hurts when someone throws it at you!”).

Energy

The QHL is powered by weak microwave pulses, preferably occurring around 10 times per second. It has very limited onboard energy storage and its internal capacitor runs out in a matter of minutes if it fails to receive energy from the environment. This is in sharp contrast with today’s battery-heavy sensor nodes that can often run for years with appropriate duty cycling. Since the QHL does not use exotic energy storage mechanisms (antimatter, nuclear, etc.), its capacity is bounded above by the amount of energy a 0.04 mm3 capacitor can store. The capacity is bounded below by the amount of energy required to self-destruct a node with a flash of light and “a tiny popping sound.”

Today’s production capacitors are limited to 30 Wh/kg. Some web searching failed to turn up a convincing upper limit on the energy density of an ultracapacitor, so let’s say it’s 500 Wh/kg: on a par with some of today’s best battery materials (we’ll also say 500 Wh/kg ≈ 500 Wh/l). This gives only 20 μWh of stored energy: enough to power one of today’s most energy-efficient microcontrollers for about a minute.

Why don’t the QHLs have photovoltaic elements? Perhaps this would have interfered with the plot or else Vinge didn’t want to get into details that didn’t push the plot along. In any case, an alternative energy harvesting mechanism would seem like an obvious win.

Processing

All we know about the QHL’s compute capability is that “each was scarcely more powerful than a Dawn Age computer”: a PC from our era. So its core either runs at a couple of GHz or — more likely — it contains a large number of weaker, specialized processing units. This makes sense both from an energy efficiency point of view, and also because the QHL is more or less a fixed-function device. It is depicted as being retaskable, but not necessarily supporting arbitrary new functionality.

Omnidirectional Imaging

One of the most interesting capabilities of the QHL is that it can capture real-time, omnidirectional video. A device with similar capabilities based on today’s technology would be bulky; probably it would use 6 wide-FOV cameras. How could a small black speck accomplish this task? Well, lensless imaging is not in itself challenging: pinhole cameras work perfectly well, although their inefficiency would be a problem in low-light environments

My napkin-sketch design for a sensor node with omnidirectional imaging has an outer skin made of an LCD-like material that can act as an electronic shutter; it displays a very rapidly changing pattern of slits. Below this skin is a thin layer of transparent material, under which is the light sensor. Due to the moving-slit design, the sensor’s sampling rate has to be much faster than the node’s effective frame rate, but that shouldn’t be a problem. The digital signal processing requirements for backing out diffraction patterns and otherwise turning the sensor data into high-quality omnidirectional video would be hellish. Probably there’s a nice pile of CS and EE PhD theses in there.

The small size of the QHL would limit its imaging quality due to both small aperture and limited light-gathering power. Both problems could be solved using distributed imaging. In the simple case, photons are simply integrated over many nodes. Since there are many points of view, it would be possible to perform very accurate reconstruction of the 3-D positions of objects. An even more interesting capability — if the localization and time synchronization across nodes is extremely tight — would be for a group of nodes to act as an optical interferometer. In other words, if you scattered QHLs over an area 10m in diameter, you’d get a 10m synthetic-aperture telescope. Obviously this would only work on a very stable surface, and probably only in a vacuum. There would have to be a lot of nodes to get a decent amount of light gathering power.

Other Sensors

The non-imaging sensors on the QHL seem relatively boring: temperature, pressure, vibration — these are on sensornet nodes today.

Actuation

Vinge gives few details about the QHL’s actuators, but we know it is capable of inducing signals in computer networks and in the human nervous system. The likely mechanism would be induction using magnetic fields. Unless this actuator is incredibly sophisticated, a complex task such as implementing a heads-up display via the optic nerve would seem to require a collection of cooperating nodes.  Perhaps to keep the story simple, Vinge shows a pair of nodes performing this function.

An intriguing capability ascribed to the QHL is building up layers of grit near a switch, to impede the switch’s operation. Presumably, nodes used their actuators to move fragments of ferrous material (maybe QHL nodes themselves were sacrificed) into the appropriate location. This activity is depicted as being very slow.

It would be interesting for nodes to use their magnetic actuators to power up other nodes. There would be substantial inefficiencies, but if one node found itself with lots of extra power (for example, if it is in a patch of bright light) then spreading the energy around would be the right thing to do.

Mobility

The QHL is not depicted as being able to move itself; it is moved passively around its environment by riding on humans, air currents, etc. However, at least a bit of mobility is implied by the presence of a magnetic actuator and by the QHL’s ability to create a gritty surface on a switch.

RF Communication

The radio communication range of the QHL is not specified, but around 10m seems like a reasonable guess. Since nodes are too small for a traditional antenna, presumably they contain an internal fractal antenna. Probably the antenna has some beamforming capability since this, combined with localization, would permit highly energy-efficient communication with neighbors, and would also make it more difficult for eavesdroppers to detect communicating nodes. The inter-node bandwidth is not specified, but clearly it is sufficient to transmit multiple video streams. The frequencies used by the QHL are not discussed.

Protocol Design

There is very little discussion about the protocols used by the QHLs, beyond this:

Ten Ksec after the microwave pulses were turned on, Ritser reported that the net had stabilized and was providing good data. Millions of processors, scattered across a diameter of four hundred meters.

Ten Ksec is about 2 hours, 45 minutes. This seems like a fairly long time for a network to stabilize, but then again “millions” is a lot of nodes. Obviously they QHL’s algorithms scale very well, but beyond that it sounds like the route formation and other parts of the protocol stack are not radically different from things that have already been implemented by the sensornet community.

Localization

The QHL determines its position relative to other nodes using time of flight of a radio signal. This is obviously the way to go; today’s signal strength based localization methods are akin to standing with a friend in a large, echoey building and trying to figure out how far apart you are by shouting. This kind of works, but not well.

The problem with time of flight is that it requires part of the sensor network node to be clocked very fast: on the order of 100 GHz to get millimeter-level localization. This doesn’t seem like a fundamental problem because only a handful of gates need to be clocked this fast; they can latch high-fidelity time values and then slower parts of the processor can digest them at their leisure.

Time synchronization is not discussed but the localization capability implies that the QHL net is very tightly time synchronized.

Software Architecture

The main thing we know about the QHL’s software is that it is large and complex enough to defeat a years-long “software archeology” effort directed at determining whether the nodes can be trusted (in the end, they cannot). The QHL is self-documenting: it will provide technical manuals when asked, but there are at least three versions of the manuals, some of which are quite coy about the nodes’ actual capabilities. There are plenty of trapdoors. All of this deception is based on the idea that there may be circumstances where it is possible or necessary to trick people into deploying a sensor network of QHLs, and then using its ubiquitous monitoring and actuation capability to take over.

I shouldn’t need to point out that “Programmer-Archaeologist” is a chilling, but very likely, future job description. Programmer-at-Arms is another great job coined by Vinge.

Internal Structure

The internal design of the QHL is not discussed, but we can make a few guesses:

  • Its skin is dominated by sensing surface.
  • Its processing elements are as close as possible to sensing surface, both to facilitate heat transfer and to reduce the distance that high-bandwidth raw sensor data must travel. Thus, the processors are probably structured as a shell just inside the skin.
  • By volume, the interior is dominated by the capacitor, nonvolatile memory, and processing elements. Presumably these can be packed together in clever ways, saving space and weight when compared to a design that segregates them.
  • The fractal antenna is spread throughout the interior of the node, but takes up only a small fraction of its volume.

Implications for Sensor Network Research

Among the Qeng Ho localizer’s capabilities, these seem to motivate hard but exciting research problems:

  • Reading and inducing signals in nerves and wires that are close to nodes.
  • Scaling protocols to millions of nodes.
  • Robust, precise, and energy-efficient time-of-flight based localization.
  • Efficient and precise omnidirectional beamforming from sub-mm scale antennas.
  • Efficient energy harvesting from microwave sources that are well below levels that would harm animals and cause other problems.
  • Omnidirectional video in real-time.
  • Light integration and other distributed image processing across large numbers of nodes.

Social Implications

I’d be remiss if I ignored one of Vinge’s larger messages, which is that cheap, effective sensor nodes lead to a government with the capacity for ubiquitous surveillance, which leads to a police state, which leads to societal collapse or worse. Can we do sensor network research in good conscience? Well, on one hand Vinge makes a valid point. On the other hand, the human race faces many dangers over the next hundred years and this one strikes me as probably not any more serious than a number of others. The problem, as always, is that the effect of a technology on society is tied up in complicated ways with how the technology ends up being used. There are intricate feedback loops and the overall effect is very hard to predict. Our hope, of course, is that in the time frame that sensor network researchers are targeting (5-50 years), the benefits outweigh the costs.

Acknowledgment: Neal Patwari helped me understand time-of-flight localization.

Going Going Gone

Technically my sabbatical ended during the summer, but yesterday it really ended when I gave two 75-minute lectures back to back. On a normal Tuesday, this would be followed by a 75-minute embedded systems lab but the students and I get a free pass for a week or two while the lab admins get all of their software straightened out. This teaching load would be a cake walk for a high school teacher, but it’s tough work for an R1 professor. Of course, given that I had 482 consecutive days away from teaching, I’m in even less of a position to complain…

Overall I’m happy to be back. During my sabbatical I avoided my office fairly rigorously, but now I’m enjoying talking to my colleagues face to face again. Since I atypically failed to move out of town for my sabbatical (Sarah had to teach last year, and our sabbaticals are perfectly out of phase) the question was: what would I do? A number of people suggested that I write a book, but I’m really happy that I did not. First, this would have been a most unwelcome task hanging over my head all year. Second, Phil Koopman wrote my book. Well, not really, but his new book Better Embedded Systems Software (I’ll review it here soon) has a very similar focus to what I’d have written.

The idea that I brought to my in-town sabbatical was that it gave me license to skip out of everything I didn’t feel like doing (faculty meetings, reading groups, etc.) while still participating in any events (distinguished lectures, hiring dinners, etc.) that seemed either appealing or else truly important. This worked out pretty well, although I think my time management skills (never my strong point) atrophied over the past year. We’ll see if I can get them back.

During my sabbatical I got a lot of good research done and a lot of good thinking done. I read some technical books that had been on the back burner for too long. I spent more time with my kids than I had previously, and managed to be outdoors and totally out of contact with the world for two weeks last summer and then a week this summer.  All these things were good. On the less satisfactory side, I didn’t manage to become a proficient user of a semi-automated theorem prover, as I had hoped, I didn’t explore some new research directions that I had wanted to, and I was often disappointingly connected to my regular job. Sarah and I will have to figure out how to sync up our sabbaticals so we can leave town next time.

Mac and Cheese++

The ability to make macaroni and cheese is a basic parental survival skill. The problem is, the boxed Kraft stuff tastes of chemicals. Although the Annie’s and Whole Foods brands are better — without being overly expensive — I’m still turned off by that packaged powdered cheese food stuff. Eventually I realized that since the water-boiling and pasta-cooking steps dominate the time required to cook mac and cheese, making it from scratch is no slower than making it from a box.

The simplest version of the recipe below gives a result that looks about like boxed mac and cheese while tasting far better. If you use high-quality cheeses and throw in onions, peas, white wine, and some ham or bacon, the resulting dish approaches the savory goodness of a simple risotto.

This is one of those recipes where no measuring whatsoever is required. However, I’ll give approximate amounts calibrated to match a half-pound of dry pasta.

  1. Before even thinking about the cheese sauce, get the water heating. Don’t skimp on the amount and make sure not to overcook the pasta.
  2. Heat 2-3 tablespoons of fat or oil over low heat in a small saucepan. I usually use olive oil, butter, or a combination. If Sarah’s not looking I may sneak in some bacon fat.
  3. Add salt and ground pepper.
  4. Optionally: Add a few tablespoons of diced onion and/or garlic.
  5. Optionally: Add a few tablespoons of diced cooked ham or bacon.
  6. Optionally: Add a big handful of frozen peas or shelled edamame.
  7. When the fat is hot and any extras have cooked a bit, add 1.5 teaspoons of all-purpose flour. Mix well. Most white sauce recipes call for more flour than this, but a teaspoon or so is enough to make a smooth cheese sauce.
  8. Next, add some liquid. Milk, water, white wine, or chicken stock will all work. A few tablespoons is enough. Stir well until a nice creamy, bubbling white sauce is formed. It shouldn’t be lumpy.
  9. Next, add at least 1.5 cups of grated cheese. Pretty much anything will work but the better the cheese, the better the final product. This is a good time to clean up those small fragments of cheese that often lurk in the fridge. Cheddar, perhaps mixed with some parmesan, is ideal. When I use gruyère, the kids say the resulting dish is “stinky but good.”
  10. Stir until the cheese melts into the white sauce, making a nice gooey mess. Don’t overheat this, or cook it any longer than it takes to make a smooth cheese sauce.
  11. Fold the cheese sauce into the drained pasta, which ideally became ready just as the cheese sauce did.

Yum! Kids are the excuse for this dish but I’d cook it for myself in a heartbeat.

What Blogs Are Good For

My colleague Suresh (of Geomblog) likes to say that blogging is passé. The first time I heard this it annoyed me because — dammit — I’ve been blogging for only about six months. It took me a while to figure out that blogs are irrelevant and I could care less if they’re passé. The important thing is reading and writing essays, and blogs are a perfectly good way to do this. This realization also explains my gut reaction that Twitter is stupid; you can’t write much of an essay in 140 characters. (I realize that my gut reaction is probably wrong… it’s just going to take me a while to come to terms with Twitter.)

Most of us, when we hear “essay,” experience flashbacks to those awful essay-writing exercises from secondary school where teachers forced us to use note cards and all that nonsense. I’d argue that it’s best to get over these feelings and embrace the essay. I was forced to do this around 1997 when I started my (no longer maintained) book pages, and realized that any piece of writing that doesn’t fulfill the requirements for being an essay also is not a real book review. On a blog, pretty much any kind of essay — persuasive, narrative, photo, etc. — can make a good post. When reading others’ blogs, I usually skip any entry that isn’t at least roughly in the form of an essay.

The Future of Compiler Correctness

Notes:

  • This piece is mainly about compilers used for safety-critical or otherwise-critical systems. The intersection of formal methods and compilers for scripting languages is the empty set.
  • Readers may be interested in a companion piece The Future of Compiler Optimization.

A half-century of experience in developing compilers has not resulted in a body of widely-used techniques for creating compilers that are correct. Rather — as most embedded system developers would be happy to tell you — it is not uncommon to run across nasty bugs in compilers and related tools, even when a lot of money has been spent on them. This piece examines the situation and explores some ideas that will be helpful during the next 50 years.

“Can we create a correct compiler?” is the wrong question to ask because clearly we can, if we choose an easy source language or a very simple style of translation. A better question is:

Can we create a trustworthy compiler that generates high-quality object code and is otherwise practical?

Let’s look a little closer at the different parts of the problem.

A correct compiler is one that, for every valid program in its input language, provably translates it into object code that implements the computation that the language’s semantics ascribe to that program. A trustworthy compiler is one that convinces us — somehow or other — that it has produced a correct translation. Creating a trustworthy compiler is easier than creating a correct one.

A compiler that produces high quality object code is one that supports at least the basic optimizations: constant propagation, dead code elimination, register allocation, instruction scheduling, various peephole optimizations, etc. I’m being a bit vague about “high quality” since in some cases it means small code, in some cases fast code, but is usually a balance between these goals. The degree of speedup due to optimization is highly variable but in general we expect program speed to at least double.

A practical compiler is one that:

  • accepts a useful programming language
  • targets at least one useful hardware architecture
  • compiles at an acceptable speed
  • can be extended and maintained
  • integrates with other development tools: debuggers, IDEs, linkers and loaders, etc.
  • produces usable error messages and warnings
  • is affordable to its intended audience
  • has technical support available

People developing production-quality compilers make many concessions in the name of practicality. Of course, they wouldn’t call it “practicality” but rather something like “creating a real compiler instead of a stupid toy.”

The “high quality object code” requirement excludes very simple compilers like TCC from the discussion. TCC is a C compiler that performs a local translation: each line of code (more or less) is independently turned into object code. This kind of compiler can be made to be relatively practical and I believe it could be made trustworthy (though TCC and other simple compilers have been quite buggy the few times I’ve run tests on them). However, I’m not ready to believe that developers — even safety-critical developers, most of the time — are willing to throw away the 2x to 5x performance improvements that are gained by permitting the compiler to look across multiple lines of code.

Now, to the core of the issue. Today there are two paths to trustworthy, practical, and performant compilers.

The High Road: Verification and Translation Validation

Verified compilation is the gold standard. It sets a very high bar, requiring a mathematical description of the source language, a mathematical description of the target language, and a machine-checkable proof that all programs in the source language are correctly translated into object code. Translation validation is an easier but still very difficult goal where an individualized proof for each program in the source language is produced as a side effect of compilation.

Although high-road topics have been studied by the research community for many years, only one verified compiler has ever been produced that I would call useful: CompCert, from INRIA. CompCert takes a low-level language that is similar to a subset of C and compiles it into PowerPC or ARM assembly language in a way that is provably correct. The compiler is pretty fast and the generated code is good (comparable to “gcc -O1”). CompCert applies most of the standard optimizations; function inlining is the most significant omission. As of recently (and perhaps not even in the latest released version) CompCert supports volatile-qualified variables, which are important in the embedded, safety-critical domain.

CompCert is substantially correct (there are gaps in its modeling of machine details and its frontend still has some unproved parts). Its code quality is good. However, it misses out on some aspects of practicality. Most importantly, it is not clear that mortal developers can maintain and extend it, though this paper represents very interesting work in that direction. CompCert is almost certainly not the place to experiment with interesting new optimizations, to experiment with code generation for new instruction set extensions, etc. For now, it is just too difficult to create verifying instances of these things.

The Low Road: Testing

Although vendors test compilers before releasing them, the kind of testing that matters most is testing through usage: does the compiler, during a multi-year development cycle, show itself to be free of crashes and to reliably produce correct object code? Of course, the important thing is not to have a correct compiler — these do not exist — but rather to have a good idea of the situations in which it is wrong. It is commonplace for large embedded projects to have a few modules that have to be compiled with optimizations turned off, to avoid crashes and wrong-code bugs. Other modules will contain code to work around bugs, such as chaining together 32-bit adds if 64-bit math is buggy. Nobody likes these workarounds, but they are considered to be an acceptable (or at least unavoidable) part of the development process.

Is trustworthiness through usage good or bad? It’s a bit of both. It’s good because it usually works in practice. Also, it is cheap: to create a reliable system we have to do nearly continuous testing at multiple levels of granularity anyway. If we’re also testing the compiler more or less as a side effect, no big deal.

Trustworthiness through use is bad because I think it’s fair to say that we (both in industry and academia) do not have a good understanding of its limits. It works due to a variant of the locality principle: the programs we write in the future are likely to resemble those we wrote in the past. If past programs have been translated correctly, it’s likely that future ones will also. But what does “likely” mean here? What does “look like” really mean? Can we count on these properties? In a nutshell: No. This is one of the things that keeps developers, testers, and managers up late at night. To summarize: we are creating life-critical systems using tools that demonstrably contain silent wrong code bugs. We are comfortable doing so because we test these systems heavily, because this methodology has worked in the past, and because we don’t have a better alternative.

In a trustworthiness-through-usage world, upgrading the compiler should be a very scary thing to do. In fact, the compiler is very seldom (many people would say “never”) upgraded in the middle of a development cycle for a critical system.

What Good Is a Verified Compiler?

We can ask: If you drop a verified compiler into a typical mission-critical, security-critical, or safety-critical system development effort, what fraction of system testing effort would go away? Very little. First of all, what does verified even mean? As I mentioned earlier, a verified compiler is one that can show a mathematical correspondence between a model of the source language and a model of the target language. But is that correspondence actually meaningful and useful in terms of meeting the goals of an X-critical system? For example, is the target model complete and accurate? Or does it formalize the previous version of the instruction set and a restricted version of the source language? Is the target model specific enough to capture details like stack overflow or does it assume infinite memory? Second, as we have seen, real systems only become reliable through massive testing efforts. Most of this testing is unrelated to compiler reliability.

The real value provided by a verified compiler is that, looking forward, it can be used as part of a verified development stack. These don’t exist, but realistic examples will appear during the next 25 years. For example, we might develop a system in a future Matlab-like high-level language. A verified Matlab-to-C compiler gives us an optimized C implementation, and then a verified C-to-ARM compiler gives us verified object code. The RTOS will be verified and so will various properties of the hardware. In principle, these proofs can all be stacked, giving us a top-to-bottom proof that the system works as advertised, based on very few residual assumptions. These assumptions would include “our proof checker is correct, “the fab implemented the chip we specified,” and very little else.

Now we can ask: What fraction of testing effort will be eliminated by a fully verified tool stack? It’s hard to say, but I think about a 30% reduction in testing effort would be about the most anyone could possibly hope for. Even so, the savings due to a 30% reduction would be enormous.

Experience with a Verified Compiler

My group has spent the last few years finding crash and wrong-code bugs in C compilers. We do this by generating random C code, compiling it using different compilers (or different flags to the same compiler), and seeing if the computation that we specified retains the same meaning in all compiled versions. I’m making it sound easy, but generating “conforming” C codes — those whose meaning cannot be changed by a correct compiler — required several person-years (pointers are hard), and there were some other problems to solve before we could make this all work well.

Although we love compilers, and we are on friendly terms with as many compiler development teams as possible, when we put on our testing hats, there is only one goal: break the compiler. This is how testing has to work; it is an inherently adversarial activity. From this point of view, CompCert can be viewed as an affront. It spits in the face of our tool which is designed to create maximally evil C code, and which had found silent, previously-unknown wrong code bugs in every single C compiler we have tested, including a number that are routinely used for creating safety-critical embedded systems. So what happens when our unstoppable tool meets an immovable wall of proofs? This is what we set out to determine.

I expected that we would find some flaws in CompCert’s formalisms. My meager experience in formal specification is that getting a large specification right is heroically difficult — perhaps not much easier than getting a complicated software system right in the first place. So far, however, we have found no errors in the CompCert PowerPC specification, their C-light specification, or in their proof strategy. This is pretty amazing and my opinion of the job done by the CompCert team is astronomically high.

Even so, we did find previously unknown wrong code bugs in CompCert. Three of these were in the unverified front end and were basically typechecker bugs. These resulted in silent generation of wrong code. The other three bugs involved emission of incorrect PPC assembly code where CompCert generated a number that was too large for some fixed-size field in the instruction encoding. These bugs were not silent: the assembler would die with an error. Why didn’t the PPC specification include constraints on the range of these fields?  Basically the CompCert authors weren’t worried about this case because assemblers reliably catch this kind of error. The front-end bugs are more worrisome and just in the last week or so, Xavier Leroy has replaced a large chunk of unproved code in the CompCert frontend with verified code. This is very cool.

The result from our CompCert testing that I find most interesting is that the CompCert silent wrong-code bugs are qualitatively different from the bugs we find in other optimizing compilers:

  • CompCert bugs: These are in an early phase of the compiler and can be found using very simple test cases. These kinds of bugs are not often found in mature compilers (CompCert isn’t, yet). It’s strongly possible that an exhaustive test suite could be created to find all instances of this kind of bug.
  • Bugs in other compilers such as LLVM, GCC, Intel CC: These are in the middle-ends and back-ends of the compilers and often require fairly large test cases (on the order of 10 lines). It is almost certainly impossible to create any fixed test suite that can find all of these kinds of bugs because the number of 10-line C programs is, for practical purposes, infinite.

I take two lessons away from our experience with CompCert. First, formal specification and verification are not a replacement for testing; the techniques are complementary. Second, formal verification can change the tester’s job and make it easier. In the case of CompCert, verification has eliminated all of the complicated optimizer bugs. In contrast, even after a few years of effort we have not been able to ferret out all instances of this kind of bug in some traditional compilers.

Predictions

OK, this post is getting long, let’s get to the point. Here are some predictions.

Prediction 1: No Compiler Is Considered Trustworthy Without Extensive Random Testing

Our random tester finds silent wrong-code bugs in all C compilers we have tested; a total of 270 crash and wrong code bugs so far. I hope that at some point in the future, compiler developers will create, with our help, a compiler that we cannot break. But this hasn’t yet happened (though CompCert may well reach this goal soon). Moreover, there are many ways to increase the power of our tester:

  • Increase the expressiveness of our program generator; there are still many C programs that we cannot even in principle generate, such as those containing unions or making interesting use of the comma operator.
  • Become smarter about how we explore the space of C programs; the conventional wisdom among software testing researchers (and I suspect there is a large element of truth to it) is that plain old random testing wastes a lot of time repeatedly exploring the same part of the test space. Solutions include uniform random sampling (difficult) and bounded exhaustive testing (slow).
  • Start using white-box testing techniques, which consider the structure of the compiler when generating test cases. The goal might be to generate a test suite that gives full branch coverage of a compiler.

Basically we have gotten excellent results using perhaps the stupidest possible technique. Actually that’s not right. Rather, you might say that we have focused all our efforts on creating a random code generator that is reasonably expressive and that targets the parts of the program space where our intuition told us compiler bugs would be found. That intuition was totally correct.

The prediction — which I do not think is even remotely radical — is that in the future the kinds of random testing I’m talking about here will be considered mandatory for any compiler that people might want to trust for X-critical systems.

Prediction 2: Practical Compiler Verification Work Will Converge Around a Small Number of IRs

Formal verification increases the effort required to create a system by at least an order of magnitude. This bar is simply too high to be tolerated by very many research groups. But consider the alternative where a high quality IR (intermediate representation) such as LLVM, or at least a stable and useful subset of it, is formalized and becomes the basis for a large fraction of verified compiler work. Now, one team can create verified x86 backend, another can create a verified PowerPC backend, and both of these efforts would be far less work than was already put into CompCert. Similarly, my guess is that some of the existing LLVM optimization passes could be extended to support translation validation without being totally redesigned. Other passes like the useful-but-dreadful instruction combiner (which was the single largest C++ file in LLVM before they refactored it) will have to be scrapped and replaced.

The result of this line of work will be a (perhaps small) subset of the LLVM passes the are proof-producing, permitting verified optimizing compilers to be constructed with far less effort than was required to do CompCert. My belief is that this kind of centralized infrastructure is critical to lowering the barriers to entry for verified compiler work.

Prediction 3: Testing and Verification Will Merge

The path to trustworthy and practical compilers will be through interesting combinations of testing and verification techniques. Here are two examples of directions that may end up being promising.

Example 1

Assume that we have three independently developed compilers for some programming language, all of which are pretty good but none of which is totally trustworthy. The classic way to exploit independent versions is n-version programming. To use it, we can run three copies of our system (each built using a different compiler) on three different processors, and then take a vote before performing any externally-visible action. This strategy also provides margin against hardware failure provided that we can make the voter highly reliable. On the downside, running three copies wastes a lot of hardware and energy and also requires the replicas to be deterministic. Determinism may actually be pretty hard to achieve in this case. First, the compilers will variously create faster or slower code, so our voter will have to have some timing tolerance. Second, if nodes are concurrent, small timing nondeterminisms will bootstrap themselves into significant behavioral differences as context switches cause significant events to be reordered. Third, programming languages like C and C++ permit compilers to make choices about significant program behaviors such as order of evaluation of arguments to a function.

How can we do better than node-level replication? If we had a powerful equivalence checker for binaries, we could simply use it to prove that the three systems all implement the same computation and then run only one of them. However, a tool capable of proving that three diverse binary images are equivalent is unlikely to appear anytime soon. But what about proving that short sequences of instructions are equivalent? For example, proving that

mov $0, r1

and

xor r1, r1

are equivalent is trivial, assuming that we don’t care about what happens to the condition code flags. Can we scale this checker up to tens of instructions, for example using a clever SAT encoding? Probably. Hundreds or thousands? Perhaps. What could we do with a tool that can reliably prove the equivalence or inequivalence of sequences of hundreds of instructions? If every function in the system we’re verifying is small enough, it should suffice to individually perform equivalence verification of functions. For systems that contain functions that are too large, we could automatically split the source code for the too-big functions into smaller functions — there are well-known techniques for accomplishing this kind of thing — until the pieces are compiled into object code small enough that their equivalence can be verified. I’m leaving out a lot of minor pieces such as verifying the function-splitter, but overall I believe this strategy could be make to work, giving a great deal of extra trustworthiness to a collection of legacy compilers. It seems elegant that we could bootstrap a scalable and workable translation validation mechanism from a poorly-scalable equivalence checker.

Example 2

As I’ve said, having used a compiler successfully in the past does not provide a rigorous basis for believing it will work in the future.  Nevertheless, there’s something undeniably useful about verification through usage: why should we need to prove something about all paths through the compiler when we’re going to exercise only an infinitesimally small fraction of them?

It should be feasible to perform “just enough testing” to gain confidence that a particular source program is compiled correctly by a particular compiler configuration. To perform just enough testing, we’d again break the application under test into small pieces. For each, we’d compile it and perform some sort of intelligent white-box testing of the resulting object code. If we can test it thoroughly enough, and if we can choose a compiler configuration where optimizations are not performed across code chunks in harmful ways, maybe we can start to say something about the correctness of the translation.

Prediction 4: Formal Semantics Will Be Constructed Earlier

Most programming languages intended for safety-critical use will be developed in conjunction with a formal semantics: a machine-readable way to assign a mathematical meaning to each program in the language. Verified compilation is, of course, nonsensical without a formal semantics. A completely separate benefit is that a document like the C99 standard would have benefited enormously if its construction had had input from an experienced semanticist. Certain shady parts of the design would have had light shone upon them much earlier, and hopefully these could have been cleared up prior to finalizing the standard. The current situation is that the standard not only contains many dark corners, but also some real absurdities that compiler developers have been forced to ignore.

The Future of Compiler Optimization

Also see The Future of Compiler Correctness.

Compiler optimizations are great: developers can write intuitive code in high-level languages, and still have them execute in a reasonably fast way. On the other hand, progress in optimization research is excruciatingly slow despite hundreds of papers being published on the topic every year. Proebsting’s Law speculates that improvements in optimization will double the performance of a program in 18 years — a humorous reference to the better-known Moore’s Law where the time constant is only 18 months. Of course, good optimization research often focuses not so much on speeding up dusty-deck codes, but rather on taking programming idioms previously thought to be impossibly slow — infinitely ranged integers, for example — and making them practical.

This piece is about the future of compiler optimization: some areas that I predict will be interesting and relevant during the next 25 years or so. At the bottom are a few anti-predictions: areas I think are red herrings or played out.

Verification and Optimization Will Join Forces

Currently, optimization tools are totally divorced from verification tools. This is odd because both kinds of tools are doing more or less the same thing: attempting to compute and make use of program properties that are true over all possible executions. Let’s contrast the two approaches:

Optimizer: Attempts to infer program properties from source code in a sound way, so that semantics-preserving program transformations can be performed. Since optimizers must run quickly and get no help from developers, it is very uncommon for deep properties to be discovered. Rather, optimizers uncover low-level and superficial facts about programs: “w is always constant” or “y and z do not alias.”

Verifier: Starts with high-level program properties specified by developers and attempts to show that they are consistent with the code. Runs on the unoptimized program text, so must repeat much of the work done by the optimizer. The high-level properties and intermediate mid-level properties are unavailable for purposes of generating better code.

So in one case the properties are high-level and given, and they are checked. In the other case they are low-level and inferred, and are used to drive transformations. I predict that moving forward, it will turn out to be extremely useful to have interplay between these. Here are a few examples:

  • Verification often relies on separation arguments: claims that two pieces of code access non-overlapping parts of the machine’s state. Separation supports a variety of optimizations including deterministic parallel execution.
  • Showing that a computation executes without causing any exceptional conditions (divide by zero, null pointer access, etc.) is an important part of verification. These properties can also be used to optimize, for example by making it easier to show that computations can be safely reordered, eliminated, or duplicated.
  • Designing algorithms and data structures for hashing, searching, sorting, and the like is not very entertaining or productive. One good way to avoid this is the Perl approach: create sophisticated and well-tuned algorithms and just use them everywhere. In the non-scripting domain, a bit more finesse is required. If I’ve properly annotated my code to support verification, the properties (stability, etc.) that I expect from my sorting algorithms will be available to the compiler. This, probably combined with profile data, can be used to choose the best sort implementation for my program. Applied at multiple levels of abstraction — but mostly behind the scenes as part of the runtime library — algorithm selection will be a useful technique for increasing developer productivity and application performance. It will be particularly important in parallel and distributed programming. There has been work on automated algorithm selection, but nothing I’ve read has been convincing. Probably this is a bit of a thankless research area: it won’t give massive speedups on any particular application, but rather will make like slightly to moderately easier for a large number of programmers.

As a side benefit, specification-driven optimizers will help convince developers to write the specs in the first place. Currently, it’s hard to convince a substantial fraction of developers that serious testing is a good idea, much less formal specification of properties of their code. However, people love to write fast code and if it can be shown that specifications help the optimizer, specification will gain traction.

Am I saying that in the future there will be a single tool that both optimizes and verifies code? While that is an interesting idea, it’s not likely in the short term. A more practical way to permit these tools to interoperate is to define a data format so they can exchange information about program properties.

Decision Procedures Will Be Heavily Used

Decision procedures for boolean satisfiability and SMT problems have become very powerful during the last 10 years. The basic idea is that many interesting problem instances — even when the general case is intractable or undecidable — can be automatically decided in practice. Verifying a program requires answering lots of (often silly, but occasionally quite deep) questions and decision procedures have emerged as a key way to get these answers in a relatively easy way. Of course, optimizers also ask a lot of extremely silly questions about the programs they’re optimizing and decision procedures can help there as well. The reason decision procedures are not currently used much in compilers is that fast compile time is a top-priority goal.

A great example of using a decision procedure in an optimizer is the peephole superoptimizer, which uses a SAT solver to answer questions of the form “Given this sequence of x86 instructions, is there a shorter sequence that has the same effect?” Computing the answer may be slow, but the results can be stored in a lookup table, supporting rapid reuse. The results in the peephole superoptimizer paper were not very impressive: basically they did a fine job automatically generating known optimizations, but didn’t discover a lot of new optimizations.


I believe a much more effective superoptimizer could be created by:

  • Operating on small (~5 nodes) sub-graphs of the program dependency graph, as opposed to short linear sequences of instructions. Sequences of instructions often mix irrelevant computations, whereas sub-graphs of the PDG are dependent by definition, and therefore are likely to encode interesting, optimizable computations.
  • Operating at the level of an intermediate representation such as LLVM, as opposed to assembly code. First, more high-level information is preserved (is overflow behavior defined? is the memory access volatile?). Second, register allocation, instruction scheduling, and other target-specific optimizations have not yet been performed. The superoptimizer is likely to mess these up, or at least to be constrained by them in incidental ways. Third, the non-superoptimized optimization passes can clean up any left-over junk, getting a bit of extra mileage.
  • Harnessing specifications, when available. The peephole superoptimizer effectively uses x86 instructions as the specification, which forces incidental implementation decisions to be fixed in stone, limiting the effectiveness of the tool. If I have a formal specification of the hash function I want to implement, we might as well start from there instead of committing to an implementation and then trying to optimize it. The Denali superoptimizer was based on this idea, but their article fails to report any nontrivial results. The approach seems reasonable and the paper doesn’t admit to any fatal flaws, so we’re left to guess if they simply decided to stop working on Denali too early, or if there’s a major hidden problem. It could easily be that their approach works much better now simply because the decision procedures are a decade better.

My random guess is that about half of LLVM’s optimizations could be superseded by a strong superoptimizer. This would have major advantages in code maintainability, and the superoptimizer would keep getting stronger as time went by, since it would always be learning new tricks.

Decision procedures can be used in other parts of compilers; this is perhaps just the most obvious one.

Compilers Will Rely on Models and Feedback

Existing compilers are based on simple hard-coded heuristics for applying optimizations: “perform inline substitution of a function when its body is less than 17 instructions, unless it is called from more than 25 sites.” Sometimes, these work perfectly well; for example, it never hurts to replace an always-constant variable with the constant. On the other hand, many little decisions made by a compiler, such as which loops to unroll, are not well-suited to simple heuristics. In these cases, the compiler needs an answer to the question: is this transformation a win? The solution, unfortunately, is a bit complicated. First, we either need a model of how the optimization works, or else we can just try it and see — but being ready to roll back the transformation if it’s not profitable. Second, we need either a model of the target platform or else we can just run our code directly and see if it works better. Third, we require a model of the system developers’ preferences: did they want small code? Fast code? This model will be trivial in many cases, but it can be interesting in the embedded world where I may absolutely require my code size to be less than 32 KB (since that is how much ROM I have) but not really care if it’s 1 byte less or 10 KB less. Does there exist an embedded compiler today that can operate on this kind of high-level constraint? Not that I know of.

My prediction is that a compiler that makes pervasive use of models and feedback will be able to generate significantly better code than today’s compilers, even using only the optimization passes that are currently available. The benefit comes from making thousands of correct decisions about which optimizations to apply, instead of making 80% good decisions and being slightly to totally wrong the rest of the time. For platforms with interesting performance characteristics and for embedded systems with hard resource constraints, the benefits may be significant.

Optimizations Will Emit Proof Witnesses

To support routine compiler debugging as well as more ambitious goals such as translation validation, it will become more common to implement optimization passes that emit proof witnesses: data that can be used to build proofs that the passes didn’t change the meaning of the code while transforming it. Currently this is painful, but a variety of technologies will make it easier. Any optimizer based on an SMT solver can take advantage of proof-producing SMT. Research by people like Sorin Lerner is producing infrastructure for creating proof-producing optimizations.

A Few IRs Will Win

A compiler is built around its intermediate representations (IRs). Ten years ago, it didn’t seem clear that the “IR to rule them all” was achievable. This still isn’t totally clear, but for many purposes LLVM looks good enough. It adequately fills the low-level niche. A few more IRs, perhaps one for GPU-style computing and one that can capture interesting properties of functional languages, are needed, but not very many. Substantial engineering wins will be obtained if the compiler community centralizes its efforts around a handful of well-designed IRs. This will largely happen and nobody will ever write an x86 register allocator again, unless they really feel like it.

Anti-Predictions

There will be few, if any, major improvements in the speedup provided by the optimizations that have historically been most important: register allocation, alias analysis, function inlining, etc. The engineering of register allocators and such will no doubt improve, but the quality of the generated code will not.

The distinction between online (JVM style) vs. offline (C++ compiler style) optimization is not fundamental and will not seem like a big deal in the long run. Rather, large collections of optimizers will be built around IRs like LLVM, and people will be able to create AOT and JIT compilers, as well as link-time optimizers and whatever else seems useful, simply by choosing suitable components from the toolkit.

Finally — and this is a prediction that might actually be wrong, though I don’t think so — machine learning in compilers will not end up being fundamental. Machine learning permits a model to be inferred from a collection of data. To be sure, there are parts of the compiler (phase ordering, tuning the inliner, etc.) where we cannot be bothered to build a good model by hand or from first principles, and so machine learning will be used. However, I predict that the benefits coming from these techniques will always be low-grade. To take an example, consider phase ordering. People have shown great speedups in for example DSP codes using machine learning. However, what is really being bought is fast compile times: if I’m willing to wait a bit, I could always get a good optimization result by running all of my optimization passes until a fixpoint is reached. Similarly, I might be able to get great optimization results by using machine learning to tune my function inlining pass. However, if I’m willing to wait a bit, I can always get the same or better results by speculatively inlining, and then backing out from decisions that don’t seem to be working well.

Get Outside

Readers will have picked up that I’m a little fanatical about hiking. Going to the gym would probably be the most efficient way to stay in shape, but there’s a problem — I dislike exercising in a gym. Going beyond odd personal preferences, I consider hiking time to be one of the most important parts of my work day: it’s a great time to go over ideas and make decisions about research directions without being distracted by anything other than the occasional rattlesnake. So I was interested to read this NYT article (PDF here) about researchers who believe that getting away from distractions may have quantifiable benefits. Among them is Dave Strayer, one of my wife’s colleagues who is best known for his result that driving while talking on a cell phone is about as bad a driving drunk. This looks like a fun research direction, and it should come as no surprise that a professor from the U of Utah is involved — SLC has perhaps best access to outdoor activities of any major metropolitan area in the USA. With any luck, in ten years, “advice for assistant professors” will commonly include something about getting outside every day and being totally out of contact for a few weeks each year.