Software Engineering Takeaways

I had a great time this spring teaching a software engineering course for a new professional masters degree program created by my department. Since I didn’t use slides or hand out lecture notes, some students were asking if maybe I could write up a summary of what I wanted them to learn in the course. This sounded like a good idea and I figured I’d make a blog post out of it.

This course didn’t have a textbook, but we ended up going through much of Writing Solid Code and also several chapters of Code Complete. I particularly like Appendix D of the 2nd edition of Writing Solid Code, where Maguire describes how he conducts an interview for a programming position. The students also completed about a dozen (mostly) programming assignments, individually and in groups, that I’m not going to talk about here.


Being good at testing is sort of a superpower and computer science programs usually don’t do a great job teaching it. And testing truly is hard to teach since there’s a lot to it and so much of testing is either highly situational or else is about engaging the problem with the right mindset and plenty of energy: you can’t test well unless you truly want to make things fail. My approach to teaching this material is to focus on what I think of as testing’s three pillars:

  1. Test cases. These come from regressions, from systematic or randomized test-case generators, from requirements and specifications, and from corner cases in the implementation. In a few special situations, exhaustive testing works. Test cases should rarely be discarded but they can be segregated, for example into fast and slow ones, which run at different times. Test cases should exercise all relevant sources of input: not just file inputs, but also environment variables, user inputs, and things like the Windows registry, if the system under test depends on those things. The time at which parts of the test case arrive may be important. Though I forgot to give it to the class, Cem Kaner’s What Is a Good Test Case? is excellent.
  2. Oracles. Test cases are only useful insofar as we can tell if the system under test processes them correctly. Some oracles are easy and broadly applicable, such as detecting crashes, excessive resource use, and violations of programming language rules (e.g. using ASan and UBSan). The most useful oracles, however, are specific to the system under test. The one we see most often is to specify, by hand, the expected behavior of the system, based on our understanding of the requirements or specification. Differential testing — checking one implementation of a spec against another — is highly powerful, and more broadly applicable than it first appears: the reference implementation doesn’t need to implement the full functionality of the system, and sometimes it is simply a different mode of the same system. For example, an optimizing compiler can be tested against a non-optimizing version of itself. Function-inverse pairs make an excellent oracle, when applicable. In metamorphic testing we change the test case (add irrelevant levels of nesting to a C program, or remove dead code from it) in a way that shouldn’t change how it works. Assertions and checkReps are extremely valuable oracles. Finding good oracles requires creativity and attention to detail, but the potential rewards are high.
  3. Coverage. Every serious programming language has one or more tools for measuring code coverage, with the goal of finding code not exercised by a test suite. There are a lot of variants on coverage, but in practice we seldom see anything beyond line or branch coverage. If the coverage induced by a test suite is bad, the test suite itself is bad. If the coverage is good, then the test suite might be good, but this is not a sure thing.

Testing should be as frictionless as possible: investments in automation and parallelization often pay off. In class we saw how incredibly easy it is to run the tests for a project in Github on every commit using Travis.

One of my favorite kinds of testing is burning in an ADT implementation — this exercise brings the three pillars of testing together in a very satisfying way.

How SQLite is Tested is great supplemental reading.

Tool and Library Ecosystems

One of the defining characteristics of modern software engineering is its reliance on tooling to help us create large projects rapidly, with lots of high-quality libraries to build on. Every programming language has one or more collections of tools, each of which forms a more or less coherent ecosystem for getting programming tasks done: editing, refactoring, linting, profiling, building, packaging, measuring coverage, debugging, etc. The tricky bit for newcomers is to rapidly learn the shape of an ecosystem in order to avoid wasting time doing jobs that could be better accomplished with tool support.


If testing is the reason that large software sometimes works, then modularity is the reason it can be constructed in the first place. Modularity is kind of a tough topic to cover in a course but we did spend time on API design, where I mainly wanted to convey that it’s hard to get right and there are plenty of pitfalls to avoid: excess mutability, unclear ordering constraints among API calls, long lists of parameters, poor choices for default values, hard-to-follow memory ownership protocols, cryptic names, and unwise exposure of implementation details. Something I tried to show the class is how a small group of people working closely together on a well-defined task can almost always succeed, but that — without some care and experience — multiple small groups who have accomplished tasks separately will invariably have problems integrating things they have developed separately.

Code Reviews

Testing only gets us so far and a huge number of problems can be headed off by putting two or more sets of eyes on code before it lands, whether this is done in Github or in a meeting room. We did a bit of code reviewing in class, though I feel like we should have done this more. Some aspects I find important are trying to stay constructive and objective, having clear goals for the code review, and making sure that outcomes get written down and acted upon.

Coding Styles

My view is that the details aren’t nearly as important as just having and following a coding style. This include source code formatting details, identifier naming, language features to avoid, etc. This can result in a code base that is more approachable to newcomers and where some kinds of bugs are easier to spot. The various Google documents seem as good examples as any. Coding styles are particularly important for languages that are older, cruftier, and less safe. For example, LLVM lives in a C++ style that I generally find to be comfortable and tasteful.

Source Control Systems

Git is more or less the only game in town so there’s not much to talk about there. The thing I tried to convince the students here is that we don’t just “use git.” Rather, it’s crucial to define a sensible git-based workflow and get everyone onboard with it. The main thing is that for a variety of common use cases, we have a fairly mechanical set of steps to follow; git is confusing and badly designed enough that workflow innovation is best left to people with a lot of experience.

Critical Systems and Responsibility

We are far past the point of pretending that software is either ethically neutral or uniformly positive, and the 90s-era Internet optimism sounds pretty naive at this point. A significant fraction of today’s CS students are going to work on, at some point, software that is very much ethically non-neutral. This is a good post on that topic. We read various documents related to the Toyota unintended acceleration case and discussed other software safety problems like the Therac-25.

Backwards Compatibility

People tend to underestimate its importance and there are great stories on both sides. Microsoft’s commitment is incredible: a long time ago I had access to the Windows 2000 sources and ran across crazy things, like a copy of Windows 3.1 (in x86 assembly!) and routines for detecting legacy applications and implementing various buggy behaviors that they depended on. On the other side we have, for example, the Python 2/3 debacle.

Static Analysis

I tried to convince students that static analysis is useful, starting with compiler warnings. We read A Few Billion Lines of Code Later and they spent some time looking at issues pointed out by the Clang static analyzer.

Engineering for Performance

The message here is a bit subtle: we want to discourage premature optimization while still encouraging people to learn to build software that isn’t fundamentally too slow to meet its goals. So on one hand we don’t want to write everything in C “for speed” but on the other hand we need to avoid showstoppers such as Python, a bottleneck data structure stored on disk, a quadratic algorithm, etc.

For performance tuning, of course measurement is everything, so we need profilers and maybe also hardware-specific utilities like perf. It doesn’t hurt to know the strengths and weaknesses of both the compiler and the hardware platform, and we should always be ready to read some assembly. Chapters 25 and 26 of Code Complete 2e are a good resource for all of this, though they do not present a very nuanced view of what one can expect the compiler to accomplish.

Reporting Bugs

Getting a software developer to stop doing what they wanted to do, and to spend the day fixing a defect instead, can be difficult, but there are a few simple ingredients that can make bug reports vastly more effective. The report should be framed politely and matter-of-factly; implying that the developers are incompetent is rarely helpful. The issue must be reproducible and the bug report must describe all of the circumstances necessary to make the bug show itself. However, all nonessential circumstances should be left out, including any part of the test case that is not necessary for triggering the bug.

Software Process

There are plenty of software development process models, but I guess I don’t find any of these to be particularly compelling, so we didn’t spend much time on them. On the other hand, the elements of software process — estimation, requirements, modeling, architecture, implementation, quality assurance, maintenance, etc. — are hard to argue with. I spend a bit of effort trying to prepare students for the huge diversity in software development organizations they might see out in the world, from startups to banks to Googles to IBMs. I tried to get them used to the fact that in some cases management may have wildly unrealistic ideas about software development and that requirements can change at the drop of a hat.

The Human Element

Effectively reviewing someone’s code requires a light touch; you have to critique the code rather than judging its author. On the other side, it can be very difficult to gracefully process criticism of a piece of code that you’ve bled and sweated over for months. A bug report is really just an argument that a person should make a change to a piece of software. It is hard to implement a piece of software that solves someone’s problem without putting yourself in their place. Similarly, it is very difficult to solve a problem for someone you don’t like or respect or at least empathize with to some degree. UI design requires understanding the needs of people who don’t share your own abilities and cultural background. Ethical concerns are common, and probably should be thought about a lot more often than they are. Team dynamics are complex and managing people is really not easy. In summary, software development is fundamentally a human endeavor and we’d all do well to keep that in mind.

Paths to External Engagement in Computer Science Research

The other day I wrote a post imploring academic computer scientists to at least occasionally break out of their research bubbles and engage with real engineering problems where their research matters. That post was, admittedly, a bit facile about how one might make this engagement happen. This piece suggests some ways. I don’t claim any particular authority or expertise, though I have — except where noted — tried out all of the suggestions below. Perhaps others will chime in with what has worked for them.

Talk to a lot of people and create a network. Networking is a really important skill. Back in the day Phil Agre’s Networking on the Network was what students would read. It predates modern social networks but contains a lot of timeless advice and in any case I don’t know of a better choice.

Attend conferences outside the academic mainstream. In some cases, such as SIGGRAPH and Supercomputing, there’s plenty of non-academic content at the conference you’re attending anyhow, but most of us aren’t so lucky. There’s a really wide range of industrial conferences; some of them definitely won’t be very interesting for academics so you need to do some homework ahead of time to find the one where the right people will be. For my current work, the LLVM Developers Meeting is the main event, almost all of the community’s heavy hitters are there and the technical sessions are amazing. The security community has plenty of great non-academic events. I’ve heard good things about !!Con and Strange Loop. In any case, the point of attending these conferences (besides the technical content) is to meet people who aren’t professors or students.

Spend a day visiting a company or a government agency. You need an invitation, but you can invite yourself if you can make a contact for example via your advisor, a mutual friend, or by meeting people at conferences. Talk to people there, get a sense of what they’re doing with their days, what they’re worried about, what they’re frustrated with. Give a talk about your work if possible, but this isn’t necessary. It often makes sense to do these visits when you’re already traveling.

Spend longer visiting a company or government agency. Depending on your career stage this could be an internship, a postdoc, a sabbatical, a summer, or a leave of absence. This is a chance to work closely with people for an extended period of time. A lot of people do this at some point in their careers and I think it’s a really good idea.

Engage on twitter. It’s a weird, crowded, noisy place and it can take a while to find the right people to follow and longer to get them to follow you. The advantage of Twitter is that there’s a huge number of smart people are out there and communicating with them is almost frictionless.

Blog. People are far more likely to read a blog entry than a paper, in my experience. Also, the readership is different, because non-academics are even less likely to read a paper than academics are. Realistically, starting a blog only makes sense if you have a fairly consistent stream of things to say that don’t fit into tweets and don’t really belong in academic papers. Building an audience takes time and requires a certain amount of regularity in writing; these don’t necessarily fit in very well with the academic binge model of working that many of us subscribe to. Another issue is that blogging doesn’t pay the bills, academically speaking — you should only do it because you want to, not because you expect any direct benefit to your career. I waited until getting tenure to start a blog for this reason, and also to make sure that I had at least a few years’ worth of ideas lined up.

Find people who are great at external engagement. Emulate them, collaborate with them, or join them. The Racket folks are amazing at this.

Release software. Put your stuff on Github, polish it up, and then tell people about it. Get users, accept pull requests, respond to feedback, fix bugs, add features, cut releases, and repeat. Either your code will provide people with a good value proposition or it won’t — either way you learn something. The caveats are that building a user base takes time, creating realistically usable software is like 25 times as much work as creating research-grade crapware, and only a small subset of computer science professors will value your contributions in this area. But it is enormously fun and anyway you don’t want to make the mistake of caring too much what professors think.

Engage with existing open source software. For many of us, there’s an obvious open source project that we could be contributing to or otherwise helping out. Find that project and read their mailing lists, look into the issue tracker, build and use the code, read the code, and maybe submit a patch or two. Beyond these things, consider attending their meetings or BoF sessions, if these exist. A reasonable long-term goal would be to use your work to make a significant improvement to the open source program.

Start a company. This one I haven’t done, though I know many people who have. It is a somewhat extreme option, as much a lifestyle choice as research engagement strategy. Details are out of scope of this post and anyway I don’t know anything about doing this.

Ok, with all that said, I’ll make a prediction or two about what will happen if you follow these suggestions. First, you’ll often find it frustrating and some of the time you invest will be wasted. I’ve burned months on things that never bore the tiniest fruit; if I knew how to tell you to avoid this, I certainly would. Second, you’ll discover that the problems that people are having out there aren’t the ones that you would have predicted, nor are they the ones that your CS friends and colleagues predicted. You need to learn to listen to people, but often even the people having problems aren’t actually having the problems that they think they’re having (anyone who has worked tech support will tell you this is the case more often than not). You need to learn to observe carefully and read between the lines to figure out what’s really going on. Third, at some point you will run into the distinction between problem-driven research and solution-driven research. The former is like trying to cure cancer or put a person on Mars: the problem is everything and you’ll consider any solution that might work. The latter is where your research hammer is a damn good one and you’re never going to put it down: if it can’t solve someone’s problem, you’ll move on and find a different problem. Obviously there’s a spectrum — but you’ll need to decide where on it you sit.

Closing the Loop: The Importance of External Engagement in Computer Science Research

Computer scientists tend to work by separating the essence of a problem from its environment, solving it in an abstract form, and then figuring out how to make the abstract solution work in the real world. For example, there is an enormous body of work on solving searching and sorting problems and in general it applies to finding and rearranging things regardless of whether they live in memory, on disk, or in a filing cabinet.

To paint a bit of a caricature, we have the real world where:

  • all problems are engineering problems, and every engineering problem is distinct from the rest in small or large ways
  • abstractions are leaky: machines fail, bits flip, packets drop, and bugs exist
  • requirements are many-dimensional, diverse, and poorly specified
  • systems often involve human beings who are notoriously hard to reason about

Overall, engineering problems are messy and we usually can’t prove anything about anything, any more than we could prove the correctness and optimality of a bridge or lawnmower.

On the other hand, in the abstract problem space, we’ve dropped every detail that we don’t wish to reason about but retained (ideally) all of the important characteristics of the problem. We’re now dealing with a model — sometimes a formal mathematical model, other times an executable model such as a kernel or simulation — of part or all of the problem. Models can be rigorously analyzed for efficiency, optimality, correctness, and whatever else. Furthermore, connections between problems that initially seem very different may become more apparent in the abstract form.

This process of lifting engineering challenges into abstract problems, solving them, and applying the results — I’ll call it the computer science research loop — is so integral to the DNA of computer science research that it is simply assumed; people have a hard time imagining any other way to work. Also, it has been incredibly successful, which is unsurprising since we inherited it from mathematics where it had been giving good results ever since some prehistoric person observed that you could talk about the number three without actually having three things in front of you.

Here’s the loop in its simplest form:

Here are some ways the research loop shows up in areas that I’m familiar with:

  • In compilers we can develop a toy language that is much easier to compile but that (we argue) retains the essential features of real programming languages.
  • In compilers we can compile a benchmark suite instead of real applications, arguing that our results will translate over into practice.
  • In resource scheduling research it is typical to abstract away all details of the jobs being scheduled.
  • In databases or operating systems we can create a transaction engine or OS kernel that supports only a tiny subset of the features provided by SQL Server or Linux, arguing that the advantages displayed by our simplified model would not disappear if we took the trouble to implement all the boring stuff.

In all cases the goal is to elide details that make our work harder, but without oversimplifying. This piece is about an avoidable but undesirable second-order effect: it is common for both edges of the computer science research loop to be weaker than they could be.

The concrete-to-abstract edge suffers when people working on the abstract side don’t have deep expertise in the concrete problems they’re trying to solve, and it also tends to weaken over time as the character of the problem drifts, causing assumptions on the abstract side to be invalidated. The abstract side has a kind of inertia: infrastructure builds up in code, formalisms, books and papers, and mental models. It requires significant time, energy, and risk-taking to throw away parts of the abstract infrastructure and create fresh pieces. Abstractions that are out of touch with real-world problems can linger, producing papers and PhD theses, for decades.

The abstract-to-concrete edge of the research loop is also problematic: solving real engineering problems, or interacting with the people whose jobs are to solve those problems, can be difficult and time-consuming. It is generally much easier to work purely on the abstract side, and in fact our field’s mathematical roots encourage this behavior. Abstract work is, of course, fine as long as someone else is doing the grungy part, but in many cases that never happens because the abstract side has drifted off to the side of the real problems, becoming more elaborate and complex over time as the easy problems get mined out, and in the end there’s no realistic prospect of applying it.

I believe these issues cause computer science research, overall, to be less interesting and impactful than it could be. I also believe that mitigating the problem isn’t that difficult and that doing so tends to make researchers’ careers a lot more fun and rewarding.

The solution is for researchers to engage with the world outside of their research bubble. Working on real-time scheduling? Then go find some drone software and help its authors or users avoid deadline misses, or else contribute scheduling code to Linux. Working on concurrency control in databases? Figure out a way to integrate the new scheme into MySQL or something, instead of waiting for them to read your paper. Working on finding bugs in software? Report the bugs to the people who maintain the code and watch their reactions. It is particularly important that students do these things, first because their intuitions often aren’t as well-developed and second because I’ve noticed that quite a few CS graduate students are quietly and privately wondering if their work is good for anything in the end. It turns out there’s a way to answer this question: engage with the people whose problems you are solving. As a bonus you’ll publish fewer papers.

It is not the case that that every piece of research should be applied research. Rather, good pure research usually stems from direct personal experience with problems on the engineering side of our world. It’s a bit of a subtle point: doing the grungy boots-on-ground work is how we build our intuitions about what kinds of solutions actually work vs. sounding good on paper. It is hard — though not impossible — to skip this step and still do great work.

Going a bit further, my experience is that much of the interesting action in research happens on the abstract-to-concrete edge of the CS research loop, even though this work is not glamorous or well-rewarded by program committees or grant panels. Even the old workhorses like sorting an array or implementing a key-value map became dramatically more interesting and complex in the context of a real machine, operating system, compiler, and workload.

Concretely, here are some things to look for that might indicate that a research community needs to tighten up its loop:

  • few members of the community are plugged into the concrete problem domain, and are providing fresh insights from developments there
  • few members of the community are moving abstract results into practice
  • members of the community are mainly interested in impressing each other (or, equivalently, papers that demonstrate external impact are not highly valued)
  • the community rewards complex solutions because they are innately interesting, as opposed to rewarding simple solutions because they have engineering merit
  • years of results cluster around the same baseline or benchmark set, instead of continually moving the bar higher

In conclusion, the tension between mathematics and engineering is alive and well in our field. My belief is that more of us, perhaps even most of us, should be skilled in, and actively working in, both modes of thinking.

Also see this followup post.

The Basic Toolbox

This post is aimed at computer science students.

In the software engineering course I’m teaching this spring, I often find myself saying things like “you need to know a scripting language” or “everyone should be able to run a code coverage tool.” Finally, the other day, a student stopped me and asked for the whole list. In other words, what — in my opinion — is the collection of tools that someone graduating with a CS degree should know how to use. Of course I couldn’t answer this on the spot but I’ve been thinking about it since then. The basic idea is that for most any common situation, you should have a decent tool at hand and be able to start solving problems with it without too much fumbling around. (Keep in mind that this is a wish list for self-study: I doubt that any CS program teaches all of these. Also, I didn’t have all of these tool skills when I got my undergraduate CS degree, though I did by the time I got a PhD.)

A version control system: Git is the obvious choice; the main thing you should have is a basic Github-centric workflow including pull requests, remotes, dealing with merge conflicts, etc.

A text editor: We all end up using different editors from time to time, but we should each have a solid default choice that does a good job with most editing tasks. It should highlight and indent any common programming language, integrate with a spellchecker, easily load gigantic files, have nice regex-based search and replace, etc. There are plenty of choices, many CS people migrate to vim or emacs.

A graphing program: I routinely use gnuplot, graphviz, and Powerpoint to make figures. Lots of people like matplotlib.

A presentation tool: Powerpoint, Keynote, Google Slides, something LaTeX based, etc.

An interactive debugger for native executables: LLDB, GDB, something IDE-based.

A generic build system: Make, CMake, etc.

A scripting language: This is for low-grade automation, quick and dirty data analysis tasks, etc. Python and JavaScript would seem like natural choices. Around 20 years ago I was an intern at a networking company and my supervisor popped out of a meeting with some data concerning switch errors, and asked me to do some analysis to locate the underlying pattern. I wasn’t sure how; he handed me a Perl book and I was able to get the job done before the meeting ended.

A shell language: This is probably bash or PowerShell, but there are plenty of other choices. There’s some overlap with scripting languages but I think there are two distinct niches here: a shell language is for scripting a smallish number of commands, doing a bit of error checking, and perhaps looping or interacting with the user slightly This sort of job is a bit too cumbersome in Python, Perl, or JavaScript.

A systems language: This is for creating servers, daemons, and other code that wants to go fast, use little memory, have few dependencies, and interact tightly with the OS. C or C++ would be the obvious choices, but Rust and Go may be fine too.

A workhorse language: This is your default programming language for most tasks, it should have a huge collection of high-quality libraries, be pretty fast, run on all common platforms, have a great tool ecosystem, etc. Racket, Java, Scala, OCaml, C#, Swift, or Haskell would be great — even C++ would work.

A pocket calculator: This is your go-to REPL for basic arithmetic and conversions between number representations, it should be near-instantaneous to get answers. For reasons I no longer remember, I use gdb for this — typically multiple times in any work day. Old standbys like bc and dc also seem like bad choices. I’m curious what other people do here.

Tools for Programming Languages

There’s no reason these days to use a language that doesn’t have a good tool ecosystem. For any given language you should know how to use its interactive debugger, static and dynamic bug-finding tools, a profiler, a code coverage tool, a build system, a package manager, and perhaps a random test-case generator.

Secondary Tools

There are a lot of other tools that could have gone into my basic toolbox, such as a data analysis tool, a browser language, a cloud-based testing service, a statistics language, a typesetting system, a spreadsheet, a database, and a GUI builder/toolkit. I don’t consider these as fundamental; of course, your mileage may vary.

Trust Boundaries in Software Systems

One of the big things that has changed in computer science education over the last 20 years is that it is now mandatory to prepare students for writing software that lives in a hostile environment. This content can’t be limited to a computer security course, it has to be spread throughout the curriculum. My experience, based on talking to people, looking through textbooks, and looking at lecture material on the web, is that overall we’re not yet doing a great job at this.

When teaching this subject, I’ve started using trust boundaries as an organizing principle. A trust boundary exists any time we (the system designers or system owners) trust code, data, or human actors on one side of an interface more than we trust the other side of the interface. Students need to be able to recognize, understand, fortify, and stress-test the trust boundaries in any system they have a stake in.

Trust boundaries aren’t hard to find: We just need to ask questions like “What are the consequences if this code/data became horribly malicious? Is that likely? Can we defend against it? Do we want to defend against it?” It is easy to conclude, for example, that a demonic garbage collector or OS kernel might not be something that we wish to defend against, but that we had better fortify our systems against toxic PNG files that we load from random web sites.

Some basic observations about trust boundaries:

  1. They’re everywhere, even inside code written by a single person. Anytime I put an assertion into my code, it’s a tacit acknowledgment that I don’t have complete trust that the property being asserted actually holds.
  2. The seriousness of trust boundaries varies greatly, from mild mistrust within a software library all the way to major safety issues where a power plant connects to the internet.
  3. They change over time: a lot of our security woes stem from trust boundaries becoming more serious than they had been in the past. Email was not designed for security. The NSA wasn’t ready for Snowden. Embedded control systems weren’t intended to be networked. Libraries for decoding images, movies, and other compressed file formats that were developed in the 90s were not ready for the kinds of creative exploits that they faced later on.
  4. If you fail to recognize and properly fortify an important trust boundary, it is very likely that someone else will recognize it and then exploit it.

To deal with trust boundaries, we have all the usual techniques and organizing principles: input sanitization, defense in depth, sandboxing, secure authentication, least privilege, etc. The issue that I’m trying to respond to with this post is that, in my experience, it doesn’t really work to hand students these tools without some sort of framework they can use to help figure out where and when to deploy the different defenses. I’d be interested to hear how other CS instructors are dealing with these issues.

Stories Behind Papers: Integer Overflow

A couple months ago Jean Yang and Vijay Chidambaram had a Twitter discussion about the stories behind research efforts that you might hear over coffee, but that usually don’t get written up. Vijay started a series of posts containing these. I thought I’d write up a couple of them myself. Alas, none will be particularly dramatic. This seems like a good one to start with.

Around the mid/late 2000s — perhaps starting with Nearly All Binary Searches and Mergesorts are Broken — I got interested in integer overflow bugs. At this point the security aspect of integer bugs in C and C++ was receiving plenty of attention, but I didn’t feel like very many people were looking at the broader issue of logic errors stemming from integer overflows. Even in functional languages with super-serious type systems and a focus on correctness, integer overflow was (and is) an often-neglected issue. These problems are fundamentally difficult to deal with at compile time.

Anyhow, the thing that really got me motivated was the very limited understanding of undefined behavior that seemed to be par for the course in those days. Additionally, most of the existing research tools for detecting or mitigating integer overflows were operating on compiled code, while I believed this problem needed to be attacked near the source level.

By summer 2010 my student Peng Li (now at Baidu USA) had a modified version of Clang that emitted dynamic checks for integer overflow, divide by zero, value-losing typecasts, shifts past bitwidth, and that kind of thing into a compiled C or C++ program. We used this to test open source software and it turned out that basically all programs were executing a constant stream of undefined behavior. I fired off a few dozen bug reports. Since UB wasn’t widely understood at that time, many developers still had the attitude “that is OK since we did it intentionally” or “I am allowed to expect signed overflow in C/C++ to have two’s complement behavior because that is what the hardware does.” See for example the discussions that happened at PostgreSQL and PHP.

In early 2011 I visited Grigore Rosu’s group at UIUC to learn about their awesome new KCC tool. We needed something like this to filter out undefined programs that C-Reduce was creating while making minimal versions of bug-triggering programs from Csmith. During this visit I happened to be able to grab a few minutes with Vikram Adve and learned that he and his student Will Dietz were also working on LLVM-based integer overflow detection, and they also had a working prototype. Yikes! This wasn’t even close to the worst case scenario — which would have been learning about their work once they had a paper accepted somewhere — but it’s never fun to be scooped. Competition in research may occasionally be useful as a forcing function, but I am personally uninterested in competing. If smart people are working on a problem, I would like to leave them to it, they’ll most likely come up with a good solution. There are way too many other fun and important problems to work on for competing on any single problem to be attractive. Anyhow, luckily, Vikram and Will were happy to collaborate, so we started working together. I’m still happy with the resulting paper and I’m confident that it is better than what either of the groups would have produced working on its own.

One of our goals all along was to get integer overflow checks into Clang. This took a while longer and Will did most of the legwork. The job was made easier by the fact that by this time there was plenty of momentum towards dynamic undefined behavior detection in LLVM. For example, ASan was already part of the tree. There was an existing -fcatch-undefined-behavior flag that we fit into, but this pretty rapidly (in time for LLVM 3.3, I believe) got phased out in favor of the -fsanitize=undefined usage that Clang still uses.

Overall, dynamic detection of integer-related undefined behaviors in C/C++ is not difficult, but convincing people to take these bugs seriously was a long struggle and the broader issue of how integer overflows relate to program bugs is interesting and deep. Fundamentally, people usually do not want, and are not good at reasoning about, fixed-width integers, but on the other hand we don’t want to just put bignums everywhere in our low-level programming languages. The other thing I take away from this effort is how a lucky break and a willingness to collaborate were really important in making the work successful, and in fact my group continues to collaborate with Vikram’s.

A Conversation about Teaching Software Engineering

For better or worse, my impressions of software engineering as a field were shaped by a course I took as an undergrad that I thought was mostly not very interesting or useful. We spent a lot of time on waterfalls and stuff, while not covering testing in any detail. For the final project in the class we had to develop an application using a CASE tool (very hip at the time) where we described the class hierarchy using a GUI and then the tool generated skeletal C++ for us to fill in. Since we knew nothing about designing class hierarchies and also the tool was weird and buggy this all went about as disastrously as you would expect. In the end I learned quite a lot, but the lessons were probably not those intended by the instructor.

24 years later I’m teaching a software engineering class — this probably wouldn’t even happen if my department had any real software engineering faculty! Even so, I’m a true believer: I love the material and feel more strongly about its importance than I do about my more usual subjects like compilers and operating systems. I ignore software process and focus entirely on building skills and habits that I feel will come in handy in any software engineer’s career. If you like it, put a test on it. Read code. Review code. Refactor code. Write assertions. Adhere to coding standards. Design an API. Use a coverage tool, a bug-finding tool, a version control tool, a fuzz tool, a CI tool, to good effect. Repeat until end of semester. No doubt there’s room for improvement but the material seems solid.

Over Christmas break I had a beer with Daniel Dunbar, who I should have met long before, but somehow hadn’t. Daniel has done super impressive stuff: he was one of the original Klee authors and also was an early Clang implementer. I told him about my approach to teaching software engineering and picked his brain a bit about the sorts of things he wished people with CS degrees were better at doing. Of course I wasn’t taking notes and forgot most of it. So I mailed:

As I mentioned the other week, I’m teaching a software engineering course this semester, and rather than focusing on any kind of academic approach to this subject, I’ll try to teach them a lot of the real world business of making software that works, starting with all the basics like testing, coverage, assertions, and code reviews.

But you also mentioned some things that I agreed with but that I might not have thought of, or prioritized — the things that you wish people were already good at when you interview them, but they probably aren’t. Is there any chance I could get you to very briefly summarize those things, or point me to a resource you like on this subject? I want to make sure to cover, at least quickly, all of the main high points in this class.

Daniel graciously sent a long reply and also allowed me to reproduce it here:

To start with, I don’t think I know of any resources on the subject. It’s amazing how much obvious little stuff one has to know, but forgets about. Git is a huge source of “things I use every day but forgot I had to learn”, for better or worse.

I guess if I had to come up with a list off the cuff:

  1. The experience of maintaining software over time. I think we spend most of our time working with existing code bases and figuring out how to integrate changes into them. This area has a lot of related topics:
    • How do you figure out where to make a change? Tools here include debugging existing workflows to find where something happens, code search, git blame, git grep, git log -G, etc
    • How do you manage making incremental changes? I am a huge believer in always doing incremental work. How do you build a feature while always keeping the code working? Tools here include feature flags, adaptors and stub implementations, forwarding implementations, A/B testing before/after change.
    • How do you find the source of regressions? Tools here are basically bisection, git log -G.
    • How do you handle technical debt? What counts as technical debt? What kinds of debt are painful versus not?

    I feel like I probably have read things I liked on these topics, but none of the links are coming to mind now.

  2. The experience of making technical decisions. This is a *huge* part of development. Topics here:
    • How do you evaluate choices for a dependency? Tools here include benchmarking, analysis of the code, analysis of the software maintainability, etc
    • How do you evaluate when to adopt a dependency versus write your own? Topics here are NIH versus opportunity cost on innovation.
    • How do you convince people to follow a particular choice? Presenting or writing coherent write-ups on engineering tradeoffs is a really under appreciated skill which can have a big impact (the cost of bad decisions are high).
  3. How do you debug things? Another big part of development. I would emphasize debug here not just in the “how do I get this working” but even deeper in the “how do I understand what is really happening?”
    • I find that many people tend to give up at really understanding what is going on. “Oh XYZ didn’t work, so I did PDQ'”. The people who don’t give up usually end up understanding a lot more about computers, and then do a better job maturing over their career.
    • Maybe it would be good to teach people (a) don’t give up, and (b) here are all the tools you can use when you might want to give up. A lot of the time the tool people know is stackoverflow, but past that they are lost.
    • Things here include hardware watchpoints, reading the source code, disassembly and reverse engineering.
  4. How do you estimate the time to develop software? This is a huge part of a business, people will always want you to do this. Even just getting students to start to think about the process would be good, asking them to make estimates and compare results to them.
    • I have no advice on how to teach this because I still learning a lot here.
  5. How do you review code? What makes good review?
    • When is coding style important versus not? What are the pros/cons?
    • Does review catch bugs, or not? Are certain review styles more effective?
  6. How do people do release management? This is such an amazingly huge part of what we spend time on, and one that receives very little attention.
    • Do you release from trunk? If so, how do you ensure quality?
    • Do you have stable release branches? If so, how do you ensure bugs are actually fixed? Are people cherry picking fixes? What can go wrong? (I remember once cherry picking a fix that happened to merge incorrectly, but the patch applied an in a way that was still valid C++ code (an if {} clause ended up inside another one). The result was a clang that miscompiled itself.)
    • How do you deal with complicated merge conflicts?
    • People like Nicole Forsgren have research here which it would at least be nice for people to be exposed to.

If I had to come up with the kinds of potential exercises I would love it if someone was trying (no idea if they actually would work):

  1. Take a bug in a complex code base at some revision, and ask people to find it and fix it. Compare answers to the one the project actually adopted. A bug where there were several obvious fixes with tradeoffs would be a good point for discussion.
  2. Take a new feature which was added to some project, and study how it was done. Not to toot my own horn–its just an example I am familiar with–we migrated Clang to go from producing .s files to doing the assembly in memory. This involved lots of incremental refactoring, a clear switch over at one point (feature flag), A/B testing to compare old to new (i.e. we chose to shoot for binary equivalence of .o files, simply so we could easily test it — made for lots of extra engineering, but easier to guarantee correct). One could dig up how this went from a theoretical concept on a mailing list to an implemented feature.

    The best thing here would be to create a hypothetical project which is a mirror of a real project (but don’t make this clear at first) and ask people to design some extension of it. Then, compare the results to what the project actually decided to do, and the discussion around it. For example, analyze what things the project owners antagonized over that the students didn’t think of, and try and figure out why not.

  3. Do some systematic study of PRs as a “literature” exercise. How does tone impact response, how do people handle criticism, etc.
  4. Do an exercise where teams are forced to produce a project over the lifetime of the course. The exercises should be small and easier, but they have to be built into the same code base. Something that forces people to make software releases would be nice (dunno if there is a way to do this in such a way that people can choose whether or not to use “release branches”, but in a way that has tradeoffs in either direction).

Wow, this is a lot of material, way more than we cover in depth in a semester. Even so, I find it super useful as a high-level vision for the kinds of things students should end up at least having been exposed to.

Some Goals for High-impact Verified Compiler Research

I believe that translation validation, a branch of formal methods, is just about ready for widespread use. Translation validation means proving that a particular execution of a compiler did the right thing, as opposed to proving once and for all that every execution of a compiler will do the right thing. These are very different. Consider some obstacles to once-and-for-all verification of a tool like GCC or LLVM where:

  • Most of execution is spent processing pointer soup where correctness depends on poorly documented and incredibly detailed properties of the soup.
  • Hundreds or thousands of analyses and transformations are performed, and due to performance constraints the compiler implementation entangles them in a way that is nearly impossible to disentangle.
  • The implementation language is usually unsafe, forcing any formal verification effort to spend an outrageous amount of effort proving properties of the compiler code, such as memory safety, that are incidental to the task of interest. A convincing formal verifier for the subset of C++ that LLVM is written in doesn’t even exist.
  • For some compiler algorithms like register allocation, it appears to be fundamentally easier to check the result than it is to prove that the right result is always computed. For example, CompCert uses this approach (or did the last time I looked).
  • The compiler is under active, rapid development. Any proof would have to be redone, likely incurring significant effort, for every release.

So it’s clear that once-and-for-all formal verification of LLVM or GCC is never going to happen, the costs ludicrously outweigh the benefits. Translation validation, on the other hand, is already to some extent practical, see for example this effort to prove that the seL4 object code refines its C source code. (Refinement just means that C gives a typical program many different meanings and we need to prove that the compiler has picked one of them.)

Other recent, LLVM-based work in this area includes Program Analysis for Compiler Validation (2008), Evaluating Value-Graph Translation Validation for LLVM (2011), Equality-Based Translation Validator for LLVM (2011), Formal Verification of SSA-Based Optimizations for LLVM (2013), An Extensible Verified Validator For Simple Optimizations in LLVM (2014), and Black-box equivalence checking across compiler optimizations (2017).

This work is awesome but research tools don’t, by themselves, stop people from being burned by compiler bugs. One way to make things better is to combine translation validation with aggressive testing, like we did here, and then make sure any resulting bugs get fixed. Better yet, we can try to push a translation validation out into the world so that anyone can use it. It’s time for this to happen. The rest of this piece is some thoughts about how that should work.

Goal 1: Ease of Use

The only thing an application developer should need to do is add a compiler flag like this:

clang++ -O -tv file.cpp


rustc -O -tv

and then the compiler either validates, or fails to validate, its translation. It has to be this easy.

Goal 2: Near zero overhead for compiler developers

Translation validation can’t get in the way of normal development for a production compiler: it has to be almost entirely on the side. This doesn’t mean, however, that the compiler can’t help out the validator, but rather that this has to happen in non-invasive ways. For example, certain optimizations on nested loops that are hard to validate might need to emit a bit of extra debug info or optimization remarks or whatever, to help the validator piece together what happened.

Goal 3: Performance

Since translation validation will result in a lot of solver calls, it is going to be somewhat slow, probably well over an order of magnitude slower than regular compilation. A fairly easy way to speed it up would be to add a (persistent, networked) caching layer to exploit the fact that most parts of most code bases don’t change very often. We’ve had good luck using this kind of a cache for Souper, which is also slow due to making many solver calls.

Goal 4: Multiple Validators

Research tends to move rapidly when there is a level playing field and a clearly-defined goal, allowing different groups to compete or cooperate, as they see fit. Competition can be particularly motivating, see for example SMT-COMP.

The primary metric for choosing a winner in a translation validation competition is the number of functions validated for compilation of a given benchmark using a particular LLVM version and optimization level. Verification time would be a good secondary metric.

To ensure a fair competition, it would be best for all validators to be using the same semantics for the source and target languages. This isn’t so straightforward: all too often these mathematical artifacts end up not being readable or reusable since they are deeply embedded in the implementation of a formal methods tool (this is unfortunately the case for Alive, for example). A canonical, readable, writable, and reusable semantics for each of C, C++, Rust, Swift, LLVM IR, x86-64, etc. is something we should be spending significant resources on. This sort of thing is what I’m talking about.


Just to be clear, beyond the Alive-based work referenced above, I’m not working on, nor do I have any plans to work on, translation validation. Rather, it is clearly the right way to gain confidence that a production-grade compiler has done its job. The technologies are in reach and we should be working to deploy them widely.

The Real Problem with the US News Rankings

The latest list of Best Global Universities for Computer Science from US News has not been well received. For example, the Computing Research Association issued a statement saying that “Anyone with knowledge of CS research will see these rankings for what they are — nonsense — and ignore them. But others may be seriously misled.” The CRA statement identifies these problems with the US News rankings:

  • They ignore conference publications (many areas of CS publish primarily in conferences).
  • US News doesn’t even say which venues are used to compute the publication-based part of the ranking function.
  • The reputation-based part of the rankings doesn’t make much sense given the diverse, global nature of the computer science research community.

An additional problem is that it seems to be pretty easy to game this ranking system using money. For example, King Abdulaziz University (Jeddah, Saudi Arabia) has adopted hiring practices that appear to be designed to do this. Their CS department is ranked #13, compared for example to CMU at #22 and Illinois at #46. I’m trying to avoid being USA-centric and elitist here, but based on some web searches, it is just not possible to objectively rate the CS department at KAU as being better than the one at CMU. US News explains their ranking methodology here.

To summarize, US News is designed to make money, not to do the CS community any favors. Universities are going to try to maximize their rankings. It’s a pretty banal situation all around.

What I wanted to talk about today is the function of rankings. What are we supposed to do with them? The conclusion I’ve come to is that a closed, opaque ranking such as the one from US News is only good for one thing: codifying and reinforcing a pecking order so that it can be used by people who don’t need or want any more information than a total ordering. This might include, for example, university administrators who would like to know if sending additional resources to a department resulted in a measurable and externally-visible improvement.

The reason everyone’s annoyed with US News is that they’ve upended the established pecking order. But here’s the thing: they could fix this tomorrow and their opaque rankings would still be worthless for people who care about what’s behind the rankings, as opposed to being interested in ranking for its own sake. There has to be a better way.

In contrast, let’s take a look at CSRankings, a site that Emery Berger put together using publicly available data from DBLP. This ranking assigns credit to departments based on the number of top-tier papers published by their full-time faculty, credit for which is split among authors. There’s a FAQ giving additional details. (There’s a lot of quibbling that could be done about how this all works; I’m not too interested in that.)

The thing that makes this ranking different for practical purposes isn’t the openness of the algorithm and the data set, but rather the way the web site allows us to explore the data. Let’s say that I’m a prospective graduate student interested in operating systems and formal verification. The first thing I can do is select only those areas — now the site shows me the departments that have people who tend to publish heavily in conferences such as SOSP and CAV. Second, I can click on an individual department and see who the key players are in those areas. Third, I can go to these people’s home pages, Google Scholar pages, etc. in order to see what they are specifically doing, and finally I can read their code and papers. I would argue that this is a fundamentally different use of a ranking system: the purpose is to guide me towards details that matter, not to hide everything behind a number.

In summary, I find the complaints about the US News rankings to be a bit off the mark, since even a fixed version of them will provide no insights and no information beyond an opaque ordering. It would just be confirming the status quo instead of refuting it, as their current rankings do. That is what some people want, but it is of little use to faculty and students in the field. A better use of rankings is to serve as a guide for further exploration — for this to happen, the rankings need to be open and connected to more detailed sources of information. CSRankings accomplishes this and it is the tool you should use to explore the productivity of computer science departments. If you don’t like it, you can try to convince Emery to do things differently or else create your own ranking.

The Dreaded Practice Talk

[I wrote a post with the same title in 2010; this is an updated version.]

In a week you’ll be giving a talk about your work to 600 people at a conference, or perhaps to five people who will sign off (or not) on your thesis. Depending on your area and the type of talk, the questions following the talk may not be very friendly. What should you do? Practice, practice, practice.

A practice talk is usually given to a small audience anywhere between a few weeks and a few hours before an important talk. It is followed by a feedback session that can easily last five times longer than the talk itself did. Often, multiple practice talks are necessary before the presentation becomes really polished and good.

This post is about getting maximum benefit from a practice talk — this is important because they are very time consuming.

The speaker needs to:

  • Have a legible slide number on every slide. If these aren’t there, people taking notes can’t easily refer back to specific slides later on.
  • Reserve a room, acquire a projector, and have everything setup and ready to go at the arranged time. Have all of the adapter dongles that you need on hand. If anyone is calling in remotely, this should also be taken care of by the speaker or by someone who has agreed to help the speaker, and it needs to be done before the talk is scheduled to start.
  • Have practiced the talk alone first. It helps to have memorized what to say when transitioning between slides. Memorizing an entire talk is usually overkill. Focus on transitions and on getting the talk started smoothly; most of us have a much easier time continuing to talk about a topic than getting started.
  • Have an appropriate number of slides. Speakers vary widely in terms of delivery speed and amount of content per slide, but 1.5 to 2 minutes per slide is probably about right. In realistic situations you will be cut off if you exceed your time budget. At proposals and defenses there is usually not a strict time budget, but going over time is strongly frowned upon.
  • Have a pen and paper available to take notes after the talk. You cannot remember 150 detailed suggestions about things to change.
  • Arrange for someone to time the talk. Sometimes it is helpful to get timings on individual slides.
  • Act on the feedback that is given.

Each member of the audience must:

  • Listen to the talk as if it were being given for real. Interrupting the speaker should be handled according to whatever protocol will be in force during the real talk. Generally this means few or no interruptions.
  • Arrive with a pen and paper, or equivalent note-taking gear.
  • Provide detailed feedback in a constructive and respectful fashion.

In my group this is usually the procedure:

  1. I give a bit of context: remind everyone what the speaker needs to accomplish, what kind of background and temperament the audience is likely to have, etc.
  2. I introduce the speaker.
  3. The talk is given, minimizing interruptions to get a good timing estimate.
  4. Starting with students, the audience asks questions as if they had just heard the real version of the talk. The speaker responds accordingly.
  5. Starting with students, the audience makes general comments about the delivery of the talk.
  6. We go through the talk slide by slide, giving feedback and trying to figure out what to add, delete, change around, etc.

Finally, a bit of advice on making slides:

  • Don’t put text too close to the edges of slides; some projection systems crop a bit.
  • Colors often look different when they go through a projector, and low-contrast colors can be completely invisible on a screen. Use a small number of very high-contrast colors. I typically use black on white for almost everything with some bright red or blue for emphasis.
  • Minimize the number of animations.