Writing Solid Code Week 4

This week most of the students in the class (but not everyone, since we ran out of time) presented their arguments about the quality of various UNIX utility programs. This was a bit more interesting than it might otherwise have been since some of the utilities running on our student lab machines are hilariously old versions. Students subjected the tools to a wide variety of stress tests such as fuzzing, building them with Clang’s UB sanitizer, Valgrind, Murphy (an awesome tool that I haven’t had time to try, but plan to), and more that I’m forgetting now. Additionally they inspected the source code, looked at how many bugs have been fixed in the utilities in recent years, etc. The point of this exercise was to get people used to the idea of assessing an externally-produced body of code using whatever methods are suitable and available.

The other thing we did was begin the next project. This one will have two phases but I’ve only told the students about the first one. With me playing the program manager we developed the specification. They will be developing this program in groups of three: a tester, a compression person, and an I/O person. I’m hoping that this code lends itself to writing assertions; if not, we’ll do something like balanced binary search trees later on, you can’t avoid assertions when writing those.

Automatically Entering the Grand C++ Error Explosion Competition

G++ can be comically verbose; developers sometimes like to wallpaper their cubes with choice error messages from Boost or STL programs. The Grand C++ Error Explosion Competition asks the question: how large can we make the ratio between error output and compiler input?

I’m not much of a C++ person but when the contest was announced I was doing some experiments in using C-Reduce as way to search for C++ programs that have interesting properties. Of course, we usually use C-Reduce to search for small programs, but Alex and I have been using it (and other reducers) to find, for example, programs that cause interesting parts of the compiler to execute. It only took a minute or two to setup C-Reduce so that its goal was to maximize the GCEEC’s fitness function. I started it running on four C++ files; after a few days three of the reductions didn’t show signs of terminating but the fourth one — some random part of the LLVM backend — reduced to this:

struct x0 struct A<x0(x0(x0(x0(x0(x0(x0(x0(x0(x0(_T1,x0 (_T1> <_T1*, x0(_T1*_T2> 
binary_function<_T1*, _T2, x0{ }

Somewhat surprisingly, there aren’t any templates here. When compiled using G++ 4.8.1 (I’m using the one that comes with Ubuntu 13.10 on x86-64) we get 5 MB of output. It wasn’t too hard to (1) clean up this output a bit and (2) recognize that the repeated (x0 substring is important. Thus, my entry to the GCEEC was:

struct x struct z<x(x(x(x(x(x(x(x(x(x(x(x(x(x(x(x(x(x(x(x(x(x(y,x(y><y*,x(y*w>v<y*,w,x{}

Every added (x approximately doubles the size of the error output. It was tricky to choose the right number of these substrings to include since I wanted to bump up against the timeout without pushing past it. But really, at this point the competition became a lot less interesting because we can pick a target ratio of output to input and trivially craft an input that reaches the target (assuming we don’t run into implementation limits). So the contest is basically a bandwidth contest where the question is: How many bytes can we get out of G++ on the specified platform within the 5 minute timeout? At this point the winner depends on how many cores are available, the throughput of Linux pipes, etc., which isn’t too satisfying.

I was a little bummed because I didn’t need to use a trick I had been saving up, which was to give the C++ file a name that is 255 characters long — this is useful because the name of the source file is repeated many times in the error output (and the length of the source file name is not part of the fitness function). However, it was delightful to read the other contest entries which used some nice tricks I wouldn’t have thought of.

Would it be fun to repeat this contest for Clang++ or MSVC++? Also, why is G++ so verbose? My guess is that its error reporting logic should be running (to whatever extent this is possible) much earlier in the compilation process, before templates and other things have been expanded. Also, it would probably be useful to enforce a limit on the length of any given error message printed by the compiler on the basis that nobody is interested in anything past the first 10 KB or whatever.

Writing Solid Code Week 3

This week we finished up the triangle classifier. Not all students’ implementations pass all test cases, which is totally fine with me as long as people learned some lessons about when not to use floating point (I did). On Tuesday we also chose a smallish UNIX utility for each student to evaluate as if it were going to be used as mission-critical software (not sure how our lives would depend on less but that’s beside the point). Next Tuesday we’ll read some students arguments and decide whether the evidence is convincing.

Thursday I lectured about assertions, one of my favorite topics. For several years I’ve been avoiding writing a blog post about assertions because there’s plenty of good information out there, but I really do need to write that post now that I’ve sort of collected all my thoughts in one place. The triangle classifier offered poor opportunities for writing assertions, but later on we’ll play with codes where assertions are basically mandatory. Anyway, I’ll try to get that written up soon.

Status of Software Testing

The other day I received a query from some software engineering researchers who are compiling sort of a survey paper about the status of software testing research; here are the questions, with my answers. I’m interested to hear what the rest of you think about this stuff.

What do you think are the most significant contributions to testing since 2000, whether from you or from other researchers?

These would be my choices:

  • delta debugging
  • symbolic and concolic testcase generation — Klee, SAGE, etc.
  • general-purpose open source execution monitoring tools — Valgrind, “clang -fsanitize=undefined”, etc.
  • incremental improvements in random testing — QuickCheck and its variants, etc.

What do you think are the biggest open challenges and opportunities for future research in this area?

Well, as far as I can tell, gaining confidence in the correctness and security of a piece of software is still both expensive and difficult. Not only that, but this process remains an art. We need to continue making it into a science. Just as a random example, if I want to build a compiler there are about 25 books I can read, all of which cover different aspects of a well-understood body of knowledge. They will tell me the various parts of a compiler and how I should put them together. If I want to become certain that a piece of software does what I hope it does, what books should I read? It’s not even clear. It’s not that there aren’t any good books about testing, but rather that even the good ones fail to contain actionable general-purpose recipes.

Of course not all software is testable. My work on testing GCC has convinced me of this (although I already knew it intellectually). I think there are lots of opportunities in learning how to create testable and/or verifiable software. For example, what could I accomplish if I had a really good concolic tester to help with unit testing? Hopefully, with a fairly low investment I’d end up with good test cases and also good contracts that would be a first step towards verification if I wanted to go in that direction. I think we can take a lesson from CompCert: Xavier didn’t just prove the thing correct, but rather pieced together translation validation and verification depending on which one was appropriate for a given task. Of course we can throw testing into the mix when those other techniques are too difficult. There’s a lot of synergy between these various ways of gaining confidence in software but at present verification and testing are largely separate activities (not least because the verification tools are so hard to use, but that’s a separate topic).

One of the biggest unsolved challenges is finding hidden functionality (whether deliberately or accidentally inserted) in software that operates across a trust boundary. Of course hidden functionality is difficult because the specification gives us few or no clues about where to find it; it’s all about the implementation. There’s no silver bullet for this problem but one idea I like (but haven’t worked on at all) is using continuous mathematics to model the system behavior and then focusing testing effort around discontinuities. This research is of the character that I’m trying to describe.

Writing Solid Code Week 2

First, I wanted to thank everyone for the great discussion on the post about week 1. My single favorite thing about blogging is the discussion and involvement that these posts sometimes generate.

During week 2 we worked on issues relating to the triangle classifier. As several commenters predicted, getting correct results using floating point code was not easy. In fact, as of last Thurs, the only FP implementation that passed all test cases was one supplied by reader David Eisenstat. On Tuesday we went through the code of a randomly selected student; here it is. Josh’s code is — we believe — correct. On Thursday we went through David Eisenstat’s integer-based code, which was useful for several reasons. First, sometimes it’s easier to review code written by someone who isn’t actually present. Second, David’s code uses a couple of little C tricks — stringification and use of “static” in an array parameter — that I wanted to show the class. Third, except for a couple of little warts (I’d have preferred library calls for sorting an array and for converting a string to an int) his code is strong. We also went through a couple of people’s test scripts. Most students in the class used Python for test automation; a minority used Bash, one person used Ruby, and one student embedded his triangle code in a cool iOS app that classifies triangles interactively. My voting-based tester is the only Perl code anyone has committed to the repo. Perl seems totally dead as far as younger developers are concerned, which makes me a tiny bit sad. I asked the students to wrap up their triangle classifiers so we can move on.

The project that I want to assign for week 3 is only about testing. I plan to give each student a random UNIX utility (grep, sed, ed, bc, etc.) based on the premise that it is a product that their company is evaluating for production use. Each student is to perform some testing and then either argue that the tool is good enough for serious use or that it is too buggy to be adopted. It doesn’t matter which conclusion they arrive at as long as the conclusion is supported by evidence. A negative argument could be supported by bugs or deviations from the spec; a positive argument could be supported by a solid testing strategy, coverage data, etc. Of course I pointed the class to the classic Bart Miller papers evaluating the robustness of UNIX tools.

So far this class has been lots of fun and at least some of the students are enjoying it. I’ve gotten kind of obsessed with it and have been spending lots of time dreaming up projects that we probably won’t have time for.

Writing Solid Code Week 1

This semester I’m teaching a new undergraduate course called Writing Solid Code. The idea is to take a lot of lessons I’ve learned on this subject — testing, debugging, defensive coding, code reviews, etc. — and try to teach it. Since I don’t have any slides or even any lecture notes that are good enough to put on the web, I thought it would be useful to summarize each week in a blog post.

One thing that I wanted to do is create an environment where cheating is impossible. To accomplish this, all student code is going into a public Github repo, and everyone is free to look at everyone else’s code if they wish to. In other words, cheating is impossible because the actions that would normally be considered cheating aren’t. Of course, I am encouraging students to write their own code. In a self-selected group of students I don’t expect problems. If people want to exchange ideas or test cases or even subroutines, then great.

The other thing I wanted to do was get away from the kind of grading where a failing test case detracts from a student’s grade. So I am not going to do that. Rather, everyone starts with a default grade of “A” and they only lose this grade if they fail to attend class, fail to participate in activities such as code reviews, or fail to submit assignments. I am counting on people’s competitive spirit to drive them towards creating code that is at least as solid as their peers’ code. This has worked well in the past and I expect it to work again.

The class is going to be structured around a set of small coding exercises, each designed to reinforce some aspects of writing solid code. Each exercise will be iterated until all (or as many as possible) students’ codes pass all test cases that they — and I — can think of. If this takes too long I’ll start making the projects easier; if (as I expect) we can converge quickly, I’ll make the projects more complicated. This structure is designed to go against the grain of the typical fire-and-forget coding style that CS students are trained to adopt; I’m convinced that it builds very poor habits. The idea that I want the students to get out of these exercises is that in most cases creating a piece of code is the easy part; the hard part is figuring out what the code is supposed to do in the first place and also making sure that the code actually works as intended.

The first exercise is the venerable triangle classifier. The students, with a bit of guidance from me — their program manager — developed the spec in class on Tuesday. Their code is due next Tuesday, along with a collection of test cases that achieves 100% partition coverage as well as 100% coverage of feasible branches. This exercise is in C. The language isn’t very relevant, but C does enjoy good tool support. We may switch languages later on.

Since we don’t yet have any code to look at, I lectured on testing today. Basically I focused on the three hard problems in testing:

  • Generating test inputs
  • Determining if the result of a test is acceptable
  • Determining what a collection of passing test cases means

This was more or less an abbreviated version of the material I tried to cover in the early parts of my Udacity course. I wish that I had a nice collection of notes about this but I don’t (yet).