This week most of the students in the class (but not everyone, since we ran out of time) presented their arguments about the quality of various UNIX utility programs. This was a bit more interesting than it might otherwise have been since some of the utilities running on our student lab machines are hilariously old versions. Students subjected the tools to a wide variety of stress tests such as fuzzing, building them with Clang’s UB sanitizer, Valgrind, Murphy (an awesome tool that I haven’t had time to try, but plan to), and more that I’m forgetting now. Additionally they inspected the source code, looked at how many bugs have been fixed in the utilities in recent years, etc. The point of this exercise was to get people used to the idea of assessing an externally-produced body of code using whatever methods are suitable and available.
The other thing we did was begin the next project. This one will have two phases but I’ve only told the students about the first one. With me playing the program manager we developed the specification. They will be developing this program in groups of three: a tester, a compression person, and an I/O person. I’m hoping that this code lends itself to writing assertions; if not, we’ll do something like balanced binary search trees later on, you can’t avoid assertions when writing those.
7 responses to “Writing Solid Code Week 4”
Re opportunities for assertions: the priority queue should offer a few. The codeword tree is naturally built bottom-up during compression and top-down during decompression. With parent and child pointers and possibly two aggregated fields (frequency and subtree minimum), there will be a lot of opportunities for assertions, as I found out while writing the fuzzer for http://www.davideisenstat.com/dtree/ .
Some nitpicking comments on the spec follow.
The -t mode looks like a pain to implement — attempt to decompress fully (not just the stuff before the compressed bits), then, if that doesn’t work, run the first compression phase. I think that I would have split it into -tc and -td and required the latter to succeed if and only if the header was OK.
“-1 exit code” After masking, the exit code reported will be 255.
“padded with 1-7 bits whose value is unspecified” This strikes me as incongruous with the effort made to make the table of codewords deterministic. Perhaps the decompressor must accept any values for these bits while the compressor must make them zeros, just how the decompressor must accept any valid table, while the compressor’s output is determined.
On a related note, is it explicitly required that bits be packed into bytes in a little-endian fashion?
“No string of 0s and 1s can exceed length 256” This limit could be 255 (excluding newline and NUL).
And now for some nits on nits:
What happens if the input file has 2^64 bytes? (This is implausible now, but the same failure mode has cropped up in file formats with smaller limits.)
The behavior of huff is presumably undefined to some degree if there are concurrent modifications to the relevant bits of the filesystem.
Is the input file guaranteed to be seekable?
Are the calls to malloc() etc. mandated to be reasonable? Otherwise, we could malloc() until failure and then exit(-1).
David, thanks for the comments! Yes, hopefully the trees will create some natural opportunities for assertions. I’ll take a look at your tree fuzzer, sounds interesting, or maybe we’ll look at it in class.
Some of the issues you mention are ones that I will be happy for the students to encounter naturally. Especially the ordering of bits in a byte, which I had explicitly decided not to mention in the spec since the students didn’t notice this issue in class. I’m trying to keep them out of deep trouble, but not all trouble, if you see what I mean. Other issues I’ve fixed in a commit just now.
My tree fuzzer is a somewhat horrifying piece of code (long, heavily templated, uncommented) that compares whether two implementations, one simple (A, naive) and one fast (B, self-adjusting), are doing the same thing. The main lesson, I think, is to use general-purpose graph algorithms first to check whether the node pointers make a tree — bugs in my tree operations led to every kind of local topological anomaly possible!
The other point of interest is that I used archetype classes (named Free*, after free objects in abstract algebra) to ensure that the user’s data were being operated on in the limited way described in the documentation.
I notice that the Murphy paper includes the line “Practically everything we tested failed with the ‘eagain’ and ‘eintr’ gremlins enabled”. I’m not surprised; the handling of interrupted syscalls is by far the most spectacularly awful wart in the traditional unix APIs, making it easy and natural to write unreliable code and annoying and tedious to write reliable code.
pm215, in a different class I’m teaching, we’re looking at xv6, a modernized implementation of UNIX v6. The funny thing is, it doesn’t even look like it would be very hard to avoid to some of these awful API choices by adding a bit of kernel complexity. For example, avoiding short reads/writes in the pipe implementation requires adding a handful of lines of code. The KISS idea was definitely not applied profitably here.
I’d be interested to know how old something should be for you to find it “hilariously old”.
Some of the code that I maintain was written in 1990 or before, still going strong.
Hi Magnus, I get your point, but I found it ridiculous to be running these old UNIX utilities (yes, from the 90s) when the newer versions are almost certainly better. Of course not all software needs to be fixed if it ain’t broke.