Reading code is an important skill that doesn’t get enough emphasis in CS programs. There are three main aspects:
- the external view of the code: documentation, comments, APIs, white papers, information from developers, etc.
- the static view: reading the code like a book
- the dynamic view: reading the code as it executes, probably with help from a debugging tool
Of course these aren’t totally exclusive. For example, reading code like a book has a dynamic aspect because our brains serve as crude interpreters.
Ideally we’ll be able to focus on a module that is small enough to understand all at once and that has relatively clean interfaces to the rest of the system. If the code seems too big or complicated to understand as a whole, perhaps it can be broken up. If the modularity that we want just isn’t there, we may be stuck trying to understand a piece of functionality that is buried is an ocean of complexity—not very fun.
We should be explicit about our goals. Are we reading the code for general enlightenment? In order to decide whether to hire the person who wrote the code? In order to begin refactoring or adding functionality? To look for bugs? Keep in mind that if we’re looking for bugs, code reviews are a whole separate thing.
Before reading any code we’ll certainly want to look over any documentation and also skim the comments in the code: sometimes there’s halfway-decent documentation hiding in the middle of a big source file. If the code is very well-known, like the Linux kernel, there’ll be plenty of books and web pages for us to look at. If not, perhaps there’s a specification, a white paper, a README, or similar. Often, even if there’s no documentation, the code will be implementing known algorithms that we can brush up on. Some kinds of domain-specific codes (signal processing, feedback control, storage management) will be very hard to understand if we lack basic domain knowledge, so again we may need to hit the books for a little while before getting back to the code.
Often it’s a good idea to build and run the code before starting to read it, or at least be sure that someone else has done it. This is because we often run across attractive-looking open source code on the web that isn’t worth reading because it’s not even finished—we might as well discover that fact as early as possible. Just the other day I got suckered by a Github project that promised to do exactly what I wanted, but that was badly broken.
Reading code is far easier if we come into it with an understanding of the patterns that are being used. Therefore, we can expect that in any given domain the first few pieces of code we read are going to require a lot of work and after that things should be easier. The tricky thing about patterns is that sometimes they are sitting there on the surface (like gotos in kernel code or explicit reference counts) but other times they are buried deeply enough that a lot of digging is needed in order to uncover them. Making matters worse, it’s not uncommon to see badly implemented patterns the author of the code didn’t even understand they were trying to implement. In systems code we sometimes see code that is half-assedly transactional. Also, see Greenspun’s 10th Rule of Programming.
It is important to learn to recognize code that contains little information. Java seems to be particularly prone to this (for example I assume this class is an obscure joke, but perhaps not…) but all languages have it. A pernicious kind of low-information code occurs in C where the verbose code is actually easy to get wrong, as in the well-known bug struct foo *x = (struct foo *) malloc (sizeof (struct foo *)).
One way to start reading a piece of code is to create an annotated call graph. If tool support is available then great, but if not this isn’t too much trouble to do by hand. Also, a by-hand callgraph is a good way to start getting a general feel for the code. Annotations on the call graph might include:
- potential trouble spots: extra-snarly code, inline assembly, code containing comments such as “you are not expected to understand this“
- entry points to the module we’re reading, and exit points from it
- resource allocations and deallocations
- error-handling paths
- accesses to important mutable global state
Putting together a good static callgraph may not be so easy if the code is functional, OO, event driven, or uses function pointers. In this case building the callgraph may become a dynamic problem.
With callgraph in hand (or without it) we should try to get a sense of the code’s control flow structure. Is it a library? An event loop? One out of a stack of layers? Does it use threads, and how? What sort of error handling does it use?
What are the main data structures used by the code we’re looking at? Which of these are shared with callers or callees? Where are they allocated, freed, and modified? What are the crucial data invariants? Are the data structures (and their algorithms) basically the textbook versions or are there interesting quirks in the implementations?
Strictly speaking, we don’t need to run code in order to understand it. In practice, being able to run code is a lifesaver for several reasons. First, an actual execution follows a single path through the code, permitting us to ignore code not touched on that path. Second, if the computer is executing the code then the interpreter in our brains can take a rest and we can focus on other things. Third, if we have formed a bad hypothesis about the code, running the code is a good way to conclusively refute that hypothesis.
What is the dynamic view in practice? We can use a debugger to single-step through code, we can set breakpoints and watchpoints, we can add debugging printouts, we can add assertions corresponding to conjectured invariants, and we can write unit tests for the code to make sure we really understand it.
Finally, it’s often a good idea to change the code: add a bit of functionality, fix a bug, do some refactoring, etc. If we’ve successfully understood the code, we’ll be able to do this without too much trouble.
This has been a bit of a brain dump, not a checklist as much as a collection of things to keep in mind when starting out on a code-reading project. I’d be happy to get suggestions for improvement.