Fun at the UNIX Terminal Part 1

This post is aimed at kids, like the 6th graders who I was recently teaching about programming in Python. It is more about having fun than about learning, but I hope that if you enjoy playing around at the UNIX terminal, you’ll eventually learn to use this kind of system for real. Keep in mind this immortal scene from Jurassic Park.

To run the commands in this post, you’ll need a UNIX machine: either Linux or Mac OS X will work. You’ll also need the ability to install software. There are two options:

  • Install precompiled binaries using a package manager, I’ll give command lines for Homebrew on OS X and for Apt on Ubuntu Linux. You’ll need administrator access to run Apt or to install Homebrew, but you do not need administrator access to install packages after Homebrew has been installed. Other versions of Linux have their own package managers and they are all pretty easy to use.
  • Build a program from source and install it in your home directory. This does not require administrator access but it’s more work and I’m not going to go into the details, though I hope to do this in a later post.

ROT13 using tr

The tr utility should be installed by default on an OS X or Ubuntu machine. It translates the characters in a string into different characters according to rules that you provide. To learn more about tr (or any other command in this post) type this command (without typing the dollar sign):

$ man tr

This will show you the UNIX system’s built-in documentation for the command.

In this and subsequent examples, I’ll show text that you should type on a line starting with a dollar sign, which is the default UNIX prompt. Text printed by the system will be on lines not starting with a dollar sign.

We’re going to use tr to encrypt some text as ROT13, which simply moves each letter forward in the alphabet by 13 places, wrapping around from Z to A if necessary. Since there are 26 letters, encrypting twice using ROT13 gives back the original text. ROT13 is fun but you would not want to use it for actual secret information since it is trivial to decrypt. It is commonly used to make it hard for people to accidentally read spoilers when discussing things like movie plot twists.

Type this:

$ echo 'Hello this is a test' | tr 'A-Za-z' 'N-ZA-Mn-za-m'
Uryyb guvf vf n grfg

Now to decrypt:

$ echo 'Uryyb guvf vf n grfg' | tr 'A-Za-z' 'N-ZA-Mn-za-m'
Hello this is a test

Just two more things before moving on to the next command.

First, the UNIX pipe operator (the “|” character in the commands above, which looks a little bit like a piece of pipe) is plumbing for UNIX commands: it “pipes” the output of one command to the input of a second command. We’ll be using it quite a bit.

Second, how exactly did we tell tr to implement ROT13? Well, the first argument, ‘A-Za-z’, gives it a set of characters to work with. Here A-Z stands for A through Z and a-z stands for a through z (computers treat the capital and lowercase versions of letters as being separate characters). So we are telling tr that it is going to translate any letter of the alphabet and leave any other character (spaces, punctuation, numbers, etc.) alone. The second argument to tr, ‘N-ZA-Mn-za-m’, specifies a mapping for the characters in the first argument. So the first character in the first argument (A) will be translated to the first character of the second argument (N), and so on. We could just as easily use tr to put some text in all uppercase or all lowercase, you might try this as an exercise.

fortune

Tragically, this command isn’t installed by default on a Mac or on an Ubuntu Linux machine. On a Mac you can install it like this:

brew install fortune

If this doesn’t work then you need to get Homebrew setup, try this page.

On Ubuntu try this:

sudo apt-get install fortune

The “sudo” command will ask you to enter your password before running the next command, apt-get, with elevated privileges, in order to install the fortune program in a system directory that you are normally not allowed to modify. This will only work if your machine has been specifically configured to allow you to run programs with elevated privileges.

In any case, if you can’t get fortune installed, don’t worry about it, just proceed to the next command.

Fortune randomly chooses from a large collection of mildly humorous quotes:

$ fortune 
I have never let my schooling interfere with my education.
		-- Mark Twain

say

This command is installed by default on a Mac; on Ubuntu you’ll need to type “sudo apt install gnustep-gui-runtime”.

Type this:

$ say "you just might be a genius"

Make sure you have sound turned up.

The Linux say command, for whatever reason, requires its input to be a command line argument, so we cannot use a pipe to send fortune’s output to say. So this command will not work on Linux (though it does work on OS X):

$ fortune | say

However, there’s another trick we can use: we can turn the output of one command into the command-line arguments for another command by putting the first command in parentheses and prefixing this with a dollar sign. So this will cause your computer to read a fortune aloud:

$ say $(fortune)

Another way to accomplish the same thing is to put fortune’s output into a file and then ask say to read the file aloud:

$ fortune > my_fortune.txt
$ say -f my_fortune.txt

Here the greater-than symbol “redirects” the output of fortune into a file. Redirection works like piping but the output goes to a file instead of into another program. It is super useful.

If you run “say” on both a Linux box and a Mac you will notice that the Mac’s speech synthesis routines are better.

cowsay

The extremely important cowsay command uses ASCII art to show you a cow saying something. Use it like this:

$ fortune | cowsay
 __________________________________ 
/ What is mind? No matter. What is \
| matter? Never mind.              |
|                                  |
\ -- Thomas Hewitt Key, 1799-1875  /
 ---------------------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Both Homebrew and Apt have a package called “cowsay” that you can install using the same kind of command line you’ve already been using.

Cowsay has some exciting options, such as “-d” which makes the cow appear to be dead:

$ fortune | cowsay -d
 _________________________________________ 
/ Laws are like sausages. It's better not \
| to see them being made.                 |
|                                         |
\ -- Otto von Bismarck                    /
 ----------------------------------------- 
        \   ^__^
         \  (xx)\_______
            (__)\       )\/\
             U  ||----w |
                ||     ||

Use “man cowsay” to learn more.

ponysay

Don’t install ponysay unless you feel that cowsay is too restrictive. Also, it isn’t available as a precompiled package. You can build it from source code by first installing the “git” package using apt-get or brew and then running the following commands:

$ git clone https://github.com/erkin/ponysay.git
$ cd ponysay
$ ./setup.py --freedom=partial --private install

This procedure puts ponysay into an odd location, but whatever. Here (assuming Linux, on a Mac you’ll need to pipe a different command’s output to ponysay) a cute pony tells us the prime factorization of a number:

figlet

Figlet (actually called FIGlet but that’s not what you type to run the command) prints text using large letters comprised of regular terminal characters. For example:

$ whoami | figlet
                     _          
 _ __ ___  __ _  ___| |__  _ __ 
| '__/ _ \/ _` |/ _ \ '_ \| '__|
| | |  __/ (_| |  __/ | | | |   
|_|  \___|\__, |\___|_| |_|_|   
          |___/                 

Figlet has lots of options for controlling the appearance of its output. For example, you can change the font:

$ echo 'hello Eddie' | figlet -f script
 _          _   _          ___                    
| |        | | | |        / (_)   |     |  o      
| |     _  | | | |  __    \__   __|   __|      _  
|/ \   |/  |/  |/  /  \_  /    /  |  /  |  |  |/  
|   |_/|__/|__/|__/\__/   \___/\_/|_/\_/|_/|_/|__/

Another command, toilet, is similar to figlet but has even more options. Install both of these programs using the same kinds of commands we’ve already been using to talk to package managers.

lolcat

The UNIX “cat” program prints a file, or whatever is piped into it, to the terminal. Lolcat is similar but it prints the text in awesome colors:

bb

The bb program doesn’t seem to be available from Homebrew, but on Ubuntu you can install it using “sudo apt-get install bb”. It is a seriously impressive ASCII art demo.

rig

You know how lots of web sites want you to sign up using your name and address, but your parents hopefully have trained you not to reveal your identity online? Well, the rig utility can help, it creates a random identity:

$ rig
Juana Waters
647 Hamlet St
Austin, TX  78710
(512) xxx-xxxx

The zip codes and telephone area codes are even correct. For some reason rig will never generate an identity that lives in Utah.

bc

The bc program is a calculator but unlike almost every other calculator you use, it can handle numbers of unlimited size (or, more precisely, numbers limited only by the amount of RAM in your computer) without losing precision. Try this:

$ echo '2 ^ 100' | bc
1267650600228229401496703205376

Unfortunately bc does not have a built-in factorial function but you can write one easily enough using bc’s built-in programming language. Start bc in interactive mode (this will happen by default if you don’t pipe any text into bc) by just typing “bc”. Then enter this code:

define fact(n) {
  if (n < 2) return 1;
  return n * fact(n - 1);
}

Now you can compute very large factorials:

fact(1000)
40238726007709377354370243392300398571937486421071463254379991042993\
85123986290205920442084869694048004799886101971960586316668729948085\
58901323829669944590997424504087073759918823627727188732519779505950\
99527612087497546249704360141827809464649629105639388743788648733711\
91810458257836478499770124766328898359557354325131853239584630755574\
09114262417474349347553428646576611667797396668820291207379143853719\
58824980812686783837455973174613608537953452422158659320192809087829\
73084313928444032812315586110369768013573042161687476096758713483120\
25478589320767169132448426236131412508780208000261683151027341827977\
70478463586817016436502415369139828126481021309276124489635992870511\
49649754199093422215668325720808213331861168115536158365469840467089\
75602900950537616475847728421889679646244945160765353408198901385442\
48798495995331910172335555660213945039973628075013783761530712776192\
68490343526252000158885351473316117021039681759215109077880193931781\
14194545257223865541461062892187960223838971476088506276862967146674\
69756291123408243920816015378088989396451826324367161676217916890977\
99119037540312746222899880051954444142820121873617459926429565817466\
28302955570299024324153181617210465832036786906117260158783520751516\
28422554026517048330422614397428693306169089796848259012545832716822\
64580665267699586526822728070757813918581788896522081643483448259932\
66043367660176999612831860788386150279465955131156552036093988180612\
13855860030143569452722420634463179746059468257310379008402443243846\
56572450144028218852524709351906209290231364932734975655139587205596\
54228749774011413346962715422845862377387538230483865688976461927383\
81490014076731044664025989949022222176590433990188601856652648506179\
97023561938970178600408118897299183110211712298459016419210688843871\
21855646124960798722908519296819372388642614839657382291123125024186\
64935314397013742853192664987533721894069428143411852015801412334482\
80150513996942901534830776445690990731524332782882698646027898643211\
39083506217095002597389863554277196742822248757586765752344220207573\
63056949882508796892816275384886339690995982628095612145099487170124\
45164612603790293091208890869420285106401821543994571568059418727489\
98094254742173582401063677404595741785160829230135358081840096996372\
52423056085590370062427124341690900415369010593398383577793941097002\
77534720000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
0000000000000000000000000000000000000000000000000000

While we're at it, you should figure out why the factorial of any large number contains a lot of trailing zeroes.

Conclusion

We've only scratched the surface, I'll share more entertaining UNIX commands in a followup post. Some of these programs I hadn't even heard of until recently, a bunch of people on Twitter gave me awesome suggestions that I've used to write this up. If you want to see a serious (but hilariously outdated) video about what you can do using the UNIX command line, check out this video in which one of the original creators of UNIX shows how to build a spell checker by piping together some simple commands.

Teaching Python Informally to Kids

For the last few months I’ve been running a “coding club” for my son’s sixth-grade class. Once a week, the interested students (about 2/3 of the class) stick around for an hour after school and I help them learn to program. The structure is basically like the lab part of a programming class, without the lecture. I originally had planned to do some lecturing but it soon became clear that at the end of a full day of school (these kids get on the bus around 7:00am, ugh) there was very little attention span remaining. So instead I decided to give them a sheet of problems to work on each week and I spend the hour walking around helping whoever needs help.

One of my main goals is to avoid making anyone hate programming. There are three parts to this. First, I’m not forcing them to do anything, but rather providing suggestions and guidance. Second, there’s no assessment going on. I’ve never been too comfortable with the grading part of teaching, actually. It can really get in the way of learning. Third, I’m encouraging them to work together. I’ve long noticed (starting when I was a CS student) that most learning occurs between students in the lab. Not as much learning happens in the lecture hall.

For a curriculum, we started out with turtle graphics, which I’ve always found to be wonderful. Python’s turtle library is clunkier than Logo and also is abysmally slow, but the advantages (visual debugging, many kids enjoy graphics) outweigh the problems. Fractals (Sierpinski triangle, dragon curve) were a good way to introduce recursion. We spent three or four weeks doing turtle stuff before moving on to simple crypto, which didn’t work as well. On the other hand, the students did really well with mathy code, here’s the handout from one of the math weeks.

Some observations:

  • One of my favorite moments so far has been when a couple of students implemented rot13 functions and used it to send rude messages to each other.
  • It’s pretty important to take a snack break.
  • Although the rockstar / 10x programmer idea is out of favor, it is absolutely clear that there is a huge amount of variation (much more than 10x) in how easily a group of twenty 10-11 year olds learn programming. Some kids are teaching themselves to open files and parse text while others are stuck on basic syntax issues.
  • Syntax, which computer professionals more or less learn to overlook, is hugely important.
  • I’ve always disliked Python’s significant whitespace and watching kids struggle with it has made me hate it even more.

Anyhow, this has been a fun teaching exercise and hopefully it is benefiting the kids.

A Tourist’s Guide to the LLVM Source Code

In my Advanced Compilers course last fall we spent some time poking around in the LLVM source tree. A million lines of C++ is pretty daunting but I found this to be an interesting exercise and at least some of the students agreed, so I thought I’d try to write up something similar. We’ll be using LLVM 3.9, but the layout isn’t that different for previous (and probably subsequent) releases.

I don’t want to spend too much time on LLVM background but here are a few things to keep in mind:

  • The LLVM core doesn’t contain frontends, only the “middle end” optimizers, a pile of backends, documentation, and a lot of auxiliary code. Frontends such as Clang live in separate projects.
  • The core LLVM representation lives in RAM and is manipulated using a large C++ API. This representation can be dumped to readable text and parsed back into memory, but this is only a convenience for debugging: during a normal compilation using LLVM, textual IR is never generated. Typically, a frontend builds IR by calling into the LLVM APIs, then it runs some optimization passes, and finally it invokes a backend to generate assembly or machine code. When LLVM code is stored on disk (which doesn’t even happen during a normal compilation of a C or C++ project using Clang) it is stored as “bitcode,” a compact binary representation.
  • The main LLVM API documentation is generated by doxygen and can be found here. This information is very difficult to make use of unless you already have an idea of what you’re doing and what you’re looking for. The tutorials (linked below) are the place to start learning the LLVM APIs.

So now on to the code. Here’s the root directory, it contains:

  • bindings that permit LLVM APIs to be used from programming languages other than C++. There exist more bindings than this, including C (which we’ll get to a bit later) and Haskell (out of tree).
  • cmake: LLVM uses CMake rather than autoconf now. Just be glad someone besides you works on this.
  • docs in ReStructuredText. See for example the Language Reference Manual that defines the meaning of each LLVM instruction (GitHub renders .rst files to HTML by default; you can look at the raw file here.) The material in the tutorial subdirectory is particularly interesting, but don’t look at it there, rather go here. This is the best way to learn LLVM!
  • examples: This is the source code that goes along with the tutorials. As an LLVM hacker you should grab code, CMakeLists.txt, etc. from here whenever possible.
  • include: The first subdirectory, llvm-c, contains the C bindings, which I haven’t used but look pretty reasonable. Importantly, the LLVM folks try to keep these bindings stable, whereas the C++ APIs are prone to change across releases, though the pace of change seems to have slowed down in the last few years. The second subdirectory, llvm, is a biggie: it contains 878 header files that define all of the LLVM APIs. In general it’s easier to use the doxygen versions of these files rather than reading them directly, but I often end up grepping these files to find some piece of functionality.
  • lib contains the real goodies, we’ll look at it separately below.
  • projects doesn’t contain anything by default but it’s where you checkout LLVM components such as compiler-rt (runtime library for things like sanitizers), OpenMP support, and the LLVM C++ library that live in separate repos.
  • resources: something for Visual C++ that you and I don’t care about (but see here).
  • runtimes: another placeholder for external projects, added only last summer, I don’t know what actually goes here.
  • test: this is a biggie, it contains many thousands of unit tests for LLVM, they get run when you build the check target. Most of these are .ll files containing the textual version of LLVM IR. They test things like an optimization pass having the expected result. I’ll be covering LLVM’s tests in detail in an upcoming blog post.
  • tools: LLVM itself is just a collection of libraries, there isn’t any particular main function. Most of the subdirectories of the tools directory contain an executable tool that links against the LLVM libraries. For example, llvm-dis is a disassembler from bitcode to the textual assembly format.
  • unittests: More unit tests, also run by the check build target. These are C++ files that use the Google Test framework to invoke APIs directly, as opposed to the contents of the “test” directory, which indirectly invoke LLVM functionality by running things like the assembler, disassembler, or optimizer.
  • utils: emacs and vim modes for enforcing LLVM coding conventions; a Valgrind suppression file to eliminate false positives when running make check in such a way that all sub-processes are monitored by Valgrind; the lit and FileCheck tools that support unit testing; and, plenty of other random stuff. You probably don’t care about most of this.

Ok, that was pretty easy! The only thing we skipped over is the “lib” directory, which contains basically everything important. Let’s look its subdirectories now:

  • Analysis contains a lot of static analyses that you would read about in a compiler textbook, such as alias analysis and global value numbering. Some analyses are structured as LLVM passes that must be run by the pass manager; others are structured as libraries that can be called directly. An odd member of the analysis family is InstructionSimplify.cpp, which is a transformation, not an analysis; I’m sure someone can leave a comment explaining what it is doing here (see this comment). I’ll do a deep dive into this directory in a followup post.
  • AsmParser: parse textual IR into memory
  • Bitcode: serialize IR into the compact format and read it back into RAM
  • CodeGen: the LLVM target-independent code generator, basically a framework that LLVM backends fit into and also a bunch of library functions that backends can use. There’s a lot going on here (>100 KLOC) and unfortunately I don’t know very much about it.
  • DebugInfo is a library for maintaining mappings between LLVM instructions and source code locations. There’s a lot of good info in these slides from a talk at the 2014 LLVM Developers’ Meeting.
  • ExecutionEngine: Although LLVM is usually translated into assembly code or machine code, it can be directly executed using an interpreter. The non-jitting interpreter wasn’t quite working the last time I tried to use it, but anyhow it’s a lot slower than running jitted code. The latest JIT API, Orc, is in here.
  • Fuzzer: this is libFuzzer, a coverage-guided fuzzer similar to AFL. It doesn’t fuzz LLVM components, but rather uses LLVM functionality in order to perform fuzzing of programs that are compiled using LLVM.
  • IR: sort of a grab-bag of IR-related code, with no other obvious unifying theme. There’s code for dumping IR to the textual format, for upgrading bitcode files created by earlier versions of LLVM, for folding constants as IR nodes are created, etc.
  • IRReader, LibDriver, LineEditor: almost nobody will care about these and they contain hardly any code anyway.
  • Linker: An LLVM module, like a compilation unit in C or C++, contains functions and variables. The LLVM linker combines multiple modules into a single, larger module.
  • LTO: Link-time optimization, the subject of many blog posts and PhD theses, permits compiler optimizations to see through boundaries created by separate compilation. LLVM can do link-time optimization “for free” by using its linker to create a large module and then optimize this using the regular optimization passes. This used to be the preferred approach, but it doesn’t scale to huge projects. The current approach is ThinLTO, which gets most of the benefit at a small fraction of the cost.
  • MC: compilers usually emit assembly code and let an assembler deal with creating machine code. The MC subsystem in LLVM cuts out the middleman and generates machine code directly. This speeds up compiles and is especially useful when LLVM is used as a JIT compiler.
  • Object: Deals with details of object file formats such as ELF.
  • ObjectYAML seems to support encoding object files as YAML. I do not know why this is desirable.
  • Option: Command line parsing
  • Passes: part of the pass manager, which schedules and sequences LLVM passes, taking their dependencies and invalidations into account.
  • ProfileData: Read and write profile data to support profile-guided optimizations
  • Support: Miscellaneous support code including APInts (arbitrary-precision integers that are used pervasively in LLVM) and much else.
  • TableGen: A wacky Swiss-army knife of a tool that inputs .td files (of which there are more than 200 in LLVM) containing structured data and uses a domain-specific backend to emit C++ code that gets compiled into LLVM. TableGen is used, for example, to take some of the tedium out of implementing assemblers and disassemblers.
  • Target: the processor-specific parts of the backends live here. There are lots of TableGen files. As far as I can tell, you create a new LLVM backend by cloning the one for the architecture that looks the most like yours and then beating on it for a couple of years.
  • Transforms: this is my favorite directory, it’s where the middle-end optimizations live. IPO contains interprocedural optimizations that work across function boundaries, they are typically not too aggressive since they have to look at a lot of code. InstCombine is LLVM’s beast of a peephole optimizer. Instrumentation supports sanitizers. ObjCARC supports this. Scalar contains a pile of compiler-textbooky kinds of optimizers, I’ll try to write a more detailed post about the contents of this directory at some point. Utils are helper code. Vectorize is LLVM’s auto-vectorizer, the subject of much work in recent years.

And that’s all for the high-level tour, hope it was useful and as always let me know what I’ve got wrong or left out.

Undefined Behavior: Not Just for Programming Languages

This is an oldie but goodie. Start with this premise:
a = b
Multiply both sides by a:
a2 = ab
Subtract b2 from both sides:
a2 – b2 = ab – b2
Factor the left side:
(a + b)(a – b) = ab – b2
Factor the right side:
(a + b)(a – b) = b(a – b)
Divide both sides by (a – b) and cancel:
a + b = b
Substitute b for a:
b + b = b
Finally, let b = 1 and simplify:
2 = 1

I ran into this derivation when I was nine or ten years old and it made me deeply uneasy. The explanation, that you’re not allowed to divide by (a – b) because this term is equal to zero, seemed to raise more questions than it answered. How are we supposed to keep track of which terms are equal to zero? What if something is equal to zero but we don’t know it yet? What other little traps are lying out there, waiting to invalidate a derivation? This was one of many times where I noticed that in school they seemed willing to teach the easy version, and that the real world was never so nice, even in a subject like math where — you would think — everything is clean and precise.

Anyway, the point is that undefined behavior has been confusing people for well over a thousand years — we shouldn’t feel too bad that we haven’t gotten it right in programming languages yet.

Vigorous Public Debates in Academic Computer Science

The other day a non-CS friend remarked to me that since computer science is a quantitative, technical discipline, most issues probably have an obvious objective truth. Of course this is not at all the case, and it is not uncommon to find major disagreements even when all parties are apparently reasonable and acting in good faith. Sometimes these disagreements spill over into the public space.

The purpose of this post is to list a collection of public debates in academic computer science where there is genuine and heartfelt disagreement among intelligent and accomplished researchers. I sometimes assign these as reading in class: they are a valuable resource for a couple of reasons. First, they show an important part of science that often gets swept under the rug. Second, they put discussions out into the open where they are widely accessible. In contrast, I’ve heard of papers that are known to be worthless by all of the experts in the area, but only privately — and this private knowledge is of no help to outsiders who might be led astray by the bad research. For whatever reasons (see this tweet by Brendan Dolan-Gavitt) the culture in CS does not seem to encourage retracting papers.

I’d like to fill any holes in this list, please leave a comment if you know of a debate that I’ve left out!

Here are some more debates pointed out by readers:

Advanced Compilers Weeks 3-5

This continues a previous post.

We went through the lattice theory and introduction to dataflow analysis parts of SPA. I consider this extremely good and important material, but I’m afraid that the students looked pretty bored. It may be the case that this material is best approached by first looking at practical aspects and only later going into the theory.

One part of SPA that I’m not super happy with is the material about combining lattices (section 4.3). This is a useful and practical topic but the use cases aren’t really discussed. In class we went through some examples, for example this function that cannot be optimized by either constant propagation or dead code elimination alone, but can be optimized by their reduced product: conditional constant propagation. Which, as you can see, is implemented by both LLVM and GCC. Also, this example cannot be optimized by either sign analysis or parity analysis, but can be optimized using their reduced product.

We didn’t go into them, but I pointed the class to the foundational papers for dataflow analysis and abstract interpretation.

I gave an assignment to implement subtract and bitwise-and transfer functions for the interval abstract domain for signed 5-bit integers. The bitwidth is small so I can rapidly do exhausive testing of students’ code. Their subtract had to be correct and maximally precise — about half of the class accomplished this. Their bitwise-and had to be correct and more precise than always returning top, and about half of the class accomplished this as well (a maximally precise bitwise-and operator for intervals is not at all easy — try it!). Since not everyone got the code right, I had them fix bugs (if any) and resubmit their code for this week. I hope everyone will get it right this time! Also I will give prizes to students whose bitwise-and operator is on the Pareto frontier (out of all submitted solutions) for throughput vs precision and code size vs precision. Here are the results with the Pareto frontier in blue and the minimum and maximum precision in red (narrower intervals are better).

Impressively, student k implemented an optimally precise bitwise-and transfer function! Student c’s transfer function returned an answer other than top only for intervals of width 1. Mine (labeled JOHN) looked at the number of leading zeroes in both operands.

We looked at the LLVM implementation of the bitwise domain (“known bits”, they call it) which lives in ValueTracking.cpp. This analysis doesn’t have a real fixpoint computation, it rather simply walks up the dataflow graph in a recursive fashion, which is a bit confusing since it is a forward dataflow analysis that looks at nodes in the backward direction. The traversal stops at depth 6, and isn’t cached, so the code is really very easy to understand.

We started to look at how LLVM works, I went partway through some lecture notes by David Chisnall. We didn’t focus on the LLVM implementation yet, but rather looked at the design, with a bit of focus on SSA, which is worth spending some time on since it forms the foundation for most modern compilers. I had the students read the first couple of chapters of this drafty SSA book.

Something I’d appreciate feedback on is what (besides SSA) have been the major developments in ahead-of-time compiler technology over the last 25 years or so. Loop optimizations and vectorization have seen major advances of course, as have verified compilers. In this class I want to steer clear of PL-level innovations.

Finally, former Utah undergrad and current Googler Chad Brubaker visited the class and gave a guest lecture on UBSan in production Android: very cool stuff! Hopefully this motivated the class to care about using static analysis to remove integer overflow checks, since they will be doing assignments on that very topic in the future.

Teaching C

The other day Neel Krishnaswami mentioned that he’s going to be teaching the C class at Cambridge in the fall, and asked if I had any advice about that topic. Of course I do! In fact the response got so long that it ended up being a blog post.

My main idea is that we need to teach C in a way that helps students understand why a very large fraction of critical software infrastructure, including almost all operating systems and embedded systems, is written in C, while also acknowledging the disastrously central role that it has played in our ongoing computer security nightmare.

There’s a lot of reading material out there. For the basics, I still recommend that students purchase K&R. People say good things about C Programming: A Modern Approach; I’ve only skimmed it. For advanced C I’ve not read a better book than Expert C Programming, though like K&R it is fairly old. The Practice of Programming is a really great book though it’s not completely specific to C. I haven’t read all of it but from what I’ve seen Modern C is a very good resource, with AFAIK the best treatment of undefined behavior of any C book. The C FAQ contains lots of good material.

For supplemental reading, of course the students need to look at all three parts of Chris Lattner’s writeup about undefined behavior, and mine as well.

What version of C should we teach? Probably a common subset of C99 and C11. In a first C class there’s no need to go into advanced C11 features such as the concurrent memory model.

We’d like students to be able to answer the question: Is C an appropriate choice for solving this problem? We’ll want some lecture material about C’s place in the modern world and we also need to spend time reading some high-quality C code, perhaps starting with Redis, Musl, or Xv6. Musl, in particular, is a good match for teaching since it contains lots of cute little functions that can be understood in isolation. From any such function we can launch a discussion about tradeoffs between portability, efficiency, maintainability, testability, etc. If Rich Felker (the Musl author) did something a certain way, there’s probably a good reason for it and we should be able to puzzle it out. We can also use Matt Godbolt’s super awesome compiler explorer to look at the code generated by various compilers. C’s lightweight-to-nonexistent runtime support is one of its key advantages for real-world system building, and it also means that generated code can be understood without thinking about something like a garbage collector.

We probably also need to spend a bit of time looking at bad old C, the kind that makes the world work even though we’re not proud of it. We can find files in OpenSSL and in the PHP interpreter that would singe your brain despite getting run billions of times a day, or we can always pick on an old standby like glibc — worth looking at just for the preprocessor abuse. But perhaps I am being uncharitable: Pascal Cuoq (reading a draft of this piece) correctly points out that “even what seems like plain stupidity often stems from engineering trade-offs. Does the project try to remain compilable under MS-DOS with DJGPP, with C90 compilers, under VMS, or all three at the same time?” And it is true that we would do well to help students understand that real-world engineering constraints do not often resemble the circumstances that we lead them to expect in school.

The second big thing each student should learn is: How can I avoid being burned by C’s numerous and severe shortcomings? This is a development environment where only the paranoid can survive; we want to emphasize a modern C programming style and heavy reliance on the (thankfully excellent) collection of tools that is available for helping us develop good C.

Static analysis is the first line of defense; the students need to use a good selection of -W flags and then get used to making things compile without warnings. A stronger tool such as the Clang static analyzer should also be used. On the dynamic side, all code handed in by students must be clean as far as ASan, UBSan, and MSan are concerned. tis-interpreter holds code to an even higher standard; I haven’t had students use this tool yet but I think it’s a great thing to try. Since dynamic testing is limited by the quality of the test cases, the students need to get used to using the output of a code coverage tool to find gaps in test coverage. Lots of coverage tools for C are available but I usually just use gcov since it is ubiquitous and hassle-free.

Teaching undefined behavior using sanitizers is a piece of cake: the tool gives students exactly the feedback that they need. The other way of teaching undefined behavior, by looking at its consequences, is something that we should spend a bit of time on, but it requires a different kind of thinking and we probably won’t expect the majority of students to pick up on all the subtleties — even seasoned professional C programmers are often unaware of these.

Detecting errors and doing something about them is a really important part of programming that we typically don’t teach much about in school. Since C is designed to avoid sweeping these problems under the rug, a C class is a great place to get students started on the right track. They should have to implement a goto chain.

Something I’m leaving out of this post is the content of the assignments that we give the students — this mostly depends on the specific goals of the course and how it fits into the broader curriculum (In what year are students expected to take the class? What kind of background do they have in math and science? What languages do they already know?). I’ve always taught C as a side effect of teaching operating systems, embedded systems, or something along those lines. In a course where the primary goal is C we have more freedom, and could look at more domains. Image processing and cryptographic algorithms would be really fun, for example, and even the old standby, data structures, can be used to good effect in class.

I’m also leaving out build systems and version control. They should use these.

In some courses I will give students access to the test infrastructure that will be used to grade their code. This makes assignments a lot more fun, and makes students a lot more happy. Other times I will give them a few test cases and keep the good tests (and the fuzzers) for myself. The idea is to make the assignment not only about implementation but also about testing. This stresses students out but it’s far more realistic.

Pascal remarks that “C is mostly taught very badly, and a student who aims at becoming good at maintaining C code will need to unlearn much that they have (typically) been told in class.” This is regrettably true — a lot of instructors learned C in previous decades and then they teach an outdated language, for example failing to discourage preprocessor abuse. The most serious common failing is to leave students unaware of their side of the bargain when the deal with a C compiler. I am talking of course about undefined behavior (and, to a lesser extent, unspecified and implementation-defined behavior). As a concrete example, I have taught numerous classes based on Computer Systems: A Programmer’s Perspective. In most respects this is an excellent book, but (even in the 3rd edition) it not only ignores undefined behavior but, worse, explicitly teaches students that signed integers in C have two’s complement behavior on overflow:

This claim that positive signed overflow wraps around is neither correct by the C standard nor consistent with the observed behavior of either GCC or LLVM. This isn’t an acceptable claim to make in a popular C-based textbook published in 2015. While I can patch problems in the book during lecture, that isn’t very satisfying, and not all instructors have the time and expertise.

One might argue that we shouldn’t be teaching C any longer, and I would certainly agree that C is probably a poor first or second language. On the other hand, even if we were in a position where no new projects should be written in C (that day is coming, but slowly — probably at least a decade off), we’re still going to be stuck maintaining C for many decades. A random CS graduate has pretty good odds of running into C during her career. But beyond that, even after we replace C, the systems programming niche will remain. A lot of what we learn when we think we’re learning C is low-level programming and that stuff is important.

Thanks to Pascal Cuoq and Robby Findler for commenting on drafts of this piece.