Software Engineering Takeaways

I had a great time this spring teaching a software engineering course for a new professional masters degree program created by my department. Since I didn’t use slides or hand out lecture notes, some students were asking if maybe I could write up a summary of what I wanted them to learn in the course. This sounded like a good idea and I figured I’d make a blog post out of it.

This course didn’t have a textbook, but we ended up going through much of Writing Solid Code and also several chapters of Code Complete. I particularly like Appendix D of the 2nd edition of Writing Solid Code, where Maguire describes how he conducts an interview for a programming position. The students also completed about a dozen (mostly) programming assignments, individually and in groups, that I’m not going to talk about here.

Testing

Being good at testing is sort of a superpower and computer science programs usually don’t do a great job teaching it. And testing truly is hard to teach since there’s a lot to it and so much of testing is either highly situational or else is about engaging the problem with the right mindset and plenty of energy: you can’t test well unless you truly want to make things fail. My approach to teaching this material is to focus on what I think of as testing’s three pillars:

  1. Test cases. These come from regressions, from systematic or randomized test-case generators, from requirements and specifications, and from corner cases in the implementation. In a few special situations, exhaustive testing works. Test cases should rarely be discarded but they can be segregated, for example into fast and slow ones, which run at different times. Test cases should exercise all relevant sources of input: not just file inputs, but also environment variables, user inputs, and things like the Windows registry, if the system under test depends on those things. The time at which parts of the test case arrive may be important. Though I forgot to give it to the class, Cem Kaner’s What Is a Good Test Case? is excellent.
  2. Oracles. Test cases are only useful insofar as we can tell if the system under test processes them correctly. Some oracles are easy and broadly applicable, such as detecting crashes, excessive resource use, and violations of programming language rules (e.g. using ASan and UBSan). The most useful oracles, however, are specific to the system under test. The one we see most often is to specify, by hand, the expected behavior of the system, based on our understanding of the requirements or specification. Differential testing — checking one implementation of a spec against another — is highly powerful, and more broadly applicable than it first appears: the reference implementation doesn’t need to implement the full functionality of the system, and sometimes it is simply a different mode of the same system. For example, an optimizing compiler can be tested against a non-optimizing version of itself. Function-inverse pairs make an excellent oracle, when applicable. In metamorphic testing we change the test case (add irrelevant levels of nesting to a C program, or remove dead code from it) in a way that shouldn’t change how it works. Assertions and checkReps are extremely valuable oracles. Finding good oracles requires creativity and attention to detail, but the potential rewards are high.
  3. Coverage. Every serious programming language has one or more tools for measuring code coverage, with the goal of finding code not exercised by a test suite. There are a lot of variants on coverage, but in practice we seldom see anything beyond line or branch coverage. If the coverage induced by a test suite is bad, the test suite itself is bad. If the coverage is good, then the test suite might be good, but this is not a sure thing.

Testing should be as frictionless as possible: investments in automation and parallelization often pay off. In class we saw how incredibly easy it is to run the tests for a project in Github on every commit using Travis.

One of my favorite kinds of testing is burning in an ADT implementation — this exercise brings the three pillars of testing together in a very satisfying way.

How SQLite is Tested is great supplemental reading.

Tool and Library Ecosystems

One of the defining characteristics of modern software engineering is its reliance on tooling to help us create large projects rapidly, with lots of high-quality libraries to build on. Every programming language has one or more collections of tools, each of which forms a more or less coherent ecosystem for getting programming tasks done: editing, refactoring, linting, profiling, building, packaging, measuring coverage, debugging, etc. The tricky bit for newcomers is to rapidly learn the shape of an ecosystem in order to avoid wasting time doing jobs that could be better accomplished with tool support.

Modularity

If testing is the reason that large software sometimes works, then modularity is the reason it can be constructed in the first place. Modularity is kind of a tough topic to cover in a course but we did spend time on API design, where I mainly wanted to convey that it’s hard to get right and there are plenty of pitfalls to avoid: excess mutability, unclear ordering constraints among API calls, long lists of parameters, poor choices for default values, hard-to-follow memory ownership protocols, cryptic names, and unwise exposure of implementation details. Something I tried to show the class is how a small group of people working closely together on a well-defined task can almost always succeed, but that — without some care and experience — multiple small groups who have accomplished tasks separately will invariably have problems integrating things they have developed separately.

Code Reviews

Testing only gets us so far and a huge number of problems can be headed off by putting two or more sets of eyes on code before it lands, whether this is done in Github or in a meeting room. We did a bit of code reviewing in class, though I feel like we should have done this more. Some aspects I find important are trying to stay constructive and objective, having clear goals for the code review, and making sure that outcomes get written down and acted upon.

Coding Styles

My view is that the details aren’t nearly as important as just having and following a coding style. This include source code formatting details, identifier naming, language features to avoid, etc. This can result in a code base that is more approachable to newcomers and where some kinds of bugs are easier to spot. The various Google documents seem as good examples as any. Coding styles are particularly important for languages that are older, cruftier, and less safe. For example, LLVM lives in a C++ style that I generally find to be comfortable and tasteful.

Source Control Systems

Git is more or less the only game in town so there’s not much to talk about there. The thing I tried to convince the students here is that we don’t just “use git.” Rather, it’s crucial to define a sensible git-based workflow and get everyone onboard with it. The main thing is that for a variety of common use cases, we have a fairly mechanical set of steps to follow; git is confusing and badly designed enough that workflow innovation is best left to people with a lot of experience.

Critical Systems and Responsibility

We are far past the point of pretending that software is either ethically neutral or uniformly positive, and the 90s-era Internet optimism sounds pretty naive at this point. A significant fraction of today’s CS students are going to work on, at some point, software that is very much ethically non-neutral. This is a good post on that topic. We read various documents related to the Toyota unintended acceleration case and discussed other software safety problems like the Therac-25.

Backwards Compatibility

People tend to underestimate its importance and there are great stories on both sides. Microsoft’s commitment is incredible: a long time ago I had access to the Windows 2000 sources and ran across crazy things, like a copy of Windows 3.1 (in x86 assembly!) and routines for detecting legacy applications and implementing various buggy behaviors that they depended on. On the other side we have, for example, the Python 2/3 debacle.

Static Analysis

I tried to convince students that static analysis is useful, starting with compiler warnings. We read A Few Billion Lines of Code Later and they spent some time looking at issues pointed out by the Clang static analyzer.

Engineering for Performance

The message here is a bit subtle: we want to discourage premature optimization while still encouraging people to learn to build software that isn’t fundamentally too slow to meet its goals. So on one hand we don’t want to write everything in C “for speed” but on the other hand we need to avoid showstoppers such as Python, a bottleneck data structure stored on disk, a quadratic algorithm, etc.

For performance tuning, of course measurement is everything, so we need profilers and maybe also hardware-specific utilities like perf. It doesn’t hurt to know the strengths and weaknesses of both the compiler and the hardware platform, and we should always be ready to read some assembly. Chapters 25 and 26 of Code Complete 2e are a good resource for all of this, though they do not present a very nuanced view of what one can expect the compiler to accomplish.

Reporting Bugs

Getting a software developer to stop doing what they wanted to do, and to spend the day fixing a defect instead, can be difficult, but there are a few simple ingredients that can make bug reports vastly more effective. The report should be framed politely and matter-of-factly; implying that the developers are incompetent is rarely helpful. The issue must be reproducible and the bug report must describe all of the circumstances necessary to make the bug show itself. However, all nonessential circumstances should be left out, including any part of the test case that is not necessary for triggering the bug.

Software Process

There are plenty of software development process models, but I guess I don’t find any of these to be particularly compelling, so we didn’t spend much time on them. On the other hand, the elements of software process — estimation, requirements, modeling, architecture, implementation, quality assurance, maintenance, etc. — are hard to argue with. I spend a bit of effort trying to prepare students for the huge diversity in software development organizations they might see out in the world, from startups to banks to Googles to IBMs. I tried to get them used to the fact that in some cases management may have wildly unrealistic ideas about software development and that requirements can change at the drop of a hat.

The Human Element

Effectively reviewing someone’s code requires a light touch; you have to critique the code rather than judging its author. On the other side, it can be very difficult to gracefully process criticism of a piece of code that you’ve bled and sweated over for months. A bug report is really just an argument that a person should make a change to a piece of software. It is hard to implement a piece of software that solves someone’s problem without putting yourself in their place. Similarly, it is very difficult to solve a problem for someone you don’t like or respect or at least empathize with to some degree. UI design requires understanding the needs of people who don’t share your own abilities and cultural background. Ethical concerns are common, and probably should be thought about a lot more often than they are. Team dynamics are complex and managing people is really not easy. In summary, software development is fundamentally a human endeavor and we’d all do well to keep that in mind.

The Basic Toolbox

This post is aimed at computer science students.

In the software engineering course I’m teaching this spring, I often find myself saying things like “you need to know a scripting language” or “everyone should be able to run a code coverage tool.” Finally, the other day, a student stopped me and asked for the whole list. In other words, what — in my opinion — is the collection of tools that someone graduating with a CS degree should know how to use. Of course I couldn’t answer this on the spot but I’ve been thinking about it since then. The basic idea is that for most any common situation, you should have a decent tool at hand and be able to start solving problems with it without too much fumbling around. (Keep in mind that this is a wish list for self-study: I doubt that any CS program teaches all of these. Also, I didn’t have all of these tool skills when I got my undergraduate CS degree, though I did by the time I got a PhD.)

A version control system: Git is the obvious choice; the main thing you should have is a basic Github-centric workflow including pull requests, remotes, dealing with merge conflicts, etc.

A text editor: We all end up using different editors from time to time, but we should each have a solid default choice that does a good job with most editing tasks. It should highlight and indent any common programming language, integrate with a spellchecker, easily load gigantic files, have nice regex-based search and replace, etc. There are plenty of choices, many CS people migrate to vim or emacs.

A graphing program: I routinely use gnuplot, graphviz, and Powerpoint to make figures. Lots of people like matplotlib.

A presentation tool: Powerpoint, Keynote, Google Slides, something LaTeX based, etc.

An interactive debugger for native executables: LLDB, GDB, something IDE-based.

A generic build system: Make, CMake, etc.

A scripting language: This is for low-grade automation, quick and dirty data analysis tasks, etc. Python and JavaScript would seem like natural choices. Around 20 years ago I was an intern at a networking company and my supervisor popped out of a meeting with some data concerning switch errors, and asked me to do some analysis to locate the underlying pattern. I wasn’t sure how; he handed me a Perl book and I was able to get the job done before the meeting ended.

A shell language: This is probably bash or PowerShell, but there are plenty of other choices. There’s some overlap with scripting languages but I think there are two distinct niches here: a shell language is for scripting a smallish number of commands, doing a bit of error checking, and perhaps looping or interacting with the user slightly This sort of job is a bit too cumbersome in Python, Perl, or JavaScript.

A systems language: This is for creating servers, daemons, and other code that wants to go fast, use little memory, have few dependencies, and interact tightly with the OS. C or C++ would be the obvious choices, but Rust and Go may be fine too.

A workhorse language: This is your default programming language for most tasks, it should have a huge collection of high-quality libraries, be pretty fast, run on all common platforms, have a great tool ecosystem, etc. Racket, Java, Scala, OCaml, C#, Swift, or Haskell would be great — even C++ would work.

A pocket calculator: This is your go-to REPL for basic arithmetic and conversions between number representations, it should be near-instantaneous to get answers. For reasons I no longer remember, I use gdb for this — typically multiple times in any work day. Old standbys like bc and dc also seem like bad choices. I’m curious what other people do here.

Tools for Programming Languages

There’s no reason these days to use a language that doesn’t have a good tool ecosystem. For any given language you should know how to use its interactive debugger, static and dynamic bug-finding tools, a profiler, a code coverage tool, a build system, a package manager, and perhaps a random test-case generator.

Secondary Tools

There are a lot of other tools that could have gone into my basic toolbox, such as a data analysis tool, a browser language, a cloud-based testing service, a statistics language, a typesetting system, a spreadsheet, a database, and a GUI builder/toolkit. I don’t consider these as fundamental; of course, your mileage may vary.

Trust Boundaries in Software Systems

One of the big things that has changed in computer science education over the last 20 years is that it is now mandatory to prepare students for writing software that lives in a hostile environment. This content can’t be limited to a computer security course, it has to be spread throughout the curriculum. My experience, based on talking to people, looking through textbooks, and looking at lecture material on the web, is that overall we’re not yet doing a great job at this.

When teaching this subject, I’ve started using trust boundaries as an organizing principle. A trust boundary exists any time we (the system designers or system owners) trust code, data, or human actors on one side of an interface more than we trust the other side of the interface. Students need to be able to recognize, understand, fortify, and stress-test the trust boundaries in any system they have a stake in.

Trust boundaries aren’t hard to find: We just need to ask questions like “What are the consequences if this code/data became horribly malicious? Is that likely? Can we defend against it? Do we want to defend against it?” It is easy to conclude, for example, that a demonic garbage collector or OS kernel might not be something that we wish to defend against, but that we had better fortify our systems against toxic PNG files that we load from random web sites.

Some basic observations about trust boundaries:

  1. They’re everywhere, even inside code written by a single person. Anytime I put an assertion into my code, it’s a tacit acknowledgment that I don’t have complete trust that the property being asserted actually holds.
  2. The seriousness of trust boundaries varies greatly, from mild mistrust within a software library all the way to major safety issues where a power plant connects to the internet.
  3. They change over time: a lot of our security woes stem from trust boundaries becoming more serious than they had been in the past. Email was not designed for security. The NSA wasn’t ready for Snowden. Embedded control systems weren’t intended to be networked. Libraries for decoding images, movies, and other compressed file formats that were developed in the 90s were not ready for the kinds of creative exploits that they faced later on.
  4. If you fail to recognize and properly fortify an important trust boundary, it is very likely that someone else will recognize it and then exploit it.

To deal with trust boundaries, we have all the usual techniques and organizing principles: input sanitization, defense in depth, sandboxing, secure authentication, least privilege, etc. The issue that I’m trying to respond to with this post is that, in my experience, it doesn’t really work to hand students these tools without some sort of framework they can use to help figure out where and when to deploy the different defenses. I’d be interested to hear how other CS instructors are dealing with these issues.

A Conversation about Teaching Software Engineering

For better or worse, my impressions of software engineering as a field were shaped by a course I took as an undergrad that I thought was mostly not very interesting or useful. We spent a lot of time on waterfalls and stuff, while not covering testing in any detail. For the final project in the class we had to develop an application using a CASE tool (very hip at the time) where we described the class hierarchy using a GUI and then the tool generated skeletal C++ for us to fill in. Since we knew nothing about designing class hierarchies and also the tool was weird and buggy this all went about as disastrously as you would expect. In the end I learned quite a lot, but the lessons were probably not those intended by the instructor.

24 years later I’m teaching a software engineering class — this probably wouldn’t even happen if my department had any real software engineering faculty! Even so, I’m a true believer: I love the material and feel more strongly about its importance than I do about my more usual subjects like compilers and operating systems. I ignore software process and focus entirely on building skills and habits that I feel will come in handy in any software engineer’s career. If you like it, put a test on it. Read code. Review code. Refactor code. Write assertions. Adhere to coding standards. Design an API. Use a coverage tool, a bug-finding tool, a version control tool, a fuzz tool, a CI tool, to good effect. Repeat until end of semester. No doubt there’s room for improvement but the material seems solid.

Over Christmas break I had a beer with Daniel Dunbar, who I should have met long before, but somehow hadn’t. Daniel has done super impressive stuff: he was one of the original Klee authors and also was an early Clang implementer. I told him about my approach to teaching software engineering and picked his brain a bit about the sorts of things he wished people with CS degrees were better at doing. Of course I wasn’t taking notes and forgot most of it. So I mailed:

As I mentioned the other week, I’m teaching a software engineering course this semester, and rather than focusing on any kind of academic approach to this subject, I’ll try to teach them a lot of the real world business of making software that works, starting with all the basics like testing, coverage, assertions, and code reviews.

But you also mentioned some things that I agreed with but that I might not have thought of, or prioritized — the things that you wish people were already good at when you interview them, but they probably aren’t. Is there any chance I could get you to very briefly summarize those things, or point me to a resource you like on this subject? I want to make sure to cover, at least quickly, all of the main high points in this class.

Daniel graciously sent a long reply and also allowed me to reproduce it here:

To start with, I don’t think I know of any resources on the subject. It’s amazing how much obvious little stuff one has to know, but forgets about. Git is a huge source of “things I use every day but forgot I had to learn”, for better or worse.

I guess if I had to come up with a list off the cuff:

  1. The experience of maintaining software over time. I think we spend most of our time working with existing code bases and figuring out how to integrate changes into them. This area has a lot of related topics:
    • How do you figure out where to make a change? Tools here include debugging existing workflows to find where something happens, code search, git blame, git grep, git log -G, etc
    • How do you manage making incremental changes? I am a huge believer in always doing incremental work. How do you build a feature while always keeping the code working? Tools here include feature flags, adaptors and stub implementations, forwarding implementations, A/B testing before/after change.
    • How do you find the source of regressions? Tools here are basically bisection, git log -G.
    • How do you handle technical debt? What counts as technical debt? What kinds of debt are painful versus not?

    I feel like I probably have read things I liked on these topics, but none of the links are coming to mind now.

  2. The experience of making technical decisions. This is a *huge* part of development. Topics here:
    • How do you evaluate choices for a dependency? Tools here include benchmarking, analysis of the code, analysis of the software maintainability, etc
    • How do you evaluate when to adopt a dependency versus write your own? Topics here are NIH versus opportunity cost on innovation.
    • How do you convince people to follow a particular choice? Presenting or writing coherent write-ups on engineering tradeoffs is a really under appreciated skill which can have a big impact (the cost of bad decisions are high).
  3. How do you debug things? Another big part of development. I would emphasize debug here not just in the “how do I get this working” but even deeper in the “how do I understand what is really happening?”
    • I find that many people tend to give up at really understanding what is going on. “Oh XYZ didn’t work, so I did PDQ'”. The people who don’t give up usually end up understanding a lot more about computers, and then do a better job maturing over their career.
    • Maybe it would be good to teach people (a) don’t give up, and (b) here are all the tools you can use when you might want to give up. A lot of the time the tool people know is stackoverflow, but past that they are lost.
    • Things here include hardware watchpoints, reading the source code, disassembly and reverse engineering.
  4. How do you estimate the time to develop software? This is a huge part of a business, people will always want you to do this. Even just getting students to start to think about the process would be good, asking them to make estimates and compare results to them.
    • I have no advice on how to teach this because I still learning a lot here.
  5. How do you review code? What makes good review?
    • When is coding style important versus not? What are the pros/cons?
    • Does review catch bugs, or not? Are certain review styles more effective?
  6. How do people do release management? This is such an amazingly huge part of what we spend time on, and one that receives very little attention.
    • Do you release from trunk? If so, how do you ensure quality?
    • Do you have stable release branches? If so, how do you ensure bugs are actually fixed? Are people cherry picking fixes? What can go wrong? (I remember once cherry picking a fix that happened to merge incorrectly, but the patch applied an in a way that was still valid C++ code (an if {} clause ended up inside another one). The result was a clang that miscompiled itself.)
    • How do you deal with complicated merge conflicts?
    • People like Nicole Forsgren have research here which it would at least be nice for people to be exposed to.

If I had to come up with the kinds of potential exercises I would love it if someone was trying (no idea if they actually would work):

  1. Take a bug in a complex code base at some revision, and ask people to find it and fix it. Compare answers to the one the project actually adopted. A bug where there were several obvious fixes with tradeoffs would be a good point for discussion.
  2. Take a new feature which was added to some project, and study how it was done. Not to toot my own horn–its just an example I am familiar with–we migrated Clang to go from producing .s files to doing the assembly in memory. This involved lots of incremental refactoring, a clear switch over at one point (feature flag), A/B testing to compare old to new (i.e. we chose to shoot for binary equivalence of .o files, simply so we could easily test it — made for lots of extra engineering, but easier to guarantee correct). One could dig up how this went from a theoretical concept on a mailing list to an implemented feature.

    The best thing here would be to create a hypothetical project which is a mirror of a real project (but don’t make this clear at first) and ask people to design some extension of it. Then, compare the results to what the project actually decided to do, and the discussion around it. For example, analyze what things the project owners antagonized over that the students didn’t think of, and try and figure out why not.

  3. Do some systematic study of PRs as a “literature” exercise. How does tone impact response, how do people handle criticism, etc.
  4. Do an exercise where teams are forced to produce a project over the lifetime of the course. The exercises should be small and easier, but they have to be built into the same code base. Something that forces people to make software releases would be nice (dunno if there is a way to do this in such a way that people can choose whether or not to use “release branches”, but in a way that has tradeoffs in either direction).

Wow, this is a lot of material, way more than we cover in depth in a semester. Even so, I find it super useful as a high-level vision for the kinds of things students should end up at least having been exposed to.

The Dreaded Practice Talk

[I wrote a post with the same title in 2010; this is an updated version.]

In a week you’ll be giving a talk about your work to 600 people at a conference, or perhaps to five people who will sign off (or not) on your thesis. Depending on your area and the type of talk, the questions following the talk may not be very friendly. What should you do? Practice, practice, practice.

A practice talk is usually given to a small audience anywhere between a few weeks and a few hours before an important talk. It is followed by a feedback session that can easily last five times longer than the talk itself did. Often, multiple practice talks are necessary before the presentation becomes really polished and good.

This post is about getting maximum benefit from a practice talk — this is important because they are very time consuming.

The speaker needs to:

  • Have a legible slide number on every slide. If these aren’t there, people taking notes can’t easily refer back to specific slides later on.
  • Reserve a room, acquire a projector, and have everything setup and ready to go at the arranged time. Have all of the adapter dongles that you need on hand. If anyone is calling in remotely, this should also be taken care of by the speaker or by someone who has agreed to help the speaker, and it needs to be done before the talk is scheduled to start.
  • Have practiced the talk alone first. It helps to have memorized what to say when transitioning between slides. Memorizing an entire talk is usually overkill. Focus on transitions and on getting the talk started smoothly; most of us have a much easier time continuing to talk about a topic than getting started.
  • Have an appropriate number of slides. Speakers vary widely in terms of delivery speed and amount of content per slide, but 1.5 to 2 minutes per slide is probably about right. In realistic situations you will be cut off if you exceed your time budget. At proposals and defenses there is usually not a strict time budget, but going over time is strongly frowned upon.
  • Have a pen and paper available to take notes after the talk. You cannot remember 150 detailed suggestions about things to change.
  • Arrange for someone to time the talk. Sometimes it is helpful to get timings on individual slides.
  • Act on the feedback that is given.

Each member of the audience must:

  • Listen to the talk as if it were being given for real. Interrupting the speaker should be handled according to whatever protocol will be in force during the real talk. Generally this means few or no interruptions.
  • Arrive with a pen and paper, or equivalent note-taking gear.
  • Provide detailed feedback in a constructive and respectful fashion.

In my group this is usually the procedure:

  1. I give a bit of context: remind everyone what the speaker needs to accomplish, what kind of background and temperament the audience is likely to have, etc.
  2. I introduce the speaker.
  3. The talk is given, minimizing interruptions to get a good timing estimate.
  4. Starting with students, the audience asks questions as if they had just heard the real version of the talk. The speaker responds accordingly.
  5. Starting with students, the audience makes general comments about the delivery of the talk.
  6. We go through the talk slide by slide, giving feedback and trying to figure out what to add, delete, change around, etc.

Finally, a bit of advice on making slides:

  • Don’t put text too close to the edges of slides; some projection systems crop a bit.
  • Colors often look different when they go through a projector, and low-contrast colors can be completely invisible on a screen. Use a small number of very high-contrast colors. I typically use black on white for almost everything with some bright red or blue for emphasis.
  • Minimize the number of animations.

Fun at the UNIX Terminal Part 1

This post is aimed at kids, like the 6th graders who I was recently teaching about programming in Python. It is more about having fun than about learning, but I hope that if you enjoy playing around at the UNIX terminal, you’ll eventually learn to use this kind of system for real. Keep in mind this immortal scene from Jurassic Park.

To run the commands in this post, you’ll need a UNIX machine: either Linux or Mac OS X will work. You’ll also need the ability to install software. There are two options:

  • Install precompiled binaries using a package manager, I’ll give command lines for Homebrew on OS X and for Apt on Ubuntu Linux. You’ll need administrator access to run Apt or to install Homebrew, but you do not need administrator access to install packages after Homebrew has been installed. Other versions of Linux have their own package managers and they are all pretty easy to use.
  • Build a program from source and install it in your home directory. This does not require administrator access but it’s more work and I’m not going to go into the details, though I hope to do this in a later post.

ROT13 using tr

The tr utility should be installed by default on an OS X or Ubuntu machine. It translates the characters in a string into different characters according to rules that you provide. To learn more about tr (or any other command in this post) type this command (without typing the dollar sign):

$ man tr

This will show you the UNIX system’s built-in documentation for the command.

In this and subsequent examples, I’ll show text that you should type on a line starting with a dollar sign, which is the default UNIX prompt. Text printed by the system will be on lines not starting with a dollar sign.

We’re going to use tr to encrypt some text as ROT13, which simply moves each letter forward in the alphabet by 13 places, wrapping around from Z to A if necessary. Since there are 26 letters, encrypting twice using ROT13 gives back the original text. ROT13 is fun but you would not want to use it for actual secret information since it is trivial to decrypt. It is commonly used to make it hard for people to accidentally read spoilers when discussing things like movie plot twists.

Type this:

$ echo 'Hello this is a test' | tr 'A-Za-z' 'N-ZA-Mn-za-m'
Uryyb guvf vf n grfg

Now to decrypt:

$ echo 'Uryyb guvf vf n grfg' | tr 'A-Za-z' 'N-ZA-Mn-za-m'
Hello this is a test

Just two more things before moving on to the next command.

First, the UNIX pipe operator (the “|” character in the commands above, which looks a little bit like a piece of pipe) is plumbing for UNIX commands: it “pipes” the output of one command to the input of a second command. We’ll be using it quite a bit.

Second, how exactly did we tell tr to implement ROT13? Well, the first argument, ‘A-Za-z’, gives it a set of characters to work with. Here A-Z stands for A through Z and a-z stands for a through z (computers treat the capital and lowercase versions of letters as being separate characters). So we are telling tr that it is going to translate any letter of the alphabet and leave any other character (spaces, punctuation, numbers, etc.) alone. The second argument to tr, ‘N-ZA-Mn-za-m’, specifies a mapping for the characters in the first argument. So the first character in the first argument (A) will be translated to the first character of the second argument (N), and so on. We could just as easily use tr to put some text in all uppercase or all lowercase, you might try this as an exercise.

fortune

Tragically, this command isn’t installed by default on a Mac or on an Ubuntu Linux machine. On a Mac you can install it like this:

brew install fortune

If this doesn’t work then you need to get Homebrew setup, try this page.

On Ubuntu try this:

sudo apt-get install fortune

The “sudo” command will ask you to enter your password before running the next command, apt-get, with elevated privileges, in order to install the fortune program in a system directory that you are normally not allowed to modify. This will only work if your machine has been specifically configured to allow you to run programs with elevated privileges.

In any case, if you can’t get fortune installed, don’t worry about it, just proceed to the next command.

Fortune randomly chooses from a large collection of mildly humorous quotes:

$ fortune 
I have never let my schooling interfere with my education.
		-- Mark Twain

say

This command is installed by default on a Mac; on Ubuntu you’ll need to type “sudo apt install gnustep-gui-runtime”.

Type this:

$ say "you just might be a genius"

Make sure you have sound turned up.

The Linux say command, for whatever reason, requires its input to be a command line argument, so we cannot use a pipe to send fortune’s output to say. So this command will not work on Linux (though it does work on OS X):

$ fortune | say

However, there’s another trick we can use: we can turn the output of one command into the command-line arguments for another command by putting the first command in parentheses and prefixing this with a dollar sign. So this will cause your computer to read a fortune aloud:

$ say $(fortune)

Another way to accomplish the same thing is to put fortune’s output into a file and then ask say to read the file aloud:

$ fortune > my_fortune.txt
$ say -f my_fortune.txt

Here the greater-than symbol “redirects” the output of fortune into a file. Redirection works like piping but the output goes to a file instead of into another program. It is super useful.

If you run “say” on both a Linux box and a Mac you will notice that the Mac’s speech synthesis routines are better.

cowsay

The extremely important cowsay command uses ASCII art to show you a cow saying something. Use it like this:

$ fortune | cowsay
 __________________________________ 
/ What is mind? No matter. What is \
| matter? Never mind.              |
|                                  |
\ -- Thomas Hewitt Key, 1799-1875  /
 ---------------------------------- 
        \   ^__^
         \  (oo)\_______
            (__)\       )\/\
                ||----w |
                ||     ||

Both Homebrew and Apt have a package called “cowsay” that you can install using the same kind of command line you’ve already been using.

Cowsay has some exciting options, such as “-d” which makes the cow appear to be dead:

$ fortune | cowsay -d
 _________________________________________ 
/ Laws are like sausages. It's better not \
| to see them being made.                 |
|                                         |
\ -- Otto von Bismarck                    /
 ----------------------------------------- 
        \   ^__^
         \  (xx)\_______
            (__)\       )\/\
             U  ||----w |
                ||     ||

Use “man cowsay” to learn more.

ponysay

Don’t install ponysay unless you feel that cowsay is too restrictive. Also, it isn’t available as a precompiled package. You can build it from source code by first installing the “git” package using apt-get or brew and then running the following commands:

$ git clone https://github.com/erkin/ponysay.git
$ cd ponysay
$ ./setup.py --freedom=partial --private install

This procedure puts ponysay into an odd location, but whatever. Here (assuming Linux, on a Mac you’ll need to pipe a different command’s output to ponysay) a cute pony tells us the prime factorization of a number:

figlet

Figlet (actually called FIGlet but that’s not what you type to run the command) prints text using large letters comprised of regular terminal characters. For example:

$ whoami | figlet
                     _          
 _ __ ___  __ _  ___| |__  _ __ 
| '__/ _ \/ _` |/ _ \ '_ \| '__|
| | |  __/ (_| |  __/ | | | |   
|_|  \___|\__, |\___|_| |_|_|   
          |___/                 

Figlet has lots of options for controlling the appearance of its output. For example, you can change the font:

$ echo 'hello Eddie' | figlet -f script
 _          _   _          ___                    
| |        | | | |        / (_)   |     |  o      
| |     _  | | | |  __    \__   __|   __|      _  
|/ \   |/  |/  |/  /  \_  /    /  |  /  |  |  |/  
|   |_/|__/|__/|__/\__/   \___/\_/|_/\_/|_/|_/|__/

Another command, toilet, is similar to figlet but has even more options. Install both of these programs using the same kinds of commands we’ve already been using to talk to package managers.

lolcat

The UNIX “cat” program prints a file, or whatever is piped into it, to the terminal. Lolcat is similar but it prints the text in awesome colors:

bb

The bb program doesn’t seem to be available from Homebrew, but on Ubuntu you can install it using “sudo apt-get install bb”. It is a seriously impressive ASCII art demo.

rig

You know how lots of web sites want you to sign up using your name and address, but your parents hopefully have trained you not to reveal your identity online? Well, the rig utility can help, it creates a random identity:

$ rig
Juana Waters
647 Hamlet St
Austin, TX  78710
(512) xxx-xxxx

The zip codes and telephone area codes are even correct. For some reason rig will never generate an identity that lives in Utah.

bc

The bc program is a calculator but unlike almost every other calculator you use, it can handle numbers of unlimited size (or, more precisely, numbers limited only by the amount of RAM in your computer) without losing precision. Try this:

$ echo '2 ^ 100' | bc
1267650600228229401496703205376

Unfortunately bc does not have a built-in factorial function but you can write one easily enough using bc’s built-in programming language. Start bc in interactive mode (this will happen by default if you don’t pipe any text into bc) by just typing “bc”. Then enter this code:

define fact(n) {
  if (n < 2) return 1;
  return n * fact(n - 1);
}

Now you can compute very large factorials:

fact(1000)
40238726007709377354370243392300398571937486421071463254379991042993\
85123986290205920442084869694048004799886101971960586316668729948085\
58901323829669944590997424504087073759918823627727188732519779505950\
99527612087497546249704360141827809464649629105639388743788648733711\
91810458257836478499770124766328898359557354325131853239584630755574\
09114262417474349347553428646576611667797396668820291207379143853719\
58824980812686783837455973174613608537953452422158659320192809087829\
73084313928444032812315586110369768013573042161687476096758713483120\
25478589320767169132448426236131412508780208000261683151027341827977\
70478463586817016436502415369139828126481021309276124489635992870511\
49649754199093422215668325720808213331861168115536158365469840467089\
75602900950537616475847728421889679646244945160765353408198901385442\
48798495995331910172335555660213945039973628075013783761530712776192\
68490343526252000158885351473316117021039681759215109077880193931781\
14194545257223865541461062892187960223838971476088506276862967146674\
69756291123408243920816015378088989396451826324367161676217916890977\
99119037540312746222899880051954444142820121873617459926429565817466\
28302955570299024324153181617210465832036786906117260158783520751516\
28422554026517048330422614397428693306169089796848259012545832716822\
64580665267699586526822728070757813918581788896522081643483448259932\
66043367660176999612831860788386150279465955131156552036093988180612\
13855860030143569452722420634463179746059468257310379008402443243846\
56572450144028218852524709351906209290231364932734975655139587205596\
54228749774011413346962715422845862377387538230483865688976461927383\
81490014076731044664025989949022222176590433990188601856652648506179\
97023561938970178600408118897299183110211712298459016419210688843871\
21855646124960798722908519296819372388642614839657382291123125024186\
64935314397013742853192664987533721894069428143411852015801412334482\
80150513996942901534830776445690990731524332782882698646027898643211\
39083506217095002597389863554277196742822248757586765752344220207573\
63056949882508796892816275384886339690995982628095612145099487170124\
45164612603790293091208890869420285106401821543994571568059418727489\
98094254742173582401063677404595741785160829230135358081840096996372\
52423056085590370062427124341690900415369010593398383577793941097002\
77534720000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
00000000000000000000000000000000000000000000000000000000000000000000\
0000000000000000000000000000000000000000000000000000

While we're at it, you should figure out why the factorial of any large number contains a lot of trailing zeroes.

Conclusion

We've only scratched the surface, I'll share more entertaining UNIX commands in a followup post. Some of these programs I hadn't even heard of until recently, a bunch of people on Twitter gave me awesome suggestions that I've used to write this up. If you want to see a serious (but hilariously outdated) video about what you can do using the UNIX command line, check out this video in which one of the original creators of UNIX shows how to build a spell checker by piping together some simple commands.

Teaching Python Informally to Kids

For the last few months I’ve been running a “coding club” for my son’s sixth-grade class. Once a week, the interested students (about 2/3 of the class) stick around for an hour after school and I help them learn to program. The structure is basically like the lab part of a programming class, without the lecture. I originally had planned to do some lecturing but it soon became clear that at the end of a full day of school (these kids get on the bus around 7:00am, ugh) there was very little attention span remaining. So instead I decided to give them a sheet of problems to work on each week and I spend the hour walking around helping whoever needs help.

One of my main goals is to avoid making anyone hate programming. There are three parts to this. First, I’m not forcing them to do anything, but rather providing suggestions and guidance. Second, there’s no assessment going on. I’ve never been too comfortable with the grading part of teaching, actually. It can really get in the way of learning. Third, I’m encouraging them to work together. I’ve long noticed (starting when I was a CS student) that most learning occurs between students in the lab. Not as much learning happens in the lecture hall.

For a curriculum, we started out with turtle graphics, which I’ve always found to be wonderful. Python’s turtle library is clunkier than Logo and also is abysmally slow, but the advantages (visual debugging, many kids enjoy graphics) outweigh the problems. Fractals (Sierpinski triangle, dragon curve) were a good way to introduce recursion. We spent three or four weeks doing turtle stuff before moving on to simple crypto, which didn’t work as well. On the other hand, the students did really well with mathy code, here’s the handout from one of the math weeks.

Some observations:

  • One of my favorite moments so far has been when a couple of students implemented rot13 functions and used it to send rude messages to each other.
  • It’s pretty important to take a snack break.
  • Although the rockstar / 10x programmer idea is out of favor, it is absolutely clear that there is a huge amount of variation (much more than 10x) in how easily a group of twenty 10-11 year olds learn programming. Some kids are teaching themselves to open files and parse text while others are stuck on basic syntax issues.
  • Syntax, which computer professionals more or less learn to overlook, is hugely important.
  • I’ve always disliked Python’s significant whitespace and watching kids struggle with it has made me hate it even more.

Anyhow, this has been a fun teaching exercise and hopefully it is benefiting the kids.

A Tourist’s Guide to the LLVM Source Code

In my Advanced Compilers course last fall we spent some time poking around in the LLVM source tree. A million lines of C++ is pretty daunting but I found this to be an interesting exercise and at least some of the students agreed, so I thought I’d try to write up something similar. We’ll be using LLVM 3.9, but the layout isn’t that different for previous (and probably subsequent) releases.

I don’t want to spend too much time on LLVM background but here are a few things to keep in mind:

  • The LLVM core doesn’t contain frontends, only the “middle end” optimizers, a pile of backends, documentation, and a lot of auxiliary code. Frontends such as Clang live in separate projects.
  • The core LLVM representation lives in RAM and is manipulated using a large C++ API. This representation can be dumped to readable text and parsed back into memory, but this is only a convenience for debugging: during a normal compilation using LLVM, textual IR is never generated. Typically, a frontend builds IR by calling into the LLVM APIs, then it runs some optimization passes, and finally it invokes a backend to generate assembly or machine code. When LLVM code is stored on disk (which doesn’t even happen during a normal compilation of a C or C++ project using Clang) it is stored as “bitcode,” a compact binary representation.
  • The main LLVM API documentation is generated by doxygen and can be found here. This information is very difficult to make use of unless you already have an idea of what you’re doing and what you’re looking for. The tutorials (linked below) are the place to start learning the LLVM APIs.

So now on to the code. Here’s the root directory, it contains:

  • bindings that permit LLVM APIs to be used from programming languages other than C++. There exist more bindings than this, including C (which we’ll get to a bit later) and Haskell (out of tree).
  • cmake: LLVM uses CMake rather than autoconf now. Just be glad someone besides you works on this.
  • docs in ReStructuredText. See for example the Language Reference Manual that defines the meaning of each LLVM instruction (GitHub renders .rst files to HTML by default; you can look at the raw file here.) The material in the tutorial subdirectory is particularly interesting, but don’t look at it there, rather go here. This is the best way to learn LLVM!
  • examples: This is the source code that goes along with the tutorials. As an LLVM hacker you should grab code, CMakeLists.txt, etc. from here whenever possible.
  • include: The first subdirectory, llvm-c, contains the C bindings, which I haven’t used but look pretty reasonable. Importantly, the LLVM folks try to keep these bindings stable, whereas the C++ APIs are prone to change across releases, though the pace of change seems to have slowed down in the last few years. The second subdirectory, llvm, is a biggie: it contains 878 header files that define all of the LLVM APIs. In general it’s easier to use the doxygen versions of these files rather than reading them directly, but I often end up grepping these files to find some piece of functionality.
  • lib contains the real goodies, we’ll look at it separately below.
  • projects doesn’t contain anything by default but it’s where you checkout LLVM components such as compiler-rt (runtime library for things like sanitizers), OpenMP support, and the LLVM C++ library that live in separate repos.
  • resources: something for Visual C++ that you and I don’t care about (but see here).
  • runtimes: another placeholder for external projects, added only last summer, I don’t know what actually goes here.
  • test: this is a biggie, it contains many thousands of unit tests for LLVM, they get run when you build the check target. Most of these are .ll files containing the textual version of LLVM IR. They test things like an optimization pass having the expected result. I’ll be covering LLVM’s tests in detail in an upcoming blog post.
  • tools: LLVM itself is just a collection of libraries, there isn’t any particular main function. Most of the subdirectories of the tools directory contain an executable tool that links against the LLVM libraries. For example, llvm-dis is a disassembler from bitcode to the textual assembly format.
  • unittests: More unit tests, also run by the check build target. These are C++ files that use the Google Test framework to invoke APIs directly, as opposed to the contents of the “test” directory, which indirectly invoke LLVM functionality by running things like the assembler, disassembler, or optimizer.
  • utils: emacs and vim modes for enforcing LLVM coding conventions; a Valgrind suppression file to eliminate false positives when running make check in such a way that all sub-processes are monitored by Valgrind; the lit and FileCheck tools that support unit testing; and, plenty of other random stuff. You probably don’t care about most of this.

Ok, that was pretty easy! The only thing we skipped over is the “lib” directory, which contains basically everything important. Let’s look its subdirectories now:

  • Analysis contains a lot of static analyses that you would read about in a compiler textbook, such as alias analysis and global value numbering. Some analyses are structured as LLVM passes that must be run by the pass manager; others are structured as libraries that can be called directly. An odd member of the analysis family is InstructionSimplify.cpp, which is a transformation, not an analysis; I’m sure someone can leave a comment explaining what it is doing here (see this comment). I’ll do a deep dive into this directory in a followup post.
  • AsmParser: parse textual IR into memory
  • Bitcode: serialize IR into the compact format and read it back into RAM
  • CodeGen: the LLVM target-independent code generator, basically a framework that LLVM backends fit into and also a bunch of library functions that backends can use. There’s a lot going on here (>100 KLOC) and unfortunately I don’t know very much about it.
  • DebugInfo is a library for maintaining mappings between LLVM instructions and source code locations. There’s a lot of good info in these slides from a talk at the 2014 LLVM Developers’ Meeting.
  • ExecutionEngine: Although LLVM is usually translated into assembly code or machine code, it can be directly executed using an interpreter. The non-jitting interpreter wasn’t quite working the last time I tried to use it, but anyhow it’s a lot slower than running jitted code. The latest JIT API, Orc, is in here.
  • Fuzzer: this is libFuzzer, a coverage-guided fuzzer similar to AFL. It doesn’t fuzz LLVM components, but rather uses LLVM functionality in order to perform fuzzing of programs that are compiled using LLVM.
  • IR: sort of a grab-bag of IR-related code, with no other obvious unifying theme. There’s code for dumping IR to the textual format, for upgrading bitcode files created by earlier versions of LLVM, for folding constants as IR nodes are created, etc.
  • IRReader, LibDriver, LineEditor: almost nobody will care about these and they contain hardly any code anyway.
  • Linker: An LLVM module, like a compilation unit in C or C++, contains functions and variables. The LLVM linker combines multiple modules into a single, larger module.
  • LTO: Link-time optimization, the subject of many blog posts and PhD theses, permits compiler optimizations to see through boundaries created by separate compilation. LLVM can do link-time optimization “for free” by using its linker to create a large module and then optimize this using the regular optimization passes. This used to be the preferred approach, but it doesn’t scale to huge projects. The current approach is ThinLTO, which gets most of the benefit at a small fraction of the cost.
  • MC: compilers usually emit assembly code and let an assembler deal with creating machine code. The MC subsystem in LLVM cuts out the middleman and generates machine code directly. This speeds up compiles and is especially useful when LLVM is used as a JIT compiler.
  • Object: Deals with details of object file formats such as ELF.
  • ObjectYAML seems to support encoding object files as YAML. I do not know why this is desirable.
  • Option: Command line parsing
  • Passes: part of the pass manager, which schedules and sequences LLVM passes, taking their dependencies and invalidations into account.
  • ProfileData: Read and write profile data to support profile-guided optimizations
  • Support: Miscellaneous support code including APInts (arbitrary-precision integers that are used pervasively in LLVM) and much else.
  • TableGen: A wacky Swiss-army knife of a tool that inputs .td files (of which there are more than 200 in LLVM) containing structured data and uses a domain-specific backend to emit C++ code that gets compiled into LLVM. TableGen is used, for example, to take some of the tedium out of implementing assemblers and disassemblers.
  • Target: the processor-specific parts of the backends live here. There are lots of TableGen files. As far as I can tell, you create a new LLVM backend by cloning the one for the architecture that looks the most like yours and then beating on it for a couple of years.
  • Transforms: this is my favorite directory, it’s where the middle-end optimizations live. IPO contains interprocedural optimizations that work across function boundaries, they are typically not too aggressive since they have to look at a lot of code. InstCombine is LLVM’s beast of a peephole optimizer. Instrumentation supports sanitizers. ObjCARC supports this. Scalar contains a pile of compiler-textbooky kinds of optimizers, I’ll try to write a more detailed post about the contents of this directory at some point. Utils are helper code. Vectorize is LLVM’s auto-vectorizer, the subject of much work in recent years.

And that’s all for the high-level tour, hope it was useful and as always let me know what I’ve got wrong or left out.

Undefined Behavior: Not Just for Programming Languages

This is an oldie but goodie. Start with this premise:
a = b
Multiply both sides by a:
a2 = ab
Subtract b2 from both sides:
a2 – b2 = ab – b2
Factor the left side:
(a + b)(a – b) = ab – b2
Factor the right side:
(a + b)(a – b) = b(a – b)
Divide both sides by (a – b) and cancel:
a + b = b
Substitute b for a:
b + b = b
Finally, let b = 1 and simplify:
2 = 1

I ran into this derivation when I was nine or ten years old and it made me deeply uneasy. The explanation, that you’re not allowed to divide by (a – b) because this term is equal to zero, seemed to raise more questions than it answered. How are we supposed to keep track of which terms are equal to zero? What if something is equal to zero but we don’t know it yet? What other little traps are lying out there, waiting to invalidate a derivation? This was one of many times where I noticed that in school they seemed willing to teach the easy version, and that the real world was never so nice, even in a subject like math where — you would think — everything is clean and precise.

Anyway, the point is that undefined behavior has been confusing people for well over a thousand years — we shouldn’t feel too bad that we haven’t gotten it right in programming languages yet.

Vigorous Public Debates in Academic Computer Science

The other day a non-CS friend remarked to me that since computer science is a quantitative, technical discipline, most issues probably have an obvious objective truth. Of course this is not at all the case, and it is not uncommon to find major disagreements even when all parties are apparently reasonable and acting in good faith. Sometimes these disagreements spill over into the public space.

The purpose of this post is to list a collection of public debates in academic computer science where there is genuine and heartfelt disagreement among intelligent and accomplished researchers. I sometimes assign these as reading in class: they are a valuable resource for a couple of reasons. First, they show an important part of science that often gets swept under the rug. Second, they put discussions out into the open where they are widely accessible. In contrast, I’ve heard of papers that are known to be worthless by all of the experts in the area, but only privately — and this private knowledge is of no help to outsiders who might be led astray by the bad research. For whatever reasons (see this tweet by Brendan Dolan-Gavitt) the culture in CS does not seem to encourage retracting papers.

I’d like to fill any holes in this list, please leave a comment if you know of a debate that I’ve left out!

Here are some more debates pointed out by readers: