Teaching C

The other day Neel Krishnaswami mentioned that he’s going to be teaching the C class at Cambridge in the fall, and asked if I had any advice about that topic. Of course I do! In fact the response got so long that it ended up being a blog post.

My main idea is that we need to teach C in a way that helps students understand why a very large fraction of critical software infrastructure, including almost all operating systems and embedded systems, is written in C, while also acknowledging the disastrously central role that it has played in our ongoing computer security nightmare.

There’s a lot of reading material out there. For the basics, I still recommend that students purchase K&R. People say good things about C Programming: A Modern Approach; I’ve only skimmed it. For advanced C I’ve not read a better book than Expert C Programming, though like K&R it is fairly old. The Practice of Programming is a really great book though it’s not completely specific to C. I haven’t read all of it but from what I’ve seen Modern C is a very good resource, with AFAIK the best treatment of undefined behavior of any C book. The C FAQ contains lots of good material.

For supplemental reading, of course the students need to look at all three parts of Chris Lattner’s writeup about undefined behavior, and mine as well.

What version of C should we teach? Probably a common subset of C99 and C11. In a first C class there’s no need to go into advanced C11 features such as the concurrent memory model.

We’d like students to be able to answer the question: Is C an appropriate choice for solving this problem? We’ll want some lecture material about C’s place in the modern world and we also need to spend time reading some high-quality C code, perhaps starting with Redis, Musl, or Xv6. Musl, in particular, is a good match for teaching since it contains lots of cute little functions that can be understood in isolation. From any such function we can launch a discussion about tradeoffs between portability, efficiency, maintainability, testability, etc. If Rich Felker (the Musl author) did something a certain way, there’s probably a good reason for it and we should be able to puzzle it out. We can also use Matt Godbolt’s super awesome compiler explorer to look at the code generated by various compilers. C’s lightweight-to-nonexistent runtime support is one of its key advantages for real-world system building, and it also means that generated code can be understood without thinking about something like a garbage collector.

We probably also need to spend a bit of time looking at bad old C, the kind that makes the world work even though we’re not proud of it. We can find files in OpenSSL and in the PHP interpreter that would singe your brain despite getting run billions of times a day, or we can always pick on an old standby like glibc — worth looking at just for the preprocessor abuse. But perhaps I am being uncharitable: Pascal Cuoq (reading a draft of this piece) correctly points out that “even what seems like plain stupidity often stems from engineering trade-offs. Does the project try to remain compilable under MS-DOS with DJGPP, with C90 compilers, under VMS, or all three at the same time?” And it is true that we would do well to help students understand that real-world engineering constraints do not often resemble the circumstances that we lead them to expect in school.

The second big thing each student should learn is: How can I avoid being burned by C’s numerous and severe shortcomings? This is a development environment where only the paranoid can survive; we want to emphasize a modern C programming style and heavy reliance on the (thankfully excellent) collection of tools that is available for helping us develop good C.

Static analysis is the first line of defense; the students need to use a good selection of -W flags and then get used to making things compile without warnings. A stronger tool such as the Clang static analyzer should also be used. On the dynamic side, all code handed in by students must be clean as far as ASan, UBSan, and MSan are concerned. tis-interpreter holds code to an even higher standard; I haven’t had students use this tool yet but I think it’s a great thing to try. Since dynamic testing is limited by the quality of the test cases, the students need to get used to using the output of a code coverage tool to find gaps in test coverage. Lots of coverage tools for C are available but I usually just use gcov since it is ubiquitous and hassle-free.

Teaching undefined behavior using sanitizers is a piece of cake: the tool gives students exactly the feedback that they need. The other way of teaching undefined behavior, by looking at its consequences, is something that we should spend a bit of time on, but it requires a different kind of thinking and we probably won’t expect the majority of students to pick up on all the subtleties — even seasoned professional C programmers are often unaware of these.

Detecting errors and doing something about them is a really important part of programming that we typically don’t teach much about in school. Since C is designed to avoid sweeping these problems under the rug, a C class is a great place to get students started on the right track. They should have to implement a goto chain.

Something I’m leaving out of this post is the content of the assignments that we give the students — this mostly depends on the specific goals of the course and how it fits into the broader curriculum (In what year are students expected to take the class? What kind of background do they have in math and science? What languages do they already know?). I’ve always taught C as a side effect of teaching operating systems, embedded systems, or something along those lines. In a course where the primary goal is C we have more freedom, and could look at more domains. Image processing and cryptographic algorithms would be really fun, for example, and even the old standby, data structures, can be used to good effect in class.

I’m also leaving out build systems and version control. They should use these.

In some courses I will give students access to the test infrastructure that will be used to grade their code. This makes assignments a lot more fun, and makes students a lot more happy. Other times I will give them a few test cases and keep the good tests (and the fuzzers) for myself. The idea is to make the assignment not only about implementation but also about testing. This stresses students out but it’s far more realistic.

Pascal remarks that “C is mostly taught very badly, and a student who aims at becoming good at maintaining C code will need to unlearn much that they have (typically) been told in class.” This is regrettably true — a lot of instructors learned C in previous decades and then they teach an outdated language, for example failing to discourage preprocessor abuse. The most serious common failing is to leave students unaware of their side of the bargain when the deal with a C compiler. I am talking of course about undefined behavior (and, to a lesser extent, unspecified and implementation-defined behavior). As a concrete example, I have taught numerous classes based on Computer Systems: A Programmer’s Perspective. In most respects this is an excellent book, but (even in the 3rd edition) it not only ignores undefined behavior but, worse, explicitly teaches students that signed integers in C have two’s complement behavior on overflow:

This claim that positive signed overflow wraps around is neither correct by the C standard nor consistent with the observed behavior of either GCC or LLVM. This isn’t an acceptable claim to make in a popular C-based textbook published in 2015. While I can patch problems in the book during lecture, that isn’t very satisfying, and not all instructors have the time and expertise.

One might argue that we shouldn’t be teaching C any longer, and I would certainly agree that C is probably a poor first or second language. On the other hand, even if we were in a position where no new projects should be written in C (that day is coming, but slowly — probably at least a decade off), we’re still going to be stuck maintaining C for many decades. A random CS graduate has pretty good odds of running into C during her career. But beyond that, even after we replace C, the systems programming niche will remain. A lot of what we learn when we think we’re learning C is low-level programming and that stuff is important.

Thanks to Pascal Cuoq and Robby Findler for commenting on drafts of this piece.

33 responses to “Teaching C”

  1. I was hoping you would mention the dynamic checkers, ASAN etc. My fear is people will start using these but not understand how to really use them properly. Since they’re dynamic checkers you really need a good testing harness and test cases to go with it. The bugs ASAN/TSAN etc catch are usually strange boundary conditions that wont get triggered with normal grading test cases the TAs will use — You really need a fuzzer.

    I think a good exercise for any C class is have a buggy program and have the students write test cases to try and catch all the bugs. Not only that they should be able to understand why the bug was triggered and write patches. Perhaps one assignment design some socket program that must allocate/destroy buffers and do math on siad buffers via user input. The next assignment each student gets another student’s assignment and tries to break it.

  2. As another possible example code, how about SQLite? It’s often held up as one of the best and most-thoroughly tested C code bases in existence. You’ve also used it for examples in other work, too.

  3. “A lot of what we learn when we think we’re learning C is low-level programming and that stuff is important.”

    This is the key part here. If you’re just teaching “coding” to school kids or whatever, it’s acceptable to pick something accessible depending on age-group and/or prior experience. But if you’re preparing future computer scientists/engineers (as in a CSE program in college) there’s no excuse to not teach how computers actually work. And that’s best done with a low level language like C working both at the kernel level ( ie. involving direct interactions with real or simulated hardware) as well as user level just above the kernel.

    We need more people as practicing software engineers who have the capability to understand issues at those levels even if they end up using a higher level stack for building business logic for whatever their application requires.

  4. I feel like it’s worthwhile talking about how to avoid undefined behaviour that the sanitisers *don’t* check, but I think it’s quite easy to accidentally invoke UB when it comes to issues like strict aliasing without realising it, *especially* when you’re relatively inexperienced with C.

  5. You mentioned that “that day is coming” regarding no projects should be written in C. What existing (or at least developing) alternatives you see? I mean such languages that can produce code at least as fast as C does, and that is at least as low-level as C (so a little to nothing ASM need to be used for stuff like kernel)? Because the only one I can think of right now is Rust, but it doesn’t seem to be fair replacement for C.

    Also, as you correctly mentioned (and I guess it should be noted in your article), the behavior of signed integer overflow in C is undefined by C standard. But I’m not quite sure though that GCC does something undefined (at least for x86 architecture). Seems like at least for GCC x86 it wraps exactly as that book mentions (and of course we still shouldn’t rely on that).

  6. Sam, see the linked code examples! GCC for x86 is not wrapping those multiplications around to negative values.

    In an optimistic future, Rust could be a complete replacement for C within a decade. I don’t know of any other languages that might realistically be used to write an OS kernel or hypervisor within that time frame.

  7. Hi Geoffrey, you’re right that strict aliasing is still a big problem. In a first C class I would probably mention the problem and recommend that everyone compiles their code with -fno-strict-aliasing unless they are certain that the code base contains zero problematic type casts.

  8. Phil, SQLite is an amazing code base in many ways, but I do not find it pleasing to read in the same way that those other code bases are. I can’t speak for others, but my own preference would be to steer clear of it in a class like the one we’re talking about here.

  9. Scott, I agree that bug-finding exercises can be good. I’ve tried running build-it / break-it cycles in classes. It can work but it’s pretty difficult to run it properly.

  10. Great post.

    I don’t think C is going away even in 10 years. To cite just one example, we have a current explosion in GPUs for HPC. And for HPC, realtime, and big-memory stuff (where data layout and custom allocation schemes matter), C is still a fine choice to get close to the HW. Of course, with great power comes great responsibility–to learn the craft of software engineering and use the tooling available to not add to the security mess or write horrific code for someone else to maintain later.

    In my experience, I’ve worked with far too many “engineers” who were never taught or never bothered to learn on their own how the HW really works. How does that fancy string split function really do its work, in say Python. These “engineers” have no idea; “That’s just a function call in Python|Ruby|Java|C++ (STL)” Just horrible.

    A could not agree more with Chetan’s wise comment:
    “We need more people as practicing software engineers who have the capability to understand issues at those [low] levels even if they end up using a higher level stack for building business logic for whatever their application requires.”

    If you’ve never built a binary search tree or {insert your favorite data structure} or implemented a hashing algorithm with bit manipulations, you generally will lack so much knowledge about how hardware works and therefor how to write good software, even if at a higher level.

    “The C programming language’s only failing is giving you access to what is really there, and telling you the cold hard raw truth. C gives you the red pill. C pulls the curtain back to show you the wizard. C is truth.” — Zed Shaw (c.learncodethehardway.org)

  11. Ada is realistically better than C for writing kernels, hypervisors etc… and it’s here now.

  12. At the moment, Rust definitely seems like the only language on a trajectory to possibly replace C, but mainly for social reasons rather than technical ones. You could replace it today with Ada, or even something like Modula-2, if people weren’t predisposed to dismiss them as options.

    I find it hilarious when people think of Pascal as some old, slow, useless language that would be a terrible choice for anything, yet still think of C as a realistic language to choose for development today. Not that Pascal is without problems, but it was the default systems language for years on the Mac and was not too long ago one of the major business software implementation languages.

    We want so badly to believe that the language/OS situation today is based to a large degree on some contest of technical merit, but from my experience, it’s mostly an accident of social dynamics.

    Anyway, I think your outline for how to approach teaching C is right on target. Thanks for taking the time to flesh it out in blog form!

  13. Folks have been trying to replace C for a long time, to no avail.
    The author mentioned embedded systems, but how do you write
    an executive or state-machine for a 16 bit micro and then port it
    to dozens of different architetures of from a variety of microprocessor vendors. How do you integrate optimized and necessary assembler into another language as easily as C?

    Dream on smarty pants. C will be fundamental long after you are dead. It’s get rediscovered again and again and again.

    College kids are gullible.

  14. I think the why of undefined behaviour should be explained, otherwise students will just say “that’s stupid” and give up. For example: hot inner loops that are 30% faster if the compiler can assume addition will never overthrow, different CPU architectures, and so on. Also, talk about some of the, let’s say ‘fun’, cases where undefined behaviour has caused a programme to blow up, to keep students interested and engaged.

    I think that as soon as C functions (i.e. sub-routines) are taught, the mechanics of the stack (stack pointer, return address, local variables) should be taught, including how it all works with recursive functions. This can also be a good time to show one of the several ways that pointers can be invalidated (pointer to a local variable that goes out of scope).

  15. SQLite is an idiomatic and polarizing codebase. For example, its maintainer chooses not to use NULL. It’s carefully written and battle-tested software, but people are increasingly unhappy with the coding style.

  16. I agree with David A and Chetan’s comment that teaching C should be accompanied with its working at HW/low level. This gives a holistic view of the system and encourages to write a good C code getting maximum performance from HW. A good example would be demonstrating performance difference when loops are accessed row major wise than column major wise.

  17. Regarding replacement of C, I agree with others that Ada is a viable alternative, often dismissed because of non technical assumptions. Ada can be used to write low level stuff, has good IDE and testing environment, Ada 2012 has built-in contracts and Ada has even a subset with precisely defined semantics and proving capabilities: SPARK!

    Regarding the C teaching, maybe it would be worth skimming through rules to write safety-critical C code, like Holzmann’s Power of Ten (https://en.wikipedia.org/wiki/The_Power_of_10:_Rules_for_Developing_Safety-Critical_Code) and MISRA-C rules.

  18. James, have you ever heard of Forth? Works on an awful lot of 8- and 16-bit micros and generally has an integrated assembler; even better-integrated than the one in C. The Philae Lander, which touched down on Comet 67P in 2014, was programmed in Forth; it’s not commonly used today, but not quite dead either!

    A variant of Modula-2 (called Modula-GM) was used for the engine management microcontrollers in many GM vehicles from ’86 to ’05. It inspired the “Structured Text” form of PLC programming. It’s certainly a lot easier to write a compiler for than C. It’s got a more precise ISO standard than C has; it’s defined via the formal language VDM.

    Ada is used in microcontrollers for all sorts of embedded applications. It’s also ISO standardized and has the SPARK variant for formal analysis along with things like the Ravenscar profile for safety-critical, hard real-time systems.

    There’s absolutely nothing special about C with regard to writing microcontroller executive and state machine code, aside from the industry currently being ridiculously myopic in its recognition of other languages. C’s shortcomings are especially obvious when you try to port code written for one quirky microcontroller environment to another. Not to mention the inanity of trying to write portable, optimizable C code for DSP processors.

    Speaking of which, learners of C ought to be exposed to a variety of CPU architectures to see why, even today, you can’t assume ‘char’ is always signed or unsigned or that it’s any particular number of bits wide. There are a number of TI DSP cores that are in quite a lot of consumer devices today that have 16- or 32-bit wide ‘char’ types. People, for some reason, want to write C code for them. Go figure!

    As an embedded systems programmer, I honestly have no idea why the industry puts up with it. My guess is that the EE folks that previously dominated the writing of embedded control software like it better than assembly, haven’t been shown any other options, and don’t care enough about software to look for or learn anything better. Yet, various parts of the industry are perfectly willing to adopt sub-languages of C (such as MISRA and other strict coding standards) that, while preventing some iffy code constructs, also end up requiring as much learning/training to adopt as would be required to learn a reasonable language, and constrain programs to a stilted and sprawling style that tends to obscure the meaning of the code in convoluted control flow logic or linter-exception comments. It’s like Stockholm Syndrome or something.

  19. @Levi,

    All good, but antidotal, items you strung together in your response; but it doe not in any way substantiate a reason to abandon C. Furthermore, as EE and other hardware oriented folks continue to push DSP, FPGA, SoC and a myriad of GPU and other memory centric hardware, “C” is where they meet and can work together. Real embedded products mostly use C and assembler. A few pennies save on a board design translates into millions of dollars for a given product. Most C/asm embedded developers greatly appreciated new, particularly “bolt-on” codes, but those will *NEVER replace C*, imho.

    These systems I’m talking about dominate the number of actual processors in the wild. So, teaching C, with all of it’s variances and nuances is warranted. Please stop trying to kill the goose that definitely laid the golden eggs of modern computing.

  20. A couple of comments about the post:

    1. Rust, as much as I love it, is not a replacement for C; it’s a replacement for C++. If you have any qualms about using C++ for a project, many of the same arguments will apply to Rust.

    2. Are you going to mention any of the advantages of C? Why would anyone want to use it, particularly if you have spent all that time on the difficulties? And if you do, how can you get the most out of it? Many of the most important C idioms go against the “best practices” of other languages.

  21. @James,

    The reasons to move away from C (not “abandon”; it’s always going to be around in at least a legacy capacity) are obvious to anyone who studies the literature regarding software reliability and the defect rates in C code. Good education will help reduce that, but there are plenty of reasons why C code is particularly hard to get right. When the design of your tool makes it more error-prone than others, you don’t make excuses for it because it’s done so much for you the way it is–you fix it, or design a better one. The C standards committee has made it clear that it’s not going to change in any significant way, so replacement (for at least some purposes) is what needs to happen.

    There’s a lot to like about C, but it’s not magic and it’s not even very unique as a language. Just because it’s what people use now doesn’t mean that its inherently specially suited for its role. It’s not a “golden goose”; the actual “golden goose” is cheap programmable semiconductors and the rate at which we’ve been improving their capabilities. Whether you program them in C, Rust, Forth, Ada or Modula-2, they get the job done about the same as long as the code is correct. But the complexity in microcontroller code is going up, and so are software defects, and these buggy controllers are going into more and more devices in our lives. They’re already driving our cars!

    This demands a multi-front attack on the software reliability issues:

    1. Education about how to reduce defects in C
    2. Better tools for reducing defects in C code
    3. Serious effort towards replacing C with something inherently less error-prone and more tool-friendly

    Any plan that relies completely on #1 is a total non-starter. Getting people to produce reliable C code takes a lot of training and experience, and those are really expensive. Demand for embedded programmers is going to go *up* and demand for efficiency in software production isn’t going to leave development solely in the hands of experienced C programmers or produce magical quantities of training/hiring budget.

    John is already doing an admirable job of advancing the state of the art in #2, but there are inherent limits to the analysis that can be done due to the nature of C. The tools that exist now are mostly some combination of expensive, hard to use, and hard to interpret the results from correctly. These problems are not going to go away, although the efforts of people like John will hopefully reduce the cost and training required to use them effectively. They will be critical in maintaining our legacy C codebases, which as I think we agree will be with us for a very long time.

    The only long-term solution is to supplement them with #3; people who refuse to see the writing on the wall for C and boldly proclaim that it can never be replaced aren’t doing anyone any favors in this regard. C is a legacy language, albeit one with a particularly grand legacy that will remain with us for a long time. Moving on from it is not an insult to C or the many people who use it today, simply a sign of progress and increased maturity for the programming industry. I am okay with skepticism about any particular candidate for successor to C (personally, I hope there are many rather than just one, so we don’t get stuck in another tool-rut like this), but the notion that C will never be replaced just has to die.

  22. Some [old] references that I like:

    Hatton L. Safer C: Developing Software for High Integrity and Safety-Critical Systems. McGraw-Hill, 1995. ISBN 0-07-707640-0.
    Discussion of what makes C software safe or unsafe combined with surveys of large codebases to find the most common kinds of unsafe practices. His focus is on best bang for the buck – focus your efforts on weeding out the most serious and most likely faults.

    Eide E, Regehr J. Volatiles are Miscompiled, and What to Do About It. ACM, 2008.
    Important if course includes projects where use of volatile is called for. Introduce the concept that a compiler might be buggy.

    Oram A, Wilson G. Beautiful Code: Leading Programmers Explain How They Think. O’Reilly Books, 2007. ISBN 0-596-51004-7.
    Beauty comes from union of design with use of common language idioms. Minority of examples are in C.

    Spinellis D. Code Reading: The Open Source Perspective. Addison-Wesley, 2003. ISBN 0-201-79940-5.
    How to read code; how to assess if it is code or bad; how to use code reading to improve your own code. Most examples are in C.

  23. Some curriculum topics that perhaps might be covered in a C course. Some is specific to C; some is system programming.
    > Two’s-Complement Notation (and that its use isn’t guaranteed in C). How to portably convert between specified interface’s two’s complement encoding and native integer encoding.
    > ASCII Encoding (and that its use isn’t guaranteed in C).
    > Endianness; how to portably convert between specified interface’s endianness and native endianness.
    > Limitations of floating-point arithmetic (as per What Every Computer Scientist Should Know About Floating-Point Arithmetic).
    > Memory-Mapped I/O versus Port-Mapped I/O and how each might be implemented in C. Importance of segregating nonportable code. Importance of volatile. Examples of read/write registers, read-only registers, and write-only registers.
    > Interrupts. What is an interrupt. How to write an interrupt handler in C; need for an assembly language wrapper. Mixing C and assembler. Segregating nonportable code.
    > Polled I/O versus Interrupt-Driven I/O.
    > Synchronous versus Asynchronous I/O.
    > Y2K, D10K: bugs created by earlier generations. What are today’s incipient Y2K bugs?
    > Data integrity strategies: Hamming codes, checksums, CRCs, etc. Engineering tradeoffs between implementation effort, runtime load, size of the check values, and sensitivity to corruption.
    > Perhaps a simple foreground/background scheduler to teach some of the basics of context switching.

  24. I know you’ve already rejected a suggestion of SQLite, but I would put in another vote for it as I personally I learnt a lot from that code base in my early days and still consider it exemplary. IMHO A good example of modern C written with regards to safety and security is the core of the Signal protocol.

  25. @James,

    First, I assume you meant “anecdotal” not “antidotal”. The use of Ada in practically the entire aerospace and railway industry, is not an anecdote. It’s concrete proof that Ada is reliable and efficient, up to hard-realtime use cases.

    Of course there are costs in changing languages: re-training, re-tooling, etc… but the tooling is available today. If you were to point out an issue, it’s that most of these compilers and verification tools are proprietary and very expensive, with the exception of GNAT Ada, whose standard library is straight GPL which some would find unacceptable.

  26. OTOH, CompCert is the only C compiler that I’m aware of having been formally verified, on par with Ada ones, and that’s proprietary and expensive too.

  27. Hi!

    @regehr, related to – “…even if we were in a position where no new projects should be written in C (that day is coming, but slowly — probably at least a decade off)…”, which programming language do you support(suggest) as a replacement of C?

    Thank you!

  28. I wish all my firmware developers could have taken a course as you describe. Heck I wish had had a course like that. 30 years of embedded development in ‘C’ and I still get surprised by some things the compilers do. I would think with all the embedded software these days there would be a major in embedded development not just a course or two.

  29. “I’m also leaving out build systems and version control. They should use these.”
    The problem is while students should use those, almost nobody actually teaches them how to use them.

    I’m not sure at what level one should teach people about real build systems (time is limited after all), but if all they ever see are their handwritten 5 line makefiles they’ll be unpleasantly surprised if they ever work on a larger project.

    I remember debugging/adapting HotSpot’s makefiles and I really wish I wouldn’t have had to learn all those details and idioms on my own.

  30. John, I think it would be fruitful to expose students to the recent work on nailing down C semantics from the Cambridge group, including the results of their survey on “C in Practice”: http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2013.pdf

    And also teach the emerging secure coding standard for C, the one that is being incorporated into the ISO C standard. It might be a TR or something. I don’t have a link, but I think it draws from the CERT standard, and means to bring that into the forefront, into the core C standard.

    I agree with others that C might not go away as soon as we’d like. This is because no one is seriously trying to make it go away. Rust is a fairly low key, low cost effort, and because its syntax isn’t informed by usability research, and seems to be very challenging, it probably won’t see wide adoption. (I know that it’s not normal for a PL to be informed by empirical research, much less usability research, but we should start setting higher standards, and such research would help a new language for sure.)

    We need a focused effort to build a C successor, or several of them. If I were the US government, I would have done this a decade ago. You remember how 2016 was The Future? In the 1980s, I doubt they would have predicted that we’d be using C in 2016. It’s sad that such a primitive language is still taught, and that we’ve allowed it to retain any performance advantage over alternatives. I will say that I lay some of the blame on academic CS and PL researchers for stalling progress in computing by promoting C and Unix so much. All the things Rob Pike said back in 2000 still apply (http://herpolhode.com/rob/utah2000.pdf) It’s long past time to move *forward*. I also think C and its offshoots have contributed to the huge gender imbalance in programming, which I’m going to do some research on this year — I think there are some distinctive things about men that make us willing to traffic in things like C. Were women in charge, I think they would have fixed this a long time ago.

  31. Perhaps “when the deal with a C compiler” should be “when the*y* deal with a C compiler”?