A Tourist’s Guide to the LLVM Source Code

In my Advanced Compilers course last fall we spent some time poking around in the LLVM source tree. A million lines of C++ is pretty daunting but I found this to be an interesting exercise and at least some of the students agreed, so I thought I’d try to write up something similar. We’ll be using LLVM 3.9, but the layout isn’t that different for previous (and probably subsequent) releases.

I don’t want to spend too much time on LLVM background but here are a few things to keep in mind:

The LLVM core doesn’t contain frontends, only the “middle end” optimizers, a pile of backends, documentation, and a lot of auxiliary code. Frontends such as Clang live in separate projects.
The core LLVM representation lives in RAM and is manipulated using a large C++ API. This representation can be dumped to readable text and parsed back into memory, but this is only a convenience for debugging: during a normal compilation using LLVM, textual IR is never generated. Typically, a frontend builds IR by calling into the LLVM APIs, then it runs some optimization passes, and finally it invokes a backend to generate assembly or machine code. When LLVM code is stored on disk (which doesn’t even happen during a normal compilation of a C or C++ project using Clang) it is stored as “bitcode,” a compact binary representation.
The main LLVM API documentation is generated by doxygen and can be found here. This information is very difficult to make use of unless you already have an idea of what you’re doing and what you’re looking for. The tutorials (linked below) are the place to start learning the LLVM APIs.

So now on to the code. Here’s the root directory, it contains:

bindings that permit LLVM APIs to be used from programming languages other than C++. There exist more bindings than this, including C (which we’ll get to a bit later) and Haskell (out of tree).
cmake: LLVM uses CMake rather than autoconf now. Just be glad someone besides you works on this.
docs in ReStructuredText. See for example the Language Reference Manual that defines the meaning of each LLVM instruction (GitHub renders .rst files to HTML by default; you can look at the raw file here.) The material in the tutorial subdirectory is particularly interesting, but don’t look at it there, rather go here. This is the best way to learn LLVM!
examples: This is the source code that goes along with the tutorials. As an LLVM hacker you should grab code, CMakeLists.txt, etc. from here whenever possible.
include: The first subdirectory, llvm-c, contains the C bindings, which I haven’t used but look pretty reasonable. Importantly, the LLVM folks try to keep these bindings stable, whereas the C++ APIs are prone to change across releases, though the pace of change seems to have slowed down in the last few years. The second subdirectory, llvm, is a biggie: it contains 878 header files that define all of the LLVM APIs. In general it’s easier to use the doxygen versions of these files rather than reading them directly, but I often end up grepping these files to find some piece of functionality.
lib contains the real goodies, we’ll look at it separately below.
projects doesn’t contain anything by default but it’s where you checkout LLVM components such as compiler-rt (runtime library for things like sanitizers), OpenMP support, and the LLVM C++ library that live in separate repos.
resources: something for Visual C++ that you and I don’t care about (but see here).
runtimes: another placeholder for external projects, added only last summer, I don’t know what actually goes here.
test: this is a biggie, it contains many thousands of unit tests for LLVM, they get run when you build the check target. Most of these are .ll files containing the textual version of LLVM IR. They test things like an optimization pass having the expected result. I’ll be covering LLVM’s tests in detail in an upcoming blog post.
tools: LLVM itself is just a collection of libraries, there isn’t any particular main function. Most of the subdirectories of the tools directory contain an executable tool that links against the LLVM libraries. For example, llvm-dis is a disassembler from bitcode to the textual assembly format.
unittests: More unit tests, also run by the check build target. These are C++ files that use the Google Test framework to invoke APIs directly, as opposed to the contents of the “test” directory, which indirectly invoke LLVM functionality by running things like the assembler, disassembler, or optimizer.
utils: emacs and vim modes for enforcing LLVM coding conventions; a Valgrind suppression file to eliminate false positives when running make check in such a way that all sub-processes are monitored by Valgrind; the lit and FileCheck tools that support unit testing; and, plenty of other random stuff. You probably don’t care about most of this.

Ok, that was pretty easy! The only thing we skipped over is the “lib” directory, which contains basically everything important. Let’s look its subdirectories now:

Analysis contains a lot of static analyses that you would read about in a compiler textbook, such as alias analysis and global value numbering. Some analyses are structured as LLVM passes that must be run by the pass manager; others are structured as libraries that can be called directly. An odd member of the analysis family is InstructionSimplify.cpp, which is a transformation, not an analysis; I’m sure someone can leave a comment explaining what it is doing here (see this comment). I’ll do a deep dive into this directory in a followup post.
AsmParser: parse textual IR into memory
Bitcode: serialize IR into the compact format and read it back into RAM
CodeGen: the LLVM target-independent code generator, basically a framework that LLVM backends fit into and also a bunch of library functions that backends can use. There’s a lot going on here (>100 KLOC) and unfortunately I don’t know very much about it.
DebugInfo is a library for maintaining mappings between LLVM instructions and source code locations. There’s a lot of good info in these slides from a talk at the 2014 LLVM Developers’ Meeting.
ExecutionEngine: Although LLVM is usually translated into assembly code or machine code, it can be directly executed using an interpreter. The non-jitting interpreter wasn’t quite working the last time I tried to use it, but anyhow it’s a lot slower than running jitted code. The latest JIT API, Orc, is in here.
Fuzzer: this is libFuzzer, a coverage-guided fuzzer similar to AFL. It doesn’t fuzz LLVM components, but rather uses LLVM functionality in order to perform fuzzing of programs that are compiled using LLVM.
IR: sort of a grab-bag of IR-related code, with no other obvious unifying theme. There’s code for dumping IR to the textual format, for upgrading bitcode files created by earlier versions of LLVM, for folding constants as IR nodes are created, etc.
IRReader, LibDriver, LineEditor: almost nobody will care about these and they contain hardly any code anyway.
Linker: An LLVM module, like a compilation unit in C or C++, contains functions and variables. The LLVM linker combines multiple modules into a single, larger module.
LTO: Link-time optimization, the subject of many blog posts and PhD theses, permits compiler optimizations to see through boundaries created by separate compilation. LLVM can do link-time optimization “for free” by using its linker to create a large module and then optimize this using the regular optimization passes. This used to be the preferred approach, but it doesn’t scale to huge projects. The current approach is ThinLTO, which gets most of the benefit at a small fraction of the cost.
MC: compilers usually emit assembly code and let an assembler deal with creating machine code. The MC subsystem in LLVM cuts out the middleman and generates machine code directly. This speeds up compiles and is especially useful when LLVM is used as a JIT compiler.
Object: Deals with details of object file formats such as ELF.
ObjectYAML seems to support encoding object files as YAML. I do not know why this is desirable.
Option: Command line parsing
Passes: part of the pass manager, which schedules and sequences LLVM passes, taking their dependencies and invalidations into account.
ProfileData: Read and write profile data to support profile-guided optimizations
Support: Miscellaneous support code including APInts (arbitrary-precision integers that are used pervasively in LLVM) and much else.
TableGen: A wacky Swiss-army knife of a tool that inputs .td files (of which there are more than 200 in LLVM) containing structured data and uses a domain-specific backend to emit C++ code that gets compiled into LLVM. TableGen is used, for example, to take some of the tedium out of implementing assemblers and disassemblers.
Target: the processor-specific parts of the backends live here. There are lots of TableGen files. As far as I can tell, you create a new LLVM backend by cloning the one for the architecture that looks the most like yours and then beating on it for a couple of years.
Transforms: this is my favorite directory, it’s where the middle-end optimizations live. IPO contains interprocedural optimizations that work across function boundaries, they are typically not too aggressive since they have to look at a lot of code. InstCombine is LLVM’s beast of a peephole optimizer. Instrumentation supports sanitizers. ObjCARC supports this. Scalar contains a pile of compiler-textbooky kinds of optimizers, I’ll try to write a more detailed post about the contents of this directory at some point. Utils are helper code. Vectorize is LLVM’s auto-vectorizer, the subject of much work in recent years.

And that’s all for the high-level tour, hope it was useful and as always let me know what I’ve got wrong or left out.

January 5, 2017

regehr

Compilers, Computer Science, Education

10 responses to “A Tourist’s Guide to the LLVM Source Code”

Mathieu Stumpf Guntz says:

January 6, 2017 at 2:29 am

Hi, would you allow me to translate your article to French, preferably under a free lincense like CC-by-sa?
Ted Mielczarek says:

January 6, 2017 at 2:54 am

I don’t know for certain, but I highly suspect that ObjectYAML is used for the `tbd` files that Apple ships in newer OS X/iOS SDKs in place of dylibs. There seems to be precisely zero official documentation about this, but see: http://stackoverflow.com/a/32115656 .
regehr says:

January 6, 2017 at 7:23 am

Ted, thanks!

Mathieu, certainly.
Al Peterson says:

January 6, 2017 at 8:40 am

Prof. Regehr, you rock! Thank you for this informative article! 🙂
Nick Lewycky says:

January 6, 2017 at 9:02 am

“An odd member of the analysis family is InstructionSimplify.cpp, which is a transformation, not an analysis; Iâ€™m sure someone can leave a comment explaining what it is doing here.”

It doesn’t mutate the IR itself. The rule for llvm::SimplifyInstruction is that it may only return constants or existing Values in the program, which meets the requirements for an Analysis. The pass which calls SimplifyInstruction on every instruction is a transformation pass under lib/Transforms/Utils/SimplifyInstructions.cpp.
Peter Goodman says:

January 6, 2017 at 9:26 am

This is a great website for browsing the LLVM source code, as well as things like the Linux kernel: https://code.woboq.org/llvm/llvm/
Stijn says:

January 6, 2017 at 2:03 pm

Where it says “resources: something for Visual C++ that you and I donâ€™t care about.”, I would rather have liked it said “resources: the version resource definition used for Windows binaries”.
regehr says:

January 6, 2017 at 5:58 pm

Aha, thanks Nick!
Quentin Carbonneaux says:

January 7, 2017 at 1:16 pm

Hi John, thanks for that great overview! This is my first comment but I’ve been enjoying your work and blog for now many months.

I hop in because you mention that you use LLVM in a compiler class, it made me think that I should introduce my own work. For more than a year I have been developing (in parallel with my PhD work) a very small compiler backend in C [1]. My goal was always educational (i.e. learn about ssa and more generally, state of the art compiler techniques), and it has been a very rich experience.

As it is now, the compiler might not be suitable to study many advanced compiler techniques. However, I found it very convenient for experimentation, and it showed me many times what it takes to go from a research paper to a practical artifact.

I would love to see this backend used in a compiler class. For example, LLVM could still be used as a model of serious industrial implementation, but for student projects my plainer C backend could lower the barrier of entry significantly and let them focus on the actual compiler techniques.

For a concrete example of how little code a complex optimization pass might require, you can take a look at how copy elimination is implemented [2]. As it is, it works on programs with cycles of copies and phi nodes.

I look forward to hearing your thinking on this! If you have any suggestions on what could make it more fit for education, I’d be happy to incorporate them.

[1] http://c9x.me/compile/
[2] http://c9x.me/git/?p=qbe.git;a=blob;f=copy.c;hb=HEAD

PS: You might worry about the generation of the IL, but this will not be a problem for long as two projects close to an early release are using my backend for machine code generation.
regehr says:

January 7, 2017 at 5:28 pm

Quentin, QBE looks very cool! It is certainly something that I might use in a class. Along those same lines, I used Xv6 in a small Advanced OS class a few years ago and it worked really well. There’s something nice about understanding most of a small system instead of a tiny part of a huge system.