In my Advanced Compilers course last fall we spent some time poking around in the LLVM source tree. A million lines of C++ is pretty daunting but I found this to be an interesting exercise and at least some of the students agreed, so I thought I’d try to write up something similar. We’ll be using LLVM 3.9, but the layout isn’t that different for previous (and probably subsequent) releases.
I don’t want to spend too much time on LLVM background but here are a few things to keep in mind:
- The LLVM core doesn’t contain frontends, only the “middle end” optimizers, a pile of backends, documentation, and a lot of auxiliary code. Frontends such as Clang live in separate projects.
- The core LLVM representation lives in RAM and is manipulated using a large C++ API. This representation can be dumped to readable text and parsed back into memory, but this is only a convenience for debugging: during a normal compilation using LLVM, textual IR is never generated. Typically, a frontend builds IR by calling into the LLVM APIs, then it runs some optimization passes, and finally it invokes a backend to generate assembly or machine code. When LLVM code is stored on disk (which doesn’t even happen during a normal compilation of a C or C++ project using Clang) it is stored as “bitcode,” a compact binary representation.
- The main LLVM API documentation is generated by doxygen and can be found here. This information is very difficult to make use of unless you already have an idea of what you’re doing and what you’re looking for. The tutorials (linked below) are the place to start learning the LLVM APIs.
So now on to the code. Here’s the root directory, it contains:
- bindings that permit LLVM APIs to be used from programming languages other than C++. There exist more bindings than this, including C (which we’ll get to a bit later) and Haskell (out of tree).
- cmake: LLVM uses CMake rather than autoconf now. Just be glad someone besides you works on this.
- docs in ReStructuredText. See for example the Language Reference Manual that defines the meaning of each LLVM instruction (GitHub renders .rst files to HTML by default; you can look at the raw file here.) The material in the tutorial subdirectory is particularly interesting, but don’t look at it there, rather go here. This is the best way to learn LLVM!
- examples: This is the source code that goes along with the tutorials. As an LLVM hacker you should grab code, CMakeLists.txt, etc. from here whenever possible.
- include: The first subdirectory, llvm-c, contains the C bindings, which I haven’t used but look pretty reasonable. Importantly, the LLVM folks try to keep these bindings stable, whereas the C++ APIs are prone to change across releases, though the pace of change seems to have slowed down in the last few years. The second subdirectory, llvm, is a biggie: it contains 878 header files that define all of the LLVM APIs. In general it’s easier to use the doxygen versions of these files rather than reading them directly, but I often end up grepping these files to find some piece of functionality.
- lib contains the real goodies, we’ll look at it separately below.
- projects doesn’t contain anything by default but it’s where you checkout LLVM components such as compiler-rt (runtime library for things like sanitizers), OpenMP support, and the LLVM C++ library that live in separate repos.
- resources: something for Visual C++ that you and I don’t care about (but see here).
- runtimes: another placeholder for external projects, added only last summer, I don’t know what actually goes here.
- test: this is a biggie, it contains many thousands of unit tests for LLVM, they get run when you build the check target. Most of these are .ll files containing the textual version of LLVM IR. They test things like an optimization pass having the expected result. I’ll be covering LLVM’s tests in detail in an upcoming blog post.
- tools: LLVM itself is just a collection of libraries, there isn’t any particular main function. Most of the subdirectories of the tools directory contain an executable tool that links against the LLVM libraries. For example, llvm-dis is a disassembler from bitcode to the textual assembly format.
- unittests: More unit tests, also run by the check build target. These are C++ files that use the Google Test framework to invoke APIs directly, as opposed to the contents of the “test” directory, which indirectly invoke LLVM functionality by running things like the assembler, disassembler, or optimizer.
- utils: emacs and vim modes for enforcing LLVM coding conventions; a Valgrind suppression file to eliminate false positives when running make check in such a way that all sub-processes are monitored by Valgrind; the lit and FileCheck tools that support unit testing; and, plenty of other random stuff. You probably don’t care about most of this.
Ok, that was pretty easy! The only thing we skipped over is the “lib” directory, which contains basically everything important. Let’s look its subdirectories now:
- Analysis contains a lot of static analyses that you would read about in a compiler textbook, such as alias analysis and global value numbering. Some analyses are structured as LLVM passes that must be run by the pass manager; others are structured as libraries that can be called directly. An odd member of the analysis family is InstructionSimplify.cpp, which is a transformation, not an analysis; I’m sure someone can leave a comment explaining what it is doing here (see this comment). I’ll do a deep dive into this directory in a followup post.
- AsmParser: parse textual IR into memory
- Bitcode: serialize IR into the compact format and read it back into RAM
- CodeGen: the LLVM target-independent code generator, basically a framework that LLVM backends fit into and also a bunch of library functions that backends can use. There’s a lot going on here (>100 KLOC) and unfortunately I don’t know very much about it.
- DebugInfo is a library for maintaining mappings between LLVM instructions and source code locations. There’s a lot of good info in these slides from a talk at the 2014 LLVM Developers’ Meeting.
- ExecutionEngine: Although LLVM is usually translated into assembly code or machine code, it can be directly executed using an interpreter. The non-jitting interpreter wasn’t quite working the last time I tried to use it, but anyhow it’s a lot slower than running jitted code. The latest JIT API, Orc, is in here.
- Fuzzer: this is libFuzzer, a coverage-guided fuzzer similar to AFL. It doesn’t fuzz LLVM components, but rather uses LLVM functionality in order to perform fuzzing of programs that are compiled using LLVM.
- IR: sort of a grab-bag of IR-related code, with no other obvious unifying theme. There’s code for dumping IR to the textual format, for upgrading bitcode files created by earlier versions of LLVM, for folding constants as IR nodes are created, etc.
- IRReader, LibDriver, LineEditor: almost nobody will care about these and they contain hardly any code anyway.
- Linker: An LLVM module, like a compilation unit in C or C++, contains functions and variables. The LLVM linker combines multiple modules into a single, larger module.
- LTO: Link-time optimization, the subject of many blog posts and PhD theses, permits compiler optimizations to see through boundaries created by separate compilation. LLVM can do link-time optimization “for free” by using its linker to create a large module and then optimize this using the regular optimization passes. This used to be the preferred approach, but it doesn’t scale to huge projects. The current approach is ThinLTO, which gets most of the benefit at a small fraction of the cost.
- MC: compilers usually emit assembly code and let an assembler deal with creating machine code. The MC subsystem in LLVM cuts out the middleman and generates machine code directly. This speeds up compiles and is especially useful when LLVM is used as a JIT compiler.
- Object: Deals with details of object file formats such as ELF.
- ObjectYAML seems to support encoding object files as YAML. I do not know why this is desirable.
- Option: Command line parsing
- Passes: part of the pass manager, which schedules and sequences LLVM passes, taking their dependencies and invalidations into account.
- ProfileData: Read and write profile data to support profile-guided optimizations
- Support: Miscellaneous support code including APInts (arbitrary-precision integers that are used pervasively in LLVM) and much else.
- TableGen: A wacky Swiss-army knife of a tool that inputs .td files (of which there are more than 200 in LLVM) containing structured data and uses a domain-specific backend to emit C++ code that gets compiled into LLVM. TableGen is used, for example, to take some of the tedium out of implementing assemblers and disassemblers.
- Target: the processor-specific parts of the backends live here. There are lots of TableGen files. As far as I can tell, you create a new LLVM backend by cloning the one for the architecture that looks the most like yours and then beating on it for a couple of years.
- Transforms: this is my favorite directory, it’s where the middle-end optimizations live. IPO contains interprocedural optimizations that work across function boundaries, they are typically not too aggressive since they have to look at a lot of code. InstCombine is LLVM’s beast of a peephole optimizer. Instrumentation supports sanitizers. ObjCARC supports this. Scalar contains a pile of compiler-textbooky kinds of optimizers, I’ll try to write a more detailed post about the contents of this directory at some point. Utils are helper code. Vectorize is LLVM’s auto-vectorizer, the subject of much work in recent years.
And that’s all for the high-level tour, hope it was useful and as always let me know what I’ve got wrong or left out.