Discovering New Instructions


Sometimes I wonder what instruction sets are supposed to look like. That is, what instructions would there be if computers were redesigned by smart people who understood our fabrication capabilities and who knew what we wanted to accomplish using computers, but who didn’t care about backwards compatibility and who haven’t seen our architectures? We can get little glimpses into that world by looking at network processors, DSPs, GPUs, ISA extensions like SSE4 and NEON, extensible processors like Tensilica’s, and others. But still, these are too rooted in the past.

Although the machine code emitted by our compilers is inextricably tied to our processors, perhaps this code can still be leveraged to discover new instructions. As a thought experiment, let’s start with a collection of executables whose performance we care about. Preferably, some of these will have been compiled from programs in Haskell, OCaml, and other languages that are not well-represented in today’s benchmark suites. We’ll run these programs in a heavily instrumented execution environment that creates a dynamic dataflow graph for the computation; the excellent Redux paper shows how this can be done. Next, we’ll need to clean up the dataflow graphs. First, we rewrite processor-specific operations (condition code dependencies, CISC instructions, etc.) into a simpler, portable form. Next, we optimize away as much dead, redundant, and vacuous code as possible; including, hopefully, all irrelevancies such as stack frame manipulation, dynamic linking, and garbage collection. The result — perhaps — will be something beautiful: the essence of the original computation, stripped of all sequential constraints, processor-specific operations, and bookkeeping. Of course this dataflow graph has some limitations. First, it only encodes the meaning of a single execution of the program. Second, it encodes a lot of incidental facts about the computation such as the bitwidths of all integer operations, the specific hashing methods used, etc. We’ll just have to live with these problems. The Redux paper contains a great example where factorial codes written in C, in Haskell, and in a stack machine are shown to all produce basically the same dataflow graph.

So now we have a collection of giant dataflow graphs: one for each execution of each program that we’re interested in. Our goal is to design an instruction set that can compute these dataflow graphs. Trivially, this can be done by partitioning the graphs into very small units of computation corresponding to a RISC instruction set. But that ISA is boring and won’t show any performance wins. To do better we’ll use a search-based strategy to find subgraphs that:

  • occur a large number of times across the collection of dataflow graphs — these are operations that are executed frequently by our workloads
  • contain a lot of parallelism — making them good candidates for hardware acceleration
  • contain a lot of internal symmetry — supporting SIMD-style execution
  • have a small number of dynamic inputs and outputs
  • rely on a small number of constants
  • do not contain dependency chains that are too long — we don’t want to create instructions that are too slow

I think this can be done; none of these properties seems particularly difficult to test for. The major problem necessitating cleverness will be the huge size of the dataflow graphs. We’ll end up with a list of candidate instructions ranked by some fitness function, such as performance or code density. We can build as many of these into our new ISA as we have hardware budget for.

Would this method discover saturating arithmetic instructions when applied to signal processing codes? Would it find clever ways to optimize bounds checks and exception handling in high-level programming programming languages? It’s possible (though I’d be super disappointed) that the new machines are just incidental variations on existing RISC and CISC designs. If this happened, I would suspect that we had failed to abstract away a sufficient number of processor artifacts. Or, perhaps it was a mistake to compile our computations to an existing machine architecture before building the dataflow graphs. Rather, perhaps we should start with a source-level program and its operational semantics, unrolling it into a dataflow graph without any compilation to machine code. This avoids ties to our existing processors, but also risks coming up with graphs that are very hard to map back onto actual hardware. Of course, many languages don’t even have a real semantics, but researchers are diligently working on that sort of thing. An even more difficult option would build up the functional representation of a source program (or executable) without running it, but this has the disadvantage of losing the “profile data” that is built into a dynamic dataflow graph — we’d need to add that in separately.

An aspect of this exercise that I find interesting is that gives insight into what our processors really do. Many years ago (I don’t have a reference handy, unfortunately) a study showed that computers spend most of their time in sorting algorithms. That cannot be true anymore — but what does the average mobile phone do? What about the average supercomputer chip? The average machine in Google or Amazon’s cloud? Of course we know the answers at a high level, but are there surprises lurking inside the computations? I would expect so — it’s tough to take a close look at a complicated system and not learn something new. Are there new instructions out there, waiting to be discovered, that can help these workloads execute more efficiently? I have to think so, at least for for workloads that are not well-represented in Intel’s, AMD’s, and ARM’s benchmark suites.

,

17 responses to “Discovering New Instructions”

  1. I’ve heard that a very large faction of super computer time is spent in LAPACK. Not surprising, and I suspect there are some very detailed breakdowns of exactly how much time is spent where for that class of program.

  2. Good post, and I agree.
    I would add that the ISA depends strongly on the micro-architectural assumptions of the computer in question.
    For example, is indirect memory access supported or not?

    “Understanding Sources of Inefficiency in General-Purpose Chips” ( http://www-vlsi.stanford.edu/papers/rh_isca_10.pdf ), does some of what you describe, albeit from an ECE point of view, and it was done in collaboration with Tensilica.

    Funny you should ask this question, as my supervisor and I have a paper coming out at FPGA 2012 describing a soft-processor (Octavo) whose ISA was built-up from scratch. The result can recapitulate MIPS-ish ISAs, but without using immediate literals or indirect memory access. We’re still figuring out all the consequences.

  3. You’ve pretty much completely described my fourth-year engineering project. However, since we only had four months to work on it, I didn’t get to be as fancy as I would have liked. We did, however, find a new instruction with a performance benefit.

    I’m not 100% sure I’m remembering correctly, but the new instruction involved loading a value from memory and adding an immediate value to it. This was on AMR7, and gave about a 10% speed increase for an edge detector program. (We wanted to do a video codec, but the simulator we used was far too slow. It took over 30 minutes to decode 5 seconds of video!)

  4. Eric, thanks for the reference. Tensilica’s stuff is definitely cool. I’d appreciate a copy of your paper, when you get a chance.

    Re. indirect memory access, I was thinking it would be best to first synthesize register-only instructions. Maybe some manual tweaking would be needed to figure out addressing mode concerns.

    Ryan, cool!

    bcs, yeah, it seems likely that there exist workloads that have been studied so hard that we’re not going to get mileage out of them. On the other hand, notice that the dynamic dataflow graph approach may shine a light into some previously-unseen corners since it is not constrained by things like how close together in the instruction stream dependent instructions execute. So it may find interesting interprocedural inefficiencies.

  5. Daniel, I kept up a bit on the TRIPS project while it was going on. I wish there was a nice retrospective about the project — didn’t see this in the list of publications.

  6. I am sorry to take a slightly contrarian position, but most of the ideas that impressed me in instruction set design would not have been found using your method, and seem quite revolutionary in comparison to the ideas I would expect to emerge the way you describe.

    I hope I am not mischaracterizing your idea by saying that it’s about looking for computational idioms: what are we computing often?
    The ideas that impressed me took architectural constraints into account (what can current and future technology offer?) at the same time.

    – the conditional move. I don’t know who invented it, and it is obvious in retrospect, but I’m not sure that you would find it using the “what are we computing often?” question. The point is that it’s an idiom that can often be forced to appear even if it does not occur naturally.

    – The PowerPC’s dot and non-dot instructions (the one writes to the flags and the other doesn’t). Call it a hack over an existing design if you want, but I think it’s brilliant to continue sharing most of two distinct computations, while allowing one result to be completely ignored and not to overwrite its output port. You might actually discover this idea if you looked at dataflow graphs at a very small scale, but you might never do that fearing to let existing designs influence the graphs too much.

    – Going further, I believe there is an instruction set that has several flag registers. You specify which one to use as output for each arithmetic operation. Surely, I didn’t invent this? Sounds like it might be IA64. If it does not enlarge instructions too much, it must allow some great tricks. Again, it’s more the sort of thing that the compiler would attempt to make appear.

    – The Sparc’s register file is a brilliant idea entirely born from architectural constraints.

    – The branch delay slot, on the other hand, does not sound like something that would age well. I do not know how easily compilers or humans were able to fill it, but a 1-instruction effort seems futile with all modern desktop processors being deeply pipelined. So you can’t win every time even if you look at the problem from the other end.

  7. If I understood John’s description, he means to track dynamic data dependencies, even through the heap and stack. In that case, it could be very interesting to see if the abstraction layers could be peeled away, even through many layers of object-orientation.

  8. (1) Why on Earth doesn’t x86 have an integer multiply-accumulate instruction?

    (2) Just for fun, instead of having our silly Z/2^L Z arithmetic, can you imagine if you had binary Galois fields instead? In integer arithmetic, you would have that (x * y ) / x = y for all non-zero values of x **even if x*y overflows**.

  9. This problem has been tackled before, mostly in the CAD community.

    An example of early work:

    Ing-Jer Huang, Alvin M. Despain: Generating instruction sets and microarchitectures from applications. ICCAD 1994: 391-396

    The most widely cited paper on the topic:

    Kubilay Atasu, Laura Pozzi, Paolo Ienne: Automatic application-specific instruction-set extensions under microarchitectural constraints. DAC 2003: 256-261

    Detailed survey paper:

    Carlo Galuzzi, Koen Bertels: The Instruction-Set Extension Problem: A Survey. TRETS 4(2): 18 (2011)

    Most of the key issues, such as finding repeating patterns across a variety of data flow graph have been studied by others as well. In the PL&S community, you could look at Proebsting’s work on superoperators, for example. The CAD community also has a lot of papers that focus on isomorphism and other similarity metrics.

    Handling long dependency chains is not a problem, as long as you are not opposed to creating multi-cycle instructions that stall your pipeline — similar to floating-point or division instructions.

    —-

    The real limiting factor in doing the proposed study properly is access to representative benchmarks. Since this work originated on the embedded side of the spectrum, the question has more or less been answered for typical embedded workloads such as EEMBC and MediaBench. I’m not sure if anyone bothered to do it for SPEC.

  10. @Pascal Cuo, I disagree with some of the examples given:
    -I’m not especially impressed with the PPC ISA: the Alpha had no flag registers, as the designers thought that they could become a bottleneck.
    -the register window while innovative, is a bit like the branch delay slot: it works well on one implementation but on others implementations with different technology, it can become a liability as it makes all the registers access indirect.

    One thing you didn’t list that I would add is MIPS’ two versions of integer arithmetic instructions, one which TRAP on integer overflow, one which set the flag register: very useful for implementing sane integer semantic like Ada does.

  11. For some purposes it might be more worthwhile to optimise for code size than try to cherry pick individual instructions.

    Modern OOO processors tend to make pretty fast and loose with the information exposed in an ISA, caching it, rewriting it into microcode, etc. Thus a highly specialised instruction stands at least some chance of becoming obsolete over time as the processor implementation evolves, whereas compactness doesn’t seem likely to ever become undesirable so long as caches are still a thing.

  12. @Daniel Lemire

    (2) Galois multiplication is not nearly as common as integer multiplication. If the divisibility property is important, then you should work in a large prime close to 2^n. However, the number-theoretic characterization of division is usually more important than its algebraic properties.

    By the way, the AES-NI instruction set does allow faster multiplication over a Galois field.

  13. Another line of similar work is Bracy’s Mini-Graphs (e.g., Dataflow Mini-Graphs: Amplifying Superscalar Capacity and Bandwidth, MICRO, 2004). I was surprised that it doesn’t seem to be mentioned in the Galuzzi & Bertels survey.

    Scott Mahlke has also done a raft of related work.

  14. John: I think you’re right that we need to look at instruction sequences, dataflow, etc. but there’s a lot of tradeoffs here.

    Your intended microarchitecture matters: If you plan to use register renaming, you want to avoid instructions with large numbers of destination registers and favor conditional move instructions over conditional instructions.
    (I also don’t know which processor added conditional move – but there’s a good chance it was a processor with register renaming.)

    If you’re more focused on energy efficiency, you might choose more complex instructions requiring greater logic depth – but that’s going to hurt your high frequency implementations.

    How long you’re going to have to live with your design choices. If you’re commercially successful, you’ll be living with them for 20-30 years so decisions which make a huge amount of sense for today’s microarchitecture may be things you regret in 15 years time. So you’ve got to anticipate all the various scaling trends (Moore’s law, memory wall, power wall, dark silicon, …) but also produce a design that you can make well today. You’ve also got to leave room to add new instructions in the future.

    Do you want to define subsets of your architecture for use in cost/area-constrained niches like microcontrollers? (Microcontrollers are often implemented on older processes like 180nm or 130nm instead of the 45nm or 32nm used for high end cores so, while Moore’s law applies, they’re behind the curve.)
    And if you do aim to service multiple niches, then you have to consider a wider range of workloads – embedded code is very different from running a web browser.

    Do you want to support high performance computing? How are you going to support the massive parallelism and energy efficiency required for exascale machines? With 1,000,000 or more processors, reliability is an issue – what will you handle in software and what in hardware? Do you handle it in user-level code or only at the OS layer?

    And then you have to produce compilers – some degree of completeness and regularity are a good idea. In a regular architecture you can exploit these by first generating a direct version of the code (using simple instructions) and then applying a series of peepholes to replace them with more complex compound instructions. If the ISA lacked the simple instructions, you would have to leap straight for the complex instructions. (I’m especially thinking of vectorization here – you have a complex transformation with many side-conditions which prevent it from being applied so you want to start by targeting a simple ISA, and then be able to apply a series of transformations to improve on it. Too many optimality islands will be hard to target.)

    Do you even have the workloads you need to produce the traces? SPEC is hopelessly unrepresentative of real programs. JITs are changing rapidly, runtime code generation seems to be making a comeback. Languages are becoming more dynamic (more branches, more indirect branches) – but will that trend reverse as designers accept that Moore’s law is ending (if it is) and programmers have to take an ax to today’s bloatware? In the time it takes to launch a new processor, get compilers, OSes, etc. fully up to speed, get enough design wins to reach high volume, etc. all the key parts of a web browser will be completely rewritten. So just as you have to anticipate the hardware trends, you have to predict some of the software trends in the absence of actual workloads.