CPU registers and OCaml

pcwalton · on June 21, 2015

I talked with Chris Lattner a few years ago about register allocation and he had an interesting perspective. In his view (from what I remember; my recollection could be somewhat incorrect) most of the classical academic research on it is solving the wrong problem. Most formulations of register allocation are graph coloring algorithms designed to answer the question "is this function colorable using K registers without spilling?" The algorithms for handling the case when you do spill usually don't have as much thought put into them. But this is emphasizing the wrong aspect of the problem in practice; for almost any interesting function on the commonly used architectures (x86/ARM), the answer to the theoretical question is trivially "no" (because there are relatively few registers), and the most important practical problem is how to spill effectively (including deciding whether to spill versus split versus rematerialize, etc.) As I understand it, that's the idea behind LLVM's (relatively) new greedy allocator [1]: the graph-coloring part of the problem is simple, and the focus is on spilling and splitting, problems that the classical academic literature has tended to put less emphasis on.

[1]: http://blog.llvm.org/2011/09/greedy-register-allocation-in-l...

DannyBee · on June 21, 2015

I'm going to guess you have misunderstood what Chris said (with no offense meant) :)

Spilling is actually a very well known and very well studied problem, and in fact, most compilers and research spend a lot of time figuring out how to place spill code and coealesce copies, not how to color things. Chris knows this. I had discussions with him (many many years ago, before he ever started at Apple) about GCC's register allocation approach vs what he was thinking of doing for LLVM.

There are actually only so many good approaches to spill code /rematerialization/live range splitting you can use. At some point, it's really just a large integer-linear programming problem.

Note that coloring is essentially free on SSA form. It's linear time or better.

pcwalton · on June 21, 2015

Regardless, the whole point of the greedy allocator is to throw out the fancy list of active intervals that linear scan maintains in favor of doing something simpler that allows more flexibility in spilling/splitting, right?

For example:

"Live ranges being spilled without being split first cause the mess that the rewriter is working so hard to clean up. We would much rather split them into smaller pieces that might be assignable, but this would require the linear scan algorithm to backtrack. This is very expensive, and full live range splitting isn't really feasible with linear scan."

i.e. a weakness of linear scan is that the active list approach helps it solve the coloring problem quickly and accurately, but it comes with a big drawback that live range splitting is hard, which is the wrong tradeoff in practice.

DannyBee · on June 21, 2015

"Regardless, the whole point of the greedy allocator is to throw out the fancy list of active intervals that linear scan maintains in favor of doing something simpler that allows more flexibility in spilling/splitting, right?"

Kinda. It's not that simple, either :)

Linear scan is actually just bad all around, IMHO. It's meant for JITs that want to value speed above all else. At the time, I expect Chris was hopeful better spilling algorithms would take care of the fact that Linear Scan sucks.

It's not effective at solving the coloring problem quickly and accurately either :). It's kinda a worst-of-both worlds algorithms

Greedy is closer to what other compilers do: optimistic coloring, splitting if necessary, then remat/spilling.

So it's more accurate to say "LLVM made a bad choice in linear scan. They later fixed that mistake". I"m not away of any other high performance compiler that went down the linear scan path ever, because they didn't think it would work out.

pcwalton · on June 21, 2015

Huh, I was under the impression that linear scan was more well-regarded than it is, then. I guess I hang around JIT folks and LLVM hackers too much. :)

DannyBee · on June 21, 2015

So, there are three main papers on linear scan: Poletto and Sarkar, Traub, Holloway, and Smith, and nowadays, wimmer and franz

Sarkar and Poletto invented it for JITs. They were comparing it against literal chaitin-briggs style graph coloring (this was IBM, and IBM obviously did it that way, since IBM invented it).

The implementation they compared against was not particularly efficient (it happened at a time when I was at IBM research, so i've seen the code).

That said, linear scan made sense for a JIT at that time.

Compared to standard graph coloring algorithms of the time, you got somewhere between 10-30% crappier code, but the algorithm was a lot faster.

(There are papers that seemed to cherry pick linear scan result data to show better results. However, the consensus for implementers was it generated significantly worse code, and that was the tradeoff for faster).

Around this time, everyone started redoing everything in SSA, and noticed some things. Among them, that register allocation on SSA seemed easier.

It took until ~2005, but Sebastian Hack and Bouchez, et al then both independently proved that SSA generates chordal interference graphs, which means you can calculate the number of colors it requires (and color them) in polynomial time.

As an aside, interestingly, outside of special case graphs, you can't even estimate the chromatic number (min number of colors needed) of a graph sanely. The chromatic number is greater than or equal to the clique number, and even that can't be approximated sanely.

See, e.g, http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.16.9...

So this is why so many people were so interested in finding alternatives to graph coloring.

But then it turns out, we can graph color really quickly on SSA.

(Then you still have "out-of-ssa related problems, but ...)

Once that was proven, by 2007/2008, the world had changed.

People were designing brand new SSA based allocators. An intern of mine at the time was working on implementing alternatives, and came up with: http://www.cs.ucla.edu/~palsberg/paper/PereiraPalsberg08.pdf

Here's an abstract:

  For SPEC CPU2000, the compilation time of our implementation
  is as fast as that of the extended version of linear scan
  used by LLVM. Our implementation produces x86 code that is 
  of similar quality to the code produced by the slower, 
  state-of-the-art iterated register coalescing of George and 
  Appel with the extensions proposed by Smith, Ramsey, and    
  Holloway in 2004."

As you can see, at that point, there is no point in linear scan :)

As an aside, he also proved optimal live-range-splitting generates elementary graphs, which give you nice properties like being able to color using register pairs and under various alignment constraints, in linear time. (this would still be NP-complete on chordal graphs)

Since then, things have gotten better and better.

Quentin Colombet's(who wrote the current register allocator in LLVM) thesis is a good read for current techniques in spilling: https://tel.archives-ouvertes.fr/tel-00764405v1/document

sanjoy_das · on June 22, 2015

Another interesting data point is libFirm: http://pp.ipd.kit.edu/firm/ which regalloc's directly over SSA.

DannyBee · on June 22, 2015

Yes, it uses either chordal coloring (written by sebastian hack, who published one of the earlier mentioned graph coloring on SSA papers) or PBQP.

I'm pretty sure LLVM's PBQP implementation is more advanced.

UniQP · on June 22, 2015

What do you mean by more advanced?

The LLVM implementation is more mature from a software point of view, but libFirm's implementation supports more PBQP reductions, and thus may find a better PBQP solution.

jfb · on June 22, 2015

Thank you for this, it was fascinating.

DannyBee · on June 22, 2015

Sure.

If you want a good survey of the world of optimal register allocation/scheduling, someone just wrote one.

http://arxiv.org/pdf/1409.7628.pdf

It's pretty darn well written, and covers the world as it exists today.

The author later went to publish a nice thesis on optimal integrated RA + scheduling. https://www.sics.se/~rcas/publications/TRITA-ICT-ECS-AVH-14-...

cwzwarich · on June 21, 2015

The basic linear scan algorithm generates poor code, so you almost always want to extend it with additional heuristics. If any of these heuristics involve reversing a tentative decision then you have to rewind the state of the entire register allocator, because the linear scan algorithm is designed to process the function in one direction. This coarse-grained backtracking is fairly expensive. If you use a better algorithm that lets you consider tentative decisions on a more individual basis, then you scale better than linear scan with these extensions.

Because of the poor output code quality of the linear scan allocator (even with all of the additional heuristics), there was a post-pass that performed local optimizations. Over time it accumulated a lot of special cases to target particular poor code patterns, and it became more expensive than just using a register allocator that produced better code in the first place.

userbinator · on June 22, 2015

The fact that x86 is not a "pure" register-only architecture also matters --- it also has register-memory/memory-register operations, and push/pop (1 byte each) for efficient saving and restoring of values.

I've looked at a lot of compiler-generated code, and compared to hand-written Asm (which I have also done quite a bit of), register allocation is one of the things that compilers are very noticeably worse at. Thanks to their use of spilling, which doesn't really apply to something like x86, they shuffle data between registers and memory far more than they have to, instead of keeping values in memory or registers and using memory-register operations (which automatically "allocate" one of the many unnamed registers on the CPU, via register renaming.) A nested use structure could be handled easily with push/pop. It's good to see that compilers are slowly moving in the direction of allocating registers like a human programmer would.

cwzwarich · on June 21, 2015

There are very few academic papers that just consider the coloring problem of register allocation. One of the reasons for tis is that there is essentially no progress to be made on the instances of graph coloring that arise in register allocation. As far as I know, all graph coloring register allocators use the greedy coloring algorithm given by Chaitin, which was actually first published by Kempe in the 19th century. There are better graph coloring algorithms, but the interference graphs that arise from real programs are sufficiently uninteresting that they don't matter. And if you're using an SSA IR in your backend, things get even simpler: the greedy algorithm is optimal, and you can perfectly color the interference graph without even constructing it.

Most of the work on register allocation (both in academia and production compilers) happens on the aspects of the problem that are not reflected in the graph coloring formulation:

1) Copy elimination. You can approximate this problem by adding affinity weights to edges of the interference graph and asking for a coloring that minimizes the cost of unfulfilled affinities. This doesn't actually represent the problem as it exists on a real computer, because you can sometimes get rid of copies by other means than just choosing a lucky assignment, e.g. you can restructure the program to move a copy outside of a hot loop. This is especially relevant given the use of SSA form in modern compilers, because if you do nothing then you will end up inserting copies all over the place to implement phis.

2) Spill code insertion. If you perform a coloring of the interference graph and find that you don't have enough physical registers, then you have to restructure the program to spill registers to the stack and reload them. The particulars of which variables to spill and where to insert the stores and loads are very important for code quality. This includes things like rematerialization that let you insert other instruction sequences instead of loads to recreate the spilled value.

3) Live range splitting. In a sense, the idea that live range splitting is a distinct problem from the other problems of register allocation is relevant more to the implementation of register allocators than register allocation as an abstract optimization problem. If your allocator operates on a per-variable basis, then it often makes sense to split a variable into two separate variables and insert a copy merging the two live ranges. For example, a variable might be used frequently in one control-flow region and not used at all down another, and you want the register allocator to be able to make distinct decisions about those two regions. This problem is worsened by compiler optimizations, because compilers can often realize that multiple variables in a function have the same value and optimize them to use a common definition.

The area where academic register allocation research has the largest gap from actual practice is probably the consideration of architectural constraints on registers like instructions with fixed registers, pairs of adjacent registers, etc. and register files that have overlapping registers. You can always insert copies around an instruction to satisfy almost any constraint, but that doesn't work out so well. :) Most of the academic work I have seen on this problem is either overly simplified and doesn't handle all of the constraints that befall a production compiler, or it is considerably expensive to implement. This gets worse when you write a portable compiler; while the constraints of any single architecture may be manageable, it's harder when you have to handle the union of the constraints imposed by all architectures that you want to support. The aliasing between the various ARM floating-point register files was a strong motivator in the design of LLVM's greedy allocator.

The LLVM 'greedy' allocator is probably poorly named. It started out as a simple greedy register allocator, and over time it gained mechanisms like register eviction that make it a non-greedy algorithm. You don't have to squint too hard at it before it looks like a variation of priority allocation:

http://dl.acm.org/citation.cfm?id=88621

The allocator described in that paper uses an interference graph for liveness and interference checks and a different method of choosing the priorities, but if you abstract away those details it's pretty similar.

DannyBee · on June 21, 2015

"If the code has more than that many variables, OCaml compiler has to park the extra variables in memory and this parking is called spilling."

This is just wrong. It would be accurate to say "If the code has more than that number of variables live at the same time".

If the computations are not live at the same time, they can share a register.

gct · on June 21, 2015

The notion that there's only 13/16 registers (assuming x86) hasn't been true for a long time. There's hundreds in the latest cores from Intel. It's true there's only 13/16 names for them, but with register renaming there's way more places to actually put data than that.

rayiner · on June 21, 2015

Register renaming solves a different problem than the one outlined in the article.

Renaming allows to to eliminate write after read dependencies. E.g. Computing a value, storing it to AX, doing something with AX, computing another value, storing it to AX, doing something with AX. If the two computations are independent, the AX dependency is false and the OOO core can execute them concurrently by storing the results into two different rename registers and rewriting the "doing something with AX" to source their inputs from those two different registers instead.

In terms of live ranges, the first value stored in AX has a live range that does not overlap that if the second value stored in AX. So the compiler can reuse AX for those two values without problems.

That's not true in the scenario of the article without moving code around. The live ranges overlap--you have to keep around two or more distinct values before any of them are used. You can't store the results into the same register without clobbering one of the values. So the compiler is forced to introduce a memory operation by spilling a value, and the CPU can't do anything to eliminate that after the fact.

tempodox · on June 22, 2015

Those are interesting observations. When doing something like `let rec loop ...`, I got into the habit of using parameters for only those entities that might change from one iteration to the next. The rest (as being constant / invariant across iterations) dwells in the closure env, if only for the sake of making the source code shorter. The article shows nicely how this obvious mental shortcut has its (equally obvious) costs.

It's also a nice practical demonstration of how reading disassemblies is still an integral part of understanding a language, even this relatively abstract ML descendant.

cosmicexplorer · on June 21, 2015

Why wouldn't the OCaml compiler optimize away the variable usage in the first example? It seems an easy optimization to make if they're only used once.

rayiner · on June 21, 2015

Look at the live ranges in the function (the range between where a variable is defined and where it is used, i.e. the range over which the variable has a value that must be preserved because it will be used in the future). Generally register allocation algorithms will be able to assign different variables to the same register if their live ranges do not interfere (i.e. overlap). If live ranges overlap, that means their value needs to be preserved simultaneously, and they can't use the same register.

In the code, there is a lot of overlapping live ranges. E.g. ax and xx. Those can be eliminated by moving code around, but most register allocators do not move code.

cosmicexplorer · on June 21, 2015

Is there not like a "-O3" option to try aggressive code manipulation like that? Or is moving code around just not done at all because it's difficult to do correctly?

pcwalton · on June 21, 2015

I think rematerialization covers most of the cases in which you would benefit from doing that (such as ax and xx), and rematting is standard in advanced compilers like GCC/LLVM.

pcwalton · on June 21, 2015

A lot of register allocators can rematerialize, though, and that can have much the same effect. (ax and xx should be candidates for rematting.)

rayiner · on June 21, 2015

Good point. Though rematerialization is still pretty rare considering the entire universe of compilers (e.g. Academic projects like Ocaml was).

pcwalton · on June 21, 2015

Yeah, I was always under the impression that ocamlopt is fairly Plan 9-like; it produces medium quality code quickly, and doesn't have the full suite of compiler optimizations that projects like LLVM or HotSpot have.

_19qg · on June 21, 2015

There is <> missing in line 3 of the first code snippet.

a-dub · on June 22, 2015

Does OCaml generate code that makes use of SIMD?

froh42 · on June 21, 2015

Use a better algorithm instead of worrying about processor registers.

And if you start considering the computer architecture think about caches, memory latency and bursting first. Use a cache-conscious data structure.

Optimizing at the register level is ridiculous, this is the least place where you get significant returns for the effort you spend.

Oh, and make it correct first, make it fast second. Use a profiler.

brudgers · on June 21, 2015

There are places where slow correct is worse than fast wrong [for some definitions of "correct" and/or "wrong"]. Janestreet works where perfect is the enemy of the good, and are probably looking at registers because they have already squeezed caches for performance juice. This is a post where the quant context matters. It's not a sophomore CS student's blog.