Global Register Allocation for L5: 2011

Wednesday, April 27, 2011

final report

Our project final report is available here.

Our source code, benchmarking suite, and so on is available here.

Monday, April 25, 2011

milestone report

Here is our milestone report from April 15 (about 1.5 weeks ago). (better late than never, right?)

What we have done so far:

explored LLVM's infrastructure enough to determine that its backend is not flexible/amenable enough to make global register allocation (at least, in the more sophisticated form we want to do) feasible for a 6-week project.
devised a bottom-up-and-top-down algorithm for performing GRA, in a manner that supports rewriting calling conventions per-function
planned new infrastructure for our undergrad compiler to support the algorithm

The major change, of course, is that we have elected to switch from LLVM to POBBLES, the C0 compiler we implemented in 15-411. We missed the milestone, which was to have a basic implementation (with few features) for LLVM working correctly.

We expect infrastructure to be much easier to manipulate in our compiler than in LLVM. In essence, we are tearing out and rewriting the back-most parts (register allocation, code generation - everything up through liveness and function inlining will remain intact) with new pipeline stages. A vague checklist of what needs to be done:

find and iterate over strongly connected components of a call graph
register allocator support for inserting conflict-resolving register move instructions, and for allocating on "virtual" and "hardware" registers at the same time
main GRA pass, which will go top-down and bottom-up propagating constraints, and invoke the register allocator as necessary, and also keep track of register-used sets to determine when saving and restoring registers is needed
code generation backend rewritten to respect the calling conventions we decide on instead of the standard x86 one

This week we have been devising in more depth the algorithm we will use, and putting new infrastructure in place to support the algorithm on top of it. Next week will hopefully be for algorithm implementation, with the goal of "proof-of-concept" functionality (our original milestone, minus the LLVM part). The week after will be for adding features and polishing details of the implementation, and for benchmarking. (We may wish to continue benchmarking and polishing during exam week.)

For benchmarking, we plan to re-use the benchmark suite from 15-411, which has several complicated computation-heavy programs, such as a collatz conjecture solver and hunt the wumpus. If this proves to be insufficient (and probably even if it doesn't), we will spend some time generating more test cases. (Note: I expect the "coming up with a new algorithm" thing will take much energy away from benchmarking, so I personally would be proud to have a full-featured correct implementation even without thorough benchmarking.)

Monday, April 11, 2011

the algorithmic muse

in general, global register allocation must be a bottom-up pass along the call graph: whenever you visit a function, you must already have visited all functions that that function calls. the reason for this is that when GRAing on a function, you will end up rewriting its calling convention (arguments go here, expect the return value to go here), and any function that calls that has to know what the new convention is before you can register-allocate on it. here is a short example.

int pow2(int x) {
    return 1 << x;
}

int main(int argc) {
    return pow2(argc);
}

bottom-up means you visit pow2 first. that you're going to use the shift instruction causes the constraint that x must be assigned to %rcx. given that, and nothing else, we determine the calling convention to be simple:

<pow2>: # argument in %rcx
    mov $1, %rax
    shl %rax # uses %rcx (hardware constraint)
    ret # return value in %rax

and can next allocate on its calling functions:

<main>: # argument in %rsi (or wherever)
    mov %rsi, %rcx # load argument
    call pow2
    ret # return value already in %rax

(note: not really much benefit in this case, but you can imagine if main did something complicated in between entry and call site that having the calling convention information (as mandated by the hardware constraint) would help.)

you may notice a couple things:

with corecursive functions, you can't visit the callee before the caller, since they are each both. as mentioned in our roadmap, this will be another "next step" - we believe the right solution is that since they'll need to push registers anyway, you can just cut the losses and not bother considering them against each other.
what if a caller function has a hardware-constrained register that overlaps the call site of a callee, instead of the other way around? when register-allocating on the callee, you would want to know to avoid that register if you can, so the calling function will not have to push and pop around the call-site. this means that constraints around call sites need to be propagated top-down, in addition to all constraints being propagated bottom-up.
in the example, which register is to store the return value was actually unconstrained, and i chose %rax arbitrarily. if something in the calling function wanted to use that register around the call site (say, calling another function beforehand, which itself returns in %rax, and having its return value be live across the second call site), this would also constitute a constraint of the same sort as in point #2.

so... a complete GRA algorithm must propagate register constraints both upwards and downwards, in order to satisfy all cases. here's an example.

int add(int x, int y) {
    return x+y;
}

int pow2(int x) {
    return 1<<x;
}

int main(int argc) {
    int t1 = add(argc, 42);
    int t2 = pow2(t1);
    return t1+t2;
}

running through the output of a "satisfactory" algorithm step-by-step:

[bottom-up] add is unconstrained. pow2 wants its first argument in %rcx, and its return value is unconstrained. main (being a special case) must have its first argument in %rsi, and its return value in %rax.
[register allocation] in main, t1 wants to be in either %rcx (for pow2) or in %rax (for the return), and t2 is unconstrained. (note: only bother register allocating on functions where new constraints have appeared since the last pass.) now, if register allocation finds/introduces a want-want conflict, we should resolve it immediately, instead of keeping it vague, because it means there must be a register move in this function at some point or another. by fiat, we'll say %rcx wins.
[top-down] add wants its first argument in %rsi (from main), its second argument is unconstrained, and its return value wants to be %rcx.
[register allocation] in add, x wants to be either %rsi or %rcx. arbitrary resolution chooses %rsi.
because no further information needs to travel along the call graph, we start filling in unconstrained registers. start by saying add's second argument will be %r8. propagation causes no new constraints or conflicts, so move on: pow2's return value can also be %r8, since it is no longer live. (how do we know we can re-use this register? see open questions below...) again, no new conflicts are introduced, and we have nothing unresolved, so we are done.

generated code will look like this:

<add>: # arg1: %rsi; arg2: %r8; return: %rcx
    mov %rsi, %rcx
    add %r8, %rcx
    ret

<pow2>: # arg1: %rcx; return: %r8
    mov $1, %r8
    shl %r8
    ret

<main>: # arg1: %rsi; return: %rax
    mov %42, %r8
    call add
    call pow2
    mov %rcx, %rax
    add %r8, %rax
    ret

open questions:

somehow, when resolving want-want conflicts during the per-function "intermissions" between each call graph pass, we will need to insert register move instructions. this shall replace the register moves generated by the code generation backend in respect of calling conventions.
when filling in unconstrained registers, how do you know when you can re-use a register? this should be done via conventional register allocation, but when do you run it?
could there be a situation where a called function unavoidably clobbers a register that the callee needs (such as with idiv...)? you can imagine if left-shift clobbered %rcx that main's t2 would be quite unhappy. how do we deal with this?
interestingly, the need for a temp to be in %rcx for pow2's shift appears in a completely different function. this is an artifact of the conflict-resolution strategy i chose, and i'm not sure if it's ideal, though (in this case at least) it produces just as good results. is it useful to pursue an algorithm that would "make more sense"?

Sunday, April 3, 2011

Global Optimizations in the LLVM Backend

Recall that LLVM structures the entire contents of a program as a Module object, which contains a list of Functions, each representing a function of that program. In doing global register allocation, we are performing a global optimization that requires access not only to individual functions, but a call graph representation of them in an easily accessible manner.

Although the LLVM compiler is well designed for global passes (or as LLVM calls it, ModulePasses) on its high-level abstract syntax form, the current version of LLVM seemingly has little support for optimizations that require access to multiple functions in the compilation backend. For example, if one were to write an alias analysis optimization, he/she would write a class that extends ModulePass. Unfortunately, the backend of LLVM is designed to completely separate functions for each other. In fact, currently, instruction selection, register allocation, and assembly/bytecode output are written as a set of MachineFunctionPasses, or passes that independently operate on the machine IR representation of functions. All target dependent optimizations are also performed on a function-by-function level and are not inter-procedural in any noticeable way.

The implementation of the LLVM backend currently lacks support for global optimizations like inter-procedural register allocation. Most backend work is performed on MachineFunction objects, which represent the target-dependent intermediate representation and code form of their associated Functions. Of course, each Function therefore contains a single MachineFunction, and the same MachineFunction is used throughout the process of instruction selection, register allocation, and assembly output.

Except, MachineFunction allocations are generated in the MachineFunctionAnalysis class, which is a FunctionPass. This may initially not seem like an issue. However, note that MachineFunctions are allocated in the runOnFunction() method and are persistent across the entire backend compilation process. An issue arises when we think of the way passes are managed. In essence, there is no easy way of accessing another MachineFunction when modifying currently modifying one. Worse yet, as a programmer, we have no way of knowing whether or not other MachineFunctions are even allocated when currently modifying one. Depending on the implementation of the Pass Manager, a compiler could feasibly one-by-one translate a Function into a MachineFunction and perform instruction selection, register allocation, and assembly generation before proceeding to the next Function. Under this design of LLVM, we simply cannot make any assumptions.

It turns out that modifying LLVM to add the support global backend optimizations is not actually that tricky. What needs to be done is that the MachineFunctionAnalysis code really should modified to generate all allocations of MachineFunctions at once, allowing the programmer the ability to access a specific MachineFunction at will. Because we want to spend more time on the register allocation issue itself instead of dealing with the LLVM infrastructure, we believe the following may be the most simple way to get around such issues.

Modify MachineFunctionAnalysis, initializing the object with a mapping of Functions to MachineFunctions and adding getMachineFunction(Function &), which will return a Function's associated MachineFunction.
Writing register allocation as a ModulePass, requiring that the MachineFunctionAnalysis and CallGraph passes have been run prior.
runOnModule should traverse the CallGraph and then perform analysis on strongly connected components in a bottom-up fashion.

Thursday, March 17, 2011

the drawing board

for my own convenience, here is a useful link.

Project Proposal: Interprocedural Register Allocation for LLVM

Overview

There is much to be said in the land of register allocation for seeking to avoid having to save and restore callee-save and caller-save registers. Interprocedural register allocation is a strategy that seeks to compare multiple functions against each other for the respective sets of registers used, and subsequently allocate registers for several functions at once to minimise overlap. This can eliminate register saves and restores around function boundaries, saving on both execution time and stack usage.

Schedule

We aim to determine whether cross-function register allocation is an effective optimisation. In theory, we expect it to reduce the number of register saves and restores for many functions, and potentially also to reduce stack usage. These benefits may be tested with a suite of computationally expensive programs and with a suite of memory intensive programs respectively (though we expect the latter may be infeasible to pursue in the time provided, since heap usage will not be affected) (It may also be prudent to test this optimisation's effectiveness as compared to straightforward function inlining.). Our target featuresets are as follows:

75%: Cross-function register allocation works on simple programs, and its effectiveness can be measured with a rudimentary benchmarking suite.
100%: Support for programs including recursive call graphs (will require graph analysis regarding strongly connected components), more polished test suite.
125%: Support for one or more of the following possibilities: Cross-module link time optimisation, caller stack slot re-usage within callees, additional architectures

A week-by-week estimated progress schedule:

Research into LLVM's infrastructure, general planning of implementation details
Implementation
Implementation, possibly starting on benchmark strategies
Wrapping up basic implementation, developing benchmark suite - By April 14th (the checkpoint), we hope to have met the 75% goal.
Polishing basic implementation, adding recursion support, more sophisticated benchmarking, possibly considering one or two of the 125% goals
Final polishing, further consideration of a 125% goal

Software & Implementation Details

Our code will be written in the LLVM framework. All software that will be used as part of this project is either already included in the LLVM source tree or shall be self-supplied.

We will initially be targetting the x86-64 architecture. x86-64 contains 16 general purpose registers that may be used by the compiler, subject to calling conventions specified by Intel. While it is recommended that compilers generally follow calling conventions, interprocedural register allocation seeks to decrease the number of loads and stores (pushes and pops) in emitted code. We will therefore selectively choose to ignore certain conventions.

The CSD fish machines (fish.ics.cs.cmu.edu) currently run Intel Xeon E5520 CPUs. We also both own machines that contain Intel Core 2 Duo CPUs.

Related Work

Peter A. Steenkiste and John L. Hennessy. A simple interprocedural register allocation algorithm and its effectiveness for lisp. ACM Trans. Program. Lang. Syst. 11:1--32, January 1989.

David W. Wall. Global register allocation at link time. SIGPLAN Not., 21:264--275, July 1986.

The above papers consider register allocation in an interprocedural framework. We will take these papers into consideration when designing our implementation in LLVM.

Furthermore, as a version of global register allocation has recently been implemented in GCC, we will be looking closely at the current work that has been done on interprocedural register allocation in GCC. We will also be looking at other research done on interprocedural register allocation.

Global Register Allocation for L5