Global Register Allocation for L5: the algorithmic muse

in general, global register allocation must be a bottom-up pass along the call graph: whenever you visit a function, you must already have visited all functions that that function calls. the reason for this is that when GRAing on a function, you will end up rewriting its calling convention (arguments go here, expect the return value to go here), and any function that calls that has to know what the new convention is before you can register-allocate on it. here is a short example.

int pow2(int x) {
    return 1 << x;
}

int main(int argc) {
    return pow2(argc);
}

bottom-up means you visit pow2 first. that you're going to use the shift instruction causes the constraint that x must be assigned to %rcx. given that, and nothing else, we determine the calling convention to be simple:

<pow2>: # argument in %rcx
    mov $1, %rax
    shl %rax # uses %rcx (hardware constraint)
    ret # return value in %rax

and can next allocate on its calling functions:

<main>: # argument in %rsi (or wherever)
    mov %rsi, %rcx # load argument
    call pow2
    ret # return value already in %rax

(note: not really much benefit in this case, but you can imagine if main did something complicated in between entry and call site that having the calling convention information (as mandated by the hardware constraint) would help.)

you may notice a couple things:

with corecursive functions, you can't visit the callee before the caller, since they are each both. as mentioned in our roadmap, this will be another "next step" - we believe the right solution is that since they'll need to push registers anyway, you can just cut the losses and not bother considering them against each other.
what if a caller function has a hardware-constrained register that overlaps the call site of a callee, instead of the other way around? when register-allocating on the callee, you would want to know to avoid that register if you can, so the calling function will not have to push and pop around the call-site. this means that constraints around call sites need to be propagated top-down, in addition to all constraints being propagated bottom-up.
in the example, which register is to store the return value was actually unconstrained, and i chose %rax arbitrarily. if something in the calling function wanted to use that register around the call site (say, calling another function beforehand, which itself returns in %rax, and having its return value be live across the second call site), this would also constitute a constraint of the same sort as in point #2.

so... a complete GRA algorithm must propagate register constraints both upwards and downwards, in order to satisfy all cases. here's an example.

int add(int x, int y) {
    return x+y;
}

int pow2(int x) {
    return 1<<x;
}

int main(int argc) {
    int t1 = add(argc, 42);
    int t2 = pow2(t1);
    return t1+t2;
}

running through the output of a "satisfactory" algorithm step-by-step:

[bottom-up] add is unconstrained. pow2 wants its first argument in %rcx, and its return value is unconstrained. main (being a special case) must have its first argument in %rsi, and its return value in %rax.
[register allocation] in main, t1 wants to be in either %rcx (for pow2) or in %rax (for the return), and t2 is unconstrained. (note: only bother register allocating on functions where new constraints have appeared since the last pass.) now, if register allocation finds/introduces a want-want conflict, we should resolve it immediately, instead of keeping it vague, because it means there must be a register move in this function at some point or another. by fiat, we'll say %rcx wins.
[top-down] add wants its first argument in %rsi (from main), its second argument is unconstrained, and its return value wants to be %rcx.
[register allocation] in add, x wants to be either %rsi or %rcx. arbitrary resolution chooses %rsi.
because no further information needs to travel along the call graph, we start filling in unconstrained registers. start by saying add's second argument will be %r8. propagation causes no new constraints or conflicts, so move on: pow2's return value can also be %r8, since it is no longer live. (how do we know we can re-use this register? see open questions below...) again, no new conflicts are introduced, and we have nothing unresolved, so we are done.

generated code will look like this:

<add>: # arg1: %rsi; arg2: %r8; return: %rcx
    mov %rsi, %rcx
    add %r8, %rcx
    ret

<pow2>: # arg1: %rcx; return: %r8
    mov $1, %r8
    shl %r8
    ret

<main>: # arg1: %rsi; return: %rax
    mov %42, %r8
    call add
    call pow2
    mov %rcx, %rax
    add %r8, %rax
    ret

open questions:

somehow, when resolving want-want conflicts during the per-function "intermissions" between each call graph pass, we will need to insert register move instructions. this shall replace the register moves generated by the code generation backend in respect of calling conventions.
when filling in unconstrained registers, how do you know when you can re-use a register? this should be done via conventional register allocation, but when do you run it?
could there be a situation where a called function unavoidably clobbers a register that the callee needs (such as with idiv...)? you can imagine if left-shift clobbered %rcx that main's t2 would be quite unhappy. how do we deal with this?
interestingly, the need for a temp to be in %rcx for pow2's shift appears in a completely different function. this is an artifact of the conflict-resolution strategy i chose, and i'm not sure if it's ideal, though (in this case at least) it produces just as good results. is it useful to pursue an algorithm that would "make more sense"?

Global Register Allocation for L5

Monday, April 11, 2011

the algorithmic muse

No comments:

Post a Comment

Contributors

Blog Archive