Register Allocation in Compilers: How Variables Fit into CPU Registers

Discover how compilers solve register allocation — mapping thousands of virtual registers to 16 (x86-64) or 31 (ARM64) physical ones. See graph coloring, spilling, and rematerialization with real Godbolt examples.

arm64 c-c clang code-generation compiler gcc graph-coloring llvm optimization register-allocation spilling systems-programming aarch64 register callee-saved-register caller-saved-register function-prologue function-epilogue basic-block

In Part 5 of this series, we watched the compiler backend translate platform-independent LLVM IR into target-specific machine instructions. But we deliberately sidestepped the hardest problem in the entire backend: register allocation in compilers. LLVM IR hands the backend thousands of virtual registers — %1, %2, %3, stretching into the hundreds for any non-trivial function. Meanwhile, your x86-64 CPU offers exactly 16 general-purpose registers. ARM64 is more generous with 31, but it is still a finite resource constrained by strict calling convention rules.

This is the fundamental tension at the heart of register allocation in compilers: how do you map an unbounded set of variables onto a brutally limited set of physical storage locations, while keeping your program as fast as possible? Get it wrong, and variables spill to the stack — RAM that is orders of magnitude slower than a register. Get it right, and your code flies.

1. The Register Allocation Problem in Compilers

Before diving into algorithms, let’s understand exactly why register allocation in compilers is such a difficult problem. It is not a simple matter of assigning names — several constraints make this a deeply challenging optimization puzzle.

Register Classes Are Separate Pools

A CPU does not have one monolithic set of registers. Instead, registers are split into distinct classes that serve different purposes. On x86-64, you have 16 general-purpose registers (RAX through R15), 16 SSE/AVX floating-point registers (XMM0–XMM15, or 32 with AVX-512), and various special-purpose registers. On ARM64, you get 31 general-purpose registers (X0–X30) and 32 SIMD/FP registers (V0–V31). The register allocator must respect these boundaries — an integer variable cannot be placed in a floating-point register.

Calling Conventions Shrink the Available Pool

Not all registers are freely available to the allocator. As we explored in Assembly Part 2: Calling Conventions, the ABI splits registers into caller-saved (volatile) and callee-saved (non-volatile) categories. Caller-saved registers like RDI and RSI on x86-64 are destroyed across function calls, so any value in them must be saved before a call instruction. Callee-saved registers like RBX and R12–R15 persist across calls, but using them forces the compiler to insert save/restore code in the function prologue and epilogue.

The following table summarizes the key differences between the two major architectures’ register files and how they impact register allocation in compilers:

Feature	x86-64	ARM64
General-Purpose Registers	16 (RAX–R15)	31 (X0–X30)
FP / SIMD Registers	16 (XMM0–15) / 32 (AVX-512)	32 (V0–V31)
Caller-Saved (Volatile)	RAX, RDI, RSI, RDX, RCX, R8–R11	X0–X15
Callee-Saved (Non-Volatile)	RBX, RBP, R12–R15	X19–X28
Link Register	❌ (return address on stack)	X30 (LR) — avoids stack write for leaf functions

ARM64’s nearly double-sized register file gives the compiler significantly more room to work with. This architectural advantage means ARM64 code inherently suffers fewer register spills for the same source code — a direct consequence of RISC philosophy prioritizing a large, uniform register set.

2. Liveness Analysis — When Are Variables Alive?

Before the compiler can assign physical registers, it must answer a fundamental question for every variable: when is this variable alive? A variable is considered live from the point where it is first defined (written) to the point where it is last used (read). This span is called its live range or live interval.

The reason live ranges matter is simple: if two variables are alive at the same time, they interfere — they cannot share the same physical register. If variable v1 is alive from instruction 1 to instruction 5, and variable v2 is alive from instruction 3 to instruction 7, then both are alive simultaneously during instructions 3–5. They must be placed in different registers.

Consider this small function and its corresponding live ranges:


int example(int a, int b, int c) {
  int v1 = a + 1;       // Instruction 1: v1 defined
  int v2 = b + 2;       // Instruction 2: v2 defined
  int v3 = v1 + v2;     // Instruction 3: v1, v2 used; v3 defined
  int v4 = c + 3;       // Instruction 4: v4 defined
  int v5 = v3 + v4;     // Instruction 5: v3, v4 used; v5 defined
  return v5;            // Instruction 6: v5 used
}

The live range timeline for each variable looks like this:

Instruction

v1:

v2:

v3:

v4:

v5:

From this diagram, we can extract the interference relationships. Variables v1 and v2 overlap during instructions 2–3, so they interfere. Variables v3 and v4 overlap during instructions 4–5, so they also interfere. However, v1 and v4 never overlap — they could potentially share the same register. This observation is the key insight that drives the next step: building an interference graph.

Liveness analysis is performed by iterating backward through each basic block, tracking which variables are used before being redefined. In LLVM, the LiveIntervals analysis pass builds these intervals for every virtual register in the function, forming the foundation for the register allocation compiler phase.

3. Register Allocation via Graph Coloring

The classic academic approach to register allocation in compilers models the problem as graph coloring. The idea is elegant: build a graph where each node represents a variable’s live range, and draw an edge between any two nodes that interfere (are alive at the same time). Then, color this graph with K colors, where K equals the number of available physical registers. If you can color the graph so that no two adjacent nodes share a color, you have a valid register assignment.

Using our example from above, the interference graph looks like this:

Interference Graph (K = 3 registers available)

Interfering pairs:

• v1 ↔ v2 (live at insns 2–3)
• v3 ↔ v4 (live at insns 4–5)
• No other overlaps

Coloring Solution (2 registers needed)

v1 → R1

v2 → R2

v3 → R1

v4 → R2

v5 → R1

Notice that v1 and v3 receive the same color (R1) even though v3 = v1 + v2 reads v1 at the same instruction where v3 is defined. This works because the read of v1 completes before the write to v3 — the old value is consumed before the new one takes the register. The same logic applies at instruction 5: v4 and v5 share R1 because v4’s read finishes before v5’s write. With just 2 registers, we have successfully allocated 5 variables — no spills needed.

Interactive Lab: Liveness & Register Allocation Visualizer

Step through each instruction to watch live ranges form, interference pairs light up in red when two variables are simultaneously alive, and register assignments materialize. Try the “Triangle” preset to see what happens when spilling is unavoidable.

Why K-Coloring Is NP-Hard

In theory, graph coloring for K ≥ 3 is NP-complete — no known polynomial-time algorithm can solve all instances. However, Chaitin’s algorithm (1981) uses a clever heuristic that works remarkably well in practice: find a node with fewer than K edges, remove it from the graph (push it onto a stack), and repeat. If you get stuck — every remaining node has K or more edges — you must spill one of them. After simplification, pop nodes off the stack and assign colors greedily. This simplify-select loop produces near-optimal results for the structured interference graphs that real programs generate.

4. LLVM’s Greedy Register Allocator

While graph coloring is the textbook approach, LLVM’s production register allocation compiler pass uses a more sophisticated algorithm called RAGreedy — the Greedy Register Allocator. It is the default allocator for all optimized builds and is designed for the messy realities of production code generation.

Priority Queue and Spill Weights

Instead of processing live intervals in a fixed order, RAGreedy places all virtual register intervals into a priority queue. The priority is determined by the spill weight — a metric calculated from use density, loop nesting depth, and interval size. High-weight intervals (frequently used variables inside hot loops) are allocated first, ensuring they get the best registers. Low-weight intervals (rarely used variables in cold paths) are allocated last and are the first candidates for spilling if pressure is high.

Live Range Splitting — The Secret Weapon

This is where RAGreedy truly shines. When an interval cannot fit into any available register, the allocator does not immediately spill the entire variable. Instead, it attempts to split the live range into smaller sub-intervals. Each sub-interval can then be assigned to a different physical register, or only the cold portions are spilled while the hot-loop portion stays in a register. This is a global operation that works across basic block boundaries — something the classic graph coloring approach struggles with.

Eviction

When the allocator encounters a conflict, it evaluates whether to evict an existing assignment. If the current interval has a higher spill weight than the occupant of a desired register, the allocator kicks out the lower-priority interval and re-enqueues it. This allows globally better decisions — a high-priority loop variable can claim a register from a lower-priority variable that was allocated earlier.

LLVM also provides alternative allocators for different use cases. The -regalloc=fast allocator is used for -O0 debug builds — it is extremely fast but produces poor code. The -regalloc=basic allocator is a simple educational implementation. You can switch between them using llc or Clang’s -mllvm flag to observe the difference in output quality. More details on LLVM’s backend architecture can be found in the official LLVM Code Generator documentation.

5. Spilling and Rematerialization — When Registers Run Out

No matter how clever the allocator is, sometimes there simply are not enough registers. When the number of simultaneously live variables exceeds the physical register count, the compiler must spill — store a register’s value to a stack slot in memory and reload it later when needed.

The Cost of Spilling

Spilling inserts two types of instructions into your code. A store (spill) writes the register value to the stack: mov [rbp-8], rax on x86-64, or str x0, [sp, #16] on ARM64. A load (reload) fetches it back: mov rax, [rbp-8] or ldr x0, [sp, #16]. Even when the data hits L1 cache (~4 cycles latency), this is infinitely slower than a register access (0 cycles — the value is already in the execution pipeline). Heavy spilling can turn a tight inner loop into a memory-bound bottleneck.

If you have read Assembly Part 3: Stack Frames, you know that a function’s stack frame grows to accommodate local variables. Spilled registers are precisely the reason those stack frames expand — every spill slot adds 8 bytes (on a 64-bit architecture) to the frame size, which the prologue must reserve via sub rsp, N or sub sp, sp, #N.

Rematerialization — The Antidote to Spilling

For certain values, the compiler has a smarter option than spilling: rematerialization. Instead of paying the cost of a store and a load, the compiler simply recomputes the value wherever it is needed. This works for values that are cheap to recalculate — constants, frame pointer offsets, simple arithmetic like x + 1, or addresses derived from global variables.

Consider a constant like 42. Spilling it would require a store and a load (two memory operations). Rematerializing it requires a single mov eax, 42 or mov w0, #42 — a 1-cycle operation that uses zero memory bandwidth. LLVM’s RAGreedy automatically identifies rematerializable values during the allocation process, preferring rematerialization over spilling whenever the recomputation cost is lower than the memory access cost.

6. Register Allocation in Action: -O0 vs -O2 on Godbolt

The best way to understand register allocation in compilers is to see it yourself. Let’s take a function with deliberately high register pressure — 8 parameters and 8 intermediate variables — and compare how the compiler handles it at different optimization levels.

heavy_compute.c c


int heavy_compute(int a, int b, int c, int d,
                int e, int f, int g, int h) {
  int v1 = a * b + c;
  int v2 = d - e + f;
  int v3 = g * h - a;
  int v4 = b + c * d;
  int v5 = e - f * g;
  int v6 = h + a - b;
  int v7 = c * d + e;
  int v8 = f - g + h;

  return v1 + v2 + v3 + v4 + v5 + v6 + v7 + v8;
}

At -O0: Stack Spill City

Compile this with gcc -O0 or clang -O0 on x86-64, and you will see a massive wall of mov instructions to and from the stack. Every single parameter and local variable gets its own stack slot. The compiler does not even attempt to keep values in registers between operations — it writes each intermediate result to memory and reloads it immediately before the next use. This is -regalloc=fast in action: maximum debuggability, minimum performance.

You will see output resembling this pattern:

x86-64 assembly (-O0) asm


# x86-64, -O0: every variable lives on the stack
mov    dword ptr [rbp - 4], edi    # spill parameter 'a'
mov    dword ptr [rbp - 8], esi    # spill parameter 'b'
mov    dword ptr [rbp - 12], edx   # spill parameter 'c'
...
mov    eax, dword ptr [rbp - 4]    # reload 'a'
mov    ecx, dword ptr [rbp - 8]    # reload 'b'
imul   eax, ecx                    # multiply a * b
mov    ecx, dword ptr [rbp - 12]   # reload 'c'
add    eax, ecx                    # add 'c'
mov    dword ptr [rbp - 36], eax   # spill v1 to stack

Every operation involves a round-trip to memory. The stack frame for this simple function balloons to 60+ bytes.

At -O2: Registers Only

Now compile the same function with -O2. The output is dramatically different. The greedy allocator performs liveness analysis, discovers that most variables’ live ranges do not overlap, and packs everything into registers. On x86-64, you will typically see the entire computation happen in registers like eax, ecx, edx, and r8d–r10d with zero stack spills. On ARM64, the function is even cleaner — with 31 GPRs available, every variable fits comfortably in registers with room to spare.

Try it yourself on Compiler Explorer — click the links below to see each scenario live:

Interactive: x86-64 at -O0 — Stack Spills Everywhere
Interactive: x86-64 at -O2 — Registers Only, Zero Spills
Interactive: x86-64 vs ARM64 at -O2 — The Register File Advantage

The Cross-Platform Register Advantage

The x86-64 vs ARM64 comparison at -O2 is particularly revealing for understanding register allocation in compilers. With 8 function parameters, x86-64’s System V ABI passes the first 6 in registers (RDI through R9) and pushes parameters 7 and 8 onto the stack. The register allocation compiler pass must then reload those two values from memory. ARM64’s AAPCS, by contrast, passes all 8 parameters in registers (X0–X7) — no stack access needed at all. This is a tangible example of how CPU architecture directly shapes the quality of the compiler’s register allocation output.

7. Practical Tips: Helping the Register Allocator

While the register allocator is highly automated, there are coding patterns that can make its job easier — or harder.

Reduce live range overlap. Define variables close to where they are used, and avoid hoisting definitions to the top of a function. The shorter a variable’s live range, the less likely it is to interfere with other variables.
Minimize variables live across function calls. Any variable that is alive across a call instruction must be saved in a callee-saved register or spilled. If you can restructure code so that expensive computations happen after a call rather than before, you reduce the number of values the allocator needs to preserve.
Trust the optimizer. Manually “optimizing” C code by reusing variables (e.g., temp = a; ... temp = b;) does not help — it actually hurts. The compiler works best with SSA-style code where each variable has a single, clear definition. Let the optimizer decide what to merge.
Profile before you worry. Modern allocators like RAGreedy handle register pressure remarkably well. Unless profiling reveals spilling in hot loops, the allocator is almost certainly making better decisions than manual intervention would. Link-time optimization (Part 4: Optimization Passes) can further reduce pressure through cross-module inlining, which eliminates function call boundaries and their associated register save/restore costs.

8. Conclusion

Register allocation in compilers sits at the intersection of mathematical elegance and hardware reality. Through sophisticated heuristics like LLVM’s greedy allocator, compilers map thousands of variables onto a handful of registers, ensuring your data stays where the CPU can access it fastest.

In Part 7: Linking Revisited, we will examine how the linker finally stitches these isolated object files into a single executable.