Lesson 6.4
Scalar Unboxing
Using type information to eliminate tagged-union overhead. How Monk reached C-speed on scalar benchmarks.
Every integer was paying a tax it didn't need to.
Before Phase 6, every Monk value -- integers, floats, booleans, strings, arrays -- was represented as a MonkValue tagged union in the generated C. The runtime needed this to dispatch operations dynamically: when you write a + b, the generated C calls monk_add(a, b), which checks the tag, extracts the values, adds them, and wraps the result back.
That's a lot of work for something a C compiler knows at compile time. An integer addition in C is one instruction. In Monk, it was a function call with a tag check, two struct field extractions, an addition, and a struct construction.
The first benchmark results showed the cost clearly:
| Benchmark | C | Monk (before) | vs C |
|---|---|---|---|
| fibonacci (n=35) | 16 ms | 28 ms | 1.7× |
| mandelbrot (800²×50) | 14 ms | 17 ms | 1.2× |
| matmul (400²) | 10 ms | 142 ms | 14× |
Fibonacci and mandelbrot were already fast after an earlier -O3 -flto optimization pass. Matmul was still 14× C -- each array access was paying for tag checks and deep copies.
The type checker changes everything. It knows that n is an int. It knows that x + y is int + int = int. All of that checking and dispatching is pointless. The tag is known statically. We don't need the union at all.
Unboxing: emit raw C types instead of tagged unions.
Scalar unboxing is the idea of using int64_t, double, or bool in the generated C instead of MonkValue, whenever the checker has determined the type statically. The tag is no longer needed at runtime -- so we don't emit one.
Here's the same Monk function, generated with and without unboxing:
Without unboxing (Phase 1–5):
static MonkValue _monk_fib(MonkValue mk_n) {
if (monk_is_truthy(monk_lte(mk_n, monk_int(1)))) {
return mk_n;
}
MonkValue _t0 = monk_sub(mk_n, monk_int(1));
MonkValue _t1 = _monk_fib(_t0);
MonkValue _t2 = monk_sub(mk_n, monk_int(2));
MonkValue _t3 = _monk_fib(_t2);
return monk_add(_t1, _t3);
} With unboxing (Phase 6):
static int64_t _monk_fib(int64_t mk_n) {
if (mk_n <= 1) {
return mk_n;
}
int64_t _t0 = mk_n - 1;
int64_t _t1 = _monk_fib(_t0);
int64_t _t2 = mk_n - 2;
int64_t _t3 = _monk_fib(_t2);
return _t1 + _t3;
}
This is identical to what a C programmer would write by hand. Every monk_* call is gone. The raw C operators are back. The C compiler can now optimize this the same way it would optimize native C code -- and it does.
The storageKind abstraction.
The unboxing logic lives in src/codegen/unbox.go. Its central concept is storageKind -- a small enum that classifies how a variable should be stored in the generated C:
type storageKind int
const (
storeBoxed storageKind = iota // MonkValue (default, the old way)
storeInt // int64_t
storeFloat // double
storeBool // bool (C _Bool)
)
For each variable declaration, the code generator asks: "what storage kind does the checker assign to this?" If the type is KindInt, it uses storeInt and emits int64_t. If it's unknown or compound (array, record), it falls back to storeBoxed and emits MonkValue.
The same logic applies to expressions. When emitting a binary operation, unbox.go checks both operands' storage kinds. If both are storeInt, it emits a + b. If either is boxed, it falls back to monk_add(a, b).
What gets unboxed.
Scalar unboxing applies to all statically-typed scalar values. Specifically:
Variables.
let n int = 5 emits int64_t mk_n = 5; instead of MonkValue mk_n = monk_int(5);
Arithmetic.
Binary operations between two scalar operands use raw C operators. a + b * c becomes (a + (b * c)), not nested monk_add calls.
Function signatures.
Functions whose parameters and return type are all scalar get unboxed C signatures: static int64_t fib(int64_t n) instead of static MonkValue fib(MonkValue n).
Conditions.
if and while conditions that evaluate to a typed boolean skip the monk_is_truthy() call and emit the C expression directly.
Casts.
to_int(float_var) and to_float(int_var) inline as direct C casts when the argument is already a known scalar, skipping the runtime call.
The results: scalar benchmarks hit C parity.
| Benchmark | C | Before unboxing | After unboxing |
|---|---|---|---|
| fibonacci (n=35) | 16 ms | 28 ms (1.7×) | 17 ms (1.0×) |
| mandelbrot (800²×50) | 14 ms | 17 ms (1.2×) | 15 ms (1.0×) |
| leibniz π (50M iter) | 28 ms | — | 29 ms (1.0×) |
| trial_primes (<200k) | 6 ms | — | 6 ms (1.0×) |
| matmul (400²) | 10 ms | 142 ms (14×) | 122 ms (12×) |
Fibonacci, mandelbrot, leibniz, trial_primes -- all scalar compute -- hit C parity. The generated code is structurally identical to hand-written C and GCC optimizes it the same way.
Matmul barely moved. The hot path is array indexing -- each matrix[i][j] access still goes through the boxed MonkValue array representation. Scalar unboxing only helps scalars. Arrays need their own unboxing -- a separate, harder problem involving the C runtime's memory layout.
Arrays stay boxed for now. Typed array unboxing (backed int[] with int64_t* instead of MonkValue*) is the next performance frontier. It requires a new C struct (MonkIntArray), new codegen paths, and runtime changes. See spec/ARCHITECTURE_DECISIONS.md §4A.
What's still in progress: typed arrays.
Scalar unboxing was the first optimization enabled by the type system. Array unboxing is the second. The current state:
Inline array element access. int[] element reads emit .int_val struct-field access instead of monk_array_get. Done.
Typed array backing store. int[] backed by int64_t* instead of MonkValue*. New MONK_INT_ARRAY kind in the runtime. Done.
Unboxed for-loop variables. Loop variable over a typed array (for x in int_arr) uses a raw int64_t, no boxing per element. Done.
Copy-on-write for arrays. Share backing storage on assign, copy only on mutation. Makes let b = a O(1) instead of O(n). Not yet done.
Bounds-check elision. Typed array accesses inside provably-safe loops skip the runtime bounds check. A lightweight static analysis (gen_bounds.go) tracks constant values, array lengths, and loop-counter ranges — if it can prove 0 ≤ idx < arr.length, the check is dropped. Done.
These remaining items are documented in spec/ARCHITECTURE_DECISIONS.md with full design notes. They're not blocked -- they're just the next items in the queue.
Key takeaways
Before unboxing, every value paid the cost of a tagged union -- function calls, tag checks, struct packing -- even when the type was statically known.
Scalar unboxing uses the checker's Info to emit int64_t, double, or bool instead of MonkValue for statically-typed scalars.
Scalar-heavy benchmarks (fibonacci, mandelbrot, leibniz) hit C parity after unboxing. The generated C is structurally identical to hand-written C.
Array-heavy benchmarks (matmul) remain slow. Bounds-check elision is done; copy-on-write for arrays is the last open item on this front.