Lesson 6.4

Scalar Unboxing

Using type information to eliminate tagged-union overhead. How Monk reached C-speed on scalar benchmarks.

go c 15 min read

Every integer was paying a tax it didn't need to.

Before Phase 6, every Monk value -- integers, floats, booleans, strings, arrays -- was represented as a MonkValue tagged union in the generated C. The runtime needed this to dispatch operations dynamically: when you write a + b, the generated C calls monk_add(a, b), which checks the tag, extracts the values, adds them, and wraps the result back.

That's a lot of work for something a C compiler knows at compile time. An integer addition in C is one instruction. In Monk, it was a function call with a tag check, two struct field extractions, an addition, and a struct construction.

The first benchmark results showed the cost clearly:

Benchmark	C	Monk (before)	vs C
fibonacci (n=35)	16 ms	28 ms	1.7×
mandelbrot (800²×50)	14 ms	17 ms	1.2×
matmul (400²)	10 ms	142 ms	14×

Fibonacci and mandelbrot were already fast after an earlier -O3 -flto optimization pass. Matmul was still 14× C -- each array access was paying for tag checks and deep copies.

The type checker changes everything. It knows that n is an int. It knows that x + y is int + int = int. All of that checking and dispatching is pointless. The tag is known statically. We don't need the union at all.

Unboxing: emit raw C types instead of tagged unions.

Scalar unboxing is the idea of using int64_t, double, or bool in the generated C instead of MonkValue, whenever the checker has determined the type statically. The tag is no longer needed at runtime -- so we don't emit one.

Here's the same Monk function, generated with and without unboxing:

Without unboxing (Phase 1–5):

static MonkValue _monk_fib(MonkValue mk_n) {
    if (monk_is_truthy(monk_lte(mk_n, monk_int(1)))) {
        return mk_n;
    }
    MonkValue _t0 = monk_sub(mk_n, monk_int(1));
    MonkValue _t1 = _monk_fib(_t0);
    MonkValue _t2 = monk_sub(mk_n, monk_int(2));
    MonkValue _t3 = _monk_fib(_t2);
    return monk_add(_t1, _t3);
}

With unboxing (Phase 6):

static int64_t _monk_fib(int64_t mk_n) {
    if (mk_n <= 1) {
        return mk_n;
    }
    int64_t _t0 = mk_n - 1;
    int64_t _t1 = _monk_fib(_t0);
    int64_t _t2 = mk_n - 2;
    int64_t _t3 = _monk_fib(_t2);
    return _t1 + _t3;
}

This is identical to what a C programmer would write by hand. Every monk_* call is gone. The raw C operators are back. The C compiler can now optimize this the same way it would optimize native C code -- and it does.

The storageKind abstraction.

The unboxing logic lives in src/codegen/unbox.go. Its central concept is storageKind -- a small enum that classifies how a variable should be stored in the generated C:

type storageKind int

const (
    storeBoxed  storageKind = iota  // MonkValue (default, the old way)
    storeInt                         // int64_t
    storeFloat                       // double
    storeBool                        // bool (C _Bool)
)

For each variable declaration, the code generator asks: "what storage kind does the checker assign to this?" If the type is KindInt, it uses storeInt and emits int64_t. If it's unknown or compound (array, record), it falls back to storeBoxed and emits MonkValue.

The same logic applies to expressions. When emitting a binary operation, unbox.go checks both operands' storage kinds. If both are storeInt, it emits a + b. If either is boxed, it falls back to monk_add(a, b).

What gets unboxed.

Scalar unboxing applies to all statically-typed scalar values. Specifically:

Variables.

let n int = 5 emits int64_t mk_n = 5; instead of MonkValue mk_n = monk_int(5);

Arithmetic.

Binary operations between two scalar operands use raw C operators. a + b * c becomes (a + (b * c)), not nested monk_add calls.

Function signatures.

Functions whose parameters and return type are all scalar get unboxed C signatures: static int64_t fib(int64_t n) instead of static MonkValue fib(MonkValue n).

Conditions.

if and while conditions that evaluate to a typed boolean skip the monk_is_truthy() call and emit the C expression directly.

Casts.

to_int(float_var) and to_float(int_var) inline as direct C casts when the argument is already a known scalar, skipping the runtime call.

The results: scalar benchmarks hit C parity.

Benchmark	C	Before unboxing	After unboxing
fibonacci (n=35)	16 ms	28 ms (1.7×)	17 ms (1.0×)
mandelbrot (800²×50)	14 ms	17 ms (1.2×)	15 ms (1.0×)
leibniz π (50M iter)	28 ms	—	29 ms (1.0×)
trial_primes (<200k)	6 ms	—	6 ms (1.0×)
matmul (400²)	10 ms	142 ms (14×)	122 ms (12×)

Fibonacci, mandelbrot, leibniz, trial_primes -- all scalar compute -- hit C parity. The generated code is structurally identical to hand-written C and GCC optimizes it the same way.

Matmul barely moved. The hot path is array indexing -- each matrix[i][j] access still goes through the boxed MonkValue array representation. Scalar unboxing only helps scalars. Arrays need their own unboxing -- a separate, harder problem involving the C runtime's memory layout.

Arrays stay boxed for now. Typed array unboxing (backed int[] with int64_t* instead of MonkValue*) is the next performance frontier. It requires a new C struct (MonkIntArray), new codegen paths, and runtime changes. See spec/ARCHITECTURE_DECISIONS.md §4A.

What's still in progress: typed arrays.

Scalar unboxing was the first optimization enabled by the type system. Array unboxing is the second. The current state:

✓

Inline array element access. int[] element reads emit .int_val struct-field access instead of monk_array_get. Done.

✓

Typed array backing store. int[] backed by int64_t* instead of MonkValue*. New MONK_INT_ARRAY kind in the runtime. Done.

✓

Unboxed for-loop variables. Loop variable over a typed array (for x in int_arr) uses a raw int64_t, no boxing per element. Done.

○

Copy-on-write for arrays. Share backing storage on assign, copy only on mutation. Makes let b = a O(1) instead of O(n). Not yet done.

✓

Bounds-check elision. Typed array accesses inside provably-safe loops skip the runtime bounds check. A lightweight static analysis (gen_bounds.go) tracks constant values, array lengths, and loop-counter ranges — if it can prove 0 ≤ idx < arr.length, the check is dropped. Done.

These remaining items are documented in spec/ARCHITECTURE_DECISIONS.md with full design notes. They're not blocked -- they're just the next items in the queue.

Key takeaways

Before unboxing, every value paid the cost of a tagged union -- function calls, tag checks, struct packing -- even when the type was statically known.

Scalar unboxing uses the checker's Info to emit int64_t, double, or bool instead of MonkValue for statically-typed scalars.

Scalar-heavy benchmarks (fibonacci, mandelbrot, leibniz) hit C parity after unboxing. The generated C is structurally identical to hand-written C.

Array-heavy benchmarks (matmul) remain slow. Bounds-check elision is done; copy-on-write for arrays is the last open item on this front.