Lesson 4.4

Building the Code Generator

The emit pattern, temp variables, source mapping, and the bugs you will encounter.

build c 20 min read

The emitter: a struct with a string builder.

Think of the code generator as a writer at a typewriter. It has a blank page (a string builder) and types C code onto it line by line, indented properly. That's the emitter pattern.

The emitter struct holds a few things: the output buffer where C code accumulates, the current indentation level (incremented when entering a block, decremented when leaving), a counter for generating unique temp variable names, and a map of function names for hoisting.

// Pseudocode -- the emitter's shape
Emitter {
    output: StringBuilder
    indent: int
    tempCounter: int
    functions: Map<string, string>  // monk name -> C name
}

emit(text):
    output.write(indentSpaces + text + newline)

emitNode(node):
    switch node.type:
        case NumberExpr:  return "monk_int(" + node.value + ")"
        case StringExpr:  return "monk_string("" + node.value + "")"
        case BinaryExpr:  return emitBinary(node)
        case VarDecl:     emitVarDecl(node)
        case IfStmt:      emitIfStmt(node)
        // ... one case per AST node type

The key insight: some nodes are statements (they emit lines) and some are expressions (they return a C string to be embedded in a larger line). An IfStmt emits multiple lines. A NumberExpr returns "monk_int(42)" that the caller embeds where needed.

The two-pass structure in practice.

The top-level generate function orchestrates both passes:

// Pseudocode -- the two passes
generate(program):
    emit('#include "runtime.h"')
    emit("")

    // Pass 1: hoist functions
    for each statement in program.statements:
        if statement is VarDecl and value is FunctionExpr:
            name = "_fn_" + statement.name
            functions[statement.name] = name
            emitFunction(name, statement.value)
            emit("")

    // Pass 2: emit main
    emit("int main(int argc, char *argv[]) {")
    indent++
    emit("monk_init(argc, argv);")

    for each statement in program.statements:
        if statement is VarDecl and value is FunctionExpr:
            continue  // already hoisted
        emitNode(statement)

    // Cleanup: free all top-level variables
    for each variable declared in main:
        emit("monk_free(" + variable + ");")
    emit("return 0;")
    indent--
    emit("}")

Pass 1 scans for function declarations and skips everything else. Pass 2 skips function declarations and emits everything else. Same AST, two different filters.

Temp variables: when expressions need storage.

Some expressions can be emitted inline. monk_int(42) doesn't need storage -- you just drop it into the surrounding expression. But compound expressions need intermediate results.

show(add(1, 2))

This cannot be emitted as a single C expression because add(1, 2) returns a MonkValue that needs to be freed later. So the code generator breaks it into steps:

MonkValue _tmp_1 = _fn_add(monk_int(1), monk_int(2));
monk_show(_tmp_1);
monk_free(_tmp_1);

The temp variable _tmp_1 holds the result so it can be passed to monk_show and then freed. The name comes from incrementing the counter: _tmp_1, _tmp_2, _tmp_3, and so on. The counter is global, so names never collide.

A deeper nesting produces more temps:

show(add(mul(2, 3), 4))

Becomes:

MonkValue _tmp_1 = _fn_mul(monk_int(2), monk_int(3));
MonkValue _tmp_2 = _fn_add(_tmp_1, monk_int(4));
monk_show(_tmp_2);
monk_free(_tmp_1);
monk_free(_tmp_2);

The temp counter is shared across the entire program. If the first expression uses _tmp_1 through _tmp_5, the next expression starts at _tmp_6. This guarantees uniqueness without any scope analysis.

Source mapping: when C errors point to Monk.

When the compiled binary crashes, the error message should reference the original .monk file, not the generated C. The C preprocessor has a directive for exactly this: #line.

#line 7 "hello.monk"
MonkValue x = monk_int(42);
#line 8 "hello.monk"
monk_show(x);

The #line 7 "hello.monk" directive tells the C compiler: "from here on, pretend we're on line 7 of hello.monk." If the C code on the next line triggers a warning, assertion failure, or runtime error, the reported location will be hello.monk:7 instead of output.c:23.

This works because every AST node carries a Pos{Line, Column} from the parser. Before emitting the C code for each statement, the code generator emits a #line directive using that position. Cheap to generate, invaluable for debugging.

The memory safety pattern: a real bug story.

Early in Monk's development, the code generator had a use-after-free bug in compound assignments. The generated code for x += 1 looked like this:

// THE BUG: free before compute
monk_free(x);
x = monk_deep_copy(monk_add(x, monk_int(1)));
// x was freed on the line above, but monk_add still reads it

The problem: x is freed before monk_add(x, monk_int(1)) executes. The monk_add call reads freed memory. Sometimes it works (the memory hasn't been reused yet). Sometimes it corrupts data. Sometimes it segfaults. The worst kind of bug -- intermittent.

The fix established a pattern used everywhere in the code generator:

// THE FIX: compute, then free, then assign
MonkValue _new = monk_add(x, monk_int(1));
monk_free(x);
x = _new;

Compute the new value while the old value is still alive.

Free the old value.

Assign the new value to the variable.

This three-step pattern appears in every reassignment: plain assignment, compound operators (+=, -=), and loop variable updates. Get it wrong once, and you introduce silent memory corruption.

Testing: compile it, run it, check the output.

You might think code generator tests would check the generated C text. They don't. Checking text is brittle -- a renamed temp variable or a reordered #line directive breaks the test even though the program is correct.

Instead, Monk's code generator tests are integration tests. Each test:

Takes a .monk source string.

Parses it to an AST.

Generates C code.

Compiles the C with cc.

Runs the resulting binary.

Checks that stdout matches the expected output.

This catches bugs that text-based tests miss: memory corruption (the binary crashes), wrong operator precedence (the output is a different number), and missing frees (detectable with sanitizers).

# Run the codegen tests
go test ./codegen/ -v

Monk has 35 integration tests covering arithmetic, strings, control flow, functions, arrays, records, error handling, and nested expressions.

Common bugs and how to spot them.

Use-after-free

Freeing a value and then reading it. Symptom: intermittent crashes or garbage output. Fix: always compute-then-free-then-assign.

Missing free

Allocating a temp variable and never freeing it. Symptom: memory leak, no crash. Fix: every _tmp_N must have a matching monk_free(_tmp_N).

Double free

Freeing the same value twice. Symptom: crash, usually immediate. Fix: track ownership. Once freed, the variable should never be freed again.

Wrong temp variable scope

A temp variable allocated inside a loop but freed outside it (or vice versa). Symptom: variable not found at compile time, or freed too early. Fix: temps should be allocated and freed in the same scope.

Your task

Write a test that compiles this Monk program, runs the binary, and verifies the output is 7:

let x = 1 + 2 * 3
show(x)

This exercises the full pipeline: parsing (operator precedence ensures 2 * 3 evaluates first), codegen (temp variables for the intermediate results), compilation (the generated C must be valid), and execution (the binary must print 7).

Key takeaways

The emitter is a struct with a string builder. It writes C code line by line, tracking indentation and temp variable names.

Temp variables hold intermediate expression results. A global counter ensures unique names.

#line directives map C errors back to the original .monk file. Every AST node carries its source position.

Tests are integration tests: parse, generate, compile, run, check stdout. Not text comparison.