Lesson 4.4
Building the Code Generator
The emit pattern, temp variables, source mapping, and the bugs you will encounter.
The emitter: a struct with a string builder.
Think of the code generator as a writer at a typewriter. It has a blank page (a string builder) and types C code onto it line by line, indented properly. That's the emitter pattern.
The emitter struct holds a few things: the output buffer where C code accumulates, the current indentation level (incremented when entering a block, decremented when leaving), a counter for generating unique temp variable names, and a map of function names for hoisting.
// Pseudocode -- the emitter's shape
Emitter {
output: StringBuilder
indent: int
tempCounter: int
functions: Map<string, string> // monk name -> C name
}
emit(text):
output.write(indentSpaces + text + newline)
emitNode(node):
switch node.type:
case NumberExpr: return "monk_int(" + node.value + ")"
case StringExpr: return "monk_string("" + node.value + "")"
case BinaryExpr: return emitBinary(node)
case VarDecl: emitVarDecl(node)
case IfStmt: emitIfStmt(node)
// ... one case per AST node type
The key insight: some nodes are statements (they emit lines) and some are expressions (they return a C string to be embedded in a larger line). An IfStmt emits multiple lines. A NumberExpr returns "monk_int(42)" that the caller embeds where needed.
The two-pass structure in practice.
The top-level generate function orchestrates both passes:
// Pseudocode -- the two passes
generate(program):
emit('#include "runtime.h"')
emit("")
// Pass 1: hoist functions
for each statement in program.statements:
if statement is VarDecl and value is FunctionExpr:
name = "_fn_" + statement.name
functions[statement.name] = name
emitFunction(name, statement.value)
emit("")
// Pass 2: emit main
emit("int main(int argc, char *argv[]) {")
indent++
emit("monk_init(argc, argv);")
for each statement in program.statements:
if statement is VarDecl and value is FunctionExpr:
continue // already hoisted
emitNode(statement)
// Cleanup: free all top-level variables
for each variable declared in main:
emit("monk_free(" + variable + ");")
emit("return 0;")
indent--
emit("}") Pass 1 scans for function declarations and skips everything else. Pass 2 skips function declarations and emits everything else. Same AST, two different filters.
Temp variables: when expressions need storage.
Some expressions can be emitted inline. monk_int(42) doesn't need storage -- you just drop it into the surrounding expression. But compound expressions need intermediate results.
show(add(1, 2))
This cannot be emitted as a single C expression because add(1, 2) returns a MonkValue that needs to be freed later. So the code generator breaks it into steps:
MonkValue _tmp_1 = _fn_add(monk_int(1), monk_int(2));
monk_show(_tmp_1);
monk_free(_tmp_1);
The temp variable _tmp_1 holds the result so it can be passed to monk_show and then freed. The name comes from incrementing the counter: _tmp_1, _tmp_2, _tmp_3, and so on. The counter is global, so names never collide.
A deeper nesting produces more temps:
show(add(mul(2, 3), 4)) Becomes:
MonkValue _tmp_1 = _fn_mul(monk_int(2), monk_int(3));
MonkValue _tmp_2 = _fn_add(_tmp_1, monk_int(4));
monk_show(_tmp_2);
monk_free(_tmp_1);
monk_free(_tmp_2); The temp counter is shared across the entire program. If the first expression uses _tmp_1 through _tmp_5, the next expression starts at _tmp_6. This guarantees uniqueness without any scope analysis.
Source mapping: when C errors point to Monk.
When the compiled binary crashes, the error message should reference the original .monk file, not the generated C. The C preprocessor has a directive for exactly this: #line.
#line 7 "hello.monk"
MonkValue x = monk_int(42);
#line 8 "hello.monk"
monk_show(x);
The #line 7 "hello.monk" directive tells the C compiler: "from here on, pretend we're on line 7 of hello.monk." If the C code on the next line triggers a warning, assertion failure, or runtime error, the reported location will be hello.monk:7 instead of output.c:23.
This works because every AST node carries a Pos{Line, Column} from the parser. Before emitting the C code for each statement, the code generator emits a #line directive using that position. Cheap to generate, invaluable for debugging.
The memory safety pattern: a real bug story.
Early in Monk's development, the code generator had a use-after-free bug in compound assignments. The generated code for x += 1 looked like this:
// THE BUG: free before compute
monk_free(x);
x = monk_deep_copy(monk_add(x, monk_int(1)));
// x was freed on the line above, but monk_add still reads it
The problem: x is freed before monk_add(x, monk_int(1)) executes. The monk_add call reads freed memory. Sometimes it works (the memory hasn't been reused yet). Sometimes it corrupts data. Sometimes it segfaults. The worst kind of bug -- intermittent.
The fix established a pattern used everywhere in the code generator:
// THE FIX: compute, then free, then assign
MonkValue _new = monk_add(x, monk_int(1));
monk_free(x);
x = _new; Compute the new value while the old value is still alive.
Free the old value.
Assign the new value to the variable.
This three-step pattern appears in every reassignment: plain assignment, compound operators (+=, -=), and loop variable updates. Get it wrong once, and you introduce silent memory corruption.
Testing: compile it, run it, check the output.
You might think code generator tests would check the generated C text. They don't. Checking text is brittle -- a renamed temp variable or a reordered #line directive breaks the test even though the program is correct.
Instead, Monk's code generator tests are integration tests. Each test:
Takes a .monk source string.
Parses it to an AST.
Generates C code.
Compiles the C with cc.
Runs the resulting binary.
Checks that stdout matches the expected output.
This catches bugs that text-based tests miss: memory corruption (the binary crashes), wrong operator precedence (the output is a different number), and missing frees (detectable with sanitizers).
# Run the codegen tests
go test ./codegen/ -v Monk has 35 integration tests covering arithmetic, strings, control flow, functions, arrays, records, error handling, and nested expressions.
Common bugs and how to spot them.
Use-after-free
Freeing a value and then reading it. Symptom: intermittent crashes or garbage output. Fix: always compute-then-free-then-assign.
Missing free
Allocating a temp variable and never freeing it. Symptom: memory leak, no crash. Fix: every _tmp_N must have a matching monk_free(_tmp_N).
Double free
Freeing the same value twice. Symptom: crash, usually immediate. Fix: track ownership. Once freed, the variable should never be freed again.
Wrong temp variable scope
A temp variable allocated inside a loop but freed outside it (or vice versa). Symptom: variable not found at compile time, or freed too early. Fix: temps should be allocated and freed in the same scope.
Your task
Write a test that compiles this Monk program, runs the binary, and verifies the output is 7:
let x = 1 + 2 * 3
show(x)
This exercises the full pipeline: parsing (operator precedence ensures 2 * 3 evaluates first), codegen (temp variables for the intermediate results), compilation (the generated C must be valid), and execution (the binary must print 7).
Key takeaways
The emitter is a struct with a string builder. It writes C code line by line, tracking indentation and temp variable names.
Temp variables hold intermediate expression results. A global counter ensures unique names.
#line directives map C errors back to the original .monk file. Every AST node carries its source position.
Tests are integration tests: parse, generate, compile, run, check stdout. Not text comparison.