Lesson 0.3

Reading the Source

A developer's tour of the Go codebase. Where the pipeline lives in code, how data moves between packages, and where to start reading.

go 20 min read

Start here: main.go.

Everything begins in src/main.go. The runCompile function is the entire compiler pipeline in nine lines:

func runCompile(source, outPath string) error {
    prog, err := syntax.Parse(source)        // source text → AST
    if err != nil { return err }

    info, err := types.Check(prog)           // AST → typed Info
    if err != nil { return err }

    return codegen.GenerateWithTypes(prog, outPath, info)
    // ↑ typed AST + Info → C source, then cc compiles it
}

Three function calls, three packages. The compiler is that flat. Each package is a pure function — it takes the previous stage's output and returns the next stage's output. No global state.

monk check vs monk build vs monk run. check runs just the first two calls (parse + type-check, no codegen). build runs all three. run runs all three, executes the binary, then deletes it. The CLI flags in src/main.go all route to the same runCompile function.

The four packages.

syntax/ 11 files — lexer, parser, AST

Transforms source text into a typed AST tree. Input: a string. Output: *syntax.Program, a slice of syntax.Stmt nodes.

token.go — TokenKind constants + LookupIdent
scanner.go — Scan() → []Token (hand-written, no regex)
ast.go / ast_expr.go / ast_stmt.go — node structs
parser.go / parse_stmt.go / parse_expr.go / parse_type.go — recursive descent
types/ 6 files — static type checker

Walks the AST, infers and validates types. Input: *syntax.Program. Output: *types.Info, the type-annotation map codegen reads.

types.go — Kind enum, Type struct, AssignableTo
checker.go — Info, Check(), scope chain, builtin declarations
stmts.go / exprs.go — one check/infer function per node kind
returns.go — all-paths-return analysis
codegen/ 13 files — AST → C emitter

Emits C source. Input: *syntax.Program + *types.Info. Output: a .c file (or native binary after cc runs).

gen.go — generator struct, GenerateWithTypes entry, driver loop
gen_stmt.go / gen_expr.go — statement + expression emitters (boxed path)
gen_func.go — function hoisting, trampolines, closure capture
unbox.go — typed (unboxed) emission paths + storageKind
capture.go — free-variable analysis for closures
gen_helpers.go — mangleName, cString, builtinMap
gen_access.go — typed-array + typed-record field fast paths
gen_optimize.go — COW elision, string append, pure-builtin inlining
gen_escape.go — escape analysis: direct-call closures → stack frames
gen_bounds.go — bounds-check elision for provably-safe array access
gen_flow.go — break/continue/return with unboxed scalar cleanup
gen_module.go — multi-module codegen (GenerateModules)
runtime/ C library — compiled into every Monk binary

Not Go. A small C library linked into every binary. Defines MonkValue, arithmetic, string ops, array ops, and the setjmp-based error handling. Codegen's generated C includes runtime.h and calls these functions.

The three structs that cross package boundaries.

Each pipeline stage hands something to the next. Understanding these three types tells you how the whole compiler fits together.

1. *syntax.Program — the AST root

Just a slice of statements. Every statement is a syntax.Stmt interface; every expression is a syntax.Expr interface. The concrete types are in ast_stmt.go and ast_expr.go.

// ast.go
type Program struct {
    Stmts []Stmt
}

// ast_stmt.go — examples
type VarDeclStmt struct { Name string; Value Expr; Type *TypeExpr; IsConst bool; Pos Pos }
type IfStmt      struct { Condition Expr; Then *BlockStmt; Else Stmt; Pos Pos }
type ForStmt     struct { VarName string; Iterable Expr; Body *BlockStmt; Pos Pos }

// ast_expr.go — examples
type BinaryExpr  struct { Left, Right Expr; Op TokenKind; Pos Pos }
type CallExpr    struct { Callee Expr; Args []Expr; Pos Pos }
type FuncExpr    struct { Params []Param; ReturnType TypeExpr; Body *BlockStmt; Pos Pos }

This Monk program produces a Program with three statements — one per line:

for item in scores {   // → ForStmt { VarName:"item", Iterable:IdentExpr("scores"), Body:... }
    show(item)         //     body contains ExprStmt { CallExpr { Callee:IdentExpr("show"), Args:[IdentExpr("item")] } }
}
let total = add(1, 2)  // → VarDeclStmt { Name:"total", Init:CallExpr { Callee:IdentExpr("add"), Args:[NumberExpr(1),NumberExpr(2)] } }
if total > 10 {        // → IfStmt { Condition:BinaryExpr(>, IdentExpr("total"), NumberExpr(10)), Then:..., Else:nil }
    show("big")
}

2. *types.Info — the type annotation map

The checker annotates every AST node with its type, then hands this map to codegen. Codegen uses it to decide "boxed MonkValue or raw int64_t?"

// checker.go
type Info struct {
    // Every Expr node → its computed type
    Types map[syntax.Expr]*Type

    // Every VarDeclStmt → its resolved type (annotation || inferred)
    Decls map[*syntax.VarDeclStmt]*Type

    // Every FuncExpr → its full signature (params + return)
    // Used by codegen to decide whether to emit an unboxed function
    Funcs map[*syntax.FuncExpr]*Type
}

Concretely: compile this Monk program and the checker populates info.Types with an entry for every expression node.

// Monk source
let x int = 42          // VarDeclStmt — Decls[this] = Int
let y = x + 1           // VarDeclStmt — Decls[this] = Int (inferred)
                        // BinaryExpr(+) — Types[this] = Int
                        // IdentExpr(x)  — Types[this] = Int
                        // NumberExpr(1) — Types[this] = Int

const add = (a int, b int) int { return a + b }
// FuncExpr — Funcs[this] = (int,int)->int
// Codegen sees Funcs[this] is all-scalar → emits int64_t add(int64_t, int64_t)

Codegen looks up the same AST pointer it already has. The map[syntax.Expr]*Type key is a pointer comparison — the exact same node object the parser built, so there's no name resolution needed.

3. *types.Type — the type representation

A single struct covers all 9 kinds. Most fields are zero-valued for kinds that don't use them.

// types.go
type Kind int
const (
    KindAny Kind = iota   // unknown / dynamic
    KindInt               // int64_t
    KindFloat             // double
    KindStr               // char*
    KindBool              // bool
    KindNone              // void / null
    KindArray             // []T
    KindRecord            // {field: T, ...}
    KindFunc              // (T1, T2) -> T3
)

type Type struct {
    Kind       Kind
    Optional   bool          // T? — the orthogonal nullable flag
    Elem       *Type         // for KindArray
    Fields     []RecordTypeField  // for KindRecord
    RecordName string        // for named records ("Point", "User")
    Params     []*Type       // for KindFunc
    Return     *Type         // for KindFunc
    MinParams  int           // for KindFunc with defaults
}

Each Monk declaration resolves to one of these kinds:

let x int = 42 KindInt { Kind: KindInt }
let ratio float = 1.5 KindFloat { Kind: KindFloat }
let name string = "Monk" KindStr { Kind: KindStr }
let done bool = false KindBool { Kind: KindBool }
let maybe int? = none KindInt? { Kind: KindInt, Optional: true }
let scores int[] = [1, 2, 3] KindArray { Kind: KindArray, Elem: &Type{Kind:KindInt} }
type Point { x int, y int } KindRecord { Kind: KindRecord, RecordName:"Point", Fields:[...] }
const add = (a int, b int) int { ... } KindFunc { Kind: KindFunc, Params:[Int,Int], Return:&Int }

Why pointer keys in Info.Types? The AST uses pointer identity — each node is allocated exactly once during parsing. Using the pointer as a map key is an O(1) lookup with zero ambiguity: there's no chance of two different BinaryExpr nodes at the same line/column being confused. It also means codegen doesn't need to re-walk the AST to find nodes — it already holds the pointers.

The generator struct: codegen's working state.

Codegen's entire state is one struct. Understanding its fields explains most of what gen.go and gen_stmt.go are doing.

// gen.go
type generator struct {
    // Output buffers: functions are hoisted above main()
    funcs  strings.Builder  // static C function definitions
    body   strings.Builder  // main() body

    // Type info from checker
    info   *types.Info

    // Storage maps: what C type does each variable use?
    // storeBoxed → MonkValue, storeInt → int64_t, etc.
    storage map[string]storageKind   // current scope's variable kinds

    // Function tracking
    funcCount  int                        // counter for unique names
    funcNames  map[string]string          // monk name → C name
    fnStorage  map[string]funcStorage     // C name → param/return kinds
    funcHasCapture map[string]bool        // does this func close over anything?
    funcDefaults   map[string][]syntax.Expr  // default param expressions

    // Current function context
    retStorage      storageKind  // return type of the function being emitted
    currentCaptures []string     // captured variable names (for save-back on return)

    // Bounds-check elision (nil when no type info)
    constVals map[string]int64    // compile-time constant variables (e.g. let N = 400)
    arrayLens map[string]int64    // statically known lengths of typed array variables
    varBounds map[string][2]int64 // inclusive [lo, hi] range for while-loop counters

    // Temp variable counter
    tempCount int
}

The key insight: storage is the map from variable name → C type. When codegen decides to emit int64_t x = ... instead of MonkValue x = ..., it sets storage["mk_x"] = storeInt. Later, when that variable is read by emitExpr, it boxes it back up.

After compiling this program, the generator's storage map looks like this:

let count int = 0          // storage["mk_count"] = storeInt   → emits int64_t
let name = "Monk"          // storage["mk_name"]  = storeBoxed  → emits MonkValue
let pi float = 3.14        // storage["mk_pi"]    = storeFloat  → emits double

const add = (a int, b int) int { return a + b }
// fnStorage["_monk_func_1"] = funcStorage{params:[storeInt,storeInt], ret:storeInt}
// → emits int64_t _monk_func_1(int64_t mk_a, int64_t mk_b) { ... }

Why saveStorage / restoreStorage?

An if branch may introduce a variable with a certain storage kind. That variable doesn't exist in the other branch. If we don't snapshot and restore, a variable promoted to storeInt in the then-branch would be treated as int64_t in the else-branch where it was never declared. saveStorage clones the map; restoreStorage reverts it. Called around every if-branch, while-body, and for-body.

Two emission paths: boxed vs unboxed.

This is the most non-obvious part of codegen. There are two ways to emit any expression:

Boxed path — emitExpr(e) in gen_expr.go

Always returns a C expression of type MonkValue. Every arithmetic operation goes through a runtime call like monk_add(a, b). Correct for everything; required when the caller needs a MonkValue.

// gen_expr.go — boxed addition
case syntax.Plus:
    tmp := g.newTemp()
    return fmt.Sprintf(
        "({MonkValue %s=%s; %s.kind==MONK_STRING ? "+
        "monk_string_concat(%s,%s) : monk_add(%s,%s);})",
        tmp, left, tmp, tmp, right, tmp, right)

Unboxed path — emitExprTyped(e) in unbox.go

Returns (string, storageKind) — the C expression AND what type it produces. When both sides are storeInt, addition becomes raw +. No function call, no tagged union.

// unbox.go — unboxed addition (int + int)
case syntax.Plus:
    if lk == storeInt && rk == storeInt {
        return fmt.Sprintf("(%s + %s)", lc, rc), storeInt
    }

The two paths coexist. When an unboxed value ends up in a boxed context (e.g. passed to a builtin that expects MonkValue), boxExpr wraps it. When a boxed value feeds into an unboxed path, unboxExpr extracts the raw field. The function coerce(code, from, to) handles both directions.

The same expression emits differently depending on what the checker knows:

Monk source
let x int = 1 + 2
Unboxed path — checker knows both sides are int
int64_t mk_x = (1 + 2);
Monk source
let result = getData() + 1
Boxed path — getData() returns MonkValue, type unknown at compile time
MonkValue _tmp_1 = _monk_func_getData(_self_getData, NULL, 0); MonkValue mk_result = monk_add(_tmp_1, monk_int(1)); monk_free(_tmp_1);

When does unboxing activate? storageFor(t *types.Type) in unbox.go maps a types.Type to a storageKind. If the type is KindIntstoreInt. If it's KindFloatstoreFloat. For a function to be fully unboxed, ALL params AND the return must be raw scalars — checked by deriveFuncStorage in gen_func.go. One boxed param forces the whole function to stay boxed.

Function hoisting and the two-buffer trick.

C requires a function to be declared before it's called. Monk allows forward references and mutual recursion. The solution: two output buffers.

// gen.go — the generator has two output buffers
type generator struct {
    funcs  strings.Builder  // hoisted C functions go here
    body   strings.Builder  // main() body goes here
    ...
}

// gen.go — final assembly
func (g *generator) emit() string {
    return preamble + g.funcs.String() +
           "int main(void) {
" +
           g.body.String() +
           "    return 0;
}
"
}

This Monk program:

const double = (n int) int { return n * 2 }
show(double(21))

Produces two output buffers that get concatenated:

g.funcs — written first, appears above main()
static int64_t _monk_func_1(int64_t mk_n) { return (mk_n * 2); }
g.body — written second, goes inside main()
MonkValue mk_double = monk_make_function(_monk_func_1_thunk, NULL, 0); MonkValue _tmp_1 = _monk_func_1(21); monk_show(_tmp_1); monk_free(_tmp_1); monk_free(mk_double);

When codegen encounters let add = (a int, b int) int { ... }, it immediately writes the C function definition into g.funcs (above main), then writes MonkValue mk_add = monk_make_function(...); into g.body. When later code calls add(3, 4), the C function is already declared above.

Recursive self-reference works because emitVarDecl pre-registers the C name in g.funcNames before hoisting the body. So when the body refers to fib, emitCall finds it in funcNames and emits a direct C call.

The trampoline

Every hoisted function also gets a trampoline — a thin wrapper that matches the MonkFuncPtr signature MonkValue(*)(MonkFunction*, MonkValue*, int64_t). This is what monk_call uses for indirect calls (passing functions as values). Direct calls bypass the trampoline entirely.

Checker non-obvious details.

Why three sub-passes in checkProgram?

Type declarations (type Point = ...) must be resolved before function signatures, because a function may take a Point param. Function signatures must be hoisted before any statement is checked, because fib may call itself on line 2. So the order is: (1) types, (2) function signatures, (3) all statements.

// This works — fib calls itself before it's "defined"
const fib = (n int) int {
    if n <= 1 { return n }
    return fib(n - 1) + fib(n - 2)  // pass 2 hoisted fib's signature; pass 3 checks this call
}

// This works — distance takes a Point, defined after the function
const distance = (p Point) float { return sqrt(p.x * p.x + p.y * p.y) }
type Point { x float, y float }  // pass 1 resolves this; pass 2 can use it in distance's signature

The sticky typeErr in the parser

parseTypeExpr cannot return an error — it's called from inside expression parsers that don't propagate errors. Instead, it sets p.typeErr on the Parser struct. After every parseStmt call, parseProgram checks p.typeErr and surfaces it as a first-class error.

// This triggers p.typeErr — "int?" is a valid type, "int??" is not
let x int?? = none   // parseTypeExpr sees "??" and sets p.typeErr
                     // parseProgram surfaces it: "invalid type annotation: int??"

For-loop scope splitting

The loop variable is declared const in a scope that is the parent of the body scope. If both were in the same scope, a let i = ... inside the body would shadow the const binding and let the user "reassign" the loop variable. Separate scopes prevent this. See checkFor in types/stmts.go.

for i in scores {
    let i = 99      // OK — "let i" in the body scope shadows the loop var (separate scope)
    show(i)         // prints 99
}

for i in scores {
    i = 99          // ERROR — "i" in the loop scope is const; assignment rejected
}

Suggested reading order for contributors.

1

syntax/token.go — 5 min. Every keyword and operator in one file. Understand the vocabulary before the grammar.

2

syntax/ast_stmt.go + ast_expr.go — 10 min. The shape of every node. Read the struct definitions; skip the marker methods.

3

types/types.go — 10 min. The Kind enum and AssignableTo. The 7-rule compatibility table determines what the checker allows everywhere.

4

src/main.go runCompile — 5 min. The pipeline as code. See how the three packages are wired together.

5

codegen/gen.go — 20 min. The generator struct and GenerateWithTypes. Understand the two-buffer layout before reading the emitters.

6

codegen/gen_stmt.go → gen_expr.go → unbox.go — the two emission paths side by side. Read emitVarDecl first — it shows all three paths (typed array, scalar probe, plain boxed) in one function.

7

runtime/runtime.h — 10 min. See what MonkValue actually looks like in C. Then codegen's emit strings become concrete.