Lesson 1.3

Building the Lexer

A step-by-step walkthrough of how to implement Monk's lexer, from single characters to multi-token lookahead. TDD all the way.

build 20 min read

The plan.

We're building the lexer in stages, each one handling more of the language. At every step: write a test first, then make it pass.

Single-character tokens ( ) { } [ ] + - * , . : ;

One-or-two-character tokens = == != < <= > >= += -= -> && ||

Whitespace and comments spaces, tabs, newlines, // comments

Identifiers and keywords variable names, reserved words

Numbers integers, floats, hex, binary, underscores

Strings double-quoted, escape sequences, template literals

Error reporting unexpected characters, unterminated strings

The scanner skeleton.

The lexer is a struct that holds the source string, a cursor position, and the current line/column. It has one public method: NextToken() — call it repeatedly and it returns the next token until EOF.

The core loop looks at the character under the cursor, decides what kind of token is starting, reads until it's complete, then returns it. In pseudocode:

func NextToken() Token {
    skipWhitespace()

    ch := currentChar()

    switch {
    case ch == 0:      return token(EOF)
    case ch == '(':    return token(LPAREN)
    case ch == '+':    return token(PLUS)
    case isLetter(ch): return readIdentifier()
    case isDigit(ch):  return readNumber()
    case ch == '"':    return readString()
    default:           return error("unexpected character")
    }
}

Calling NextToken() repeatedly on this Monk source:

let x = 42 + 1

Produces this token stream:

LET let keyword — isLetter() → readIdentifier() → lookupKeyword()

IDENT x identifier — isLetter() → readIdentifier(), "x" not in keyword table

ASSIGN = ch == '=', next char is not '=' → ASSIGN (not EQUAL_EQUAL)

NUMBER 42 isDigit() → readNumber()

PLUS + ch == '+'

NUMBER 1 isDigit() → readNumber()

EOF ch == 0

Test-first. Before writing any of this, write a test: feed in "(", expect one LPAREN token followed by EOF. Make it pass. Then add ")". Then "()". Build up one test at a time.

Lookahead: one-or-two-character tokens.

When the lexer sees =, it can't immediately emit ASSIGN. It needs to peek at the next character:

=Next is =? → EQUAL_EQUAL. Otherwise → ASSIGN.

!Next is =? → BANG_EQUAL. Otherwise → BANG.

<Next is =? → LESS_EQUAL. Next is <? → SHIFT_LEFT. Otherwise → LESS.

-Next is >? → ARROW. Next is =? → MINUS_ASSIGN. Otherwise → MINUS.

This is a helper you'll use a lot: "if the next character matches, advance and return the longer token; otherwise return the shorter one." In Go, a simple if peek() == '=' conditional.

Your task

Write a test for "<= >= == != ->". The lexer should produce LESS_EQUAL, GREATER_EQUAL, EQUAL_EQUAL, BANG_EQUAL, ARROW, EOF. Make it pass.

Whitespace and comments.

Whitespace (spaces, tabs, carriage returns) is skipped before every token. Newlines are skipped too, but they also increment the line counter and reset the column.

Comments start with // and run to the end of the line. When the lexer sees /, it peeks: next is /? Consume everything until newline. Next is =? Emit SLASH_ASSIGN. Otherwise, emit SLASH.

Monk has no multi-line comments. No /* */. This is deliberate — it keeps the lexer simple and avoids the nesting problem.

The nesting problem: should /* a /* b */ c */ be one comment or two? Different languages answer differently. Monk sidesteps the question entirely.

Identifiers and keywords.

An identifier starts with a letter or underscore, then continues with letters, digits, or underscores. The scanner reads the whole word, then looks it up in the keyword table.

func readIdentifier() Token {
    start := position
    for isLetter(ch) || isDigit(ch) {
        advance()
    }
    word := source[start:position]
    kind := lookupKeyword(word) // returns IDENT if not a keyword
    return token(kind, word)
}

The same readIdentifier() call handles all three cases — it reads the characters, then the keyword table decides the kind:

let LET In keyword table → keyword token

x IDENT Not in table → identifier

fibonacci42 IDENT Letters + digits, not in table → identifier

return RETURN In keyword table → keyword token

returnValue IDENT Starts with "return" but not equal to it → identifier

The keyword lookup is a map from string to token type. 27 entries. No special logic — just a table lookup.

Your task

Test: "let x = if true return" should produce LET, IDENT("x"), ASSIGN, IF, TRUE, RETURN, EOF. Note that x is an identifier, but let, if, true, and return are keywords.

Numbers.

Number scanning has the most edge cases. The lexer needs to handle:

42Plain integer

3.14Float — has a decimal point

1.23e5Scientific notation

0xFFHex — starts with 0x

0b1010Binary — starts with 0b

0o77Octal — starts with 0o

1_000_000Underscores — visual separator, ignored

The strategy: start reading digits. If you hit 0x/0b/0o, switch to the appropriate digit set. If you hit a . followed by a digit, it's a float. If you hit e or E, it's scientific notation. Underscores are consumed but don't affect the value.

Trap: the dot ambiguity. Is 42.method() the number 42 followed by DOT and a method call? Or the start of float 42.m...? The lexer peeks after the dot: if the next character is a digit, it's a float. Otherwise, the dot belongs to the next token.

Strings.

Two kinds: double-quoted strings and backtick template literals.

Double-quoted strings start and end with ". The lexer reads character by character, looking for the closing quote. If it hits a backslash, the next character is an escape: \n becomes newline, \" becomes a literal quote, \\ becomes a literal backslash. An unrecognized escape is an error.

Template literals start and end with backtick. Simpler — no escape sequences, multiline by default. Read until the closing backtick.

The biggest error to catch: an unterminated string (the file ends before the closing quote). Report the error with the position where the string started, not where it ran out.

Your task

Test: "hello \"world\"\n" should produce one STRING token whose value is hello "world"\n (with the actual newline character, not the two characters \ and n).

Error reporting.

A good lexer doesn't crash on bad input. It reports the error and keeps scanning — so the user sees all the problems in one pass, not one at a time.

Errors to catch:

unexpected character @, #, $ — not part of Monk's syntax

unterminated string "hello — no closing quote before EOF or newline

invalid number 0xGG — invalid hex digits

invalid escape "\q" — unrecognized escape sequence

The strategy: emit an ILLEGAL token with an error message, then continue scanning from the next character. The parser will see the ILLEGAL token and know something went wrong.

Error recovery is kindness. A lexer that stops at the first error wastes the user's time. Show everything that's wrong in one shot. They'll thank you.

The full picture.

When you're done, the lexer handles all of Monk's syntax. Feed in any valid Monk program, get a correct token stream. Feed in garbage, get error tokens with useful positions.

Here's a real test to aim for:

const greet = (name string) string {
    return "Hello, " + name
}

let msg = greet("world")
// msg is "Hello, world"

This should produce 26 tokens (including EOF). If your lexer handles this correctly, Phase 1 is done.

Final exercise

Write the test for the program above. List every token you expect. Then verify your lexer produces exactly that sequence. This is the Phase 1 acceptance test.

Key takeaways

Build the lexer in layers: single chars, then two-char operators, then whitespace/comments, then identifiers, then numbers, then strings.

The core is a switch on the current character with peek-ahead for multi-character tokens.

Keywords are identifiers matched against a lookup table. Not special-cased in the scanner.

Report errors with ILLEGAL tokens and keep scanning. Show all problems in one pass.