Lesson 1.2

Designing Monk's Token Set

Every keyword, operator, and delimiter in the language, mapped to a token type. This is the contract between the lexer and the parser.

monk 15 min read

Why define tokens before writing the lexer.

The token set is a contract. The lexer promises to produce these types. The parser promises to consume them. If you get this list right, the lexer and parser can be built independently — even by different people.

Monk has ~60 token types. That sounds like a lot, but most are single characters (+, (, ,) or reserved words (let, if, return). The interesting ones are the literals — numbers, strings, identifiers.

Keywords.

Reserved words that cannot be used as variable names. The lexer reads an identifier, then checks: is it in the keyword table? If yes, emit the keyword token. If no, emit an identifier.

Declarations & control flow

letconstifelseforinwhilebreakcontinuereturn

Error handling

guardagainstthrow

Modules & types

typeuseexportfromas

Logical operators (keyword form)

andornotis

Literals (also keywords)

truefalsenone

Reserved for future

refasyncawait

Keywords vs identifiers. The lexer doesn't hardcode each keyword in the scanner. It reads a word, then does a lookup in a map. This keeps the scanner simple — the only decision is "letter? keep reading." The keyword check happens once, at the end.

Operators.

The tricky part: multi-character operators. When the lexer sees =, it needs to peek at the next character. Is it ==? Or just =?

Arithmetic

+ PLUS

- MINUS

* STAR

/ SLASH

% PERCENT

Comparison

== EQUAL_EQUAL

!= BANG_EQUAL

< LESS

> GREATER

<= LESS_EQUAL

>= GREATER_EQUAL

Assignment

= ASSIGN

+= PLUS_ASSIGN

-= MINUS_ASSIGN

*= STAR_ASSIGN

/= SLASH_ASSIGN

%= PERCENT_ASSIGN

Logical & bitwise

&& AND_AND

|| OR_OR

! BANG

& AMP

| PIPE

^ CARET

~ TILDE

<< SHIFT_LEFT

>> SHIFT_RIGHT

Special

->ARROWfunction type signature

?QUESTIONoptional type suffix

Maximal munch in practice. When the lexer sees -, it peeks ahead. Next char is >? Emit ARROW. Next char is =? Emit MINUS_ASSIGN. Otherwise, just MINUS. Two-character operators always beat one-character operators.

Delimiters.

Simple. One character, one token. No ambiguity.

( ) LPAREN / RPAREN params, calls, grouping

{ } LBRACE / RBRACE blocks, record literals

[ ] LBRACKET / RBRACKET arrays, index access

, COMMA separators (trailing allowed)

. DOT record field access

: COLON record field definitions

Literals.

These are the tokens that carry a value, not just a type.

Integers

64-bit signed. Multiple bases. Underscore separators for readability.

420xFF0b10100o771_000_000

Floats

64-bit. Decimal point or scientific notation.

3.14-0.51.23e51e-3

Strings

Double-quoted. Escape sequences: \n \t \" \\

"Hello, world!" "line\nbreak" ""

Template literals

Backtick-delimited. Multiline. No interpolation. Same type as string.

`This is a
multiline string`

What a token looks like in code.

Every token the lexer produces carries three things:

Type What kind of token — LET, PLUS, INT, etc.

Literal The raw text — "let", "+", "42", "hello"

Position Line and column — line 3, col 12

The parser never looks at the source string again. Everything it needs is in the token stream.

Key takeaways

Monk has ~60 token types: 27 keywords, ~26 operators, 9 delimiters, plus identifiers and literals.

Keywords are just identifiers that happen to match a reserved word table. The scanner doesn't special-case them.

Multi-character operators use peek-ahead. Longest match wins (maximal munch).

Every token carries its type, literal text, and source position.