Lesson 1.2
Designing Monk's Token Set
Every keyword, operator, and delimiter in the language, mapped to a token type. This is the contract between the lexer and the parser.
Why define tokens before writing the lexer.
The token set is a contract. The lexer promises to produce these types. The parser promises to consume them. If you get this list right, the lexer and parser can be built independently — even by different people.
Monk has ~60 token types. That sounds like a lot, but most are single characters (+, (, ,) or reserved words (let, if, return). The interesting ones are the literals — numbers, strings, identifiers.
Keywords.
Reserved words that cannot be used as variable names. The lexer reads an identifier, then checks: is it in the keyword table? If yes, emit the keyword token. If no, emit an identifier.
letconstifelseforinwhilebreakcontinuereturn guardagainstthrow typeuseexportfromas andornotis truefalsenone refasyncawait Keywords vs identifiers. The lexer doesn't hardcode each keyword in the scanner. It reads a word, then does a lookup in a map. This keeps the scanner simple — the only decision is "letter? keep reading." The keyword check happens once, at the end.
Operators.
The tricky part: multi-character operators. When the lexer sees =, it needs to peek at the next character. Is it ==? Or just =?
+ PLUS - MINUS * STAR / SLASH % PERCENT == EQUAL_EQUAL != BANG_EQUAL < LESS > GREATER <= LESS_EQUAL >= GREATER_EQUAL = ASSIGN += PLUS_ASSIGN -= MINUS_ASSIGN *= STAR_ASSIGN /= SLASH_ASSIGN %= PERCENT_ASSIGN && AND_AND || OR_OR ! BANG & AMP | PIPE ^ CARET ~ TILDE << SHIFT_LEFT >> SHIFT_RIGHT ->ARROWfunction type signature?QUESTIONoptional type suffixMaximal munch in practice. When the lexer sees -, it peeks ahead. Next char is >? Emit ARROW. Next char is =? Emit MINUS_ASSIGN. Otherwise, just MINUS. Two-character operators always beat one-character operators.
Delimiters.
Simple. One character, one token. No ambiguity.
( ) LPAREN / RPAREN params, calls, grouping { } LBRACE / RBRACE blocks, record literals [ ] LBRACKET / RBRACKET arrays, index access , COMMA separators (trailing allowed) . DOT record field access : COLON record field definitions Literals.
These are the tokens that carry a value, not just a type.
Integers
64-bit signed. Multiple bases. Underscore separators for readability.
420xFF0b10100o771_000_000 Floats
64-bit. Decimal point or scientific notation.
3.14-0.51.23e51e-3 Strings
Double-quoted. Escape sequences: \n \t \" \\
"Hello, world!" "line\nbreak" "" Template literals
Backtick-delimited. Multiline. No interpolation. Same type as string.
`This is a
multiline string` What a token looks like in code.
Every token the lexer produces carries three things:
LET, PLUS, INT, etc. "let", "+", "42", "hello" line 3, col 12 The parser never looks at the source string again. Everything it needs is in the token stream.
Key takeaways
Monk has ~60 token types: 27 keywords, ~26 operators, 9 delimiters, plus identifiers and literals.
Keywords are just identifiers that happen to match a reserved word table. The scanner doesn't special-case them.
Multi-character operators use peek-ahead. Longest match wins (maximal munch).
Every token carries its type, literal text, and source position.