Lesson 1.2

Designing Monk's Token Set

Every keyword, operator, and delimiter in the language, mapped to a token type. This is the contract between the lexer and the parser.

monk 15 min read

Why define tokens before writing the lexer.

The token set is a contract. The lexer promises to produce these types. The parser promises to consume them. If you get this list right, the lexer and parser can be built independently — even by different people.

Monk has ~60 token types. That sounds like a lot, but most are single characters (+, (, ,) or reserved words (let, if, return). The interesting ones are the literals — numbers, strings, identifiers.

Keywords.

Reserved words that cannot be used as variable names. The lexer reads an identifier, then checks: is it in the keyword table? If yes, emit the keyword token. If no, emit an identifier.

Declarations & control flow
letconstifelseforinwhilebreakcontinuereturn
Error handling
guardagainstthrow
Modules & types
typeuseexportfromas
Logical operators (keyword form)
andornotis
Literals (also keywords)
truefalsenone
Reserved for future
refasyncawait

Keywords vs identifiers. The lexer doesn't hardcode each keyword in the scanner. It reads a word, then does a lookup in a map. This keeps the scanner simple — the only decision is "letter? keep reading." The keyword check happens once, at the end.

Operators.

The tricky part: multi-character operators. When the lexer sees =, it needs to peek at the next character. Is it ==? Or just =?

Arithmetic
+ PLUS
- MINUS
* STAR
/ SLASH
% PERCENT
Comparison
== EQUAL_EQUAL
!= BANG_EQUAL
< LESS
> GREATER
<= LESS_EQUAL
>= GREATER_EQUAL
Assignment
= ASSIGN
+= PLUS_ASSIGN
-= MINUS_ASSIGN
*= STAR_ASSIGN
/= SLASH_ASSIGN
%= PERCENT_ASSIGN
Logical & bitwise
&& AND_AND
|| OR_OR
! BANG
& AMP
| PIPE
^ CARET
~ TILDE
<< SHIFT_LEFT
>> SHIFT_RIGHT
Special
->ARROWfunction type signature
?QUESTIONoptional type suffix

Maximal munch in practice. When the lexer sees -, it peeks ahead. Next char is >? Emit ARROW. Next char is =? Emit MINUS_ASSIGN. Otherwise, just MINUS. Two-character operators always beat one-character operators.

Delimiters.

Simple. One character, one token. No ambiguity.

( ) LPAREN / RPAREN params, calls, grouping
{ } LBRACE / RBRACE blocks, record literals
[ ] LBRACKET / RBRACKET arrays, index access
, COMMA separators (trailing allowed)
. DOT record field access
: COLON record field definitions

Literals.

These are the tokens that carry a value, not just a type.

Integers

64-bit signed. Multiple bases. Underscore separators for readability.

420xFF0b10100o771_000_000

Floats

64-bit. Decimal point or scientific notation.

3.14-0.51.23e51e-3

Strings

Double-quoted. Escape sequences: \n \t \" \\

"Hello, world!" "line\nbreak" ""

Template literals

Backtick-delimited. Multiline. No interpolation. Same type as string.

`This is a
multiline string`

What a token looks like in code.

Every token the lexer produces carries three things:

Type What kind of token — LET, PLUS, INT, etc.
Literal The raw text — "let", "+", "42", "hello"
Position Line and column — line 3, col 12

The parser never looks at the source string again. Everything it needs is in the token stream.

Key takeaways

1

Monk has ~60 token types: 27 keywords, ~26 operators, 9 delimiters, plus identifiers and literals.

2

Keywords are just identifiers that happen to match a reserved word table. The scanner doesn't special-case them.

3

Multi-character operators use peek-ahead. Longest match wins (maximal munch).

4

Every token carries its type, literal text, and source position.