Lesson 1.1

What Is a Lexer?

The compiler's first step. Raw text goes in, meaningful chunks come out. This is where every programming language begins.

concepts 12 min read

Reading before understanding.

When you read a sentence, your brain does two things. First, it identifies words — the letters "c-a-t" become the word "cat." Then it understands what "cat" means in context. These are separate processes, and your brain is so fast you don't notice the split.

A compiler does the same thing, but slower and more deliberately. The lexer handles the first step: recognizing words. It doesn't know what the program means — that's the parser's job. The lexer just reads characters and groups them into tokens.

Token: the smallest meaningful unit in source code. A keyword like let, a number like 42, an operator like +. Each one is a token.

What the lexer sees.

To you, this is a variable declaration:

let name = "Monk"

To the lexer, it's a stream of 16 characters:

l
e
t
·
n
a
m
e
·
=
·
"
M
o
n
k
"

The lexer scans left to right, grouping characters into tokens:

LET IDENT "name" ASSIGN STRING "Monk"

Whitespace is consumed but produces no token. It's just a separator.

How scanning works.

The lexer keeps a cursor — a position in the source string. It looks at the current character and decides what kind of token is starting. Then it advances the cursor until the token is complete.

1

Letter?

Keep reading letters and digits. When you stop, check if it's a keyword (let, if, return). If not, it's an identifier.

2

Digit?

Keep reading digits (and maybe a dot for floats, or 0x/0b prefixes). Produce a number token.

3

Quote?

Read everything until the matching closing quote. Handle escape sequences (\n, \"). Produce a string token.

4

Symbol?

Check if the next character extends it (= could become ==, - could become ->). Produce the longest matching operator.

5

Whitespace or comment?

Skip it. Advance the cursor. No token produced.

That's the entire algorithm. A big switch on the current character, repeated until you hit the end of the file (then emit an EOF token).

Maximal munch. The lexer always reads the longest possible token. When it sees <=, it produces one LESS_EQUAL token, not a LESS followed by an ASSIGN. This is called the "maximal munch" rule and every real lexer uses it.

A complete example.

Here's a small Monk program:

const add = (a int, b int) int {
    return a + b
}
let result = add(3, 4)  // 7

The lexer produces this token stream:

CONSTIDENT "add"ASSIGNLPARENIDENT "a"IDENT "int"COMMAIDENT "b"IDENT "int"RPARENIDENT "int"LBRACERETURNIDENT "a"PLUSIDENT "b"RBRACELETIDENT "result"ASSIGNIDENT "add"LPARENINT 3COMMAINT 4RPARENEOF

The comment // 7 is discarded. It carries no meaning for the compiler.

What the lexer does NOT do.

No grammar.

It doesn't know that let x = starts a variable declaration. That's parsing.

No type checking.

let x int = "hello" lexes just fine. The lexer doesn't know types are wrong.

No scope.

It doesn't know if a variable has been declared. foobar is just an identifier, even if nothing named foobar exists.

The lexer's only concern: is this sequence of characters a valid token? If not, report an error with the line and column number. That's it. Everything else is someone else's problem.

Line and column tracking.

Every token remembers where it came from — the line number and column in the source file. When the parser or type checker finds an error later, it uses this position to print a useful error message:

error: undefined variable "nme"
  --> main.monk:3:12
  |
3 | let x = nme + 1
  |         ^^^

The lexer tracks this by counting newlines (increment line, reset column) and incrementing the column for every other character. Cheap to do, invaluable for debugging.

Key takeaways

1

A lexer turns a string of characters into a stream of tokens. Characters in, tokens out.

2

It works by scanning left to right with a cursor, matching the longest possible token at each position.

3

Whitespace and comments are consumed but don't produce tokens.

4

Every token carries its position (line, column) for error reporting downstream.