A tiny source code example
Let’s start with the a very simple source code fragment that will be fed to the interpreter/compiler.
int result = (12 + 56) / 3 * 2;
The first step is lexing or lexical analysis. Also called scanning.
The lexer reads the source code character by character and tries to combine them into meaningful units, called words. Words form the language’s grammar. Each distinct word is a token.
The term lexical analysis comes from the Greek word “lex”, which means “word”.
There are tokens which consists of only a single character like & or ; or *. Others are several characters long, for example number or string literals like 126354 or “this is a string”. There are characters without any meaning like whitespace or comments. The lexer will only output meaningful tokens and skips the rest. It is also the job of the lexer to report tokens it does not understand. It means they are not part of the language. These errors are called lexical errors.
The token itself usually contains some useful metadata as well, like the length of the token, the start position and the line number in the source code. This extra data helps generating meaningful error messages.
Let’s see what are the tokens we get when we do a lexical analysis on the source example provided earlier:
Lexical analysis is actually quite simple, there is not much more to it.
Next time I will talk about parsing.
Stay tuned. I’ll be back.