Attila's Blog

About writing your own programming language in C, Go and Swift

Lexing ASPL

Aug 9, 2018 Comments

Categories: aspl
Tags: lexing

We have our minimalistic framework in place, we can now start writing our compiler. The first step in any compiler is lexing. The lexer takes the source code as input and outputs tokens that makes up the grammar of the given language.

Let’s write our lexer for ASPL!

Lexeme versus token

Let’s have a look on this line of ASPL code to understand what lexing is about.

var breakfast = "bacon & eggs";

This is a variable definition in ASPL. It is done via the var keyword. But how do we know if we are defining a variable or say printing a string? We know that, because we have the var keyword in the source code. var has a specific meaning. If it was abc in its place it wouldn’t mean anything to us.


Our job during the lexical analysis is that we loop through the list of individual characters in the source code and we try to group them into sequences that mean something to us. These sequences are called lexemes.

Lexical grammar

The rules that determine how individual characters are grouped into lexemes are called lexical grammar. ASPL’s lexical grammar is simple enough to be classified as regular language, and thus it is possible to use regular expressions to recognize all lexemes of the language. Tools like Lex are designed to let you do this. Since we want to learn how to write a language, we are going to write our own lexer by hand.

From lexemes to tokens

Important to mention that lexemes are not the same as tokens. A lexeme is only the raw substring of the source code we recognized as something meaningful. To become tokens we need to gather additional information and attach them to the lexeme.

What kind of information? Let’s see.

Type information

First and foremost we need to know what the lexeme represents. Is it a string? a reserved word? an integer? a comparison operator?

To make this easy, we will assign a specific identifier to every kind of possible lexeme in ASPL.

typedef enum token_type_t {




} TokenType;

Value information

There are lexemes for value literals: numbers and strings. Other than knowing that the lexeme is a string or number we have to know their value as well.

Location information

Correct error reporting is extremely important if we’d like to create a usable language. For this purpose we need to store the location of the lexeme to be able to point it out to the developers if they made a mistake. They need to know what line in the source code they have to look at.


What we do is to grab all the extra information I mentioned, along with the lexeme and put them into a specific data structure.

typedef struct token_t {
    TokenType type;
    const char* start;
    const char* line_start;
    const char* error_message;
    uint64_t length;
    uint64_t line;
} Token;

This is a token. This is how we work with lexemes during the compilation process.


We know what we need to produce during lexing, so let’s write the lexer!

The process is simple:

  • We need to loop through the source code characters
  • Recognize the current lexeme
  • Wrap it into a token
  • Start over until the we reach EOF

For our lexer to work, we need to know the current position in the source code, where the current lexeme starts, and for error reporting it is useful to know what is the current line and maybe where the current line starts.

We wrap all this into a data structure which serves as our lexer. We will have three public functions that drive the lexer. One for initializing the lexer data structure, one for releasing memory after we are done with lexing, and the one that does the work and returns us the tokens one by one. This implementation is an on-demand lexer, because we don’t do the whole thing at once, but one token at a time instead.

typedef struct lexer_t {
    const char* token_start;
    const char* current_position;
    const char* current_line_start;
    uint64_t line;
} Lexer;

extern void lexer_init(Lexer* lexer, const char* source);
extern void lexer_free(Lexer* lexer);
extern Token* lexer_next_token(Lexer* lexer);

The implementation needs some helper functions.

static Token *create_token(Lexer *lexer, TokenType type) {
    Token *token = NEW(Token);

    token->type = type;
    token->start = lexer->token_start;
    token->length = lexer->current_position - lexer->token_start;
    token->line = lexer->line;
    token->line_start = lexer->current_line_start;
    token->error_message = NULL;

    return token;

static Token *create_error_token(Lexer *lexer, const char *message) {
    Token *token = NEW(Token);

    token->type = TOKEN_ERROR;
    token->start = lexer->token_start;
    token->length = lexer->current_position - lexer->token_start;
    token->line = lexer->line;
    token->line_start = lexer->current_line_start;
    token->error_message = message;

    return token;

static void delete_token(Token *token) {
    if (token != NULL) {
        DELETE_TYPE(Token, token);

These are for creating and deleting tokens, this code is self–explanatory. Next, some functions to help determine where we are.

static bool is_at_end(Lexer *lexer) {
    return *lexer->current_position == '\0';

static char peek(Lexer *lexer) {
    return *lexer->current_position;

static char peek_next(Lexer *lexer) {
    if (is_at_end(lexer)) {
        return '\0';

    return *(lexer->current_position + 1);

static char advance(Lexer *lexer) {
    return *(lexer->current_position - 1);

static bool match(Lexer *lexer, char c) {
    if (is_at_end(lexer)) {
        return false;

    if (*lexer->current_position != c) {
        return false;


    return true;

Function is_at_end tells us when we reach the end of the source. peek and peek_next are something called lookahead. It means we can look ahead of the current position without consuming those symbols. It is necessary to be able to recognize lexemes made of multiple characters. advance will make one step forward in the source and consume the current character. match will tell us if the current character is the one we are looking for, it is like a conditional advance.

The next group of helpers will help us determine what kind of lexeme we are talking about.

static bool is_digit(char c) {
    return c >= '0' && c <= '9';

static bool is_alpha(char c) {
    return (c >= 'a' && c <= 'z') ||
           (c >= 'A' && c <= 'Z') ||
           c == '_';

static bool is_alphanumeric(char c) {
    return is_alpha(c) || is_digit(c);

static bool is_whitespace(Lexer *lexer) {
    char c = peek(lexer);
    return c == ' ' || c == '\r' || c == '\t' || c == '\n';

static bool is_comment(Lexer *lexer) {
    char current = peek(lexer);
    char next = peek_next(lexer);
    return current == '/' && (next == '/' || next == '*');

Function is_digit checks if the current character is a number. is_alpha checks for letters. is_alphanumeric is a combination of the former two. is_whitespace recognizes whitespace characters. is_comment checks if the current character is the start of a comment line or block.

Now, we have all the helpers we need, so let’s jump into actually lexing the source. First, let’s see how to recognize strings.

static Token *scan_string(Lexer *lexer) {
    while (peek(lexer) != '"' && !is_at_end(lexer)) {
        char c = advance(lexer);

        if (c == '\n') {
            lexer->current_line_start = lexer->current_position;

    if (is_at_end(lexer)) {
        return create_error_token(lexer, "Unterminated string");


    return create_token(lexer, TOKEN_STRING);

We look for character " which indicates the start of a string. We keep going until we find EOF or the closing ". In case of a multi-line string, we check for \n and increase the line number if we find any. Also set the start of the current line in this case. If we reach EOF we create and return an error token, otherwise we return the recognized string token.

Let’s see how to recognize numbers.

static Token *scan_number(Lexer *lexer) {
    while (is_digit(peek(lexer))) {

    if (peek(lexer) == '.' && is_digit(peek_next(lexer))) {

        while (is_digit(peek(lexer))) {

    return create_token(lexer, TOKEN_NUMBER);

We look for digits or . which indicates a floating point number. If we find a . we keep looking for more digits. Return the number token at the end.

Next is recognizing keywords and identfiers. How can we differentiate between them? They look the same. Well, we need some preparation first.

typedef struct keyword_t {
    const char *name;
    size_t length;
    TokenType type;
} Keyword;

static const Keyword keywords[] = {
        {"class",  5, TOKEN_CLASS},
        {"else",   4, TOKEN_ELSE},
        {"false",  5, TOKEN_FALSE},
        {"for",    3, TOKEN_FOR},
        {"func",   4, TOKEN_FUNC},
        {"if",     2, TOKEN_IF},
        {"nil",    3, TOKEN_NIL},
        {"print",  5, TOKEN_PRINT},
        {"return", 6, TOKEN_RETURN},
        {"super",  5, TOKEN_SUPER},
        {"this",   4, TOKEN_THIS},
        {"true",   4, TOKEN_TRUE},
        {"var",    3, TOKEN_VAR},
        {"while",  5, TOKEN_WHILE},
        {NULL,     0, TOKEN_EOF}

We define a keyword structure which holds all the reserved words of ASPL. We will check against them when looking for identifiers so we can decode what we have found is an identifier or a reserved word.

static Token *scan_identifier(Lexer *lexer) {
    while (is_alphanumeric(peek(lexer))) {

    TokenType type = TOKEN_IDENTIFIER;

    size_t length = lexer->current_position - lexer->token_start;

    for (const Keyword *keyword = keywords; keyword->name != NULL; keyword++) {
        if (length == keyword->length && memcmp(lexer->token_start, keyword->name, length) == 0) {
            type = keyword->type;

    return create_token(lexer, type);

We look for letters or digits and keep going until we find something other than these two. At the end we compare our identifier against the reserved word database, and return the token for the appropriate reserved word if there is a match. We return an identifier otherwise.

I know, you think you just found a bug, right? Identifiers cannot start with a number! True, but it’s gonna be OK, because the main loop that drives the whole lexing process will call this function only if the first character is a letter.

Next we filter out any noise from our source code, namely comments and whitespace.

static void skip_whitespace(Lexer *lexer) {
    for (;;) {
        char c = peek(lexer);

        switch (c) {
            case ' ':
            case '\r':
            case '\t':
            case '\n':
                lexer->current_line_start = lexer->current_position;

static Token *skip_comment(Lexer *lexer) {
    if (peek(lexer) == '/' && peek_next(lexer) == '/') {
        while (peek(lexer) != '\n' && !is_at_end(lexer)) {

        if (peek(lexer) == '\n') {
    } else if (peek(lexer) == '/' && peek_next(lexer) == '*') {
        bool finished = false;

        while (!finished) {
            char c = advance(lexer);

            if (is_at_end(lexer)) {
                return create_error_token(lexer, "Unterminated comment");

            if (c == '\n') {
                lexer->current_line_start = lexer->current_position;

            if (peek(lexer) == '*' && peek_next(lexer) == '/') {
                finished = true;

    return NULL;

It’s easier to skip the whitespace, so start with that. skip_whitespace will look for any whitespace characters (' ', '\r', '\t', '\n') and just skips over them. '\n' is special, when we find one we have to set the current line number and the start of the current line as well.

Lexing comments is a bit more complex. skip_comment supports both // and /* */ style comments. // is easier to recognize. We just look for //, and if found skip everytrhing until the end of the line or EOF. As usual, we set the line number if we hit a \n character. /**/ is harder to deal with because it has to handle multi-line comments as well. If we reach EOF before the comment is closed we return an error token. Also, check for new lines via \n, and do the usual housekeeping. We do not return a valid token when successfully finished lexing a comment, because comments are not useful lexemes in ASPL, so we just ignore them.

An important note: this implementation does not handle embedded comments like /* ... /* ... */ ... */. It is the programmers responsibility to avoid writing such code. Why? Because it was easier this way.

Next up, our public functions that communicate with the outer world.

void lexer_init(Lexer *lexer, const char *source) {
    lexer->token_start = source;
    lexer->current_position = source;
    lexer->current_line_start = source;
    lexer->line = 1;

void lexer_free(Lexer *lexer) {
    lexer_init(lexer, NULL);

They are for creating and destroying lexer instances. lexer_free seems to be unnecessary, right? It is for the moment. We will keep it though, because we might need to extend the lexer with some dynamic structures later, and in this case we will have a place to release that memory when destroying the lexer instance.

There is only one function left: the heart of the lexer. This is what connects all the helpers and drives the lexing process.

Token *lexer_next_token(Lexer *lexer) {
    bool comment;
    bool whitespace;

    do {
        Token *error = skip_comment(lexer);

        if (error != NULL) {
            return error;

        comment = is_comment(lexer);
        whitespace = is_whitespace(lexer);
    } while (comment || whitespace);

    lexer->token_start = lexer->current_position;

    if (is_at_end(lexer)) {
        return create_token(lexer, TOKEN_EOF);

    char c = advance(lexer);

    switch (c) {
        case '(':
            return create_token(lexer, TOKEN_LEFT_PAREN);
        case ')':
            return create_token(lexer, TOKEN_RIGHT_PAREN);
        case '{':
            return create_token(lexer, TOKEN_LEFT_BRACE);
        case '}':
            return create_token(lexer, TOKEN_RIGHT_BRACE);
        case ';':
            return create_token(lexer, TOKEN_SEMICOLON);
        case ',':
            return create_token(lexer, TOKEN_COMMA);
        case '.':
            return create_token(lexer, TOKEN_DOT);
        case '-':
            if (match(lexer, '-')) {
                return create_token(lexer, TOKEN_MINUS_MINUS);

            return create_token(lexer, TOKEN_MINUS);
        case '+':
            if (match(lexer, '+')) {
                return create_token(lexer, TOKEN_PLUS_PLUS);

            return create_token(lexer, TOKEN_PLUS);
        case '/': {
            return create_token(lexer, TOKEN_SLASH);
        case '*':
            return create_token(lexer, TOKEN_STAR);
        case '?':
            return create_token(lexer, TOKEN_QUESTION);
        case ':':
            return create_token(lexer, TOKEN_COLON);
        case '!':
            if (match(lexer, '=')) {
                return create_token(lexer, TOKEN_BANG_EQUAL);

            return create_token(lexer, TOKEN_BANG);
        case '=':
            if (match(lexer, '=')) {
                return create_token(lexer, TOKEN_EQUAL_EQUAL);

            return create_token(lexer, TOKEN_EQUAL);
        case '<':
            if (match(lexer, '=')) {
                return create_token(lexer, TOKEN_LESS_EQUAL);

            return create_token(lexer, TOKEN_LESS);
        case '>':
            if (match(lexer, '=')) {
                return create_token(lexer, TOKEN_GREATER_EQUAL);

            return create_token(lexer, TOKEN_GREATER);
        case '&':
            if (match(lexer, '&')) {
                return create_token(lexer, TOKEN_AND_AND);
        case '|':
            if (match(lexer, '|')) {
                return create_token(lexer, TOKEN_PIPE_PIPE);
        case '"':
            return scan_string(lexer);
            if (is_digit(c)) {
                return scan_number(lexer);
            } else if (is_alpha(c)) {
                return scan_identifier(lexer);


    return create_error_token(lexer, "Unexpected character");

We start with skipping all whitespace and comments that might be at the current position. Don’t forget, this is an on-demand implementation, so there is no guarantee that we are at the beginning of the source code, or that we have any source code at all. If you are in REPL, and just press enter, we will try lexing an empty source code string. Because of this, we also check if we reached EOF already. If not, we check the next character in a giant switch.

Recognizing single character tokens is straightforward. Check the character for a match, and we are done. Checking for two-character tokens is where we need to use match. If we do not find any special character we just check for numbers and identifiers. At the end of the switch if there is still no match we return an error token because it means we found something that is not supported in ASPL.

We are done, we now have a fully functional lexer. Before we can try it out, we need some additional setup to do. In create_token we allocated memory with NEW. This is a macro I created for helping dynamic memory allocations.

// Created by Attila Haluska on 8/2/18.


#include <stddef.h>


#define NEW(type)                                               (type*)memory_reallocate(NULL, 0, (size_t)sizeof(type))
#define NEW_WITH_SIZE(type, size)                               (type*)memory_reallocate(NULL, 0, (size_t)((size) * sizeof(type)))
#define DELETE_TYPE(type, pointer)                              memory_reallocate((pointer), sizeof(type), 0)

extern void* memory_reallocate(void* original_memory_location, size_t current_size, size_t new_size);

#endif //ASPL_C_MEMORY_H

The macro itself is straightforward. memory_reallocate is the main dynamic memory allocator we will use throught the whole implementation.

// Created by Attila Haluska on 8/2/18.

#include "memory_management/memory.h"

#include <memory.h>
#include <stdlib.h>

#include "common.h"

void* memory_reallocate(void* original_memory_location, size_t current_size, size_t new_size) {
    if (new_size == 0) {
        return NULL;

    void* new_memory_location = realloc(original_memory_location, new_size);

    if (new_memory_location == NULL) {

    return new_memory_location;

If you know C you may notice that I call free explicitly when the new size is zero. In the C90 standard if size is zero realloc behaves the same as free, so explicitly calling free might seem unnecessary. This behavior has changed in C99/C11 though, and now if size is zero the behavior is implementation dependent, and the standard does not guarantee freeing the previously allocated memory block anymore, and thus an explicit call to free is necessary.

The last thing we need is a helper function to be able to show the contents of the current token.

// Created by Attila Haluska on 8/2/18.

#ifndef ASPL_C_DEBUG_H
#define ASPL_C_DEBUG_H

#include "forward_declarations.h"

extern void debug_token(Token* token);

#endif //ASPL_C_DEBUG_H

It is called debug_token, and the imlementation is like this:

// Created by Attila Haluska on 8/2/18.

#include "debug/debug.h"

#include <stdio.h>
#include <string.h>

#include "compiler/lexer.h"
#include "memory_management/memory.h"

static char *debug_token_type(const Token *token) {
    switch (token->type) {
        case TOKEN_LEFT_PAREN:
            return "TOKEN_LEFT_PAREN";
        case TOKEN_RIGHT_PAREN:
            return "TOKEN_RIGHT_PAREN";
        case TOKEN_LEFT_BRACE:
            return "TOKEN_LEFT_BRACE";
        case TOKEN_RIGHT_BRACE:
            return "TOKEN_RIGHT_BRACE";
        case TOKEN_COMMA:
            return "TOKEN_COMMA";
        case TOKEN_DOT:
            return "TOKEN_DOT";
        case TOKEN_MINUS:
            return "TOKEN_MINUS";
        case TOKEN_MINUS_MINUS:
            return "TOKEN_MINUS_MINUS";
        case TOKEN_PLUS:
            return "TOKEN_PLUS";
        case TOKEN_PLUS_PLUS:
            return "TOKEN_PLUS_PLUS";
        case TOKEN_SEMICOLON:
            return "TOKEN_SEMICOLON";
        case TOKEN_SLASH:
            return "TOKEN_SLASH";
        case TOKEN_STAR:
            return "TOKEN_STAR";
        case TOKEN_QUESTION:
            return "TOKEN_QUESTION";
        case TOKEN_COLON:
            return "TOKEN_COLON";
        case TOKEN_BANG:
            return "TOKEN_BANG";
        case TOKEN_BANG_EQUAL:
            return "TOKEN_BANG_EQUAL";
        case TOKEN_EQUAL:
            return "TOKEN_EQUAL";
        case TOKEN_EQUAL_EQUAL:
            return "TOKEN_EQUAL_EQUAL";
        case TOKEN_GREATER:
            return "TOKEN_GREATER";
            return "TOKEN_GREATER_EQUAL";
        case TOKEN_LESS:
            return "TOKEN_LESS";
        case TOKEN_LESS_EQUAL:
            return "TOKEN_LESS_EQUAL";
        case TOKEN_IDENTIFIER:
            return "TOKEN_IDENTIFIER";
        case TOKEN_STRING:
            return "TOKEN_STRING";
        case TOKEN_NUMBER:
            return "TOKEN_NUMBER";
        case TOKEN_AND_AND:
            return "TOKEN_AND_AND";
        case TOKEN_CLASS:
            return "TOKEN_CLASS";
        case TOKEN_ELSE:
            return "TOKEN_ELSE";
        case TOKEN_FALSE:
            return "TOKEN_FALSE";
        case TOKEN_FUNC:
            return "TOKEN_FUNC";
        case TOKEN_FOR:
            return "TOKEN_FOR";
        case TOKEN_IF:
            return "TOKEN_IF";
        case TOKEN_NIL:
            return "TOKEN_NIL";
        case TOKEN_PIPE_PIPE:
            return "TOKEN_PIPE_PIPE";
        case TOKEN_PRINT:
            return "TOKEN_PRINT";
        case TOKEN_RETURN:
            return "TOKEN_RETURN";
        case TOKEN_SUPER:
            return "TOKEN_SUPER";
        case TOKEN_THIS:
            return "TOKEN_THIS";
        case TOKEN_TRUE:
            return "TOKEN_TRUE";
        case TOKEN_VAR:
            return "TOKEN_VAR";
        case TOKEN_WHILE:
            return "TOKEN_WHILE";
        case TOKEN_ERROR:
            return "TOKEN_ERROR";
        case TOKEN_EOF:
            return "TOKEN_EOF";

    return "TOKEN_UNKNOWN";

void debug_token(Token *token) {
    char* lexeme = NEW_WITH_SIZE(char, token->length + 1);
    memcpy(lexeme, token->start, token->length);
    lexeme[token->length] = '\0';
    fprintf(stdout, "Token{%s, '%s'}\n", debug_token_type(token), lexeme);
    DELETE_TYPE(char, lexeme);

Now, we can try the lexer out. Let’s fill in the missing pieces in main.c. We need to do this in two places.

In run_file:

static void run_file(const char* file_path) {
    const char* source = read_file(file_path);

    Lexer lexer;
    lexer_init(&lexer, source);

    Token* token = lexer_next_token(&lexer);

    while (token->type != TOKEN_EOF) {
        token = lexer_next_token(&lexer);


    free((void*) source);

And in run_repl:

static void run_repl() {
    char line[MAX_LINE_LENGTH];
    memset(line, 0, MAX_LINE_LENGTH);

    fprintf(stdout, "\nASPL REPL\n\nType 'Ctrl+C', 'quit' or 'q' to exit.\n\naspl> ");

    while (true) {
        if (fgets(line, MAX_LINE_LENGTH, stdin) == NULL) {
            fprintf(stdout, "\n");

        if (memcmp(line, "quit", 4) != 0 && memcmp(line, "q", 1) != 0) {
            Lexer lexer;
            lexer_init(&lexer, line);

            Token* token = lexer_next_token(&lexer);

            while (token->type != TOKEN_EOF) {
                token = lexer_next_token(&lexer);


            fprintf(stdout, "aspl> ");
        } else {
            fprintf(stdout, "Bye\n");

If you build the project and start up the REPL you can try it out.


Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> var breakfast = "bacon";
Token{TOKEN_VAR, 'var'}
Token{TOKEN_IDENTIFIER, 'breakfast'}
Token{TOKEN_EQUAL, '='}
Token{TOKEN_STRING, '"bacon"'}
Token{TOKEN_EOF, ''}
aspl> q

Process finished with exit code 0

Grab the source code of the current state of the project if you like.

Next time, we start implementing parsing.

Stay tuned. I’ll be back.

Tags: lexing

← The interpreter framework Introduction to context-free grammars and parsing theory →

comments powered by Disqus