Attila's Blog

About writing your own programming language in C, Go and Swift

Parsing arithmetic expressions

Aug 28, 2018 Comments

Categories: aspl

Last time we fixed the grammar for basic arithmetic expressions. We also decided that a top-down operator precedence parser would be the best for parsing ASPL. Let’s start writing some code then!

Let’s grab the source code we have so far and add some new features.


We are going to need the rules of precedence. Let’s write it first.

// Created by Attila Haluska on 8/14/18.


typedef enum precedence_t {
    PREC_NONE = 0,
    PREC_COMMA,         // ,
    PREC_ASSIGNMENT,    // =
    PREC_TERNARY,       // ?:
    PREC_OR,            // or
    PREC_AND,           // and
    PREC_EQUALITY,      // == !=
    PREC_COMPARISON,    // < > <= >=
    PREC_TERM,          // + -
    PREC_FACTOR,        // * /
    PREC_EXPONENT,      // ^
    PREC_UNARY_PREFIX,  // ! - +
    PREC_UNARY_POSTFIX, // ! - +
    PREC_CALL,          // . () []
} Precedence;


These are all the different precedence levels we will use in ASPL but for now we are going to need only two of them, PREC_TERM and PREC_FACTOR.

Abstract Syntax Tree

Before we start working on the parser, let’s think about how we want to store our AST nodes. In C, there is no OOP, so we need to store a hierarchy of related objects without creating an object–oriented hierarchy of related objects. :)

// Created by Attila Haluska on 8/13/18.


#include <stdint.h>

#include "forward_declarations.h"

typedef enum ast_node_type_t {
} AstNodeType;

typedef struct ast_node_t {
    AstNodeType type;
    Token* token;
    union {
        struct {
            AstNode* left;
            AstNode* right;
        } binaryInfix;
        int64_t integer;
        long double double_precision;
    } as;
} AstNode;

#endif //ASPL_C_AST_NODE_H

First, we declare a new enum AstNodeType for the various AST node types. For now, we will have only three AST node types, one for integer literals, one for floating point literals, and one for binary expressions.

The AstNode struct is the interesting one. We store the type of the node and the associated token as well. Then, to minimize memory waste, we embed a union. The point of a union data structure in C is that it is capable of holding multiple different types of data but at any given time it can hold only one of them. All contained variables start at the same memory address, so it is not possible for a union to hold multiple different values at the same time. In higher level languages its equivalent may be called a variant or any.

The union will hold the actual data for the current node type. Currently we have three available types in it, an integer, a double, and a struct for representing a binary expression.

We need some helpers as well to aid us creating different types of AST nodes, and to help managing their memory allocations.

extern AstNode* ast_node_make_integer(Token* token);
extern AstNode* ast_node_make_double(Token* token);
extern AstNode* ast_node_make_binary_infix_expression(Token* token, AstNode* left, AstNode* right);
extern void ast_node_free(AstNode* node);

ast_node_make_integer creates an integer node, ast_node_make_double creates a floating point node, and ast_node_make_binary_infix_expression creates a binary expression node. Finally ast_node_free is used to release the dynamically allocated memory for any kind of AST nodes.

The implementation of these functions is rather straightforward, I am not going to waste too much time explaining.

// Created by Attila Haluska on 8/13/18.

#include "compiler/parser/ast/ast_node.h"

#include <stdlib.h>

#include "compiler/lexer/token.h"
#include "memory_management/memory.h"

AstNode* ast_node_make_integer(Token* token) {
    AstNode* node = NEW(AstNode);

    char* end;

    node->type = NODE_INTEGER;
    node->token = token;
    node->as.integer = strtoll(node->token->start, &end, 10);

    return node;

AstNode* ast_node_make_double(Token* token) {
    AstNode* node = NEW(AstNode);

    char* end;

    node->type = NODE_DOUBLE;
    node->token = token;
    node->as.double_precision = strtold(node->token->start, &end);

    return node;

AstNode* ast_node_make_binary_infix_expression(Token* token, AstNode* left, AstNode* right) {
    AstNode* node = NEW(AstNode);

    node->token = token;
    node->as.binaryInfix.left = left;
    node->as.binaryInfix.right = right;

    return node;

static void ast_node_free_integer_node(AstNode* node) {
    DELETE_TYPE(Token, node->token);
    DELETE_TYPE(AstNode, node);

static void ast_node_free_double_node(AstNode* node) {
    DELETE_TYPE(Token, node->token);
    DELETE_TYPE(AstNode, node);

static void ast_node_free_binary_infix_node(AstNode* node) {
    DELETE_TYPE(Token, node->token);
    DELETE_TYPE(AstNode, node);

void ast_node_free(AstNode* node) {
    switch (node->type) {
        case NODE_INTEGER:
        case NODE_DOUBLE:

Forward declarations

All right, we have an increasing number of different public data structures in the application. Oftentimes we would just want to use pointer references to these structures in other header files, so it makes sense to create forward declarations out of these.

// Created by Attila Haluska on 8/2/18.


// token.h
typedef enum token_type_t TokenType;
typedef struct token_t Token;

// lexer.h
typedef struct lexer_t Lexer;

// ast_node.h
typedef struct ast_node_t AstNode;

// precedence.h
typedef enum precedence_t Precedence;


This is just to make it easier to include pointer references to existing data structures in other header files. This will somewhat speed the compilation up, although it is quite insignificant in such small applications as this one.

Parsing tokens

Now that we have our prerequisites, we can start working on the parser itself. Its header is very simple, there is only one function exported for the application to use.

// Created by Attila Haluska on 8/13/18.


//#include <stdbool.h>

#include "forward_declarations.h"

extern AstNode* parser_parse(const char* source);

#endif //ASPL_C_PARSER_H

parser_parse expects the raw source code, and it will parse it into an AST tree. Note, that from this point we won’t need to use the lexer itself, because it will be embedded into the parser and the parser will use it directly to read the tokens on–demand.

Now, let’s see the most imprtant part of this post, the parser’s implementation. First, we need to declare a data structure for the parser.

typedef struct parser_t {
    Token *previous_token;
    Token *current_token;
    bool had_error;
    bool panic_mode;
} Parser;

This will be used during parsing to hold the current state of parsing. We need to store two tokens, the current and the previous one. Why? Because we need a lookahead in the parser as well.

Remember that I mentioned lookahead, when we implemented the lexer? In case of complex tokens, we needed a way to look ahead of the current symbol to help figure out what the token really is.

For example when the lexer sees the = symbol, it needs to look ahead and check the next one without consuming it. If the next symbol is = again, the lexer knows that the token is a comparison token. Otherwise, it’s an assignment token.

We need the same feature in the parser as well. This can be implemented in several different ways. The one I chose is to fire up the lexer before the parsing starts, so that the lexer is always one token ahead of the parser. This means that the parser will most often work with the previous token, and the lookahead token is going to be the current token.

The parser also has two flags for error handling. had_error will be used later to help decide if an error happened during parsing. panic_mode will be used for synchronizing the parser in case of an error.

What does it mean? When there is an error during parsing it means that the parser lost its way of understanding the structure of the source code, and it won’t be able to recognize the correct order of tokens anymore. We need to find a fixed point in the source code where the parser could pick up and continue.

Why, do you ask? It doesn’t make any sense, does it? We should just print an error message and stop the process right? Well, not really. Imagine your favorite compiler displaying you only one error message and aborting every time you make an error. It would be extremely annoying.

Imagine you made several errors. You try to parse the code. It reports an error. You fix that one, parse again. It reports another error, you fix that as well, and so on…

I think the right approach is to try gathering every possible error we possibly can before aborting the parsing process. The more information we can show, the easier for the user to fix them. This is why we will use the panic_mode flag to indicate that we had a problem, and we need to find the next “fixed” point in the source to continue the parsing from. We will talk about this more when we get to the point to actually implement it, right now we just put the basics in place.

Next we need a parsing context.

typedef struct parsing_context_t {
    Lexer lexer;
    Parser parser;
} ParsingContext;

ParsingContext is the data we will pass to every parsing related function. You may think of it as this or self in object–oriented languages. It provides the data the functions can work on.

Next, we define the structure of our parse functions.

typedef AstNode* (*NudParseFunction)(ParsingContext *context);
typedef AstNode* (*LedParseFunction)(ParsingContext *context, AstNode* left);

This declares two function pointer types. We need to have two different types of parse functions to be able to differentiate between the prefix and infix form of some tokens.

For example the - token can be used in both forms.

x = y - z

In this expression the - token is in infix position, and it means subtraction.


In this expression the - token is in prefix position, and in this case it means negation.

To be able to handle the difference, top-down operator precedence parsers allow to have different treatments of the same tokens in infix and prefix positions. In TDOP terminology the handler — the associated parse function, that is — of a token in prefix position is called nud or null denotation, and the handler of the token in infix position is called led or left denotation.

This is exactly what we’ve declared above. NudParseFunction is for the null denotation, LedParseFunction for the left denotation.

Next is a data structure to group our parse functions together.

typedef struct parse_rule_t {
    NudParseFunction nud;
    LedParseFunction led;
    Precedence precedence;
} ParseRule;

This structure holds the nud and led function for any given token, and also the associated precedence level.

These are all the data types we need for parsing. We can start working on the implementation.

Parsing helper functions

Dynamic memory allocation

As usual, we start with memory allocation.

static void parser_init(Parser* parser) {
    parser->previous_token = NULL;
    parser->current_token = NULL;
    parser->had_error = false;
    parser->panic_mode = false;

static void parser_free(Parser* parser) {
    DELETE_TYPE(Token, parser->current_token);

These are for initializing and releasing the parser.

inline static Token *get_current_token(ParsingContext *context) {
    return context->parser.current_token;

inline static Token *get_previous_token(ParsingContext *context) {
    return context->parser.previous_token;

These are two small inline helpers to allow us writing slightly less code. Not terribly important though. :) They return the current and previous tokens from the context.

Error reporting

Next, some minimal error handling and reporting.

static void error_at_token(ParsingContext *context, const Token *token, const char *message) {
    if (context->parser.panic_mode) {

    context->parser.panic_mode = true;
    context->parser.had_error = true;

    fprintf(stderr, "[line %llu] Error", token->line);

    if (token->type == TOKEN_EOF) {
        fprintf(stderr, " at the end of file");
    } else {
        fprintf(stderr, " at '%.*s'", (int) token->length, token->start);

    fprintf(stderr, ": %s\n", message);

static void error_at_previous_token(ParsingContext *context, const char *message) {
    error_at_token(context, get_previous_token(context), message);

static void error_at_current_token(ParsingContext *context, const char *message) {
    error_at_token(context, get_current_token(context), message);

error_at_token will print a formatted error message in case of parsing or lexing errors. If you remember, the lexer does not have any built–in error reporting, but it will create TOKEN_ERROR tokens when it encounters an error, and lets the parser to report those too. In these cases the error_message property of the token will be printed to the output. error_at_previous_token and error_at_current_token just helpers to make it easier to select the token the error is coming from.

Communicating with the lexer

Just like we had an advance function in the lexer, we write its equivalent in the parser as well.

static void advance_to_next_token(ParsingContext *context) {
    context->parser.previous_token = context->parser.current_token;

    for (;;) {
        context->parser.current_token = lexer_next_token(&context->lexer);

        if (get_current_token(context)->type != TOKEN_ERROR) {

        error_at_current_token(context, get_current_token(context)->error_message);

We save the current token into the previous field and ask for the next one from the lexer, and store it in the current token field. If we get a TOKEN_ERROR, we report it.

Parsing expressions

That’s all about helper functions. Now let’s see the details of top-down operator precedence parsing.

We are going to have one function to do all the expression parsing.

static AstNode* parse_expression(ParsingContext *context) {
	// What to put here?

This function will be able to parse any kind of expressions. In the last post we discussed the problem of precedence, remember? We will need something that can properly handle the rules of precedence. This is going to be it:

static AstNode* parse_precedence(ParsingContext *context, Precedence precedence) {
	// What to put here?

This function will allow us to parse expressions while honoring the rules of precedence at the same time. I mentioned that parse_expression should be able to parse any kind of expressions. How can we do this? Easy. By allowing it to parse expressions of the lowest level of precedence, which means every other levels are automatically included. We can now fill in the implementation detail in the parse_expression function.

static AstNode* parse_expression(ParsingContext *context) {
    return parse_precedence(context, PREC_ASSIGNMENT);

This essentially says to parse anything regardless of its precedence. Note that I am using PREC_ASSIGNMENT instead of PREC_COMMA. This is because PREC_COMMA is not needed anywhere, and there are no expressions (nor they ever planned to be) with this level. This level might get removed later.

Just as an overview, let’s see one more time what we are going to parse in this post:

  • Integer literals: 12, 36, 8584, etc.
  • Floating point literals: 13.46, 1.00, etc.
  • The four common arithmetic expressions
    • Addition: 15 + 43
    • Subtraction: 456 - 34
    • Multiplication: 75 * 43
    • Division: 24 / 2

We are going to create a table from the tokens and associated parsing functions. Some tokens will only have nud functions, and some both nud and led functions.

Parsing prefix expressions

Before we fill in the blank in our parse_expression function, let’s see how to parse number literals. They are prefix expressions, because they start with a given token, and there is nothing after them that needs further processing. To parse these, we need to provide a function of type NudParseFunction.

static AstNode* parse_number_literal_expression(ParsingContext *context) {
    Token *token = get_previous_token(context);

    if (token_is_double(token)) {
        return ast_node_make_double(token);
    } else {
        return ast_node_make_integer(token);

We assume the token for the number literal has already been consumed and stored in the previous_token field, so we retrieve that. Also, we use the two helpers from ast_node.h to convert the lexeme into integers or doubles. token_is_double is a helper in token.h to help decide if we have an integer or floating point lexeme in the token.

Later we will introduce additional prefix expressions:

  • string literals
  • keywords like true, false, this, nil
  • identifiers (variable names)
  • grouping (parentheses)
  • unary prefix operators, like -, --, ++

Parsing infix expressions

We are down to binary expressions. They consist of binary operators that are in infix position. These parsing functions will be a little bit different. With infix expressions we know from the first token what we are parsing. In contrast, with infix expressions we don’t even know that we are in the middle of a binary expression until after the left operand is already had parsed and we’ve found the binary operator in the middle.

Let’s demonstrate this with an example.

12 + 24

If we walk through the parse process we have to take the following steps:

  1. We call parse_expression which will notice that the first token is an integer, so it calls the parse_number_literal_expression function, and that returns an integer AST node.

OK, what’s next? We know that the next token is +. From this symbol we realize that we are inside of a binary expression, right? Then we must also realize, that the previous expression we parsed must be the left operand of this binary expression. All we need to do is to call a function associated with parsing binary expressions, and we have to pass the left operand to it, so it can properly parse the binary expression and create the AST node. To parse this we have to create a function of type LedParseFunction. Here it is:

static AstNode* parse_binary_operator_expression(ParsingContext *context, AstNode* left) {
    Token* previous = get_previous_token(context);
    ParseRule *rule = get_parse_rule(previous);
    AstNode* right = parse_precedence(context, rule->precedence);
    return ast_node_make_binary_infix_expression(previous, left, right);

By the time this functon is called, we already parsed the left operand, and consumed the token that indicates the type of expression. We can retrieve it from the previous token. Technically, the parse_binary_operator_expression function will parse only the remainder of the expression, that is the right operand. For this, we need to know the precedence level of the current expression. We can get it from the get_parse_rule helper function.

We also see what I was referring to, when I said that our grammar is recursive. We know, that the right–hand operand of the expression can be any expression that has a lower precedence than the precedence of the binary operator we are currently parsing. To achieve this, we have to call parse_precedence again, and pass our current level of precedence as a lower limit.

Here is an example:

2 * 3 + 4

When we parse the * expression, we have to make sure that the right–hand operand will be 3 and NOT 3 + 4. When we call parse_precedence with precedence PREC_FACTOR, it is higher than the precedence of PREC_TERM, so the function will make sure that it will not include the + expression.

Now we know how parse_precedence should work, let’s see the actual implementation:

static AstNode* parse_precedence(ParsingContext *context, Precedence precedence) {
    NudParseFunction prefix_rule = get_parse_rule(get_previous_token(context))->nud;

    if (prefix_rule == NULL) {
        error_at_previous_token(context, "Expect expression");

    AstNode* node = prefix_rule(context);

    while (precedence < get_precedence(get_current_token(context))) {
        LedParseFunction infix_rule = get_parse_rule(get_previous_token(context))->led;
        node = infix_rule(context, node);

    return node;

This function is the heart of a pratt parser. As you can see the implementation itself is much simpler than the theory behind it would suggest.

It starts with advancing to the next token. Then, we start by parsing the prefix expression. By definition, the first token has to belong to a prefix expression. Always. If there is no parse rule for the given token, it has to be an error token, so we report it. Otherwise, we call the nud function and parse the expression.

When we got out prefix expression, we move forward and search for the infix parser for the next token. If we find one, it means that the prefix expression might be an operand to it. The while loop will make sure that the precedence we have as a parameter, is the lower limit of our search. As far as the current precedence is less than the one associated with the next token, we include the next token as part of the current expression.

This way we make sure that the right–hand operand in an expression like 2 + 3 * 4 will be 3 * 4, but in an expression like 2 * 3 + 4 it will be 3.

Lastly, let’s see the implementation of the remaining helper functions we’ve already talked about.

static ParseRule *get_parse_rule(Token *token) {
    return &parse_rules[token->type];

static Precedence get_precedence(Token *token) {
    ParseRule *rule = get_parse_rule(token);
    return rule->precedence;

These are very simple, no explanation needed. The last thing we need is the table of parse rules I was referring to earlier.

static ParseRule parse_rules[] = {
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_LEFT_PAREN,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_RIGHT_PAREN,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_LEFT_BRACE,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_RIGHT_BRACE,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_COMMA,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_DOT,
        {NULL, parse_binary_operator_expression, PREC_TERM},    // TOKEN_MINUS,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_MINUS_MINUS,
        {NULL, parse_binary_operator_expression, PREC_TERM},    // TOKEN_PLUS,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_PLUS_PLUS,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_SEMICOLON,
        {NULL, parse_binary_operator_expression, PREC_FACTOR},  // TOKEN_SLASH,
        {NULL, parse_binary_operator_expression, PREC_FACTOR},  // TOKEN_STAR,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_QUESTION,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_COLON,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_AND_AND,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_PIPE_PIPE,

        {NULL,                            NULL,  PREC_NONE},    // TOKEN_BANG,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_BANG_EQUAL,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_EQUAL,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_EQUAL_EQUAL,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_GREATER,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_GREATER_EQUAL,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_LESS,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_LESS_EQUAL,

        {NULL,                            NULL,  PREC_NONE},    // TOKEN_IDENTIFIER,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_STRING,
        {parse_number_literal_expression, NULL,  PREC_NONE},    // TOKEN_NUMBER,

        {NULL,                            NULL,  PREC_NONE},    // TOKEN_CLASS,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_ELSE,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_FALSE,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_FUNC,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_FOR,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_IF,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_NIL,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_PRINT,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_RETURN,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_SUPER,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_THIS,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_TRUE,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_VAR,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_WHILE,

        {NULL,                            NULL,  PREC_NONE},    // TOKEN_ERROR,
        {NULL,                            NULL,  PREC_NONE},    // TOKEN_EOF

Here we register our parsing functions, both nud and led, along with the associated precedence level.

Our parser is now complete. Now we need the public interface, so we can use it from outside.

AstNode *parser_parse(const char *source) {
    ParsingContext context = {
            .lexer = {

            .parser = {


    lexer_init(&context.lexer, source);


    AstNode* expression = parse_expression(&context);


    return expression;

Here, we create the parse context, fire up the lexer by reading one token ahead of time, and start parsing. The result is an AST node tree, containing every expressions we could parse from the source code. At the end, we free up the memory and do some cleanup.

Let’s test if it works! We will need a new debug function to pretty–print our AST nodes.

extern void debug_ast_node(AstNode* node);

I won’t explain the implementation, it’s just some string manipulation you can check in the source if you download it.

Last thing is to modify our main.c to use the parser instead of the lexer from now going forward.

Here is the modified run_repl function:

static void run_repl() {
    char line[MAX_LINE_LENGTH];
    memset(line, 0, MAX_LINE_LENGTH);

    fprintf(stdout, "\nASPL REPL\n\nType 'Ctrl+C', 'quit' or 'q' to exit.\n\naspl> ");

    while (true) {
        if (fgets(line, MAX_LINE_LENGTH, stdin) == NULL) {
            fprintf(stdout, "\n");

        if (memcmp(line, "quit", 4) != 0 && memcmp(line, "q", 1) != 0) {
            AstNode* node = parser_parse(line);

            fprintf(stdout, "aspl> ");
        } else {
            fprintf(stdout, "Bye\n");

and the modified run_file function:

static void run_file(const char* file_path) {
    char* source = read_file(file_path);

    AstNode* node = parser_parse(source);

    DELETE_TYPE(char, source);

If you execute the application and feed some valid arithmetic expression to it, you should see the AST tree printed. Note, we did not implement grouping (), so that won’t work properly.

Fire it up, and type this for example:


Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> 4 + 3 * 2 - 6 / 3

and you should see the result:

BinaryInfixOperatorExpressionNode {
  BinaryInfixOperatorExpressionNode {
    IntegerExpressionNode {
    Operator +
    BinaryInfixOperatorExpressionNode {
      IntegerExpressionNode {
      Operator *
      IntegerExpressionNode {
  Operator -
  BinaryInfixOperatorExpressionNode {
    IntegerExpressionNode {
    Operator /
    IntegerExpressionNode {

that translates to

4 + (3 * 2) - (6 / 3)

which means our parser correctly handles the rules of precedence. Try using parentheses as well:

aspl> 2 + (3 + 5)
[line 1] Error at '(': Expect expression

Process finished with exit code 11

So, error reporting also works. That’s it for now.

Grab the source code of the current state of the project if you like.

Next time, we will extend the parser to handle grouping via (), and we add support for negation (the unary prefix - operator) as well.

Stay tuned. I’ll be back.

← Fixing the ambiguity in the grammar Parsing grouping and negation →

comments powered by Disqus