Compiler basics

Well, I am back from a long vacation, ready to continue from where we stopped back in October. As a recap, we learned how to store bytecode along with its constants. All this time we wrote bytecode by hand because we did not have any means to generate it for us. This is going to change now.

Today we start implementing the basics of the compiler, and by the end of this post, we will be able to compile the four basic arithmetic expressions into bytecode. Exciting, isn’t it? Let’s jump into it.

We start by adding debugger support for our supported arithmetic operations, so we can see the debugger print the proper bytecode. Open debug.c and add the following statements to the debug_disassemble_instruction:

case OP_ADD:
    return simple_instruction("OP_ADD", offset);
case OP_SUBTRACT:
    return simple_instruction("OP_SUBTRACT", offset);
case OP_MULTIPLY:
    return simple_instruction("OP_MULTIPLY", offset);
case OP_DIVIDE:
    return simple_instruction("OP_DIVIDE", offset);
case OP_NEGATE:
    return simple_instruction("OP_NEGATE", offset);

With this addition we now have debugger support for these operations. Now we make a small modification to the parser as well. Replace the function signature in parser.h with this:

extern bool parser_parse(const char* source, AstNode** out);

The reason is simple: with this change, we can have some basic error checking built into the function. If it returns false we know parsing has failed and we cannot proceed with the compilation.

The new implementation of this function now looks like this:

bool parser_parse(const char* source, AstNode** out) {
    if (out != NULL) {
        ParsingContext context = {
                .lexer = {

                },
                .parser = {

                }
        };

        lexer_init(&context.lexer, source);
        parser_init(&context.parser);

        advance_to_next_token(&context);

        *out = parse_expression(&context);

        parser_free(&context.parser);
        lexer_free(&context.lexer);

        return !context.parser.had_error;
    }

    return true;
}

We also add a bit more precise error reporting as well in the error_at_token function, making a distinction between parse errors and syntax errors. The difference between syntax and parse errors is this: Syntax errors are issues happening in the lexer when the source code contains unsupported characters. On the other hand, a parse error happens in the parser, and it means that the source code contains only valid character sequences, but in such an incorrect order that it violated the grammar rules of the language. We will signal this to the users with this small change:

if (token->type == TOKEN_ERROR) {
    fprintf(stderr, "[line %llu] Syntax error", token->line);
} else {
    fprintf(stderr, "[line %llu] Parse error", token->line);
}

We finished all the preliminary work, and we can now start working on the compiler. But before we start coding let’s answer a couple of questions as a recap:

What is the job of the compiler? Well, in ASPL the compiler is responsible for generating bytecode from the source code.
How? The compiler will internally use the parser to parse the source code into AST nodes, and the parser will internally use the lexer to create tokens from the source code. After the compiler has access to the whole AST tree, it starts visiting each of the nodes and will translate them into bytecode. Important note: Visiting the AST nodes is a recursive process.

Okay, enough of the theory. Let’s see the compiler’s implementation. First, we create two files, compiler.h and compiler.c.

The header is very simple, it has only one function.

//
// Created by Attila Haluska on 9/8/18.
//

#ifndef ASPL_C_COMPILER_H
#define ASPL_C_COMPILER_H

#include <stdbool.h>

#include "forward_declarations.h"

extern bool compiler_compile_source(const char* source, BytecodeArray** bytecode);

#endif //ASPL_C_COMPILER_H

compiler_compile_source is the only public function of the compiler, this is the only way to communicate with the compiler from the outside. It expects the source code string as an input, and a pointer to a BytecodeArray* to place the generated bytecode into. By the double pointer, you can see that bytecode is an out parameter, and must point to a properly initialized BytecodeArray* variable to work properly.

Now we have a look inside compiler.c as well. First, we need a compilation context, which is a data structure that holds all the information necessary for the compiler to work properly.

typedef struct compilation_context_t {
    BytecodeArray* current_bytecode;
    AstNode* current_node;
    bool had_error;
} CompilationContext;

It contains a pointer to the current bytecode, the current AST node, and a flag to signal if there is an error in the process.

Next, let’s add error reporting to the compiler as well.

static void error(CompilationContext* context, const char* message) {
    context->had_error = true;

    fprintf(stderr, "[line %llu] Compile error", context->current_node->token->line);

    if (context->current_node->token->type == TOKEN_EOF) {
        fprintf(stderr, " at the end of file");
    } else {
        fprintf(stderr, " at '%.*s'", (int) context->current_node->token->length, context->current_node->token->start);
    }

    fprintf(stderr, ": %s\n", message);
}

Important to note that error is used only to report compiler errors, errors that happen during compilation, and not during lexing or parsing. We already have reporting in place to report that kind of errors. This works due to the fact that lexing and parsing happens before compiling, and if there are any problems reported during those phases, the compilation won’t even start.

As I mentioned, the compilation consists of two steps:

Visiting all AST nodes
Translating them to bytecode

Let’s have a look at the implementation in reverse, so we see the bytecode translation step first.

First and foremost, we need a function that can add something to a bytecode array.

static void emit_byte(CompilationContext* context, uint8_t byte) {
    bytecode_array_add_bytecode(context->current_bytecode, byte, context->current_node->token->line);
}

emit_byte simply utilizes the bytecode_array_add_bytecode function to add one byte of information and the current line number to the current bytecode array held by the current compilation context. Now that we have this, we can start building upon it.

static void emit_bytes(CompilationContext* context, uint8_t byte1, uint8_t byte2) {
    emit_byte(context, byte1);
    emit_byte(context, byte2);
}

emit_bytes is mostly used to emit a bytecode along with its byte-sized operand. Think about emitting bytecode for a constant. How would you do it? Let’s see.

Emitting code for constants consists of two steps. First, to actually create the constant, i.e. to add it to the constant storage of the current bytecode array.

static uint8_t make_constant(CompilationContext* context, Value* value) {
    uint16_t constant = bytecode_array_add_constant(context->current_bytecode, value);

    if (constant > CONSTANTS_MAX) {
        error(context, "Too many constants in one bytecode");
    }

    return (uint8_t) constant;
}

And the second step is to emit the OP_CONSTANT opcode along with the index of the just-created constant.

static void emit_constant(CompilationContext* context, Value* value) {
    emit_bytes(context, OP_CONSTANT, make_constant(context, value));
}

The way this works is that we create the constant and put it into the current bytecode array’s constant storage. That operation will give us back the location index of the constant, which can be used to locate and retrieve the value stored in that constant. With this information, we can now emit the OP_CONSTANT opcode and pass the index as its parameter. This will instruct the VM to load the constant at the “index-th” location from the bytecode array.

This is all we need for now to be able to generate bytecode for the four arithmetic operations. Let’s see the implementation of the second part now when we visit all AST nodes.

The entry point of the visitor implementation is a function that will have a switch that decides what kind of node the current node is and calls the appropriate visitor function for it.

static void compile_ast_node(CompilationContext* context) {
    switch (context->current_node->type) {
        case NODE_INTEGER:
            compile_integer_node(context);
            break;
        case NODE_DOUBLE:
            compile_double_node(context);
            break;
        case NODE_BINARY_INFIX_OPERATOR:
            compile_binary_infix_node(context);
            break;
        case NODE_UNARY_PREFIX_OPERATOR:
            compile_unary_prefix_node(context);
            break;
    }
}

For now, we only support numbers (both integer and double types), binary infix operators (+, -, /, *), and unary prefix operators (right now it’s only the - operator for negation).

Let’s see how to emit code for numbers.

static void compile_integer_node(CompilationContext* context) {
    Value* value = value_new_integer_value(context->current_node->as.integer);
    emit_constant(context, value);
}

static void compile_double_node(CompilationContext* context) {
    Value* value = value_new_double_value(context->current_node->as.double_precision);
    emit_constant(context, value);
}

We know that the current node is a number, so we create a new Value* with the correct number and data type, and we emit a constant for it.

Up next are the arithmetic operators.

Here is when things get interesting a bit. Important to remember that we are writing a stack-based VM, so we have to pay attention and process everything in the correct order.

Let’s try to explain this with an example. Say we are compiling a simple expression:

2 * 3

How are we compiling this into bytecode that will be executed by a stack-based VM?

First, we will push the left-hand side of the expression onto the stack (2). Next, the right-hand side (3). And last, the operation itself. Why? Because this way it will be very easy for the VM to execute this operation. Think about it. We push the two operands first. When the VM reaches the * operation in the bytecode we already have the necessary parameters on the stack. All we need is to pop them, right? After this, we have the two required operands in our hands and all we need to do is to execute the multiplication on those two numbers. Popping the operands also will clean up the stack. When the operation is ready we just push the result onto the stack and we are done.

Let’s see how this is implemented in code.

static void compile_binary_infix_node(CompilationContext* context) {
    AstNode* current_node = context->current_node;

    context->current_node = current_node->as.binary_infix.left;
    compile_ast_node(context);

    context->current_node = current_node->as.binary_infix.right;
    compile_ast_node(context);

    context->current_node = current_node;

    switch (current_node->token->type) {
        case TOKEN_PLUS:
            emit_byte(context, OP_ADD);
            break;
        case TOKEN_MINUS:
            emit_byte(context, OP_SUBTRACT);
            break;
        case TOKEN_STAR:
            emit_byte(context, OP_MULTIPLY);
            break;
        case TOKEN_SLASH:
            emit_byte(context, OP_DIVIDE);
            break;
        default:
            break;
    }
}

First, we save the current node into a temporary variable because we are about to override it. We said that we will first compile the left-hand side of the expression. Let’s make that node the current one then. After it, we call the entry point again (compile_ast_node) because we have no idea what kind of node we are talking about. Here you can see this is the recursive step in the compilation process.

After compiling the left-hand side node, we switch to the right-hand side node, making it current. Call compile_ast_node again and we have compiled both of our operands for our binary infix expression. Now we restore the original node and check what kind of operation we have. Then we emit the proper bytecode for it and we are done.

We are almost there, the only thing we have to do is to implement negation.

static void compile_unary_prefix_node(CompilationContext* context) {
    AstNode* current_node = context->current_node;

    context->current_node = current_node->as.unary_prefix.right;
    compile_ast_node(context);

    context->current_node = current_node;

    switch (context->current_node->token->type) {
        case TOKEN_MINUS:
            emit_byte(context, OP_NEGATE);
            break;
        default:
            break;
    }
}

This works pretty much the same way as compile_binary_infix_node, the difference is that it only has one operand that needs to be compiled before the negation operation.

We finished implementing the compiler that supports +, -, /, * and negation. All is left to implement the public interface of the compiler.

bool compiler_compile_source(const char* source, BytecodeArray** bytecode) {
    if (bytecode != NULL) {
        AstNode* rootNode;

        if (!parser_parse(source, &rootNode)) {
            return false;
        }

        CompilationContext context = {
                .current_bytecode = *bytecode,
                .current_node = rootNode,
                .had_error = false,
        };

        compile_ast_node(&context);

        return !context.had_error;
    }

    return true;
}

Here we start with checking that our bytecode variable points to a valid bytecode array. Then we try to parse the source code into an AST tree. If it fails, we simply return false signaling that the compilation has failed. The parser will take care of reporting any errors that have occurred. If the parsing is successful we then initialize the root compilation context and try to compile the AST tree, and finally, we return with true or false based on the compilation process is being successful or not.

There, we did it! We just wrote a basic bytecode compiler that supports arithmetic operations on numbers! Congrats!

Try it out, shall we? For this we have to modify main.c:

if (memcmp(line, "quit", 4) != 0 && memcmp(line, "q", 1) != 0) {
    BytecodeArray bytecode;
    BytecodeArray* bytecode_pointer = &bytecode;
    bytecode_array_init(bytecode_pointer);

    if (compiler_compile_source(line, &bytecode_pointer)) {
        debug_disassemble_bytecode_array(bytecode_pointer, "test");
    }

    bytecode_array_free(bytecode_pointer);

    fprintf(stdout, "aspl> ");
} else {
    fprintf(stdout, "Bye\n");
    break;
}

First, we initialize a bytecode array to store the result of the compilation. Then we compile the source and debug it if the compiler succeeds. Finally, we free our bytecode array.

If we did everything right here is what we should see when we run the app:

/Users/attilahaluska/Documents/Development/aspl-c/bin/aspl

ASPL REPL

Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> 2 * 3
== <test> bytecode start ==
[offset]  [line] [opcode]         [constant pos]   [constant value]
00000000       1 OP_CONSTANT                   0   '2'
00000002       | OP_CONSTANT                   1   '3'
00000004       | OP_MULTIPLY
== <test> bytecode end ==
aspl> q
Bye

Process finished with exit code 0

Our debugger shows that the compiler in fact works, and that it generated the correct bytecode. As I described before, the compiler first generated code for the two operands 2 and 3. Then for the * operation. This is all the VM will need to execute this expression.

Let’s see something more complicated:

/Users/attilahaluska/Documents/Development/aspl-c/bin/aspl

ASPL REPL

Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> 2 + 3 * 4 - 6 / -2
== <test> bytecode start ==
[offset]  [line] [opcode]         [constant pos]   [constant value]
00000000       1 OP_CONSTANT                   0   '2'
00000002       | OP_CONSTANT                   1   '3'
00000004       | OP_CONSTANT                   2   '4'
00000006       | OP_MULTIPLY
00000007       | OP_ADD
00000008       | OP_CONSTANT                   3   '6'
00000010       | OP_CONSTANT                   4   '2'
00000012       | OP_NEGATE
00000013       | OP_DIVIDE
00000014       | OP_SUBTRACT
== <test> bytecode end ==
aspl> q
Bye

Process finished with exit code 0

Try to replay this compilation in your mind to see what’s going on, this is a good way to understand how the VM will actually run our bytecode.

Next time, we will complete the interpreter by implementing the VM as well. When we finish, we will have a fully operational bytecode VM at our hands. Can’t wait to get there! :D

Grab the source code of the current state of the project if you like.

Stay tuned. I’ll be back.

Attila's Blog

About writing your own programming language in C, Go and Swift

Compiler basics