Attila's Blog

About writing your own programming language in C, Go and Swift

A minimal virtual machine - part 1

Jan 14, 2019 Comments

Categories: aspl

Last time we implemented the compiler, so we can now generate bytecode from the source code. To complete the cycle and implement a complete (limited functionality for now) interpreter for ASPL we still need a virtual machine. This is what we will do today.

Preliminary work

As usual, we start with some preliminary setup before jumping into the real implementation.

The first thing that we do is to add some new constants to our common.h to support the work of the virtual machine.

#define COMPILE_ERROR (65)
#define RUNTIME_ERROR (70)

#define DEBUG_TRACE_EXECUTION

COMPILE_ERROR and RUNTIME_ERROR will signal compilation and runtime errors respectively. They are the error codes the executable will return to the operating system if any error happen during executing some source code.

DEBUG_TRACE_EXECUTION is a debug macro. If it is set it will dump the contents of the current stack and the bytecode as they are executed inside the VM. Its main purpose is to aid debugging, but in the beginning, we will also use it to see the result of our arithmetic operations. It is necessary because we do not yet have any means to output anything onto the screen. For that, we will need to create a print or similar instruction, which we will implement a little bit later.

If we look back to the post when we implemented bytecode arrays we created an opcode called OP_RETURN, and we said this instruction will cause the VM to return from the current function. So far we haven’t used this opcode for anything. Now it is the time.

This might seem a little bit strange at first. We don’t have functions yet, so why would we need this? Well, any top-level code that is part of a script, or written directly into the REPL will be treated as if it would be a function. And for now, when the VM executes this opcode it will simply mean that we have reached the end of the program and the VM will return. So for now, the VM simply needs to know when to finish executing the bytecode.

How do we do this? Simple. We just emit this opcode from the end of the compilation process, after everything else was already compiled. Let’s make this simple change in the compiler. This is how the new implementation of compiler_compile_source looks like.

bool compiler_compile_source(const char* source, BytecodeArray** bytecode) {
    if (bytecode != NULL) {
        AstNode* rootNode;

        if (!parser_parse(source, &rootNode)) {
            return false;
        }

        CompilationContext context = {
                .current_bytecode = *bytecode,
                .current_node = rootNode,
                .had_error = false,
        };

        compile_ast_node(&context);
        emit_byte(&context, OP_RETURN);

        return !context.had_error;
    }

    return true;
}

All we did is to make sure that after the AST tree is compiled into bytecode via calling compile_ast_node(&context); we also emit an OP_RETURN in the next line.

Next up is a small change in value.h. We already have IS_DOUBLE and IS_INTEGER macros to see if a Value* is either an integer or a double. To make our lives a bit easier we add a new macro to see if a Value* is a number (either integer or double).

#define IS_NUMBER(value)    (IS_DOUBLE(value) || IS_INTEGER(value))

The virtual machine

That was all the required setup to prepare for the upcoming virtual machine implementation. Let’s start by creating vm.h and vm.c

typedef struct vm_t {
    BytecodeArray code;
} Vm;

extern void vm_init(Vm* vm);
extern void vm_free(Vm* vm);

We start with the basics and will gradually build the VM upon the basics. For now, our VM will be able to store the bytecode generated by the compiler. The VM will internally use the compiler, so there will be no need from outside to interact or create any compilers.

We also create functions to initialize and free our VM.

Next is a public function to actually run the source code:

extern VmInterpretResult vm_interpret(Vm* vm, const char* source);

This will receive an initialized instance of VM and the source code, and will try to execute it. It returns a VmInterpretResult constant whick looks like this:

typedef enum vm_interpret_result_t {
    VM_INTERPRET_OK = 0,
    VM_INTERPRET_COMPILE_ERROR,
    VM_INTERPRET_RUNTIME_ERROR,
} VmInterpretResult;

The result is either OK, a compile error or a runtime error. These errors will be used by the main function to determine the status code to be returned to the operating system.

Running the code will look like this for now:

VmInterpretResult vm_interpret(Vm* vm, const char* source) {
    BytecodeArray bytecode;
    BytecodeArray* bytecode_pointer = &bytecode;

    bytecode_array_init(bytecode_pointer);
    bool compile_successful = compiler_compile_source(source, &bytecode_pointer);

    VmInterpretResult result = VM_INTERPRET_COMPILE_ERROR;

    if (compile_successful) {
        vm->code = bytecode;
        vm->ip = vm->code.code.values;
        result = vm_run(vm);
    }

    bytecode_array_free(bytecode_pointer);

    return result;
}

First, we create and initialize a bytecode array that will store the generated bytecode. Next, we call compiler_compile_source that will compile the source code into bytecode. If the compilation is successful we will set things up in the vm to be ready to run the code. First, we store the compiled bytecode inside the VM, then we call vm_run (which is an internal helper) that will execute the instructions in the bytecode.

Now there is a vm->ip = vm->code.code.values line stuck in between these two. What the heck is that? Well, while executing the instructions the VM needs to keep track of where it is – the location of the instruction currently being executed.

This is going to be a pointer, and points into the bytecode array to some memory location that contains the current instruction. It is going to be a pointer to a byte.

This is called the instruction pointer or ip for short. Also called the program counter. In the line above we initialize the ip to point to the first instruction in the bytecode.

Important: ip points to the instruction to be executed next. This will be true for the entire life of the VM.

Now let’s add the ip pointer to our VM structure:

typedef struct vm_t {
    BytecodeArray code;
    uint8_t* ip;
} Vm;

Here comes the most important function of the VM:

static VmInterpretResult vm_run(Vm* vm) {
    for (;;) {
#if defined DEBUG_TRACE_EXECUTION
        vm_trace_execution(vm);
#endif

        Opcode instruction = (Opcode) vm_read_byte(vm);
        bool shouldReturn = false;
        VmInterpretResult result = VM_INTERPRET_OK;

        vm_opcode_functions[instruction](vm, &shouldReturn, &result);

        if (shouldReturn) {
            return result;
        }
    }
}

vm_run is the “engine” of the VM. This is the function that will keep executing our bytecode. It is basically an endless loop, that will read the next instruction from the bytecode and execute it in every iteration. To read the next instruction we use the vm_read_byte utility function. Next we execute it by vm_opcode_functions[instruction](vm, &shouldReturn, &result);, and finally if the result of the execution instructs the VM to finish running we exit the loop.

Now let’s have a look at the utility functions we haven’t explained yet.

Reading from the bytecode array

There are some utility functions that are used everywhere inside the VM. Let’s create a new file pair vm_functions.h and vm_functions.c to host these.

Let’s think about what we need to be able to execute arithmetic operations. First, we need to be able to read an opcode from the bytecode, right? And we need to read a stored constant as well. Let’s create the necessary utility functions.

This is how the declaration looks like.

extern uint8_t vm_read_byte(Vm* vm);
extern Value* vm_read_constant(Vm* vm);

And the definition:

uint8_t vm_read_byte(Vm* vm) {
    return *vm->ip++;
}

Value* vm_read_constant(Vm* vm) {
    return vm->code.constants.values[vm_read_byte(vm)];
}

These are simple enough.

vm_read_byte reads one byte from the bytecode. Remember that the instruction pointer we initialized (vm->ip) points to the next instruction to be executed. So all we need is to read the current byte from the ip, and make sure ip will point to the next instruction afterward.

vm_read_constant is slightly more complex. Do you remember how the bytecode array structure is built? Let’s recap.

typedef struct bytecode_array_t {
    Uint8Array code;
    Uint64Array lines;
    ValuePointerArray constants;
} BytecodeArray;

It has a code property that contains the bytecode. Also has a constants field that contains all the associated constants. constants itself is a ValuePointerArray that means it stores Value* types, and to actually read them we have to use the values field.

Now we refreshed our knowledge about generic arrays, we need to recap one more thing.

Remember when we learned how to emit bytecode for a constant? This is what we did:

static void emit_constant(CompilationContext* context, Value* value) {
    emit_bytes(context, OP_CONSTANT, make_constant(context, value));
}

So the bytecode for a constant consists of two bytes: OP_CONSTANT, followed by the index of the constant in the constants field of BytecodeArray.

Now we can easily understand what vm_read_constant does. By the time vm_read_constant is executed, the VM has already decoded OP_CONSTANT and called our handler as a result. Now what we need is to read and return the constant value itself. The index of the constant is held in the byte right after the OP_CONSTANT instruction, so the handler will read it, and use it to read the value from the constants field.

Executing instructions

Okay, now we need to figure out how the VM will execute the instructions it reads from the bytecode, so we will understand what vm_opcode_functions[instruction](vm, &shouldReturn, &result); really means. Well, I decided to separate the functions that implement opcodes into their own file. I know this is not the most efficient, but it provides nice separation and makes it easier to understand the code.

Let’s create two files: vm_opcode_functions.h and vm_opcode_functions.c. In the header we declare a function pointer with the following signature:

typedef void (* VmOpcodeFunction)(Vm* vm, bool* shouldReturn, VmInterpretResult* result);

VmOpcodeFunction will be a function that expects a valid VM instance as input and has two output variables shouldReturn and result. Each opcode will have its own VmOpcodeFunction responsible for the execution of given opcode. In other words, the body of each VmOpcodeFunction implements the behavior of the opcode. shouldReturn indicates that the VM should return from the current function, while result holds the actual result of the computation.

Now we declare an array of VmOpcodeFunctions that will hold the function implementations for each opcode.

extern const VmOpcodeFunction vm_opcode_functions[];

vm_opcode_functions holds each function under the same index as the integral value of the corresponding opcode constant. So for example vm_opcode_functions[OP_CONSTANT] contains the implementatin of the OP_CONSTANT opcode.

With this in place we now understand that vm_opcode_functions[instruction](vm, &shouldReturn, &result); simply selects a VmOpcodeFunction from the vm_opcode_functions array which is at the instructionth index and executes it.

Debugging instruction execution

We have the basic VM template in place, but we haven’t implemented any opcodes yet. Before we do that it would make sense to add some kind of debugging machinery to the VM so we would be able to see some textual information as a result of executing bytecodes inside the VM. At the moment we have no other means of figuring out what is going on inside the VM because we do not have anything that could write to the standard output. We will fix this later by implementing a print built-in function.

But until then, let’s make good use of the debugger we created earlier. The idea is that when the DEBUG_TRACE_EXECUTION macro is defined every instruction will be disassembled and printed out as they are executed in the VM. Additionally, we print out the current values on the VM`s stack as well.

This is what the vm_trace_execution does. Let’s see the implementation:

#if defined DEBUG_TRACE_EXECUTION
static void vm_trace_execution(Vm* vm) {
    debug_disassemble_instruction(&vm->code, (int) (vm->ip - vm->code.code.values));
}
#endif

For now, it just wraps the debug_disassemble_instruction function and spits out the bytecode for the current instruction. But what is that pointer magic inside? Simple. debug_disassemble_instruction needs an integer offset starting from the beginning of the bytecode, but our instruction pointer points directly to one of the opcodes, so we have to convert ip back to a relative offset from the beginning of the bytecode.

All right, this post is getting too large, so it is time to stop. I’m gonna cut this into two separate posts so it will be a bit easier to digest it. Next time we will continue by explaining how the VM’s stack will work and how we will build it. We will also finish the VM so it will be able to run actual source code that contains arithmetic operations.

Stay tuned. I’ll be back.

Tags: virtual machine

← Compiler basics A minimal virtual machine - part 2 →

comments powered by Disqus