Attila's Blog

About writing your own programming language in C, Go and Swift

A minimal virtual machine - part 2

Jan 22, 2019 Comments

Categories: aspl

Last time we started working on the VM, we implemented the basic framework for executing bytecode, integrated the compiler into the VM, paved the way for our VM to execute bytecode. We added some helpers to read opcodes and constants from the bytecode and finally implemented a debugger / disassembler feature as well. Today we will finish up the work, and implement the necessary bytecode–executing functions so our VM will be able to execute all promised arithmetic operations.

The Stack

I have mentioned several times that our VM is stack based, but what does it mean in practice? Let’s try to explain it with a small example.

Let’s say we want to execute this piece of code written in ASPL (I know, we do not have any print functions yet, but we will use it for the sake of the example).

4 + 9 + 2 + 5;

If you remember how parsing works, and use the rules of precedence and associativity of the language, here is what’s going to happen:

  1. Evaluate 4
  2. Evaluate 9
  3. Evaluate 4 + 9 ( = 13)
  4. Evaluate 2
  5. Evaluate 13 + 2 ( = 15)
  6. Evaluate 5
  7. Evaluate 15 + 5

Evaluating constants, like 4 and 9 is easy as they do not require any computation from the VM, still, it needs to store these values somewhere so it can use them when evaluating the + operator. Now the question is where to store these “temporary” values, and how? As you have already guessed the answer is the stack, but we should examine this in a bit more details to make it very clear how it really works.

Let’s do an exercise. We start “evaluating” the code and at the same time let’s keep track of our values stored in the VM. The strange | | slots are representing memory locations we use for storing values inside the VM. On the left side are the steps from the source code. On the right are the values we are tracking. A value is added to the storage when it is first produced by either a constant or an addition operation. The length of the bars (| |) tracks for how long the value needs to be stored, and | - | means the value is consumed and the given memory slot is freed.

OP_CONSTANT 4 ................ | 4 |
OP_CONSTANT 9 ................ | 4 | 9 |
OP_ADD ....................... | - | - | 13 |
OP_CONSTANT 2 ................ | - | - | 13 | 2 |
OP_ADD ....................... | - | - | -- | - | 15 |
OP_CONSTANT 5 ................................. | 15 | 5 |
OP_ADD ........................................ | -- | - | 20 |

Representing the evaluation process this way makes it easy to understand how we store these values. However, if we really did it this way that would waste a lot of memory. Let’s rearrange the previous diagram so that when a value gets consumed we really “free” the memory slot it was stored in.

OP_CONSTANT 4 ................ | 4  | - |
OP_CONSTANT 9 ................ | 4  | 9 |
OP_ADD ....................... | 13 | - |
OP_CONSTANT 2 ................ | 13 | 2 |
OP_ADD ....................... | 15 | - |
OP_CONSTANT 5 ................ | 15 | 5 |
OP_ADD ....................... | 20 | - |

Okay, that’s better. What do we see here? First of all, there are no wasteful gaps in the memory anymore. More importantly, the values are consumed from the back towards the front, so values that are stored last are the first to be consumed… last–in, first–out… it’s a stack!

Every time we introduce a value, we push it onto the stack from the right. When they are consumed, they are popped off starting from the rightmost location.

Since our temporary values naturally have a stack-like behavior the VM uses a stack to store them. When an instruction produces a value we push it onto the stack and pop them when they need to be consumed.

The VM’s stack

Now we understand why the VM uses a stack and also how it uses it. It’s time to see how it is actually implemented.

First, we define the maximum size of the stack in vm.h

#define STACK_MAX   (256)

Yes, it is an arbitrary number but it should do for this toy language. If we have more values then STACK_MAX to be pushed onto the stack we will see the famous “stack overflow” error. Of course we could make the stack to grow dynamically, but for now, let’s keep things simple.

Next, let’s add the stack itself. Remember, the stack is supposed to store any kind of ASPL values so it’s going to be a Value* type array.

typedef struct vm_t {
    BytecodeArray code;
    uint8_t* ip;
    Value* stack[STACK_MAX];
    Value** stack_top;
} Vm;

This stack is built on a plain old C array. The bottom of the stack is at index 0 in the array. The stack naturally grows and shrinks as we add and remove values, so we need to have a pointer that shows us where the top of the stack is. This is done by the stack_top pointer.

An important note here: The stack_top pointer actually points to the memory slot past the top value on the stack. In other words it points to the stack location where the next value will be pushed into. This might seem odd, but necessary because in this way we can signal that the stack is empty by pointing at element 0 in the array.

Now that we have a stack we need functions to manage it.

void vm_reset_stack(Vm* vm) {
    vm->stack_top = vm->stack;
}

vm_reset_stack will be used to reset the stack_top pointer and thus effectively clear all values from the stack. As the stack size is predefined we do not necessarily have to worry about actually clearing the memory slots (we will simply set it to NULL without freeing the memory), it is enough to make sure we won’t access them until new values need to be stored in them.

Now we need to be able to push and pop values from the stack.

void vm_push(Vm* vm, Value* value) {
    *vm->stack_top = value;
    vm->stack_top++;
}

Value* vm_pop(Vm* vm) {
    *vm->stack_top = NULL;
    vm->stack_top--;
    Value* value = *vm->stack_top;
    return value;
}

These simply add or remove values, exactly like normal stacks do. Remember, stack_top points to the slot past the top value, exactly where the next value should be placed into. So vm_push simply needs to add the new value to the location pointed by stack_top and then increase it by one. vm_pop does quite the opposite. It decreases the stack_top pointer first, and then returns the value pointed by it.

Sometimes it’s also useful to have a look-ahead feature for the same reasons we had while implementing lexing and parsing.

Value* vm_peek(Vm* vm, uint8_t distance) {
    return vm->stack_top[-1 - distance];
}

vm_peek is able to look ahead on the stack without actually popping off the values from it. This is useful when we want to make sure that the operands of certain operations are of the right kind, for example.

Now that we have a working stack we can finally extend our debugger in vm.c to debug the current state of the stack as well. Here is the new version of vm_trace_execution:

#if defined DEBUG_TRACE_EXECUTION
static void vm_trace_execution(Vm* vm) {
    printf("== Stack ==    ");

    for (Value** slot = vm->stack; slot < vm->stack_top; slot++) {
        printf("[");
        value_print(*slot);
        printf("] ");
    }

    printf("\n");

    debug_disassemble_instruction(&vm->code, (int) (vm->ip - vm->code.code.values));
}
#endif

We added a loop that will go through all occupied slots of the stack and prints their values. We start printing them from the bottom of the stack, so the value on the right side is the one to be popped first. This allows us to observe the effect of each instruction on the stack as they are executed.

A simple arithmetic calculator

We have everything in place to implement all of the supported operations. Let’s see how to execute negation, addition, subtraction, multiplication and division.

Last time we introduced the vm_opcode_functions array that holds the functions for every instruction we have. Let’s start adding the ones we need.

static void op_nop(Vm* vm, bool* should_return, VmInterpretResult* result) {

}

static void op_constant(Vm* vm, bool* should_return, VmInterpretResult* result) {

}

static void op_add(Vm* vm, bool* should_return, VmInterpretResult* result) {

}

static void op_subtract(Vm* vm, bool* should_return, VmInterpretResult* result) {

}

static void op_multiply(Vm* vm, bool* should_return, VmInterpretResult* result) {

}

static void op_divide(Vm* vm, bool* should_return, VmInterpretResult* result) {

}

static void op_negate(Vm* vm, bool* should_return, VmInterpretResult* result) {

}

static void op_return(Vm* vm, bool* should_return, VmInterpretResult* result) {

}

const VmOpcodeFunction vm_opcode_functions[] = {
        op_nop,             // OP_NOP
        op_constant,        // OP_CONSTANT,
        op_add,             // OP_ADD,
        op_subtract,        // OP_SUBTRACT,
        op_multiply,        // OP_MULTIPLY,
        op_divide,          // OP_DIVIDE,
        op_negate,          // OP_NEGATE,
        op_return,          // OP_RETURN
};

There. These are all the instructions we support for now. The implementations are empty, so in the next steps we start adding relevant code that interprets our opcodes.

The No-op instruction

This one, as its name suggests, is a no-operation so we leave it empty. Sometimes it might be useful but for now we won’t use it.

The return instruction

static void op_return(Vm* vm, bool* should_return, VmInterpretResult* result) {
    vm_pop(vm);
    *should_return = true;
    *result = VM_INTERPRET_OK;
}

As we said, the return operation will instruct the VM to return from the current function. For now, we do not have functions, so it will mean to finish executing and let the VM to return. We set should_return to true, and also provide a result of VM_INTERPRET_OK. We also do some cleaning up, and we make sure we leave the stack in a pristine condition, so we pop the current value from it.

Executing constants

In this case, executing means that we take the value of the constant from the bytecode and store it in the stack for further procesing.

static void op_constant(Vm* vm, bool* should_return, VmInterpretResult* result) {
    Value* constant = vm_read_constant(vm);
    vm_push(vm, constant);
}

Executing unary prefix operators

Let’s see how negation is done.

static void op_negate(Vm* vm, bool* should_return, VmInterpretResult* result) {
    if (!IS_NUMBER(vm_peek(vm, 0))) {
        *should_return = true;
        *result = VM_INTERPRET_RUNTIME_ERROR;
        vm_runtime_error(vm, "Operand must be a number");
        return;
    }

    Value* value = vm_pop(vm);

    if (IS_INTEGER(value)) {
        AS_INTEGER(value) = -AS_INTEGER(value);
    } else if (IS_DOUBLE(value)) {
        AS_DOUBLE(value) = -AS_DOUBLE(value);
    }

    vm_push(vm, value);
}

First, we have to make sure that the operand of the negation operation is a number. If not, we return with a runtime error. If the operand is a number we get it from the stack. We negate it according to its type. It’s going to be either an integer or double. Finally, push the result onto the stack.

There is a function we haven’t defined yet, vm_runtime_error:

void vm_runtime_error(Vm* vm, const char* format, ...) {
    va_list args;
    va_start(args, format);
    vfprintf(stderr, format, args);
    va_end(args);
    fputs("\n", stderr);

    size_t instruction = vm->ip - vm->code.code.values - 1;
    fprintf(stderr, "[line %llu] in script\n", vm->code.lines.values[instruction]);

    vm_reset_stack(vm);
}

This one simply displays some error message along with some debug information: the line number and the instruction that caused the error.

Executing binary infix operators

These operations will do almost exactly the same, so we create a helper function that they can call.

void vm_binary_operation(Vm* vm, VmBinaryOperation operation, bool* should_return, VmInterpretResult* result) {
    if (IS_NUMBER(vm_peek(vm, 0)) && IS_NUMBER(vm_peek(vm, 1))) {
        Value* b = vm_pop(vm);
        Value* a = vm_pop(vm);

        if (IS_DOUBLE(a) && !(IS_DOUBLE(b))) {
            b = value_new_double_value((long double) AS_INTEGER(b));
        }

        if (IS_DOUBLE(b) && !(IS_DOUBLE(a))) {
            a = value_new_double_value((long double) AS_INTEGER(a));
        }

        Value* new_value = NULL;

        switch (operation) {
            case VM_BIN_OP_ADD:
                if (IS_DOUBLE(a) && IS_DOUBLE(b)) {
                    new_value = value_new_double_value(AS_DOUBLE(a) + AS_DOUBLE(b));
                } else {
                    new_value = value_new_integer_value(AS_INTEGER(a) + AS_INTEGER(b));
                }
                break;
            case VM_BIN_OP_SUBTRACT:
                if (IS_DOUBLE(a) && IS_DOUBLE(b)) {
                    new_value = value_new_double_value(AS_DOUBLE(a) - AS_DOUBLE(b));
                } else {
                    new_value = value_new_integer_value(AS_INTEGER(a) - AS_INTEGER(b));
                }
                break;
            case VM_BIN_OP_MULTIPLY:
                if (IS_DOUBLE(a) && IS_DOUBLE(b)) {
                    new_value = value_new_double_value(AS_DOUBLE(a) * AS_DOUBLE(b));
                } else {
                    new_value = value_new_integer_value(AS_INTEGER(a) * AS_INTEGER(b));
                }
                break;
            case VM_BIN_OP_DIVIDE:
                if (IS_DOUBLE(a) && IS_DOUBLE(b)) {
                    new_value = value_new_double_value(AS_DOUBLE(a) / AS_DOUBLE(b));
                } else {
                    new_value = value_new_integer_value(AS_INTEGER(a) / AS_INTEGER(b));
                }
                break;
            default:
                break;
        }

        vm_push(vm, new_value);

        return;
    }

    *should_return = true;
    *result = VM_INTERPRET_RUNTIME_ERROR;
    vm_runtime_error(vm, "Operands must be numbers");
}

As for negation, the first thing we do is a type check to make sure the operands are numbers. Then we pop the two operands from the stack. Then we do a double–integer conversion if the type of the operands are different. Then, based on the instruction, we do the desired operation, and finally, push the new value onto the stack. If the operands are not numbers, we return with a runtime error.

Now it’s easy to implement the remaining operators:

static void op_constant(Vm* vm, bool* should_return, VmInterpretResult* result) {
    Value* constant = vm_read_constant(vm);
    vm_push(vm, constant);
}

static void op_add(Vm* vm, bool* should_return, VmInterpretResult* result) {
    vm_binary_operation(vm, VM_BIN_OP_ADD, should_return, result);
}

static void op_subtract(Vm* vm, bool* should_return, VmInterpretResult* result) {
    vm_binary_operation(vm, VM_BIN_OP_SUBTRACT, should_return, result);
}

static void op_multiply(Vm* vm, bool* should_return, VmInterpretResult* result) {
    vm_binary_operation(vm, VM_BIN_OP_MULTIPLY, should_return, result);
}

static void op_divide(Vm* vm, bool* should_return, VmInterpretResult* result) {
    vm_binary_operation(vm, VM_BIN_OP_DIVIDE, should_return, result);
}

Putting it all together

That’s it, we did it! We have a fully working interpreter with a working frontend and backend. Let’s try it out!

In main.c we replace the existing code and put out the shiny new vm to use. Replace run_repl:

static void run_repl() {
    Vm vm;

    vm_init(&vm);

    char line[MAX_LINE_LENGTH];
    memset(line, 0, MAX_LINE_LENGTH);

    fprintf(stdout, "\nASPL REPL\n\nType 'Ctrl+C', 'quit' or 'q' to exit.\n\naspl> ");

    while (true) {
        if (fgets(line, MAX_LINE_LENGTH, stdin) == NULL) {
            fprintf(stdout, "\n");
            break;
        }

        if (memcmp(line, "quit", 4) != 0 && memcmp(line, "q", 1) != 0) {
            vm_interpret(&vm, line);
            fprintf(stdout, "aspl> ");
        } else {
            fprintf(stdout, "Bye\n");
            break;
        }
    }

    vm_free(&vm);
}

…and run_file as well:

static void run_file(const char* file_path) {
    char* source = read_file(file_path);

    Vm vm;
    vm_init(&vm);
    VmInterpretResult result = vm_interpret(&vm, source);
    vm_free(&vm);

    DELETE_TYPE(char, source);

    if (result == VM_INTERPRET_COMPILE_ERROR) {
        exit(COMPILE_ERROR);
    } else if (result == VM_INTERPRET_RUNTIME_ERROR) {
        exit(RUNTIME_ERROR);
    }
}

Now, if we did everything right, the app should compile and we can start testing the language with all kinds of arithmetic operators. We just created a basic arithmetic calculator :D

Let’s test it with something simple!

ASPL REPL

Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> 1 + 2
== Stack ==    
00000000       1 OP_CONSTANT                   0   '1'
== Stack ==    [1] 
00000002       | OP_CONSTANT                   1   '2'
== Stack ==    [1] [2] 
00000004       | OP_ADD
== Stack ==    [3] 
00000005       | OP_RETURN

…and now with something more complex:

ASPL REPL

Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> 2 + 4 * 3 - 9 / -3
== Stack ==    
00000000       1 OP_CONSTANT                   0   '2'
== Stack ==    [2] 
00000002       | OP_CONSTANT                   1   '4'
== Stack ==    [2] [4] 
00000004       | OP_CONSTANT                   2   '3'
== Stack ==    [2] [4] [3] 
00000006       | OP_MULTIPLY
== Stack ==    [2] [12] 
00000007       | OP_ADD
== Stack ==    [14] 
00000008       | OP_CONSTANT                   3   '9'
== Stack ==    [14] [9] 
00000010       | OP_CONSTANT                   4   '3'
== Stack ==    [14] [9] [3] 
00000012       | OP_NEGATE
== Stack ==    [14] [9] [-3] 
00000013       | OP_DIVIDE
== Stack ==    [14] [-3] 
00000014       | OP_SUBTRACT
== Stack ==    [17] 
00000015       | OP_RETURN

I know, it’s not much of a language as of right now, but this is still a very important milestone. We have leared how to implement the frontend and backend of a programming language, we know how lexing and parsing works. We know how to compile an AST into bytecode, and finally, we now understand how to execute the bytecode.

That’s a lot of knowledge, and I belive, congratulations are in order. :)

We will not stop here, though. We will keep extending the language and slowly we will implement every feature we planned.

Also, our method will change a bit from now going on.

So far, we focused on learning how to implement specific building blocks of a language, like lexing, parsing, AST, code generation, and code execution in a vm. This was necessary so we could understand the basic steps needed to implement a language. But now we have all this knowledge, so we can start focusing on adding new features to the language.

This means when we decide to add a new feature, like supporting strings or supporting a “print” statement, we will add all the necessary changes in one go, by modifying the grammar rules, changing the lexer, parser, compiler, and vm until we have what we wanted.

Grab the source code of the current state of the project if you like.

Stay tuned. I’ll be back.

Tags: virtual machine

← A minimal virtual machine - part 1 Boolean expressions - extending the grammar →

comments powered by Disqus