Bytecode arrays

Last time we discussed what bytecode is and why we want to use it. Now it is time to learn how to store it effectively, so our future virtual machine will be able to execute it fast.

Let’s start with figuring out how to represent bytecodes. Last time we said that each instruction will be exactly one byte long. To be precise, each instruction has a one–byte operation code (”opcode” for short). The opcode defines what kind of instruction we are dealing with. Let’s create an appropriate data type for opcodes.

//
// Created by Attila Haluska on 9/7/18.
//

#ifndef ASPL_C_VM_OPCODE_H
#define ASPL_C_VM_OPCODE_H

typedef enum opcode_t {
    OP_NOP = 0,
    OP_CONSTANT = 1,
    OP_ADD = 2,
    OP_SUBTRACT = 3,
    OP_MULTIPLY = 4,
    OP_DIVIDE = 5,
    OP_NEGATE = 6,
    OP_RETURN = 7,
} Opcode;

#endif //ASPL_C_VM_OPCODE_H

For now, we define only a handful of instructions.

OP_NOP means “no operation”, it’s going to be useful later
OP_CONSTANT is going to store the operands for our operations
OP_ADD, OP_SUBTRACT, OP_MULTIPLY, OP_DIVIDE are the four basic arithmetic expressions
OP_NEGATE is the opcode for negation
OP_RETURN will cause the VM to return from the current function

We know that bytecode is a series of instructions. We have to store this sequence somehow. Let’s create a new data structure that will hold our bytecode, with the help of our generic containers.

//
// Created by Attila Haluska on 9/7/18.
//

#ifndef ASPL_C_BYTECODE_ARRAY_H
#define ASPL_C_BYTECODE_ARRAY_H

#include <stddef.h>
#include <stdint.h>

#include "forward_declarations.h"
#include "data_structures/uint8_array.h"
#include "data_structures/uint64_array.h"

typedef struct bytecode_array_t {
    Uint8Array code;
    Uint64Array lines;
} BytecodeArray;

#endif //ASPL_C_BYTECODE_ARRAY_H

BytecodeArray consists of two fields. code is a generic container that will hold our opcodes, this is why it stores uint8_t values. lines holds the line numbers for every opcode we store in the container. This container uses uint64_t type, so it can hold quite large line numbers. :) Of course we don’t know ahead of time how many opcodes we have to store, so this container has to be dynamic.

What is nice about this structure is that it’s going to be a continuous sequence of bytes in memory, which will play very nice with the CPU cache. It also gives us O(1) lookup, and O(1) average case append to the end of the container.

Now let’s add the usual memory management routines to it.

extern void bytecode_array_init(BytecodeArray* array) {
    uint8_array_init(&array->code);
    uint64_array_init(&array->lines);
}

extern void bytecode_array_free(BytecodeArray* array) {
    if (array->code.count > 0) {
        uint8_array_free(&array->code);
        uint64_array_free(&array->lines);
    }

    bytecode_array_init(array);
}

extern void bytecode_array_add_bytecode(BytecodeArray* array, uint8_t byte_code, uint64_t line) {
    uint8_array_add_value(&array->code, byte_code);
    uint64_array_add_value(&array->lines, line);
}

bytecode_array_init is for initializing the container. bytecode_array_free to release allocated memory after use. And finally bytecode_array_add_bytecode is to actually add an opcode and line number pair to the container. This is all we need for now.

We have our opcodes and the storage for it. But for now, we don’t have a compiler to produce any bytecode for us, neither a VM to execute it. So, temporarily let’s write some by hand. We also extend our debug facility to be able to print out the opcodes in a well-formatted, human-friendly manner so we can see what is stored inside the bytecode array.

Let’s add the debug feature first. debug.h gets two new functions.

extern void debug_disassemble_bytecode_array(BytecodeArray* array, const char* name);
extern int debug_disassemble_instruction(BytecodeArray* array, int offset);

And in debug.c their implementation:

static int simple_instruction(const char* name, int offset) {
    printf("%s\n", name);
    return offset + 1;
}

int debug_disassemble_instruction(BytecodeArray* array, int offset) {
    printf("%08d ", offset);

    if (offset > 0 && array->lines.values[offset] == array->lines.values[offset - 1]) {
        printf("      | ");
    } else {
        printf("%7lld ", array->lines.values[offset]);
    }

    uint8_t instruction = array->code.values[offset];

    switch (instruction) {
        case OP_NOP:
            return simple_instruction("OP_NOP", offset);
        case OP_RETURN:
            return simple_instruction("OP_RETURN", offset);
        default:
            printf("Unknown opcode %d\n", instruction);
            return offset + 1;
    }
}

void debug_disassemble_bytecode_array(BytecodeArray* array, const char* name) {
    printf("== <%s> bytecode start ==\n", name);
    printf("%8s %7s %-16s\n", "[offset]", "[line]", "[opcode]");

    for (int i = 0; i < array->code.count;) {
        i = debug_disassemble_instruction(array, i);
    }

    printf("== <%s> bytecode end ==\n", name);
}

Nothing fancy, it’s kind of a poor man’s disassembler that can handle ASPL’s bytecode format.

simple_instruction can print debug information about one single instruction, it’s name in this case. It also returns the offset the next instruction will start from.

debug_disassemble_instruction can disassemble one instruction of any kind. It prints the offset of the instruction. Offset is the location of the given opcode in the byte stream. Next, it does some magic to print the actual line number, or a | sign if the instruction is on the same line as the previous one. Next, it reads one byte from offset, this is the current opcode. For each kind of opcodes we will dispatch an appropriate utility function to handle it. For now, the only utility we have is simple_instruction. And finally the function returns the offset of the next instruction.

Why returning offsets instead of just using a good–old for loop, you ask? It’s going to be simpler this way, because later you will see that instructions can have different sizes.

But wait, you said that every opcode is one–byte long, right? Yes. But. Opcodes can and will have operands later, and those operands are considered to be part of the instruction. This is why I use the terms “opcode” and “instruction”, instead of just calling everything in the stream to be an opcode. An instruction consists of exactly one opcode and zero or more operands.

Finally debug_disassemble_bytecode_array is the one function we call to disassemble a whole bytecode array for us. It prints a small header, and then the offset of every instruction, the line number and the name of the opcode.

Now we have the means to debug a bytecode array, let’s try it out! Put this into main.c or actually anywhere else you like:

BytecodeArray bytecode;

bytecode_array_init(&bytecode);

bytecode_array_add_bytecode(&bytecode, OP_NOP, 1);
bytecode_array_add_bytecode(&bytecode, OP_RETURN, 1);
bytecode_array_add_bytecode(&bytecode, OP_RETURN, 2);

debug_disassemble_bytecode_array(&bytecode, "test");

bytecode_array_free(&bytecode);

If you run the interpreter you should see the output of debug_disassemble_bytecode_array:

== <test> bytecode start ==
[offset]  [line] [opcode]        
00000000       1 OP_NOP
00000001       | OP_RETURN
00000002       2 OP_RETURN
== <test> bytecode end ==

Success! We are nicely progressing towards a full bytecode implementation. What is missing for now is to be able to add operands to our instructions.

Why are they necessary? We have an opcode for the addition operation for example, right? It’s OP_ADDITION. Ok, but if I want to actually execute 1 + 2 where do I put my operands 1 and 2? This is exactly what we’ll learn next time.

Grab the source code of the current state of the project if you like.

Stay tuned. I’ll be back.

Attila's Blog

About writing your own programming language in C, Go and Swift

Bytecode arrays