Attila's Blog

About writing your own programming language in C, Go and Swift

Bytecode constants

Oct 29, 2018 Comments

Categories: aspl

Last time, we discussed how we store our bytecode. Our current bytecode storage is somewhat limited, though. We are not yet able to use operands for our operations. It is time to fix this problem.

Constants

So, what is the problem we are facing with exactly? We can store code, right? Then what? Well, indeed, we can store code, but not data. Why do we need that at all?

Let’s demonstrate with an example:

1 + 2

Simple enough, right? Let’s see what the issue is with our current bytecode implementation. We can store + in the bytecode via OP_ADD. But in this source code we have two operands as well, two integer literals, 1 and 2. Without storing them somehow, the VM won’t be able to execute this operation. For our VM to be able to understand these literals we will need a special instruction, one that stores the literals and makes our VM understand what to do.

In the previous post, we added a couple of opcodes without very detailed explanation about their purpose. There was one, called OP_CONSTANT. This is the opcode we will use to store these literals in the bytecode, and this opcode will also make our VM to load and use them.

Any kind of literals that are found in the source code needs to be stored in the bytecode. These literals are called constants.

Theoretically, we have several ways to store this data. For small pieces, like integers we could just write the data right after the opcode, why not? There is a problem, though. This would work only for small, fixed size data pieces.

This won’t work well for large pieces of data, especially if the size is dynamic. Strings, for example. A couple of posts earlier we said that bytecode is called bytecode because we are working with byte-sized pieces of information. Now, a 3Gb piece of text can hardly be called byte sized, right? We need a better way to store constants.

Well, in native code compilers these constants are stored in a separate region of the executable, and there is an instruction to load the constants. This load instruction usually has an operand, that is an offset that points to the memory location of the constant to load.

If you are a little bit familiar with assembly, you may remember writing your instructions into the .text segment, and your initialized data into the .data segment. Same thing we just discussed. Instructions and data are stored separately, and referencing data pieces happens via offsets / pointers.

Most virtual machines are doing something similar. For example, JVM has a so-called constant pool. Honestly, if this works for Java it’s going to be good enough for us as well. For the sake of consistency, we are going to store all of our constants here, even integers.

Values

We need to do a little detour now. We now understand what constants are, and why we need them. We have an idea about how to store them. Now we have to start thinking about how we are going to represent them inside the VM. Why? Because without knowing exactly what their internal structure will look like we won’t be able to store them, right? We cannot work with something if we don’t know what it is.

So let’s talk about data types a bit. ASPL is dynamically typed, which means any variable can hold any of the supported data types. This means we have to find an internal representation that can do this for us.

What do we need exactly? Two pieces of information. The value of the constant, and it’s data type. Why do we need the type? We should be able to decide if an operation is valid, right? For example, if we try to multiply 12 by false, what should be the result? It’s going to be an error because we will check that the data types are not compatible with the multiply operation. We also need to store the value itself. (Thank you, Captain Obvious!)

There are several ways to bundle this information together, but for now, we will go with the simplest (also a classic) solution C can offer: tagged unions.

Our values will consist of two parts. The type information and the actual value. For now, our type tag supports only integers and doubles.

typedef enum value_type_t {
    VAL_DOUBLE,
    VAL_INTEGER,
} ValueType;

This is easy. The trick comes when we want to store the value itself. How can we do it without wasting memory? This is where unions come in the picture.

A union is a special data type, that stores different data types in the same memory location. You can define multiple members, but only one of them can hold a value at any given time (because they all stored at the same memory location).

With this knowledge, let’s see how our tagged union will look like:

typedef struct value_t {
    ValueType type;
    union {
        double double_value;
        int64_t integer_value;
    } as;
} Value;

type is the actual tag, which tells us what the data type will be. as is the actual union that stores our data. This is nice because it’s going to be efficient, and easy to use.

For example, when we store an integer, this is how we can access it:

value->as.integer_value

There are more efficient ways to do this, but it’s going to be good enough for now.

We also introduce some helper functions that make it easier to deal with this new type:

extern Value* value_new_double_value(double value);
extern Value* value_new_integer_value(int64_t value);
extern void value_delete(Value* value);
extern void value_print(const Value* value);

value_new_double_value and value_new_integer_value are for creating new values out of integers and doubles, value_delete is to free their allocated memory, and value_print is to print their value. We will need all this later.

Also a couple of small helper macros to be able to check for a given type, and retrieve the value for that type:

#define IS_DOUBLE(value)    (value)->type == VAL_DOUBLE
#define IS_INTEGER(value)   (value)->type == VAL_INTEGER

#define AS_DOUBLE(value)    (value)->as.double_value
#define AS_INTEGER(value)   (value)->as.integer_value

Constant arrays

Now back to the constants. By using Value we just created, we know exactly how we are going to store our constants. We can now create the actual storage mechanism for them. We will store our constants alongside our bytecode but in another array. So we will need a dynamic array that stores our Value instances. It’s going to be easy by utilizing the generic array implementation we created.

Here is the new storage:

//
// Created by Attila Haluska on 9/7/18.
//

#ifndef ASPL_C_VALUE_POINTER_ARRAY_H
#define ASPL_C_VALUE_POINTER_ARRAY_H

#if defined __GENERIC_ARRAY_TYPE__
    #undef __GENERIC_ARRAY_TYPE__
#endif

#if defined __GENERIC_ARRAY_STRUCT_NAME__
    #undef __GENERIC_ARRAY_STRUCT_NAME__
#endif

#if defined __GENERIC_ARRAY_NAME__
    #undef __GENERIC_ARRAY_NAME__
#endif

#define __GENERIC_ARRAY_TYPE__ Value*
#define __GENERIC_ARRAY_NAME__ ValuePointerArray
#define __GENERIC_ARRAY_STRUCT_NAME__ value_pointer_array

#include "data_structures/generic_array.h"

#endif //ASPL_C_VALUE_POINTER_ARRAY_H

Now we can modify our bytecode array to store the constants as well:

typedef struct bytecode_array_t {
    Uint8Array code;
    Uint64Array lines;
    ValuePointerArray constants;
} BytecodeArray;

We just added an extra container for our constants. We also need a new function so we can actually add some constants.

extern uint16_t bytecode_array_add_constant(BytecodeArray* array, Value* value);

This adds a constant to the bytecode and returns the index (offset) of the newly stored constant, so we know how to retrieve it later.

Now we can store the constants, and we also have the means to execute them, this is why we created the OP_CONSTANT opcode. When the VM sees this instruction it knows that it has to load the constant value from the bytecode array. How exactly? There are going to be some instructions that are not “simple”. By this, I mean they will have operands. Operands are stored immediately after the instruction in the bytecode.

Each opcode knows how many operands it has, and what they are for. The instruction OP_CONSTANT has one single byte operand, and it’s the offset of the constant in the constant storage array, so the VM knows which one to load.

We are almost done. The only thing left is to modify our debugger / disassembler to support constants as well.

First, we change the header we print out in debug_disassemble_bytecode_array in debug.c (- and + means we replace the - line with +):

-    printf("%8s %7s %-16s\n", "[offset]", "[line]", "[opcode]");
+    printf("%8s %7s %-16s %14s %s\n", "[offset]", "[line]", "[opcode]", "[constant pos]", "  [constant value]");

Next we extend debug_disassemble_instruction to support OP_CONSTANT:

case OP_CONSTANT:
    return constant_instruction("OP_CONSTANT", array, offset);

And finally we implement constant_instruction:

static int constant_instruction(const char* name, BytecodeArray* chunk, int offset) {
    uint8_t constant = chunk->code.values[offset + 1];
    printf("%-16s %14d   '", name, constant);
    value_print(chunk->constants.values[constant]);
    printf("'\n");
    return offset + 2;
}

If we did everything right, we should be able to add and debug constants in the bytecode. Let’s test it out. Add this somewhere in the code (mine is in main.c):

BytecodeArray bytecode;

bytecode_array_init(&bytecode);

bytecode_array_add_bytecode(&bytecode, OP_NOP, 1);

Value* value = value_new_integer_value(12);
uint16_t constant = bytecode_array_add_constant(&bytecode, value);

bytecode_array_add_bytecode(&bytecode, OP_CONSTANT, 2);
bytecode_array_add_bytecode(&bytecode, (uint8_t)constant, 2);

bytecode_array_add_bytecode(&bytecode, OP_RETURN, 3);

debug_disassemble_bytecode_array(&bytecode, "test");

bytecode_array_free(&bytecode);

The output of debug_disassemble_bytecode_array should be:

== <test> bytecode start ==
[offset]  [line] [opcode]         [constant pos]   [constant value]
00000000       1 OP_NOP
00000001       2 OP_CONSTANT                   0   '12'
00000003       3 OP_RETURN
== <test> bytecode end ==

This tells us that we have an OP_CONSTANT instruction in the bytecode that puts a constant with a value of 12 at index 0.

If you add this as well before the line with OP_RETURN:

Value* value2 = value_new_integer_value(33);
uint16_t constant2 = bytecode_array_add_constant(&bytecode, value2);

bytecode_array_add_bytecode(&bytecode, OP_CONSTANT, 2);
bytecode_array_add_bytecode(&bytecode, (uint8_t)constant2, 2);

The output is:

== <test> bytecode start ==
[offset]  [line] [opcode]         [constant pos]   [constant value]
00000000       1 OP_NOP
00000001       2 OP_CONSTANT                   0   '12'
00000003       | OP_CONSTANT                   1   '33'
00000005       3 OP_RETURN
== <test> bytecode end ==

We see that we have another constant in the pool now, with a value of 33 at index 1.

That’s all we wanted for today. We now have the means to store compiled code and data. The only problem is, we write our bytecode by hand, which is not going to work :D.

Next time we start working on the basics of the compiler, that will compile our AST into bytecode.

Grab the source code of the current state of the project if you like.

Stay tuned. I’ll be back.

← Bytecode arrays Compiler basics →

comments powered by Disqus