Attila's Blog

About writing your own programming language in C, Go and Swift

Implementing strings

May 12, 2019 Comments

Categories: aspl
Tags: strings

Last time we modified ASPL’s core to add boolean expression support, that means our VM currently supports numbers (double and integer), booleans (true, false) and nil. It is time to tackle an important challenge and add string support as well. Let’s see how.

What is the most important difference between our existing types (numbers, booleans) and strings? Their size. All of them have very well defined size that is set into stone and can never change. Our numbers have a length of 64 bit, booleans and nil are currently stored on 32 bit (which is technically wasting resources, and can be optimized later). Strings are very different though because usually, we don’t know their size in advance. Of course, if there is a string hardcoded in the source code then we have the size at compile time, but those are created by the running application potentially could have unlimited size (well not exactly, the only requirement is that they have to fit in the computer’s memory). Another issue is that not every string is created equal. One might be 5 characters long, others may contain every character from a book. We have to deal with these requirements somehow.

There were languages (e.g. Pascal) that limited the possible length of strings to a certain size. Pascal stored strings in the following format: [size byte][characters...]. The size was stored at the beginning of the string on a single byte, and thus the maximum length of the string was limited to 255 characters.

No worries though, we can do much better than this. Dynamic memory allocation seems like a perfect fit for the job. We are going to store strings on the heap.

Values and Managed Objects

We are going to have to change the way our values work. Currently, every possible value is stored in the value_t structure directly. Also, all out value_t instances live on the stack so that the access to them is fast and we won’t waste resources by storing them on the heap. On the contrary, strings always live on the heap. This means from now we are going to represent our possible values in a 2 level hierarchy. Small values will live in the value_t structure on the stack as before, larger values (like strings) will live on the heap and our value_t structure will just have a pointer to them.

Okay, so what is that managed object I mentioned in the section title? Easy. Every value that is dynamically allocated will live on the heap. We will need a way to manage these objects somehow, right? We need a way to allocate them, track their usage and when they are not needed anymore free them. We will make this happen by implementing a garbage collector on our own. We will call every value (object) that is going to be managed by our garbage collector a managed object. The string type is just the first one of these, later we will add much more, like class instances, functions, etc.

All right, so we have to modify value_t a bit to be able to host strings. First, we add a new value type VAL_OBJECT to value_type_t.

typedef enum value_type_t {
    VAL_INVALID,
    VAL_NIL,
    VAL_DOUBLE,
    VAL_INTEGER,
    VAL_BOOLEAN,
    VAL_OBJECT,
} value_type_t;

Then we include a field into value_t that is a pointer to a managed object.

typedef struct value_t {
    value_type_t type;
    union {
        long double double_value;
        int64_t integer_value;
        bool boolean_value;
        managed_object_t *object_value;
    } as;
} value_t;

Okay, but what is managed_object_t? It is going to be our base data structure to store managed objects. Let’s see how it’s built.

First of all we need a way to defferentiate between different types of managed objects. An enum will do the job just fine. For now, it only has one possible kind of managed object, OBJ_STRING.

typedef enum managed_object_type_t {
    OBJ_STRING,
} managed_object_type_t;

Then we define managed_object_t as well.

typedef struct managed_object_t {
    managed_object_type_t type;
    managed_object_t *next_object;
} managed_object_t;

Wait, something is wrong here. This is not something we can store a string in, right? We need the start of the string in the memory, its length, etc. Also, what is that next_object pointer? I explain everything later, for now let’s just say that managed_object_t is the base of all our future managed objects. Think of it as it were the base class of all managed objects if we used an object–oriented approach. It contains shared data to be included in every future managed object we create.

Now let’s define our Managed String type as well.

typedef struct managed_string_t {
    managed_object_t object;
    bool is_copied;
    uint64_t length;
    char *characters;
} managed_string_t;

Way better, right? This one now has a pointer to the beginning of the string and can store the size as well. Still, why is there a managed_object_t field in our struct? Well, as I mentioned we will use a technique to implement a kind of a poor man’s inheritance. The technique in question is called type punning, and we will use it to build some fake inheritance between managed_object_t and managed_string_t. How does this work? The C language mandates that the fields of a structure are arranged in the memory in the order thay were defined.

Within a structure object, the non-bit-field members and the units in which bit-fields reside have addresses that increase in the order in which they are declared. A pointer to a structure object, suitably converted, points to its initial member (or if that member is a bit-field, then to the unit in which it resides), and vice versa. There may be unnamed padding within a structure object, but not at its beginning.

ISO/IEC 9899:201x 6.7.2.1 15

Now, why is this useful to us? If you look closely, notice that our first field is managed_object_t object in managed_string_t. If the memory layout of this structure is indeed ordered by its fields, that means it is safe to take a pointer to managed_string_t and convert it to managed_object_t and vice versa (of course it’s only safe to do so if we are sure that the pointer actually points to a managed_string_t).

We will use this fact to convert specific managed objects into the “generic” managed_object_t so we have access to its type field. By checking that field we will know exactly what kind of object we are dealing with. This makes it possible to use only a generic managed_object_t pointer in value_t, but we can still store lots of different managed objects while maintaining safe access to them.

Next, we create some functions to manage these new structures. I provide the source code without explanation, the code itself is quite easy to understand.

#define NEW_OBJECT(type, objectType)    (type*)new_object(sizeof(type), (objectType))

static managed_object_t *new_object(size_t size, managed_object_type_t type) {
    managed_object_t *object = (managed_object_t *) memory_reallocate(NULL, 0, size);
    object->type = type;
    return object;
}

managed_string_t *managed_object_new_string(const char *string, int length, bool copy) {
    managed_string_t *s = NEW_OBJECT(managed_string_t, OBJ_STRING);

    char *characters = (char *) string;

    if (copy) {
        characters = NEW_WITH_SIZE(char, length + 1);
        memcpy(characters, string, length);
        characters[length] = '\0';
    }

    s->length = length;
    s->characters = characters;
    s->is_copied = copy;

    return s;
}

void managed_object_delete_string(managed_string_t *string) {
    if (string->characters != NULL) {
        DELETE_DYNAMIC_ARRAY(char, string->characters, string->length + 1);
    }

    DELETE_TYPE(managed_string_t, string);
}

void managed_object_print(managed_object_t *object) {
    if (object != NULL) {
        switch (object->type) {
            case OBJ_STRING: {
                managed_string_t *string = (managed_string_t *) object;
                fprintf(stdout, "%.*s", (int) string->length, string->characters);
                break;
            }
            default:
                break;
        }
    }
}

void managed_object_free(managed_object_t *object) {
    if (object != NULL) {
        switch (object->type) {
            case OBJ_STRING:
                managed_object_delete_string((managed_string_t *) object);
                break;
            default:
                break;
        }
    }
}

void managed_object_free_object_list(managed_object_t *root_object) {
    managed_object_t *current = root_object;
    managed_object_t *next = NULL;

    while (current != NULL) {
        next = current->next_object;
        managed_object_free(current);
        current = next;
    }
}

The only interesting bit here is the managed_object_free_object_list function. This is actually in relation with the next_object field in managed_object_t. We are going to build a very simple linked list of our managed objects, and managed_object_free_object_list is going to be responsible to free all of these dynamic objects before the vm terminates. We will use this method until we implement our full–blown garbage collector, so we won’t leak memory in the meantime.

String support in the virtual machine

Now we have the solution of storing and managing dynamic managed objects. Let’s see what else we need to have full support inside our vm.

Luckily, the lexer already supports strings, so we won’t touch that part. We have to touch–up the parser a little bit, though. We add a function to parse string literals.

static ast_node_t *parse_string_literal_expression(parsing_context_t *context) {
    const token_t token = get_previous_token(context);
    return ast_node_make_string(token);
}

And we put this to use by adding it to parse_rules for the TOKEN_STRING row. We can now lex and parse string literals.

We also have to add a new AST node and its associated functions, because parsing will create a new AST node for our strings.

Let’s add NODE_STRING to ast_node_type_t.

typedef enum ast_node_type_t {
    NODE_INTEGER,
    NODE_DOUBLE,
    NODE_BINARY_INFIX_OPERATOR,
    NODE_UNARY_PREFIX_OPERATOR,
    NODE_BOOLEAN,
    NODE_NIL,
    NODE_STRING,
} ast_node_type_t;

Then we extend ast_node_t and add the following in the union.

struct {
    int length;
    const char *start;
} string;

And of course a new function to create a string node.

ast_node_t *ast_node_make_string(token_t token) {
    ast_node_t *node = NEW(ast_node_t);

    node->type = NODE_STRING;
    node->token = token;
    node->as.string.length = token.length;
    node->as.string.start = token.start;

    return node;
}

There we go. Easy. What else do we need? We have lexing, parsing and the AST node. Next is the bytecode, so we add a function to the compiler.

static void compile_string_node(compilation_context_t *context) {
    managed_string_t *string = managed_object_new_string(context->current_node->as.string.start + 1,
                                                         context->current_node->as.string.length - 2, true);
    managed_object_t *object = (managed_object_t *) string;
    vm_add_managed_object(context->vm, object);
    emit_constant(context, value_new_object_value(object));
}

A little explanation here. What are those + 1 and - 2? Well, the lexer and parser store the whole string literal, including the starting and ending ". We have to get rid of those, the vm won’t use them so we just chop them off. Also worth mentioning that managed_object_new_string will make a copy of the string from the source code. Why? Well, the source code is relatively short–lived, and it is deallocated after the compilation process. It would make little sense to keep it because potentially it could be huge. Easier to get rid of it as soon as possible, and just make a copy of the strings the app will use. Another reason could be for example, if we decided to separate the compilation from the actual execution, and say store the compiled bytecode on disk similarly to Java or Python.

Put the new function to good use by adding it to compile_ast_node.

case NODE_STRING:
    compile_string_node(context);
    break;

One more important change in the compiler. The linked list of objects will be stored in the vm, so we have to provide access to them from inside the compiler, as it will create new managed objects during compilation, and saves them in the linked list by calling vm_add_managed_object. For this, we have to pass in a valid vm instance for the compiler. For this, we first make space for the pointer in compilation_context_t.

typedef struct compilation_context_t {
    vm_t *vm;
    bytecode_array_t *current_bytecode;
    ast_node_t *current_node;
    bool had_error;
} compilation_context_t;

And modify compiler_compile_source to accept a vm_t pointer when compilation is requested by the vm.

extern bool compiler_compile_source(vm_t *vm, const char *source, bytecode_array_t **bytecode_inout);

When the vm compiles the source it passes itself to the compiler.

bool compile_successful = compiler_compile_source(vm, source, &bytecode_pointer);

String operations

Yay, we can now compile strings! What are we going to do with them, though? For now, let’s talk about some useful string manipulations that we will implement in the vm. First, let’s print them, so we can show their value while debugging the bytecode!

Printing string values

For this, we need to extend our value_print function. Remember that inside the vm we will deal with values (as value_t) because that is what’s stored with our bytecode. So, value_print receives a new case.

case VAL_OBJECT:
    managed_object_print(AS_OBJECT(*value));
    break;

And the way to print our string objects looks like this.

void managed_object_print(managed_object_t *object) {
    if (object != NULL) {
        switch (object->type) {
            case OBJ_STRING: {
                managed_string_t *string = (managed_string_t *) object;
                fprintf(stdout, "%.*s", (int) string->length, string->characters);
                break;
            }
            default:
                break;
        }
    }
}

String comparison

Next, ideally we should be able to compare strings for equality, like this.

"hello" == "world"
"apples" != "pears"

It’s quite easy to implement it, as we already have comparison implemented in the vm. All we need is to add some extras to value_equals. It receives a new case.

case VAL_OBJECT: {
    if (AS_OBJECT(*one)->type != AS_OBJECT(*other)->type) {
        return false;
    }

    switch (AS_OBJECT(*one)->type) {
        case OBJ_STRING: {
            managed_string_t *str1 = AS_STRING(*one);
            managed_string_t *str2 = AS_STRING(*other);
            int result = memcmp(str1->characters, str2->characters, str1->length);
            return str1->length == str2->length && result == 0;
        }
    }
}

What do we do here? First off, the two values must have the same type, otherwise they are surely not equal. Next, we check if the length of the two strings are the same. The last, we compare the memory they occupy with memcmp to make sure they are really holding the same characters. Later we will optimize this comparison, but it will serve us well for now.

String concatenation

The last string operation we implement today is concatenation. In ASPL we concat strings by adding them together.

"Hello" + " " + "world!"

Similarly to pretty much all other languages, this operation will give us a new string. We use the same OP_ADD opcode we used for adding numbers. That means we have to add an extra check in the vm so we can differentiate between numbers and strings when we try to add two values. The vm_binary_operation function receives a new else branch.

} else if (IS_STRING(vm_peek(vm, 0)) && IS_STRING(vm_peek(vm, 1)) && operation == VM_BIN_OP_ADD) {
    vm_concatenate_string(vm);
    return;
}

So if the next two values on the stack are strings, and the operation is indeed VM_BIN_OP_ADD, we call vm_concatenate_string.

void vm_concatenate_string(vm_t *vm) {
    managed_string_t *second = AS_STRING(vm_pop(vm));
    managed_string_t *first = AS_STRING(vm_pop(vm));

    uint64_t length = first->length + second->length;
    char *buffer = NEW_WITH_SIZE(char, length + 1);

    memcpy(buffer, first->characters, first->length);
    memcpy(buffer + first->length, second->characters, second->length);

    buffer[length] = '\0';

    managed_string_t *string = managed_object_new_string(buffer, length, false);

    vm_add_managed_object(vm, (managed_object_t *) string);

    vm_push(vm, value_new_object_value((managed_object_t *) string));
}

This function pops the two string values from the stack, allocates memory to hold both of them, copies the contents of both, creates a new managed string, saves it to the vm’s linked list and finally pushes a new string value onto the stack.

There, we have it! We added a basic support for using and manipulating strings in ASPL. The only thing remains is to make sure we won’t leak any memory due to the dynamic memory allocations.

Memory management

I mentioned several times that the vm will maintain a linked list of all managed objects for now. Let’s add this.

typedef struct vm_t {
    bytecode_array_t code;
    uint8_t *ip;
    value_t stack[STACK_MAX];
    value_t *stack_top;
    managed_object_t *managed_objects;
} vm_t;

We have the pointer to the root object in managed_objects. Now we implement vm_add_managed_object.

void vm_add_managed_object(vm_t *vm, managed_object_t *object) {
    if (vm != NULL && object != NULL) {
        object->next_object = vm->managed_objects;
        vm->managed_objects = object;
    }
}

Classic linked list manipulation here. We also have to make sure that the managed_objects field is properly managed by the vm.

void vm_init(vm_t *vm) {
    for (int i = 0; i < STACK_MAX; i++) {
        vm->stack[i] = value_new_invalid_value();
    }

    vm->ip = NULL;
    vm->managed_objects = NULL;

    vm_reset_stack(vm);
}

void vm_free(vm_t *vm) {
    if (vm != NULL) {
        managed_object_free_object_list(vm->managed_objects);
        vm_init(vm);
    }
}

vm_init makes sure that it is NULL, and vm_free calls managed_object_free_object_list to make sure we free all dynamically allocated managed objects.

Testing

This is all we had to do to to make strings work. Let’s see them in action. Build the app, start it up and see what we did.

Let’s see if concatenation works.

ASPL REPL

Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> "Hello" + " " + "World!" == "Hello World!"           
== <main> bytecode start ==
[offset]  [line] [opcode]         [constant pos]   [constant value]
== Stack ==    
00000000       1 OP_CONSTANT                   0   'Hello'
== Stack ==    [Hello] 
00000002       | OP_CONSTANT                   1   ' '
== Stack ==    [Hello] [ ] 
00000004       | OP_ADD
== Stack ==    [Hello ] 
00000005       | OP_CONSTANT                   2   'World!'
== Stack ==    [Hello ] [World!] 
00000007       | OP_ADD
== Stack ==    [Hello World!] 
00000008       | OP_CONSTANT                   3   'Hello World!'
== Stack ==    [Hello World!] [Hello World!] 
00000010       | OP_EQUAL
== Stack ==    [true] 
00000011       | OP_RETURN
== <main> bytecode end ==
aspl> q
Bye

Now, let’s do a simple comparison that evaluates to true.

ASPL REPL

Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> "a" == "a"
== <main> bytecode start ==
[offset]  [line] [opcode]         [constant pos]   [constant value]
== Stack ==    
00000000       1 OP_CONSTANT                   0   'a'
== Stack ==    [a] 
00000002       | OP_CONSTANT                   1   'a'
== Stack ==    [a] [a] 
00000004       | OP_EQUAL
== Stack ==    [true] 
00000005       | OP_RETURN
== <main> bytecode end ==

Now, let’s try to compare two different strings.

ASPL REPL

Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> "a" == "b"
== <main> bytecode start ==
[offset]  [line] [opcode]         [constant pos]   [constant value]
== Stack ==    
00000000       1 OP_CONSTANT                   0   'a'
== Stack ==    [a] 
00000002       | OP_CONSTANT                   1   'b'
== Stack ==    [a] [b] 
00000004       | OP_EQUAL
== Stack ==    [false] 
00000005       | OP_RETURN
== <main> bytecode end ==
aspl> "a" != "b"
== <main> bytecode start ==
[offset]  [line] [opcode]         [constant pos]   [constant value]
== Stack ==    
00000000       1 OP_CONSTANT                   0   'a'
== Stack ==    [a] 
00000002       | OP_CONSTANT                   1   'b'
== Stack ==    [a] [b] 
00000004       | OP_EQUAL
== Stack ==    [false] 
00000005       | OP_NOT
== Stack ==    [true] 
00000006       | OP_RETURN
== <main> bytecode end ==

And finally, let’s make sure that using any other operator than +, == or != gives us an error.

aspl> "a" < 12
== <main> bytecode start ==
[offset]  [line] [opcode]         [constant pos]   [constant value]
== Stack ==    
00000000       1 OP_CONSTANT                   0   'a'
== Stack ==    [a] 
00000002       | OP_CONSTANT                   1   '12'
== Stack ==    [a] [12] 
00000004       | OP_LESS
Operands must be numbers
[line 1] in script

It seems to work properly. Last thing to do is to check if we are leaking any memory.

ASPL REPL

Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> "Hello" + " " + "World!" != "apples"
--18151-- REDIR: 0x49eef20 (libc.so.6:__memchr_avx2) redirected to 0x483ceb0 (memchr)
--18151-- REDIR: 0x49f6080 (libc.so.6:__memcpy_avx_unaligned_erms) redirected to 0x483f870 (memmove)
--18151-- REDIR: 0x49ef6a0 (libc.so.6:__memcmp_avx2_movbe) redirected to 0x483f030 (bcmp)
--18151-- REDIR: 0x4906470 (libc.so.6:realloc) redirected to 0x483acd0 (realloc)
--18151-- REDIR: 0x49061d0 (libc.so.6:free) redirected to 0x4839910 (free)
--18151-- REDIR: 0x49f2cd0 (libc.so.6:__strchrnul_avx2) redirected to 0x4840360 (strchrnul)
--18151-- REDIR: 0x49f6060 (libc.so.6:__mempcpy_avx_unaligned_erms) redirected to 0x4840470 (mempcpy)
--18151-- REDIR: 0x49f3090 (libc.so.6:__strlen_avx2) redirected to 0x483bc30 (strlen)
== <main> bytecode start ==
[offset]  [line] [opcode]         [constant pos]   [constant value]
== Stack ==    
--18151-- REDIR: 0x49f3220 (libc.so.6:__strnlen_avx2) redirected to 0x483bbd0 (strnlen)
00000000       1 OP_CONSTANT                   0   'Hello'
== Stack ==    [Hello] 
00000002       | OP_CONSTANT                   1   ' '
== Stack ==    [Hello] [ ] 
00000004       | OP_ADD
== Stack ==    [Hello ] 
00000005       | OP_CONSTANT                   2   'World!'
== Stack ==    [Hello ] [World!] 
00000007       | OP_ADD
== Stack ==    [Hello World!] 
00000008       | OP_CONSTANT                   3   'apples'
== Stack ==    [Hello World!] [apples] 
00000010       | OP_EQUAL
== Stack ==    [false] 
00000011       | OP_NOT
== Stack ==    [true] 
00000012       | OP_RETURN
== <main> bytecode end ==
aspl> q
Bye
==18151== 
==18151== HEAP SUMMARY:
==18151==     in use at exit: 0 bytes in 0 blocks
==18151==   total heap usage: 26 allocs, 26 frees, 3,362 bytes allocated
==18151== 
==18151== All heap blocks were freed -- no leaks are possible
==18151== 
==18151== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)
==18151== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

It seems we are memory leak free!

That’s all for today.

Grab the source code of the current state of the project if you like.

Stay tuned. I’ll be back.

Tags: strings

← Implementing boolean expression support The theoretical background of hash maps →

comments powered by Disqus