Before making a small detour with Valgrind and generic containers, we implemented parsing. A small subset of it anyways. :) In the following couple of posts we are going to start working towards compiling those arithmetic expressions we parsed into bytecode. Before we could compile anything, we need to know what bytecode is, how to create it, and how to store it. In this post we will answer the first of these questions: what is bytecode and why we chose to use it.

Why bytecode?

Let’s do a small recap about why we use bytecode. If you remember, we discussed that we have some alternatives to bytecode.

First, we could simply walk the AST recursively, and execute it. It would be the easiest thing to do. It is very portable. Why not doing it then? Because it is not memory efficient. It is messing up spatial locality. And for all of these reasons it is slow. I mean slooooow.

Also, we could generate machine code instead of bytecode. Advantage? Lightning-fast execution. Think of C++ – like speed. Sounds great, right? It has a problem though. It takes an awful lot of work to generate machine code. Also, it’s not portable. We would have to do it on all supported platforms over and over again.

What is bytecode?

It seems like to me that we should choose something that is halfway between tree–walking and compiling to machine code. Is there something in between these two? Yes, you have guessed right. Bytecode. We love bytecode for a number of reasons. It combines the best of both worlds. It preserves the simplicity and portability of walking the AST. It is also fast. Of course not as fast as machine code, but again, it is a good compromise between the former two.

If we look at its structure, it resembles machine code. It is a linear sequence of binary instructions. It’s dense. This means, it works well with the CPU cache. Has low overhead. The instructions are simple. If you look at x64 assembly, you will see that instructions are oftentimes quite complex. And there are hundreds of them. Our instructions are going to be much simpler. Every instruction is exactly one byte long, hence the name: bytecode. Also, in ASPL we can imlpement the whole language with some sixty or so instructions.

Bytecode is essentially an imaginary (virtual) instruction set for your language. Being virtual, there is nothing that could execute it. We solve this problem by creating an emulator for it. This emulator is a simulated CPU that interprets our bytecode. This virtual CPU is what we call the virtual machine.

Obviously this extra layer adds some overhead, this is the reason bytecode virtual machines are slower than machine code, but we get portability in return. All we need to do is to write our VM in a portable language like C or C++, and we can run our language on any HW we like (those that support C or C++ of course :D).

Next time we will learn how to store bytecode effectively by utilizing our generic containers.

Stay tuned. I’ll be back.

Attila's Blog

About writing your own programming language in C, Go and Swift

What is bytecode

Why bytecode?

What is bytecode?