The easiest way to learn the basics of creating a programming language is examining the necessary steps to transform the source code into executable code. In this post, I’m going to introduce these steps, and I will explain them in detail in future posts. I will also implement these concepts in C.
High-level overview of the transformation process
The process can be divided into 2 main parts.
The first part is about understanding what the source code means. This part is more or less a one-way process. It’s virtually the same for every language. The compiler executes a couple of mandatory steps one after the other and at the end, we have an understanding of the meaning of the source code.
Think of it as transforming the source code into higher level representations. Individual characters become tokens, tokens become expressions, statements, and declarations, these become translation units, translation units getting combined, and at the end, we have a tree that contains each and every meaningful token from our source code. This tree can be used to generate code for the CPU.
These are the steps of the first phase:
- The first step is the lexical analysis, that will produce tokens
- Next is parsing the tokens that creates the abstract syntax tree (AST)
The second part is about transforming the tree to a form that is low-level enough so the CPU can execute it. Think of this phase as transforming the nodes of the tree into lower level representations. Encapsulated within our tree we have a high-level understanding of the code in a form that can be understood by us, humans. By traversing the nodes of the tree we can easily see what the code is intended to do. This representation doesn’t mean anything to the CPU, however. We have to start breaking it down into lower level units which can be executed by the CPU.
In this phase, the process is not necessarily one way only. Certain steps can be ignored or repeated multiple times until we have the executable code.
- We can decide not to do any optimizations
- Or have multiple optimization phases
- Can decide to transform the tree into the source code of another language (transpile)
- We can analyse the nodes and modify them, insert new nodes / delete existing ones based on some criteria
The steps of the second phase are:
- The syntax tree can be directly executed by a tree–walk interpreter. This is probably the easiest, and also the slowest way to execute your code. We could write a tree–walk interpreter but due to its slow execution speed it is not really usable for real–life applications, so I am going to skip implementing one
- Various kinds of optimizations can also take place: Single Static Assignment is a popular one
- The next step is analysis and/or optimization (This is optional). The result of this is usually some kind of intermediate representation (IR), that is lower level than the syntax tree we started with. Optimization and analysis steps can be repeated and every time they will result another IR that usually either lower level than the previous one or the same but with added information. (For example we have a type checking phase that determines if every operand of an expression has the correct data type)
- The final step is code generation
- This is either bytecode to be executed in a virtual machine (This is what you will write in the first phase)
- Or machine code (This is the second phase of your project). Machine code generation can be a multi-step process in itself.
For example, the tree nodes are converted to assembly first, then compiled and likned together into the executable binary.
Next time I will explain what lexical analysis is.
Stay tuned. I’ll be back.