Static Analysis

We are done with the first big part of the compilation process which includes lexing and parsing, the two steps that transform the code into a form that it easy to reason about by humans. With static analysis we start transforming the code into a form that the CPU will understand.

We already have our declarations, statements and expressions, and we understand what they mean. We know that 2 + 2 is an expression that adds 2 and 2 together and will evaluate to the constant 4. We also know that x + y is another expression that adds x and y together, but we don’t yet know what they refer to, all we know that they are identifiers. At this stage our understanding about the code is incomplete and needs further clarification.

The first “clarification” step usually is trying to understand what identifiers like x and y refer to. This is called binding or resolution.

During the resolution step we take each identifier and find out where they are defined and connect the two together. This is where scope will come into play. Scope is a part of the source code where a given identifier can be used to refer to a declaration.

Take this example:

{
    int x = 12;
    int y = 14;
    printf("%d\n", x + y);
}

We know that in C { } denotes a block which in turn denotes a scope. Everything declared in this block can be referred to in the same block, so it is perfectly legal to print the sum of x + y on the third line.

If the language has a static type system then the next step is type checking. Using the previous example, it is not enough that we bound our identifiers to their declarations, we also have to check if their types are compatible with the operation (we try to add them together) we’d like to apply to them. So we check that both x and y are integers and thus it is safe to add them together. If y were to be something incompatible (say a struct) our analysis step would have to throw an error.

We have finished analyzing the syntax tree, and now we have some additional information to store somewhere: the results of the resolution and type checking steps. We have multiple options.

We could store them directly in the syntax tree nodes
We could also move them into their own storage, for example a hash table. In this case the table is called a symbol table
Another option is to start building a data structure that is easier to understand for the CPU. At the beginning of the post, I said this is the main goal of this step, so for our purposes this is the optimal choice. The new data structure is called an intermediate representation or IR.

Next time I will talk about intermediate representations.

Stay tuned. I’ll be back.

Attila's Blog

About writing your own programming language in C, Go and Swift

Static Analysis