Intermediate Representation

Think of the compiler as a pipeline. It has a number of steps, and each of them deals with the result of the previous step. This pipeline can be divided into two big subsystems: the front end and the back end.

So far, we have explored the front end of the compiler. The front end deals with the language itself: It consists of lexing, parsing and the result is the AST.

Now, we are going to dive into the back end and have a better look on the intermediate representation as well.

The back end deals with the architecture of the target system the source code will run on (object code, machine code, etc). The front end and back end are completely independent of each other and there are very good reasons for this: portability and flexibility.

What does this mean?

Let’s say, you want to support not just one but several languages in your compiler. Also, you want to port your compiler to M different platforms (i386, x86_64, ARM, SPARC, PowerPC, etc). How would you do that? Without any clever tricks, you’d have to write N * M different compilers to do it. That means N * M different lexers, parsers, analyzers, optimizers, and code generators. But we can do better than this! If we separate the compiler pipeline into front end and back end, our job instantly becomes much easier. As I said before, these two are independent of each other. This means we have to write one front end for each of the N languages, and one back end for each of the M target architectures.

For the sake of the example, let’s say we support 3 languages on 5 platforms. That would be 5 * 3 = 15 different compiler implementations without the pipeline separation. If our compiler consists of 5 subsystems (lexer, parser, analyzer, optimizer, code generator) it would mean that we have to write 15 * 5 = 75 subsystems.

That is an awful lot of work, if you ask me.

If (or rather when) we separate the pipeline, we will write 3 front ends (3 lexers, 3 parsers, 3 analyzers), and 5 back ends (5 optimizer and 5 code generators). That’s 19 versus 75.

Now there is only one thing missing. It is clear that we need to connect the front end and the back end so that they can communicate. We need some data structure that can carry information from the front to the back. This data structure is called intermediate representation (IR).

As its name suggests, it is designed to be

language independent,
platform independent,
and also to be conducive for further processing such as optmimization,

so it can represent any language on any platform. A popular format for IRs is the three-address code.

A well-known IR example is the LLVM IR.

Next time, I will talk about optimization.

Stay tuned. I’ll be back.

Attila's Blog

About writing your own programming language in C, Go and Swift

Intermediate Representation