Think of the compiler as a pipeline. It has a number of steps, and each of them deals with the result of the previous step. This pipeline can be divided into two big subsystems: the front end and the back end.
Now, we are going to dive into the back end and have a better look on the intermediate representation as well.
The back end deals with the architecture of the target system the source code will run on (object code, machine code, etc). The front end and back end are completely independent of each other and there are very good reasons for this: portability and flexibility.
What does this mean?
Let’s say, you want to support not just one but several languages in your compiler. Also, you want
to port your compiler to
M different platforms (i386, x86_64, ARM, SPARC, PowerPC, etc).
How would you do that? Without any clever tricks, you’d have to write
N * M different compilers
to do it. That means
N * M different lexers, parsers, analyzers, optimizers, and code generators.
But we can do better than this! If we separate the compiler pipeline into front end and
back end, our job instantly becomes much easier. As I said before, these two are independent of
each other. This means we have to write one front end for each of the
N languages, and one
back end for each of the
M target architectures.
For the sake of the example, let’s say we support
3 languages on
That would be
5 * 3 = 15 different compiler implementations without the pipeline separation.
If our compiler consists of
5 subsystems (lexer, parser, analyzer,
optimizer, code generator) it would mean that we have to write
15 * 5 = 75 subsystems.
That is an awful lot of work, if you ask me.
If (or rather when) we separate the pipeline, we will write
3 front ends (
3 analyzers), and
5 back ends (
5 optimizer and
code generators). That’s
Now there is only one thing missing. It is clear that we need to connect the front end and the back end so that they can communicate. We need some data structure that can carry information from the front to the back. This data structure is called intermediate representation (IR).
As its name suggests, it is designed to be
- language independent,
- platform independent,
- and also to be conducive for further processing such as optmimization,
so it can represent any language on any platform. A popular format for IRs is the three-address code.
A well-known IR example is the LLVM IR.
Next time, I will talk about optimization.
Stay tuned. I’ll be back.