Attila's Blog

About writing your own programming language in C, Go and Swift

Introduction to context-free grammars and parsing theory

Aug 16, 2018 Comments

Categories: aspl

Last time, we finished the lexer. We can now create tokens from the source code. Our understanding about the meaning of the source code grows, but it isn’t complete yet. It is time to take it to the next level by parsing the tokens and creating the abstract syntax tree.

Parsing is a lot more complicated than lexing, with several new concepts to grasp. Parsing different parts of the language requires us to utilize different techniques, so we will write the parser in small chunks, gradually adding more and more functionality, until we have a fully functional parser that is capable of parsing the syntax of the whole language.

Of course, we could use some parser generators like Yacc or ANTLR, but where would be the fun in that?

Context-free grammars

In the previous article we used a regular language to define our lexical grammar. This won’t cut it when we start parsing because expressions are recursive, they can be infinitely deeply embedded into each other. For parsing, we have to define another set of rules, the syntactic grammar. For the syntactic grammar we need something more capable, so we are going to move one level up the Chomsky hierarchy, and define a context-free grammar for the parser. They are both part of the group of formal grammars.

In formal language theory, a grammar is a set of rules for “strings” in a formal language. The rules describe how to form “strings” from the language’s “alphabet” that are valid according to the language’s syntax. Each “string” is a series of “letters” from the “alphabet”. A grammar does not describe the meaning of the strings or what can be done with them, only their form. The grammar’s job is to specify which “strings” are valid and which are not.

You notice “strings”, “alphabet” and “letters” are in quotes. This is because these names are part of the terminology, but they have different meanings as we move between the grammars.

Here is a little help understanding it.

 Context-free grammar   Lexical grammar   Syntactic grammar 
 A letter is  A character  A token
 The alphabet is  Characters  Tokens
 A string is  A lexeme  An expression

As you can see, on different levels we use the same terminology but with a different level of granularity.

The rules of grammars

Our grammar could theoretically produce an infinite number of “strings”, so it is not feasible to try writing them all down. We have to think the other way around. We should instead create a finite set of rules, and use them to generate all the valid “strings”. It’s simple: you pick a rule, and just do whatever it tells you to. These rules are called productions because they produce the “strings” of the grammar.

Each rule in a context-free grammar has a head and a body. The head is the name of the rule, and it’s body describes what to generate. It consists of a sequence of symbols. It is a defining feature of context-free grammars to restrict the head to a single symbol. There are more expressive formal grammars, like unrestricted grammars that do not have this restriction. They can be used to generate recursively enumerable languages.

Now, in our context-free grammar, symbols come in two flavors:

  • Terminals. This is a “letter” from the “alphabet”. In our case these will always be tokens coming from the lexer.
  • Nonterminals. These are named references to other rules in the grammar. They are used for rule composition.

To describe the rules, pretty much everyone uses some version of the Backus-Naur form or BNF for short.

So, somehow we need to define the rules for our grammar. Next time, we will come up with a notation that is a modified BNF, and we will define the grammar for expressions in ASPL.

Stay tuned. I’ll be back.

Tags: parsing formal grammars context-free grammar BNF

← Lexing ASPL The grammar of ASPL's expressions →

comments powered by Disqus