Last time we created the basic grammar for ASPL’s arithmetic expressions. We discovered that the grammar has a big problem: it is ambiguous, and it is possible to choose different productions that will lead to the same string at the end. That is a big problem that needs fixing. Let’s do that right now.
- Precedence means “order of operations”. Every type of operation (for example addition, subtraction, multiplication or division) has an associated strength. Stronger operations (means operations with higher precedence) are always evaluated before weaker ones.
- Associativity determines the order how multiple operators of the same precedence are evaluated in the absence of parentheses. We have left-associative and right-associative operators in ASPL.
Left-associativity means that operators on the left are evaluated before operators on the right.
For example, the
+ operator is left-associative, the first (leftmost)
+ is evaluated before the second.
1 + 2 + 3 =
(1 + 2) + 3
Right-associativity is the opposite. In this case, operators on the right are evaluated before operators on the left.
As an example, the
= operator is right-associative, so
x = y = z =
x = (y = z)
By defining precedence and associativity rules in our grammar we can avoid ambiguity, and thus we ensure that there is only one way to evaluate any expression.
ASPL mostly follows the precedence rules of C++.
For now, we define only a tiny subset of these rules to cover the bases.
Note: Primary is the name of the highest level of precedence that includes our literals like numbers and strings, and thus it does not have any associated operators.
Now we just need a way to put these new rules into our existing grammar. Remember how we constructed the
infix_binary := expression operator expression ;
Right now, we allow any expressions on both sides of the operator, but the new rules of precedence limit that. For example, the operands of a
* expression cannot be
+ expressions, because
+ has lower precedence.
What does it mean? It means that in an expression like:
2 * 3 + 4
The operands of
* must be
3, and we cannot have
3 + 4 as the left-side operand, because
+ has lower precedence. To use
3 + 4 as the left-side operand we would have to explicitly allow it by enclosing it in parentheses:
2 * (3 + 4).
For the operands of
* we need something with higher precedence, and in our case, it’s currently the
primary rule, like so:
multiplication := primary ( "*" | "/" ) primary ;
Right? Well, not so much… Can you see what the issue is? If we use this rule, we cannot construct an expression like
3 * 4 * 5, so it breaks associativity. We need something recursive here. How about this one:
multiplication := multiplication ( "*" | "/" ) primary ;
It is recursive now, so we are good, right? Nope. This rule is left-recursive. It means that the nonterminal
multiplication rule appears on the leftmost position in its own production. Since the parser we are going to write is a recursive-descent parser, which is a kind of top-down parser, it works from the top to the bottom. To be able to recognize
multiplication, it has to first recognize
multiplication, and for that, it has to first recognize
multiplication and so on… This rule ends up in an infinite recursion, and this is why recursive-descent parsers cannot handle left-recursion.
We can fix this with a small change:
multiplication := primary ( ( "*" | "/" ) primary )* ;
Let’s rewrite our previous grammar using these new principles:
expression := addition ; addition := multiplication ( ( "+" | "-" ) multiplication )* ; multiplication := primary ( ( "*" | "/" ) primary )* ; primary := literal ; literal := integer | floating_point ; integer := digit+ ; floating_point := digit+ ( "." digit+ )? ; digit := "0" ... "9" ;
We got rid of the
infix_binary rule and created one for each level of precedence. The main
expression rule is now simply a reference to the lowest precedence rule we currently have, that is the
addition rule. Why? Because
addition includes every higher precedence rules as well. We also got rid of the
operator rule, and separated our operators for the multiplication and addition operations. Each binary operator’s operands use the next higher precedence level.
Now we have an unambiguous grammar, and we are ready to create the parser for it.
Top-down operator precedence parsing
There are lots of different parsing algorithms out there, and we are going to write the simplest one that is sufficient for our language. It is a Pratt parser, which is an improved version of a recursive-descent parser, invented by Vaughan Pratt in 1973.
Recursive-descent parsers are usually the simplest to write without using extra tools like parser-generators. It is a kind of top-down parser, because it starts from the top (means the outermost grammar rule), and works its way down into the nested subexpressions until it reached the leaves (the terminal symbols) of the tree.
Recursive-descent parsers are easy to write by hand because each of the grammar’s rules will directly become a function / method in our parser’s code. The symbols of the grammar can be roughly translated into code like this:
|Terminal||Code to consume a token|
|Nonterminal||Call that rule’s function|
|* or +||Loop|
It’s called recursive because when a rule refers to itself either directly or indirectly, it translates to recursive function calls. And it’s called descent because it walks down the grammar.
Traditional recursive-descent parsers do not perform very well when parsing expressions, due to the heavy recursion. This is why we write an improved version, by combining it with a top-down operator precedence parser.
If you’d like to know more about parsing theory, I recommend reading the famous dragon book.
There are some great posts about top-down operator precedence parsers, I highly recommend them to help you understand how they work:
We have everything now to start implementing the parser, so this is what we are going to do next time.
Stay tuned. I’ll be back.