Attila's Blog

About writing your own programming language in C, Go and Swift

The grammar of ASPL's expressions

Aug 22, 2018 Comments

Categories: aspl

It is time to start describing the grammar for ASPL. We start with basic arithmetic expressions that support adding, subtracting, multiplying, and dividing positive numbers only. Later we will add negative numbers and grouping as well. This will be the base of our parser.

First, let’s examine what our expressions are made of. We want to start really small, so this will be a tiny subset of the full grammar.

For now, we will only have:

  • Literals. These are integer and floating point numbers
  • Binary infix expressions. These consist of literals and arithmetic operators for addition +, subtraction -, multiplication * and division /

Based on these elements we can describe the grammar as follows:

expression     := literal | infix_binary ;
literal        := integer | floating_point ;
integer        := digit+ ;
floating_point := digit+ ( "." digit+ )? ;
digit          := "0" ... "9" ;
infix_binary   := expression operator expression ;
operator       := "+" | "-" | "*" | "/" ;

All right, we have the grammar, but what does it mean? Let’s see.

As I mentioned previously, every rule has a head and a body. The head is the name of the rule, and the body describes what the rule is supposed to do.

In our notation the head and body is separated by :=. Nonterminals are all lowercase strings. Terminals are strings enclosed in quotes. Rules are terminated with ; at the end. We allow the selection between a series of productions with the | symbol. ... is used to denote a range with both limits included.

We also borrow some stuff from regular expressions, so we have a familiar syntax. We use ( and ) for grouping. * means the previous symbol or group may repeat zero or more times. + is similar, it requires the preceding production to appear one or more times. And finally ? requires the preceding production to appear zero or one time.

Now we can fully understand our grammar. Let’s build an expression from scratch.

expression := literal | infix_binary ;

It means that an expression is either a literal, or an infix_binary, both of them are referring to rules.

Let’s pick literal:

literal := integer | floating_point ;

We see that a literal is either an integer or a floating point number.

We select floating_point this time:

floating_point := digit+ ( "." digit+ )? ;

Here comes an interesting bit. A floating point number is one or more digits (digit+), optionally followed by a . and at least one or more digits (( "." digit+ )?).

If we choose integer instead, we get this rule:

integer := digit+ ;

It’s much simpler than the floating_point rule, it simply states that an integer is a sequence of one or more digits.

Now we need to know what a digit is:

digit := "0" ... "9" ;

A digit is a character in the range of 0 ... 9.

So far so good. Here comes the most important rule in our tiny grammar:

infix_binary := expression operator expression ;

Here is the point where the grammar gets recursive. An infix_binary rule is combination of an expression, an operator, and another expression, expression being the first rule we started our grammar with. This means it can contain anything we described so far.

The only unknown rule left is operator.

operator := "+" | "-" | "*" | "/" ;

It says an operator is one of the following symbols: +, -, *, /.

If you have sharp eyes you may notice now that we have a problem. This grammar is actually ambiguous. It means, it is possible to choose different productions that will lead to the same string at the end.

I give you an example.

Try creating this expression with our current grammar: 2 * 3 + 4. How many different ways can you produce it? Let’s see.

One way to do it:

  1. Start with the expression rule, and pick infix_binary.
  2. For the left-hand side of the infix_binary, pick infix_binary again.
    1. In the inner infix_binary, pick integer (2) for the left-hand side.
    2. Pick * for the operator.
    3. Pick integer (3) for the right-hand side.
  3. For the operator of the outer infix_binary, pick +.
  4. For the right-hand side of the outer infix_binary, pick integer (4).

Another way:

  1. Start with the expression rule, and pick infix_binary.
  2. Pick integer (2) for the left-hand side.
  3. Pick * for the operator.
  4. Pick infix_binary for the right-hand side.
    1. In the inner infix_binary, pick integer (3) for the left-hand side.
    2. Pick + for the operator.
    3. Pick integer (4) for the right-hand side.

What did we get as a result of these 2 expressions?

(2 * 3) + 4 in the first expression and 2 * (3 + 4) in the second. It’s clear that these two are different, right?

Let’s execute them:

(2 * 3) + 4 = 10 versus 2 * (3 + 4) = 14, so we can say that
(2 * 3) + 4  ≠  2 * (3 + 4)

This is a serious issue because it means that the parser may misunderstand the source code. We have to fix this.

This is exactly what we are going to do next time.

Stay tuned. I’ll be back.

Tags: parsing formal grammars context-free grammar BNF

← Introduction to context-free grammars and parsing theory Fixing the ambiguity in the grammar →

comments powered by Disqus