Attila's Blog

About writing your own programming language in C, Go and Swift

Parsing grouping and negation

Sep 4, 2018 Comments

Categories: aspl

Last time we finished with a fully functional but limited parser. Let’s start adding some extras. Parsing arithmetic expressions still needs two, currently missing features: grouping and negation. This is what we are going to implement today.

Revising the grammar

Before jumping into the coding part, let’s revise our grammar for ASPL’s expressions. Remember the final version of the grammar we currently have? Here it is:

expression     := addition ;
addition       := multiplication ( ( "+" | "-" ) multiplication )* ;
multiplication := primary ( ( "*" | "/" ) primary )* ;
primary        := literal ;
literal        := integer | floating_point ;
integer        := digit+ ;
floating_point := digit+ ( "." digit+ )? ;
digit          := "0" ... "9" ;

We need to modify it, and make place for negation and grouping. Let’s see how.

First, we have to create a new precedence rule for negation. In ASPL, negation has higher precedence than multiplication, so the modified precedence table looks like this:

 Name   Operators   Associativity 
 Primary 
 Negation   -  Right
 Multiplication   * /  Left
 Addition   + -  Left


It is also important to mention that the unary - operator is right-associative. Now that we know where the place of the negation expression is, let’s incorporate it into the grammar.

expression     := addition ;
addition       := multiplication ( ( "+" | "-" ) multiplication )* ;
multiplication := unary ( ( "*" | "/" ) unary )* ;
unary          := "-" unary | primary ;
primary        := literal ;
literal        := integer | floating_point ;
integer        := digit+ ;
floating_point := digit+ ( "." digit+ )? ;
digit          := "0" ... "9" ;

The unary rule comes after the multiplication, because is has higher precedence, and we said that each rule’s operands use the next higher precedence level.

Notice something? Let me help. We said that our parser cannot handle right recursion, remember? Yet, if you look closely, you can see that the new unary rule is in fact right recursive. Did we just mess up or what?

Good news, we didn’t mess up. First, I mentioned that the unary - operator is right-associative, this means we need to have right recursion in this rule. Also, the parser has issues with right recursion when the body of the nonterminal rule starts with itself. Here, the body starts with a terminal, the - token, and the nonterminal is actually at the end of the body. This means the parser won’t have problems with this rule.

As the grouping operator is amongst the ones with the the highest precedence, let’s just put it together with the primaries. The final form of the new grammar is this:

expression     := addition ;
addition       := multiplication ( ( "+" | "-" ) multiplication )* ;
multiplication := unary ( ( "*" | "/" ) unary )* ;
unary          := "-" unary | primary ;
primary        := literal | "(" expression ")" ;
literal        := integer | floating_point ;
integer        := digit+ ;
floating_point := digit+ ( "." digit+ )? ;
digit          := "0" ... "9" ;

Making code changes

We finished with the grammar, now let’s see how all this looks like in code. First, we are going to need to extend our AstNode type to support unary expressions.

typedef enum ast_node_type_t {
    NODE_INTEGER,
    NODE_DOUBLE,
    NODE_BINARY_INFIX_OPERATOR,
    NODE_UNARY_PREFIX_OPERATOR,
} AstNodeType;

typedef struct ast_node_t {
    AstNodeType type;
    Token* token;
    union {
        struct {
            AstNode* left;
            AstNode* right;
        } binary_infix;
        struct {
            AstNode* right;
        } unary_prefix;
        int64_t integer;
        long double double_precision;
    } as;
} AstNode;

First, we add a new node type to AstNodeType to support unary expressions: NODE_UNARY_PREFIX_OPERATOR. After that, we extend AstNode with this:

struct {
    AstNode* right;
} unary_prefix;

Unary expressions have only one operand. Also, negation is a prefix expression, and that means its only operand will be on the right-hand side of the operator, so we call it right in the AST node.

Next, we need to add new functions to create and release AST nodes of this new type.

extern AstNode* ast_node_make_unary_prefix_expression(Token* token, AstNode* right);

This is how its public interface looks like. Let’s see the implementation.

AstNode* ast_node_make_unary_prefix_expression(Token* token, AstNode* right) {
    AstNode* node = NEW(AstNode);

    node->type = NODE_UNARY_PREFIX_OPERATOR;
    node->token = token;
    node->as.unary_prefix.right = right;

    return node;
}

There is nothing new here, it is exactly like all the other ones. Same goes for the function that releases the memory for this type.

static void ast_node_free_unary_prefix_node(AstNode* node) {
    token_free(node->token);
    ast_node_free(node->as.unary_prefix.right);
    DELETE_TYPE(AstNode, node);
}

And finally we have to modify ast_node_free to use the new functions.

void ast_node_free(AstNode* node) {
    switch (node->type) {
        case NODE_INTEGER:
            ast_node_free_integer_node(node);
            break;
        case NODE_DOUBLE:
            ast_node_free_double_node(node);
            break;
        case NODE_BINARY_INFIX_OPERATOR:
            ast_node_free_binary_infix_node(node);
            break;
        case NODE_UNARY_PREFIX_OPERATOR:
            ast_node_free_unary_prefix_node(node);
            break;
    }
}

And the AST node is done. It can now fully support unary prefix expressions of any kind.

We are halfway done with the required modifications. Now we have to add some new parsing functions so the parser will also support our new features.

First, we need two helper functions.

static bool check_current_token_type(ParsingContext* context, TokenType type) {
    return get_current_token(context)->type == type;
}

static void consume_current_token_if_type(ParsingContext* context, TokenType type, const char* message) {
    if (check_current_token_type(context, type)) {
        advance_to_next_token(context);
        token_free(get_previous_token(context));
        return;
    }

    error_at_current_token(context, message);
}

check_current_token_type simply checks if the current token is of a certain type. consume_current_token_if_type checks for a certain token, and if the current token matches it will consume it. By consume I mean it will just read it from the token stream, and “gets rid of it”, by releasing the token.

Why is this good for us, you ask? There are some tokens whose only job is to tell us that we reached the end of an expression or statement. We know that the language’s syntax requires us to terminate a certain kind of expression or syntax with a special token. The parser knows it too, and when it parses that specific expresion it will look for the terminator symbol. But other than knowing that we reached the end of the expression, we do not need the token anymore, so we read it and then simply discard it.

A good example for this is the ) token in grouping expressions. When we encounter the opening ( we know that the current expression is grouping. We also know that by the syntactic rules of the language, we have to close the grouping expression with the ) token. So at the end we specifically look for it, and report an error if it doesn’t exist because in this case the programmer made a syntax error. If we find it, we know that the expression is syntactically correct, and we do not need the ) token anymore.

All right, based on this explanation let’s implement the parse function for grouping expressions. It is very simple.

static AstNode* parse_grouping_expression(ParsingContext* context) {
    token_free(get_previous_token(context));
    AstNode* node = parse_expression(context);
    consume_current_token_if_type(context, TOKEN_RIGHT_PAREN, "Expect ')' after expression");
    return node;
}

We know that it starts with (, and we do not need that symbol for anything else, so let’s release it. After that, we need to figure out what is the expression enclosed in the parentheses. We can figure it out by recursively calling parse_expression. We need this exact function call, because a grouping expression can contain any kind of other expressions. At the end, we also know that grouping must end with the ) symbol, so we look for it. If it isn’t there, it is a syntax error, so we report it. Otherwise, we consume the token and we are done.

Next up is the parse function for unary prefix expressions.

static AstNode* parse_unary_prefix_operator_expression(ParsingContext* context) {
    Token* previous = get_previous_token(context);
    AstNode* right = parse_precedence(context, PREC_UNARY_PREFIX - 1);
    return ast_node_make_unary_prefix_expression(previous, right);
}

We recognize the type of expression when we read the - token. We call the parse_unary_prefix_operator_expression to parse the unary expression. When the function starts, we can get the previous token, which holds the operator for the current unary expression. Next, we have to make sure that the rules of precedence are honored, so instead of calling parse_expression to figure out the right-hand side operand, we call parse_precedence instead with PREC_UNARY_PREFIX - 1 as the parameter, to make sure that it will parse only expressions that have lower precedence than the current unary expression.

That’s all we need to be able to parse unary prefix expressions. The last thing we have to do, is to put them to work, by including them in our parse function registration table.

static ParseRule parse_rules[] = {
        {parse_grouping_expression,       NULL,                                    PREC_CALL},    // TOKEN_LEFT_PAREN,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_RIGHT_PAREN,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_LEFT_BRACE,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_RIGHT_BRACE,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_COMMA,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_DOT,
        {parse_unary_prefix_operator_expression, parse_binary_operator_expression, PREC_TERM},    // TOKEN_MINUS,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_MINUS_MINUS,
        {NULL,                                   parse_binary_operator_expression, PREC_TERM},    // TOKEN_PLUS,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_PLUS_PLUS,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_SEMICOLON,
        {NULL,                                   parse_binary_operator_expression, PREC_FACTOR},  // TOKEN_SLASH,
        {NULL,                                   parse_binary_operator_expression, PREC_FACTOR},  // TOKEN_STAR,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_QUESTION,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_COLON,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_AND_AND,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_PIPE_PIPE,

        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_BANG,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_BANG_EQUAL,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_EQUAL,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_EQUAL_EQUAL,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_GREATER,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_GREATER_EQUAL,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_LESS,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_LESS_EQUAL,

        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_IDENTIFIER,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_STRING,
        {parse_number_literal_expression, NULL,                                    PREC_NONE},    // TOKEN_NUMBER,

        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_CLASS,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_ELSE,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_FALSE,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_FUNC,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_FOR,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_IF,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_NIL,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_PRINT,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_RETURN,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_SUPER,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_THIS,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_TRUE,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_VAR,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_WHILE,

        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_ERROR,
        {NULL,                            NULL,                                    PREC_NONE},    // TOKEN_EOF
};

For the TOKEN_LEFT_PAREN we register parse_grouping_expression in the nud position, and for TOKEN_MINUS we register parse_unary_prefix_operator_expression also in nud position.

This is all we had to modify. Let’s comple the project and try it out with an example. We start with testing if grouping works.

ASPL REPL

Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> 1 + 2 * 3
BinaryInfixOperatorExpressionNode {
  IntegerExpressionNode {
    1
  }
  Operator +
  BinaryInfixOperatorExpressionNode {
    IntegerExpressionNode {
      2
    }
    Operator *
    IntegerExpressionNode {
      3
    }
  }
}
aspl> (1 + 2) * 3
BinaryInfixOperatorExpressionNode {
  BinaryInfixOperatorExpressionNode {
    IntegerExpressionNode {
      1
    }
    Operator +
    IntegerExpressionNode {
      2
    }
  }
  Operator *
  IntegerExpressionNode {
    3
  }
}
aspl> q
Bye

Process finished with exit code 0

The first expression translates to 1 + (2 * 3) which is correct. When we use grouping, it translates to (1 + 2) * 3 which is exactly what we wanted. Grouping works properly. Let’s see what’s up with negation. Hit it with some expressions.

ASPL REPL

Type 'Ctrl+C', 'quit' or 'q' to exit.

aspl> -2
UnaryPrefixOperatorExpressionNode {
  Operator -
  IntegerExpressionNode {
    2
  }
}
aspl> 3 - -3
BinaryInfixOperatorExpressionNode {
  IntegerExpressionNode {
    3
  }
  Operator -
  UnaryPrefixOperatorExpressionNode {
    Operator -
    IntegerExpressionNode {
      3
    }
  }
}
aspl>  1 + 2 * 3 - -4 / 2
BinaryInfixOperatorExpressionNode {
  BinaryInfixOperatorExpressionNode {
    IntegerExpressionNode {
      1
    }
    Operator +
    BinaryInfixOperatorExpressionNode {
      IntegerExpressionNode {
        2
      }
      Operator *
      IntegerExpressionNode {
        3
      }
    }
  }
  Operator -
  BinaryInfixOperatorExpressionNode {
    UnaryPrefixOperatorExpressionNode {
      Operator -
      IntegerExpressionNode {
        4
      }
    }
    Operator /
    IntegerExpressionNode {
      2
    }
  }
}
aspl> q
Bye

Process finished with exit code 0

The first one was correctly translated to (-2). The second one becomes 3 - (-3), which is also correct. The last one is 1 + (2 * 3) - ((-4) / 2), also correct.

This was quite easy, wasn’t it? :)

Grab the source code of the current state of the project if you like.

So far, I haven’t talked about memory management a lot, although we should mention it because it is important in a low level language like C. We have to properly manage memory to avoid memory leaks.

Next time I will introduce Valgrind and explain how you can (and should) use it to check if your application is leaking memory.

Stay tuned. I’ll be back.

← Parsing arithmetic expressions Checking for memory leaks with Valgrind →

comments powered by Disqus