Project #1 Notes, Fall 2011

This page is an accumulation of (we hope) helpful notes, hints, etc., that might be useful to you in doing Project #1.

  1. Making ASTs
  2. Defining AST subtypes
  3. Making tokens explicitly

Making ASTs

The Horn framework does quite a bit for you, and it is best not to re-invent the wheel, or do a lot of writing. Here are a couple of sample rules taken from my solution that illustrate the kinds of things that Horn can do for you.

Calls

    call: primary "("! argument_list ")"! 
                    { $$ = $^(CALL, $*); }
        ;

    argument_list: (expression (","! expression)* ","!?)?
                    /* Trailing comma requires GLR*/
                    { $$ = $^(EXPR_LIST, $*); }
        ;
In the prelude, I have defined:
    %token CALL
    %token EXPR_LIST

I never use CALL or EXPR_LIST as tokens (terminal symbols) in the grammar or define them in lexical rules. They are just syntactic categories whose only purpose is to communicate to the tree-building machinery. Giving them %token declarations causes the framework to define these two names as unique integer constants in the apyc-parser.hh file.

Most of my rules look like this: there is very little writing involved. The effect of $* is to gather up all the semantic values of the symbols in the rule into a list, which is then passed to the tree-making ($^) routine with the appropriate operator. All the extra pieces of syntax that aren't supposed to be reflected in the AST get `!'ed away. Only for a few rules will you need to explicitly extract .value() from a particular right-hand-side symbol. My solution has no need for exotic things like the .missing method; in fact the context-free grammar rules generally contain a single statement, and none of them is an "if", "while", or "for".

Passing through values without change

A parenthesized expression can just look like this:

    parenth_form: "("! expression ")"! ;

Since the two parentheses get tossed out, and there are no tree- forming instructions, the value of the remaining expression simply gets transferred out as the value of the parenth_form.

This convention allows you to have simple rules whose sole purpose is to return a bunch of operands to be gathered up by another rule, as in:

    dict_display: "{"! key_datum_list? "}"!
                    { $$ = $^(DICT_DISPLAY, $*); }
        ;

    key_datum_list: key_datum (","! key_datum)* ","!? 
        ;

Here, key_datum_list exists only to gather up a list of KEY:VALUE items.

GLR Parsers

The note on calls, above, mentions GLR parsing, which is invoked with the directive %glr-parser in the prelude (before the first %%). You will find this useful for handling things like targets of assignments. Given a statement such as

    x, y = 3
how do we know that x here is an assignment target as opposed to an expression formed from an identifier? The parser has to make that decision at the comma, although it only becomes possible to know later, on encountering the equals sign. Therefore, the Horn framework (actually, the Bison part of it) will report a conflict here.

GLR parsers can get around this by trying to parse x both as a target and as an ordinary expression, throwing away whichever parse later encounters a syntax error. However, use this with extreme caution; the system will still report shift-reduce and reduce-reduce conflicts. You will have to convince yourself that these are all expected—that you know why each occurs. Having done that, add declarations to your prelude that tell the framework how many conflicts to expect. Here's what's in my solution:

    %expect-rr 18
    %expect 2
which says to expect 18 reduce-reduce and 2 shift-reduce conflicts. Any other numbers will cause an error, and will probably be an error. I checked out each of the conflicts above (using the file apyc-parser.output). Most of them are actually the same thing: the parser generator does not know whether something is a target or an expression. The other conflicts involved trailing commas.

In a non-GLR parser, consider all conflicts to be errors. In a GLR parser, do not allow many conflicts; 20 is about the limit, I'd say. Resist the temptation to try to "fix" a conflict with GLR parsing; it almost certainly won't work, and the problem won't show up without proper testing.

Defining AST subtypes

For Project #1, it is not strictly necessary to define different subtypes of AST for different kinds of AST tree. This is because all that the spec asks at this point is that you print the tree out at the end, and all internal AST nodes have the same format:
     (operator line-number operand1 operand2...)
Therefore, you could easily get away with a single Tree type (the one provided for you, in fact), somehow arranging for externalName to know the proper AST operators to print for each different operator token (perhaps by means of a table?)

However, I suggested that you might want to create subtypes anyway for when they will come in useful (starting in the next project). The skeleton illustrates much of what you need to create an AST subtype. The class Expr_List_AST in exprs.cc shows how to set up a subtype of AST that is supposed to represent the kind of AST that prints as (expr_list etc.). The NODE_FACTORY macro in particular does some behind-the-scenes magic so that in the definition of argument_list in this preceding example, the call

       { $$ = $^(EXPR_LIST, $*); }
knows to create an Expr_List_AST for this AST node.

There are actually analogous macros for tokens. I didn't use them in the skeleton, frankly, because I just added them to the horn framework. But if you want, feel free to replace the Int_Token definition with

    /** Represents an integer literal. */
    class Int_Token : public AST_Token {
    public:

        void print (ostream& out, int indent) { ... }

	TOKEN_CONSTRUCTORS(Int_Token, AST_Token);

        ...
    };

    TOKEN_FACTORY(Int_Token, INT);
removing the constructors, the make method, and the definition of Int_Factory. Again, the TOKEN_FACTORY macro links up the syntax INT (defined by a lexical rule) with the class Int_Token.

Another potentially useful tool is the method post_make, which the Horn framework defines for its tree and token types. This method is called after constructing a new token or tree node. By default, it does nothing, returning the node unchanged, but you may override it to do any additional set-up or perhaps error checking you want for a tree node or token. For example, my Int_Token class defines

    Int_Token* post_make () {
        ...
        return this;
    }
where I've put stuff for converting and checking integer literals (which you could also put in the lexer, of course, but I thought I'd concentrate everything having to do with integer literals in the Int_Token class.

Making tokens explicitly

There are one or two places where you will want to create a new token "by hand", rather than using the usual lexical rule's default symbol creation. In our project, assuming that the top-level AST node type is named AST, as in the skeleton, the call needed to do so is

    AST::make_token (SYNTAX, TEXTSIZE, TEXT, OWNER)
where SYNTAX is the syntactic category (one of the upper-case names from a %token declaration or the left side of a lexical rule), TEXT contains the text of the token (type const char*), TEXTSIZE is the length of the text (for generality; the text need not be NUL-terminated), and OWNER is true if the Horn framework is free to delete the text when the token is eventually deallocated. The value of TEXT must be "permanent". The text and length of another token (otherToken->text_size() and otherToken->as_chars()) are acceptable for this purpose (specify OWNER as false in this case). Alternatively, the output of strdup or new char[...] also work (OWNER can be true in those cases). Finally, string literals also work (with OWNER set to false).


Page was last modified on Thu Sep 29 15:32:55 2011.
Address comments and questions to cs164@eecs.berkeley.edu