Project #2 Notes, Spring 2011 (revision 3)

This page is an accumulation of (we hope) helpful notes, hints, etc., that might be useful to you in doing Project #2.

1. Things You Can Skimp On

1.1. Least Upper Bound types

The static types of the expressions E1 and E2, E1 or E2, and E1 if B else E2 are all the least upper bounds of the types of E1 and E2. The full definition of this is a little tedious, especially when function types get involved, so we'll test only a simplified version of T=lub(T1,T2):

1.2. Assigning to target lists

Consider an assignment with multiple targets, such as

    x::int, y::bool = 2, 3
You could, of course, notice that 3 will be assigned to a bool variable, and issue an error message. On the other hand, consider
    L = [2, 3]
    x::int, y::bool = L
which is really the same case, semantically. But here, you would generally not expect a compiler to catch the error.

In the view of the fact that you can only do a partial job anyway, I'm not going to worry about these cases (feel free to check if you want, just don't expect us to test for it). To be specific, when faced with multiple targets, just pretend that the right-hand side has type any. You still have to check that the targets are all assignable.

2. C++ Hints and Gotchas

2.1. 1 Dangling pointers (I)

This will cause grief:
  AST_Node*
  afunc (...) {
      char name[10] = "__init__";
      AST_Node* token = make_token(IDENTIFIER, init_name);
      ... token eventually returned from afunc, possibly in a tree.
  }
The problem is that name here is a local variable, whose lifetime is the duration of one call to afunc. Thus, the token created will be stuck pointing to garbage when afunc returns.

2.2. Don't forget 'virtual'

Unlike Java or Python, C++ functions are by default not virtual (so that redeclaring a method in a subtype hides but does not override the ancestor's method.) So if you define a new method in AST_Node, and forget to declare it virtual, you'll end up not calling the overriding methods in your subtypes, when those methods are selected from a pointer whose static type is AST_Node. Be careful; this can be a puzzling bug. If something is not getting called that you expect to be called, check that you've declared it virtual.

2.3. No default initialization of local variables and instance variables

Unlike Java, C++ defines no default initial value for local or heap-allocated variables, or for instance variables within them. Be sure, therefore, to initialize all instance variables in your constructors. The symptoms can be bizarre: segmentation faults, tests for NULL that fail when you don't expect them to, failing sanity checks from the Horn software, etc.

The usual way to initialize instance variables to a simple value is with an initializer, as in

     class Foo : public Bar {
     public:
          Foo (int x) : Bar (x), foo_var (NULL) { }

          ...
     private:
          Foo* foo_var;
     };

2.4. Reference counting

The Horn framework provides a convenient way to release AST storage when it is no longer needed. However, its use requires some care. Basically, the idea is that the framework keeps track of how many ``claims'' there are on each node, where a claim indicates a persistent reference to the node from some someplace else. The functions ref and unref, respectively, make and release a claim on a node. Initially, nodes are created with a single claim to them. As such, you may transfer them to a tree node immediately, turning over that claim to the new parent node:
    aNode->append (MAKE_TREE (...));
(see the section Useful macros (3) for a description of MAKE_TREE). If I replace this a child of aNode, the framework will release aNode's claim on the old child and take over the claim on the new one:
    aNode->replace (i, MAKE_TREE (...));
The previous value of aNode->child(i) is unref'ed if it differs from the new value. When the number of claims on a node reaches 0, the node is released. Thus, the following code is dangerous:
    return MAKE_TREE (NEW, child(0), etc.);
where my caller is doing something like:
    replace (i, child (i)->convert_new ());
Here, I am the old value of child(i) in my caller. The replace method will unref me, which will most likely cause me to be deleted. The destructor for AST_Node will then release my children, including the one I just inserted into the node I returned. Thus, this new node will contain garbage. Typically, the dangling reference thus created does not show itself until much later, in the form of a segfault, a "bad object" message, an abort, or just plain inexplicable spewing of garbage output. The proper way to do this is:
    return MAKE_TREE (NEW, ref (child(0)), etc.);
thus making sure that when I am deleted, my former, and still useful, child is not.

3. Using GDB

3.1. Printing dynamic object types.

The command
   (gdb) set print object on
will cause GDB thenceforth to print the dynamic type of pointers that you print out, so that, e.g.,
   (gdb) p decl
will print out
   $3 = (InstanceDecl *) 0x8102c30
rather than
   $3 = (Decl *) 0x8102c30
and printing out *$3 will show the fields of an InstanceDecl

I suggest also using the command

   (gdb) set print static off
to avoid seeing static fields in every object you print.

3.2. GDB can call functions

Don't forget you can call functions and methods in GDB, as in
   (gdb) p tree->child (0)
Some things don't work too well, such as passing reference parameters (you have to remember to pass their addresses), but integer and pointer arguments should work.

One good way to exploit this feature is to add functions to your program that are never used, except by the debugger. You might have a method for dumping information about a Decl or AST_Node on the standard error, so that you can simply write

   (gdb) call dbg_decl(aDecl)
and get a readable printout. You might provide functions to print C++ strings, vectors, or dictionaries. Template methods or functions can't be called directly in the debugger, but you can provide some debugging functions that call useful instances.

It is best not to rely on overloading for these debugging methods. Give each a distinct name.

3.3. Useful macros (1): Another control structure

The skeleton contains the macro for_each_child. It's possible to define a version that allows you to replace the children you traverse. It's only a minor convenience, because you can easily write:
    for_each_child (c, aTree) {
	aTree->replace (c_i_, new_value);
    } end_for;
(c_i_ gets defined by the macro as the index of the current child.) However, if you are adventurous, you might play around with my extension:
    /** Control structure: 
     *      For each child, VAR, of AST_Node* VAR, ...
     *  replacing the child with the value of VAR at the end of each iteration.
     *    
     *  Usage:
     *      for_each_child_var (C, aTree) {
     *           <operations involving C (an AST_Node*), including assignments>
     *      } end_for;
     */
    #define for_each_child_var(Var, Node)                                   \
	do {                                                                \
	   AST_Node* Var ## _n_ = Node;                                     \
	   AST_Node* Var = NULL;                                            \
	   for (int Var ## _i_ = 0;                                         \
		Var ## _i_ < (int) Var ## _n_->arity ();                    \
		_replace_and_incr (Var ## _n_, Var ## _i_, Var)) {          \
	       Var = Var ## _n_->child (Var ## _i_);

    /** Auxiliary function used by the for_each_child_var */
    static inline int
    _replace_and_incr (AST_Node* node, int& k, AST_Node* new_child)
    {
	int n = new_child == NULL ? 0 
	    : new_child->is_list () ? new_child->arity () 
	    : 1;
	node->replace (k, new_child);
	return k+=n;
    }
As always, don't use it if you don't understand it.

3.4. Useful macros (2): Declaration boilerplate

In project 1, at least one group rediscovered the use of macros to provide "boilerplate" for recurring code fragments. When using the Horn framework, for example, you have to add some particular definitions to subclasses of your node classes to get them to register with the framework, and thus get used to make trees of particular kinds. Each such type looks like this:
   class Child_Node : public Parent_Node {
       ...
   protected:

       /** Create Child_Node with operator OPER and initial capacity N. */
       Child_Node (AST_Node* oper, size_t n) : Parent_Node (oper, n) { }

       /** Create and register exemplar. */
       Child_Node (int syntax) : Parent_Node (syntax) { }

       /** Override make (in the exemplars) to create a Child_Node. */
       Child_Node* make (AST_Node* oper, size_t n);

       static const Child_Node exemplar;
   };

   Child_Node Child_Node::exemplar (some syntactic category);
If you're defining a type that's intended only as a common supertype for a bunch of nodes (and will never be explicitly allocated itself), then you don't need the the definition of make and of exemplar.

Well, this is all pretty tedious, and it looks the same for each type. So why not refactor as a macro? For example, I use the following:

    #define STD_BASE_CONSTRUCTORS(Type, Supertype)				     \
    protected:							   \
                                                                                 \
        Type (AST_Node* oper, size_t n) : Supertype (oper, n) {	   \
        }							   \
                                                                   \
        Type (int syntax) : Supertype (syntax) { }

    #define STD_CONSTRUCTORS(Type, Supertype)			   \
        STD_BASE_CONSTRUCTORS (Type, Supertype)			   \
                                                                   \
        Type* make (AST_Node* oper, size_t n) {			   \
            return new Type (oper, n);				   \
        }							   \
                                                                   \
        static const Type exemplar

    #define STD_EXEMPLAR(Type, Syntax) const Type Type::exemplar (Syntax)
So that now I can write
   class Child_Node : public Parent_Node {
       ...
   protected:

       STD_CONSTRUCTORS (Child_Node, Parent_Node);
   };

   STD_EXEMPLAR (Child_Node, some syntactic category);

Of course, this definition doesn't allow for creating constructors with initializers (see the section " No default initialization of local variables and instance variables"), as in

       Child_Node (AST_Node* oper, size_t n) 
           : Parent_Node (oper, n), my_info (NULL) { }

       private:
           Info* my_info;
You'll have to work out appropriate generalizations yourselves.

3.5. Useful macros (3): Nonstandard expressions

At some point, I should probably augment the Horn framework to make it easier to get at tree-building stuff from outside the .hn files. Until then, I found it convenient to define a simple way to construct trees in my ordinary .cc files. Specifically, I wanted to be able to write things like
    MAKE_TREE (DEF, name, params, return_type, body)
and have it do the right thing. The problem is in allowing a call in which the number of parameters is arbitrary, because in C++ and C, there is no way to tell how many parameters you've been passed (if your signature has a trailing "..."). Basically, I copied stuff from the Horn framework to do this, and declared a function in ast.h:
    /** make_tree(SYNTAX, c1, ..., cn, NULL) creates a tree whose operator
     *  has syntax SYNTAX and whose children are C1 to Cn.  This is 
     *  essentially taken from the Horn parser framework. */
    extern AST_Tree* make_tree (int syntax, ...);
But how to get the NULL at the end? I can ask the programmer to do it, but that's a little error-prone and unsightly. A variadic macro serves to do the job:
    /** MAKE_TREE(SYNTAX, c1, ..., cn) creates a tree whose operator
     *  has syntax SYNTAX and whose children are C1 to Cn.  This is 
     *  essentially taken from the Horn parser framework. */
    #define MAKE_TREE(...) ::make_tree (__VA_ARGS__, NULL)

4. Stylistic Do's and Don'ts

4.1. Use OOP!

If you have code like this in your program:

    if (c->oper ()->as_string () == "lambda")) {
        do something to c...
    }
you are on the wrong track, at least stylistically. One almost never does direct type tests (instanceof in Java) or their moral equivalent (as above) in properly constructed object-oriented code. Instead, behavior that depends on the type of an object corresponds to a virtual method. For example, you might define in the AST_Node class:
   virtual void munge () {
       for_each_child (c, this) {
           c->munge ();
       } end_for;
   }
—so that, by default, munging an AST node simply munges its children— and then define
    class Lambda_Node : public AST_Tree {
        ...
        void munge () {
            Special munging for lambda nodes
            AST_Tree::munge ();  /* C++ version of super.munge (). */
        }

        ...
    };
so that munging a Lambda_Node does some special work first and then does the default of munging its children. What if you want an effect like this:
    void munge2 () {
         AST_Node* myParent = get my parent node somehow;
         if (myParent->oper ()->as_string () == "def")) {
             mungify me...
         } else {
             glorpify me...
         }
    }
? In that case, your code is most likely calling out to be structured differently, such as:
    void munge2 () {
         mungify me...
    }

    void glorp () {
        glorpify me...
    }
Where munge2 and glorp are both virtual functions. It is then your parent node's job to decide whether it really wants to munge or glorp you. If there is normally no difference between the two, your top-level definitions (in AST_Node) might read:
    virtual void munge2 ();
    virtual void glorp ();
with implementations
    void
    AST_Node::munge2 () { default definition; }

    void
    AST_Node::glorp () { munge2 (); }

4.2. Refactor!

Whenever you have recurring (or nearly recurring) code sequences, it's time to consider refactoring—moving recurring code into new functions. For example, I had written
    void
    AST_Tree::print (ostream& out, int indent)
    {
        out << "(" << ast_name () << " " << line_number ();
        for_each_child (c, this) {
            out << endl << setw(indent+2) << "";
            c->print (out, indent+2);
        } end_for;
        out << ")";
    }
and later, in class Id_Node:
    void print (ostream& out, int indent) {
        out << "(id " << line_number () << " ";
        child(0)->print (out, indent+2);
        if (get_decl () != NULL)
            out << " " << get_decl ()->get_index ();
        out << ")";
    }
and then, in class Type_Node:
    void print (ostream& out, int indent) {
        out << "(type " << line_number () << " ";
        child(0)->print (out, indent+2);
        out << ")";
    }
But these are all variations on a theme. So I defined a (non-virtual) protected method, AST_Tree::print_prefix, to catch the common beginning parts:
    void
    AST_Tree::print_prefix (ostream& out, int indent, bool compact)
    {
        out << "(" << ast_name () << " " << line_number ();
        for_each_child (c, this) {
            if (compact)
                out << " ";
            else 
                out << endl << setw(indent+2) << "";
            c->print (out, indent+2);
        } end_for;
    }
so that now, my print methods read:
    void
    AST_Tree::print (ostream& out, int indent)
    {
        print_prefix (out, indent, false);
        out << ")";
    }    

    // In Id_Node:
    void print (ostream& out, int indent) {
        print_prefix (out, indent, true);
        if (get_decl () != NULL)
            out << " " << get_decl ()->get_index ();
        out << ")";
    }

    // In Type_Node:
    void print (ostream& out, int indent) {
        print_prefix (out, indent, true);
        out << ")";
    }

4.3. The Law of Demeter

At one point I found myself writing stuff like this:
     string name = an_id_node->child(0)->as_string ();
—in other words, I wanted the name of an identifier stored in an Id_Node. This code is inside a method other than Id_Node, and yet it has to know something about the object structure of Id_Nodes. Of course, Id_Nodes can occur in places that also allow Typed_Id_Nodes (as in (typed_id N (id N NAME))), in which case, I would have to write
     string name = an_id_node->child(0)->child(0)->as_string ();
and (shudder) choose between one or the other depending on the dynamic type of an_id_node. I had violated the Law of Demeter, which tells me that I should have limited knowledge of how objects not closely related to me represent the information they contain.

In this case, we can fix the problem by overriding as_string on Id_Node and Typed_Id_Node:

     // In Id_Node:
     string as_string () { return child (0)->as_string (); }

     // In Type_Id_Node:
     string as_string () { return child (0)->as_string (); }
                   // my child (0) is an Id_Node!
Now I can rewrite my original code can handle both cases clearly:
     string name = an_id_node->as_string ();

4.4. Inner comments

I am always suspicious of inner comments, as in this excerpt from a recent bug-submit:
       //add a new decl
       current_index_list->push_back(global_decls.size());
           ...
       }
        else {
            //add a new ref instead of decl (if we have "x=5\nx=6") the second x is a ref
            ...
        }
    }
    //check right side to see if its assign.
    // if not assign, send to frame reference creater
There are actually three problems here:

5. Autograder Considerations

5.1. Canonical declarations

The autograder uses pyunparse --canonical --noprint to get a list of declarations that your AST actually uses (and, transitively, that those declarations reference). It reorders the declarations according to the order in which they are encountered in the source in a way that is supposed to be pretty consistent for any correctly decorated program. There are a few considerations, collected below.

5.2. Original Decl cross-reference

The numbers in square braces at the end of each canonical Decl refer to the original declaration for that Decl in your output. They are ignored for grading purposes.

5.3. Order of lambda conversions

When adding defs of functions you create for lambda expressions, put them at the beginning of their block and keep the order of these defs the same as the order of the original lambda expressions they come from.

5.4. Avoid adding id references

One group introduced a typed_id node not present in the source to indicate the type of first method parameters. Don't do this (just keep the type in the Decl itself), since it creates additional cross-references to the type's identifier, which changes the canonical declaration produced.
Page was last modified on Wed Apr 6 12:56:01 2011.
Address comments and questions to cs164@eecs.berkeley.edu