Project #1 Notes, Fall 2006

This page is an accumulation of (we hope) helpful notes, hints, etc., that might be useful to you in doing Project #1.

Handling erroneous cases
Answers to questions
Patterns can't do everything
Lexer states
Putting text back

Handling erroneous cases

There are several erroneous cases that could be handled either in the parser or the lexer. To allow us to use test cases properly, we have to decide how to handle them. Here is a list of anomalous situations and what to do about them.

Misplaced "import" statements: Any line that begins with import (unindented) is supposed to be an "import statement" with significance to the lexer. If it is not of the proper form ("import", identifier, newline), then the lexer should report an error (and remember, it can be continued with a backslash!). Any other occurrence of import should be treated as non-erroneous and given to the parser. This is simplest, since the keyword is used elsewhere in the grammar.
Non-matching brackets: The issue here is what to do with cases like this:
```
    if a[x) = f(q):
```
or this:
```
    if a[x) = f(q:
```
The lexer should treat all left brackets (parenthesis, square bracket, and curly brace) as identical; likewise for right brackets. So the first example is not erroneous (as far as the lexer is concerned) and not continued. The second is not erroneous and is continued.
When there is an excess of right brackets at some point, count brackets as balanced (and non-erroneous) at that point. Thus,
```
    if f x ) + (x +
```
is continued (the first right parenthesis, having no matching left bracket, does not count in bracket matching).
Backslashes at ends of lines in strings: I have changed the description of backslash continuation to match my original intent:
- Backslash-newline is an unrecognized escape sequence. So in triple-quoted strings, \<newline> produces \<newline>.
- Newlines are illegal in single-quoted strings, whether or not they are preceded by backslash.
Bad numeric literals: Consider the integer literal 09, which is octal because it starts with a 0 digit, but which contains a non-octal digit. Do we treat it as two literals (a 0 and a decimal 9), or as an error? There are lots of similar cases. Consider 0xfoo, which could be interpreted as 0xf followed by the identifier oo, or 0x.2, which could be 0, x, and .2.
Here, I think we'll follow the practice of Python compilers (because it turns out be easy to do). In general, the left-to-right maximal munch rule applies: the next token in the input is the maximum-length prefix of the unprocessed part of the input that matches a valid token pattern. However, we'll make two exceptions: octal numerals may not be adjacent to non-octal numerals, so 09 is an error. Likewise, hexadecimal literals may not be adjacent to identifiers (including reserved words and keywords), so 0x3and is invalid. These exceptions are easy to encode. Just add an extra rule, immediately after your rules for octal and hex literals, that matches, respectively, "octal literal followed by more digits" and "hex literal followed by letters and underscores." The actions for these will, of course, be to report errors. They will always match erroneous cases, since they are longer than the valid literals that prefix such cases. They will not match correct cases, where they match the same number of characters as the legal pattern, because you put these error cases after the legal cases.
We'll adapt the same rule to floating-point numerals. For example, read "1.32.16.8" as the three (non-erroneous) numerals "1.32", ".16", ".8".

Answers to questions

This entry is a collection of substantive clarifications in response to students' questions or problems found in the project description.

Q1: Can we assume that LexTest will always receive "proper" arguments when it's called, that is can we assume a search path is always given, and that the input files always exist? If not, what error message(s) are we supposed to print out?

A1: As a general rule, you should always assume that data from outside is suspect, and be prepared to detect errors and recover in some more-or-less graceful fashion. Having said that, however, since LexTest itself is merely a test harness, not part of a real application, the standards of "gracefulness" are low, and I don't really care what kind of message you produce.

Q2: Let's say you run the command "LexTest test1.py test2.py", and test1.py imports test2.py. Do we treat this as a case of importing the same file twice?

A2: The spec refers only to importing a file, so technically you should read the file twice in that case.

Q3: Consider the string '\\012' (representing a four-character string containing backslash followed by the digits 0, 1, and 2). How is this to be distinguished in the output from '\n' (representing a one-character string containing a newline)?

A3: Good question. I have updated the project description to say that the backslash character in a string should also be printed in octal form (its ASCII value is 92, so that's \134). The description now also makes clear that you should always use three octal digits, even when two will do (so '\a' should be represented by \007, not \07).

Q4: Suppose that someone writes a literal 1 with 30 leading 0's. Should the lexer reject it? If not, what should it print? Likewise, suppose someone enters 2.3 followed by 30 trailing 0's.

A4: Of course the lexer should not reject things with leading 0's; they are numerals after all. I think the lexer should do some canonicalization here. So let's say that leading 0s get chopped off and likewise for trailing 0s in floating-point constants. In Java, if you convert these strings to integers and back to strings, you'll get proper formatting (actually, you'll need to convert them to Longs and check that they fit in integers due to some idiotic design decisions in the Java library).

Q5: Is the exponent in a floating-point number to be treated as an integer literal (octal if it starts with 0)?

A5: No. Always interpret the exponent as a decimal numeral. Leading zeroes are ignored, and the 0x notation is not allowed.

Q6: Should the lexer handle concatenated string literals, reporting

    "Hello, " 'world' """!"""

A6: No. We'll let the parser handle this. Return the strings separately.

Q7: Should we canonicalize float and int literals when returning their lexemes? E.g. should 1230e-1 be returned to LexTest (or the parser) as 123.0?

A7: Since it would be nice to get standard output, let's output all floating-point literals using %.16e format (as in printf("%.16e\n",x)), which I believe will produce the same results in most cases for C++ and Java.

Q8: Should our lexer gracefully handle ascii characters in range >=128?

A8: Don't bother.

Q9: What is the clean way of exiting from LexTest? However I found that when I use System.exit(1), the run-check python program thinks everything is OK.

A10: That's just because run-check looks at the contents of the standard error to see if there is an error. (I did this to accommodate Windows brain damage). So do a System.exit(1) to indicate errors, but make sure you have output something explanatory to System.err.

Patterns can't do everything

One of the rules of Pyth requires that if a line break occurs while you are inside an as-yet-unmatched pair of brackets, the line break is simply whitespace. Don't try to use patterns alone to distinguish newline-in-brackets from newline-outside-brackets, however. Finite-state automata cannot count—at least not above the number of states they contain. It's much easier to keep track of the number of unbalanced brackets in your actions, and then use it in an action to decide what to do with a newline—return a token or treat it as whitespace. Similarly, you'll want to check for amounts of whitespace and figure out what to do with them in actions.

Actions are also where you manipulate start states (next).

Lexer start states

Both flex and JFlex allow you to define "lexer start states" so as to alter the set of patterns matched (yes, internally these do have something to do with states of the FSA generated by these tools). By defining a state and labeling a pattern or group of patterns with it you can, in effect, turn those patterns on or off in the actions of your lexer.

An obvious use in this project is to handle indentation. You'd like to be able to have your lexer match indentation and decide whether to return INDENTs or DEDENTS. But there are a couple of problems. First, matching the indentation of an unindented line is difficult. It has zero length and is immediately followed by some token, so the maximal munch rule would normally just match the following token, and ignore any zero-length whitespace rule. But by putting the lexer into a "beginning of line" state when you encounter a newline character (and also initializing the lexer to start in that state at the beginning of input) you can arrange to first match only indentation at the beginning of a line.

Putting text back

Kludge warning: You may need to push back a character in cases where you must match the empty string (for example, at the beginning of a line with no indentation where you have to output DEDENT tokens). This is because flex and JFlex refuse to match the empty string. What you do instead is something like this:

(.|\n)   { yypushback (1); do something... }

[CS164 Homework] [CS164 Home Page]

Page was last modified on Fri Sep 22 20:34:32 2006.
Address comments and questions to cs164@eecs.berkeley.edu