Start Conditions in Flex

(from Nathan's discussion section on 1/25/10)

Recognizing String Literals

Many programming languages allow string literals with escapes, e.g., \n for newline. It's cleanest to process these escapes in the lexer before we pass string tokens to the parser; this way the escape recognition and handling code is all contained in one file and the parser's treatment of strings is simple.

One way we can handle string escapes is to recognize a string literal using a single regular expression and post-process it in the lexer. For example, the following could be a Flex rule for string literals in C:

(?x:\"
    (
      [^\"\n\\]
      |
      \\ (
           ['\"?\\abfnrtv]
           |
           [0-7]{1,3}
           |
           x[0-9a-fA-Z]+
           |
           [uU][0-9a-zA-Z]{4}
         )
    ) *
    \")    { return substitute_escapes(strbuf, yytext + 1, yyleng – 2); }


This rule is readable, thanks to the x option, but still formidably complex. Worse, we must recognize all the escapes a second time in order to select the correct substitutions for them.


String Rules with Start Conditions

We can recognize string literals and construct string tokens with simpler rules by making the rules conditional. In Flex, rules that are prefixed with “<SC>” are active only when the lexer is in the start condition named “SC”. For example:


<INITIAL>\"            Clear string buffer; BEGIN(STRING);
<STRING>\"             BEGIN(INITIAL); Append '\0' to string buffer; return TOK_STR;
<STRING>\n             error("Missing quote at end of string");
<STRING>\\['"?\\]      Append yytext[1] to string buffer;
<STRING>\\a            Append '\a' to string buffer;
<STRING>\\b            Append '\a' to string buffer;
<STRING>\\f            Append '\a' to string buffer;
<STRING>\\[0-7]{1,3}   {
                         strncpy(tmp, yytext, yyleng);
                         tmp[yyleng] = 0; 
                         Append (char) strtol(tmp, 0, 8) to string buffer.
                       }
// ...other escapes...
<STRING>[^\"\n\\]+     Append yytext to string buffer.


Notice that we only have to refer to each escape once, since its interpretation is contained in its own rule.


Exercise: Using Start Conditions for Syntax Highlighting

Suppose we want to display some source code with strings and comments in different color from the regular code, and the mechanism we have for displaying colored text is to output control codes that switch the current text color. We can use Flex rules with start conditions to recognize when to output the color codes.

What start conditions do you need? What patterns tell you when to change the condition?

Show answer below