CS164 - Section 2 - February 3/4, 2004 Announcements: WA1 due Thursday at 12:30pm PA2 due Sunday at 11:59pm (early autograder Saturday 11:59pm) WA2/PA3 out Thursday Send e-mail to me with: 1. Your e-mail address 2. 1 thing you like so far or 1 thing to improve about section Outline: - Flex and JLex (PA1) - Finite Automata (WA1) - Context-Free Grammars (WA2, PA3) ---------------------------------------------------------------------------- 1. Flex and JLex What are Flex and JLex? Flex and JLex are tools that will spit out C or Java code for a scanner (lexer) for you, given a bunch of regular expressions. Flex and JLex take regular expressions, convert them internally to a DFA, and then emit code for that DFA. Recall Compiler Stages: source ==LEXER==> tokens ==PARSER==> abstract syntax tree ==> ... Flex/JLex helps you build your lexer. How? - flex/JLex specification ==flex/JLex==> .c/.java - flex/JLex, given a specification, writes a C/Java source file that when compiled becomes your lexer. Flex syntax for regular expressions: 'a' literal 'a' a* 0 or more a's a+ 1 or more a's a? 0 or 1 a's [ab] (a|b) [^ab] all characters not equal to a or b [a-z] all characters between a and z [a-zA-Z_] all letters plus the underscore [^a-z0-9] all characters not between a-z or 0-9 . all character not equal to \n Flex input format: %{ [headers for lexer, such as #includes] %} [definitions for useful names and delcarations for start conditions] %% [lexer rules] %% [user code to be included with lexer] What does the output of flex look like? Something like: [headers for lexer, such as #includes] int yylex() { [code from your declarations and lexer rules] } [user code to be included with lexer] General Idea of Lexer Rules: You specify regular expressions that when matched execute some action. What action? Usually to return some token that you've identified. Special variables, functions, etc.: Flex Java ---- ---- yytext yytext() The string that was matched, usually the lexeme. yyleng yylength() The length of the string that was matched. yylex Lexer.next_token() The function that gets the next token that you implement with the rules in the specification. In flex, the return value is an integer specifying which token. In JLex, it's a Symbol that has an integer for the token and an Object used to associate a semantic value with the token. yylval N/A A global variable where you assign any semantic value to associate with the current token. Special state in which the lexer starts. (discuss states) Flex Example (cribbed from manual) In this lexer, we want to skip C-style comments and whitespace, recognize integers, floating point numbers, keywords (if, then, else, fi), identifiers (lowercase letters with number starting with lowercase letter), operators (+, -, *, /). Spot the errors in the flex example: %{ /* We need math.h for the atof() call below. */ #include DIGIT [0-9] ID [a-z][a-z0-9]* %} %% ID printf("identifier: %s\n", yytext); [^*]* /* skip */ "*"[^*/]* /* skip */ "*"+"/" BEGIN(INITIAL); "/*" BEGIN(COMMENT); {DIGIT}+ printf("int: %d\n", atoi(yytext)); {DIGIT}+"."{DIGIT}* printf("float: %g\n", atof(yytext)); if|then|else|fi printf("keyword: %s\n", yytext); "+"|"-"|"*"|"/" printf("operator: %s\n", yytext); [ \t\n]+ /* eat up whitespace */ .+ printf("unrecognized: %s\n", yytext); %% int main(int argc, char **argv) { ... } The corrected version of this lexer is available online along with the section notes. Questions: - How do we handle ambiguity? - How do we handle errors? A clarification of the Cool spec: Strings ERROR or OK? "Good or bad" "Good or bad" "Good or \n bad" "Good or bad" "Good or \ bad" "Good or \0 bad" ---------------------------------------------------------------------------- 2. Finite Automata Finite automata can be used to implement regular languages. Theorem: A language is regular if and only if it is accepted by a finite automaton. Formal definition: (\Sigma, S, n, F, \delta) \Sigma is the alphabet S is the set of states n is the start state (n \in S) F is the set of accepting states (F \subseteq S) \delta is the transition function (\delta : (state, input) -> state) Construct DFA for: - (a|b)*c - Strings that contain "ab". - All identifiers except "if" and "fi". NFA -> DFA: Idea: Simulate the NFA by having each state in the DFA be a subset of the states of the NFA. What is the start state in the DFA? A transition S -> S' on input 'a' exists in the DFA iff what? Which states in the DFA are final? Example: Do the NFA -> DFA conversion for the above example for (ab)*c. --------------------------------------------------------------------------- 3. Context-Free Grammars (Intro) After the lexing phase, we now have a stream of tokens. Now what? Parse! - Analogy: We've identified the words in the sentence. Now we want to give structure to determine if they're valid sentences. - What is the output of the parser? - What does the parser filter out? Why aren't regular languages enough? A context-free grammar is N - a set of non-terminals T - a set of terminals S - a start symbol (a non-terminal) \delta - a set of productions - How do we tell if some sequence is in this language? - Let G be a CFG. Then the language of G is: { a_1 ... a_n | S ->* a_1 ... a_n and every a_i is a terminal } Write a grammar for the language of balanced parantheses. Assume that T = {'(', ')'}. Example Programming Language: - This language has two special symbols 't' and 'f'. It has a prefix operator '!' and two infix operators '&&' and '||'. Write a grammar for this language. - Many CFGs describe the same language. The form is important. - What is a derivation? - Show the leftmost derivation of "!t && f || t". - Are there other derivations? - Can you rewrite the grammar so that any derivation gives precedence to the operators in the following order: !, &&, || (That is, ! binds tightest and || is loosest.)