580 likes | 760 Views
CSE P501 – Compiler Construction. Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next. Scanner. ‘Middle End’. Back End. Target. Source. Front End. chars. IR. IR. Scan. Select Instructions. Optimize. tokens. IR. Allocate Registers. Parse. IR. AST. Emit.
E N D
CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars& BNF Next Jim Hogg - UW - CSE P501
Scanner ‘Middle End’ Back End Target Source Front End chars IR IR Scan Select Instructions Optimize tokens IR Allocate Registers Parse IR AST Emit Semantics IR IR Machine Code AST = Abstract Syntax Tree IR = Intermediate Representation Jim Hogg - UW - CSE P501
Automatic or Hand-Written? JFlex regex define tokens Scanner .jflex .java OR • Write a scanner, in Java, by hand • Easyand enlightening • Will see an outline of how, later Use a scanner-generator - JFlex Jim Hogg - UW - CSE P501
Reminder: a token is . . . class C { public intfac(int n) { // factorial intnn; if (n < 1) nn= 1; else nn = n * this.fac(n-1); return nn; } } Key for Char Stream: ◊ newline \n ∙ space class∙C∙{◊∙∙public∙int∙fac(int∙n)∙{∙∙//∙factorial◊∙∙∙∙int∙nn;◊∙∙∙∙if(n∙<∙1)◊∙∙∙∙∙∙nn∙=∙1;◊∙∙∙∙else◊∙∙∙∙nn∙=∙n∙*∙(this.fac(n-1));◊∙∙∙∙return∙nn;◊∙∙}◊} CLASS ID:C LBRACE PUBLIC INT ID:fac LPAREN INT ID:n RPAREN LBRACE INT ID:nn SEMI IF LPAREN ID:n LT ILIT:1 RPAREN ID:nn EQ ILIT:1 ELSE ID:nn EQ ID:n TIMES LPAREN ID:this DOT ID:fac LPAREN ID:n MINUS ILIT:1 RPAREN RPAREN SEMI RETURN ID:nn SEMI RBRACE RBRACE Jim Hogg - UW - CSE P501
A Token in your Java scanner • Obviously this Token is wasteful of memory: • lexeme is not required for primitive tokens, such as LPAREN, RBRACE, et • value is only required for ILIT • But, there's only 1 token alive at any instant during parsing, so no point refining into 3 leaner variants! class Token { public int kind; // eg: LPAREN, ID, ILIT public int line; // for debugging/diagnostics public int column; // for debugging/diagnostics public String lexeme;// eg: “x”, “Total”, “(“, “42” publicint value; // attribute of ILIT } Jim Hogg - UW - CSE P501
Typical Tokens • Operators & Punctuation • Single chars: + - * = / ( ] ; : • Double chars: :: <= == != • Keywords • ifwhileforgotoreturnswitchvoid … • Identifiers • A single ID token kind, parameterized by lexeme • Integer constants • A single ILIT token kind, parameterized by int value See jflex-1.5.0\examples\java\java.flex for real example Jim Hogg - UW - CSE P501
Token Spotting if(a<=3)++grades[1]; // what are the tokens? (no spaces) public intfac(int n) { // what are the tokens? (need spaces?) • Counter-example: fixed-format FORTRAN: • DO 50 I = 1,99 // DO loop • DO 50 I = 1.2 // assignment: DO50I = 1.2 Jim Hogg - UW - CSE P501
Principle of Longest Match • Scanner should pick the longest possible string to make up the next token (“greedy” algorithm) • Example return idx <= iffy; should be scanned into 5 tokens: • <= is one token, not two • iffy is an ID, not IF followed by ID:fy RETURN ID:idx LEQ ID:iffy SEMI Jim Hogg - UW - CSE P501
Regex • The syntax, of most programming languages can be specified using Regular Expressions • “REs” in Cooper&Torczon • “regex” is more common • Tokens can be recognized by a deterministic finite automaton (DFA) • DFA (a Java class) is almost always generated from regex using a software tool, such as JFlex Jim Hogg - UW - CSE P501
Regex Cheat Sheet Precedence: * (highest), concatenation, | (lowest) Parentheses can be used to group regexs as needed Notice meta-characters, in red Escaped characters: \* \+ \? \| \. \t \n Jim Hogg - UW - CSE P501
Regex Examples Check free online Regex tutorials if you are rusty. Eg: http://regexone.com/ Experiment with a regex-capable editor. Eg: http://www.editpadpro.com/ Jim Hogg - UW - CSE P501
regex • Defined over some alphabet Σ • For programming languages, alphabet is ASCII or Unicode • If re is a regular expression, L(re ) is the language (set of strings) generated by re Jim Hogg - UW - CSE P501
regex macros • Possible syntax for numeric constants Digit = [0-9] Digits = Digit+ Number = Digits ( .Digits )? ( [eE] (+ | -)? Digits ) ? • How would you describe this set in English? • What are some examples of legal constants (strings) generated by Number? • Tools like JFlex accept these convenient macros Jim Hogg - UW - CSE P501
Automata • Finite automata (state machines) can be used to recognize strings generated by regular expressions • Can build automaton by-hand or automagically • Will not build by-hand in this course • Will use the JFlex tool: given a set of regex, it generates an automaton recognizer (a Java class) Jim Hogg - UW - CSE P501
Finite Automata Terminology Jim Hogg - UW - CSE P501
DFA for “cat” regex = cat c a t Accepting State (double circles) Start State Jim Hogg - UW - CSE P501
DFA for ILIT regex = [0-9][0-9]* = [0-9]+ 0-9 0-9 1 2 We have labelled the states Jim Hogg - UW - CSE P501
DFA for ID • regex = [a-zA-Z_][a-zA-Z0-9_]* _ a-z A-Z _ a-z A-Z 1 0 0-9 Jim Hogg - UW - CSE P501
DFAs work like this . . . • scan the input text string, character-by-character • following the arc/edge corresponding to the character just read • if there is no arc for the character just read, then, either: • if you are in an accepting state: you're done. Success! • if you are not in an accepting state: you're done. Failure! Jim Hogg - UW - CSE P501
DFAs work like this - examples • Scan "fac(intn);" for the regex, alphaid= [a-z]+ (lower-case alphas)We hit "(" and are already in state 1. Success • Scan "23;" for regex alphaidThere is no arc for "2". We are still in state 0. Failure • Scan "today" for regex alphaidWe hit end-of-string and are already in state 1. Success a-z a-z 0 1 Note: no need to add arcs to the DFA for all error cases - they are implicit Jim Hogg - UW - CSE P501
Thompson’s Construction: Combining DFAs b a DFA for: a DFA for: b a b ε NFA for: ab a ε ε NFA for a|b b ε ε Jim Hogg - UW - CSE P501
Combining DFAs, cont’d b a DFA for: a DFA for: b ε ε a ε NFA for: a* ε Jim Hogg - UW - CSE P501
Exercise b a t a g u b g Draw the NFA for: b(at|ag) | bug Jim Hogg - UW - CSE P501
Exercise a t b a g u b g Draw the NFA for: b(at|ag) | bug Jim Hogg - UW - CSE P501
NFA for a(b|c)* • To recognize "acb" successfully, we need to: • guess the future correctly • backtrack and retry if we fail to recognize • somehow execute all possible paths • None of these is attractive! Can we construct an equivalent DFA? a b c b a c Jim Hogg - UW - CSE P501
Finite State Automaton (FSA) • A finite set of states • One marked as initial state • One or more marked as final states • States sometimes labeled or numbered • A set of transitions from state to state • Each labeled with symbol from Σ, or ε • Operate by reading input symbols (usually characters) • Transition can be taken if labeled with current symbol • ε-transition can be taken at any time (free bus ride) • Accept when final state reached & no more input • Scanner uses an FSA as a subroutine – accept longest match from current location each time called, even if more input • Reject if no transition possible, or no more input and not in final state (DFA) Jim Hogg - UW - CSE P501
DFA vs NFA • Deterministic Finite Automata (DFA) • No choice of which transition to take • In particular, no ε transitions • No guessing • Non-deterministic Finite Automata (NFA) • Choice of transition in at least one case • Accepts if some way to reach final state on given input • Reject if no possible way to final state • How to implement in software? Jim Hogg - UW - CSE P501
DFAs in Scanners • We really want DFA for speed: no backtracking, no guessing, no foretelling the future • Conversion from regex to NFA is easy, right? • But how to turn an NFA into an equivalent DFA? • Turns out to be obvious (once seen) and easy Jim Hogg - UW - CSE P501
NFA to DFA NFA for a(b|c)* b 4 5 a 9 8 1 2 3 0 c 6 7 Starting with the above NFA, we want to 'collapse' epsilon edges, ending up with a DFA that recognizes, and rejects, the same char strings. Ideally, we will end up with: b a 1 0 c Jim Hogg - UW - CSE P501
NFA to DFA NFA for a(b|c)* b 4 5 a 9 8 1 2 3 0 c 6 7 • Begin in the Start state • Foreach labelled arc leaving that state, what set of states can I reach, along labelled arc, or along transitions? Jim Hogg - UW - CSE P501
NFA to DFA NFA for a(b|c)* b n4 n5 a n9 n8 n1 n2 n3 n0 c n6 n7 Jim Hogg - UW - CSE P501
NFA to DFA DFA for a(b|c)* d2 b b a d1 d0 b c c d3 c Jim Hogg - UW - CSE P501
NFA to DFA - Even Better DFA for a(b|c)* b a d1 d0 c • Can reduce number of states further, to yield above result • If interested, see books for details • States minimization is not examined in P501 Jim Hogg - UW - CSE P501
From NFA to DFA • Subset construction (equivalence class) • Construct DFA from NFA, where each DFA state represents a set of NFA states • Key idea • State of DFA after reading some input is the set of all states the NFA could have reached after reading the same input • Algorithm: example of a fixed-point computation • If NFA has n states, DFA has at most 2n states • => DFA is finite, can construct in finite # steps Jim Hogg - UW - CSE P501
Build DFA for: b(at|ag) | bugfrom its NFA a t 2 3 4 b 0 1 a g 5 6 7 12 u b 8 9 g 10 11 Jim Hogg - UW - CSE P501
Build DFA for: b(at|ag) | bugfrom its NFA a t 2 3 4 b 0 1 a g 5 6 7 12 u b 8 9 g 10 11 Jim Hogg - UW - CSE P501
Hand-Written Scanner • Idea: show a hand-written DFA for some typical tokens • Then use to construct hand-written scanner • Setting: Parser calls scanner whenever it wants next token • JFlexprovides next_token • Scanner stores current position in input • For illustration only. Course project will use JFlex scanner-generator • Note - most commercial compilers use hand-written scanners - generally faster Jim Hogg - UW - CSE P501
Scanner DFA Example – Part 1 2 3 4 1 whitespaceor comments 0 end of input Accept EOF ( Accept LPAREN ) Accept RPAREN ; Accept SEMI Jim Hogg - UW - CSE P501
Scanner DFA Example – Part 2 6 7 9 10 5 = ! Accept NEQ [other ] Accept NOT 8 = < Accept LEQ [other ] Accept LESS Jim Hogg - UW - CSE P501
Scanner DFA Example – Part 3 12 [0-9] 11 [0-9] [other ] Accept ILIT Jim Hogg - UW - CSE P501
Scanner DFA Example – Part 4 14 • Strategies for handling identifiers vs keywords • Hand-written scanner: look up identifier-like things in table of keywords • Machine-generated scanner: generate DFA with appropriate transitions to recognize keywords [a-zA-Z0-9_] 13 [a-zA-Z] [other ] Accept ID or keyword Jim Hogg - UW - CSE P501
Scanner – class, ctor, skipWhite public class Scanner { privateString prog; // the MiniJava program to be scanned private intp; // index in 'prog' of current char public Scanner(String prog) { this.prog = prog; p = 0; } private voidskipWhite() { charc = prog.charAt(p); while( Character.isWhitespace(c) ) c = prog.charAt(++p); } Jim Hogg - UW - CSE P501
Scanner- id privateToken id() { intpBegin = p; // remember begin index of id charc = prog.charAt(p); // current char - alphabetic while( Character.isAlphabetic(c) || Character.isDigit(c) || c == '_') { c = prog.charAt(++p); } return newToken(ID, prog.substring(pBegin, p)); } Jim Hogg - UW - CSE P501
Scanner - iLit private Token iLit() { intpBegin = p; // remember begin index of lexeme charc = prog.charAt(p); // current char intval= Character.getNumericValue(c); // convert to int while( Character.isDigit(c) ) { // step thru chars of number c = prog.charAt(++p); val= 10 * val + Character.getNumericValue(c); } String lex = prog.substring(pBegin, p); return newToken(ID, lex, val); } Jim Hogg - UW - CSE P501
Scanner - nextToken public Token nextToken() { skipWhitespace(); // returns at prog[p] charc = prog.charAt(p); // current char in 'prog' charn = prog.charAt(p + 1); // next char in 'prog' switch(c) { case ‘>': if(n == '=') { p++; p++; return newToken(GEQ, “>="); } else { p++; return newToken(GT, “>"); } // . . . case'+': p++; return newToken(PLUS, "+"); // . . . } // end of switch Jim Hogg - UW - CSE P501
Scanner – nextToken, cont’d An entire hand-written scanner for MiniJava takes ~100 lines of Java if(Character.isDigit(c)) { return this.iLit(); } else if (Character.isAlphabetic(c)) { return this.id(); } else { return newToken(BAD, ""); } } // end of nextToken } // end of class Scanner Jim Hogg - UW - CSE P501
Grammars & BNF • Since the 60s, the syntax of every significant programming language has been specified by a formal grammar • First done in 1959 with BNF (Backus-Naur Form); used to specify ALGOL 60 syntax • Borrowed from the linguistics community (Noam Chomsky) Jim Hogg - UW - CSE P501
Grammar for a Tiny Language • program statement | program statement • statement assignStmt|ifStmt • assignStmtid = expr ; • ifStmtif ( expr ) statement • expr id |ilit| expr + expr • id a | b | c |i| j | k | n | x | y | z • ilit0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Note: often see ::= used instead of Jim Hogg - UW - CSE P501
Example Derivation program ::= statement | program statement statement ::= assignStmt | ifStmt assignStmt ::= id = expr ; ifStmt ::= if ( expr ) statement expr ::= id | ilit| expr + expr id ::= a | b | c | i | j | k | n | x | y | z ilit::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 a = 1 ; if ( a + 1 ) b = 2 ; P S | P S S A| I A id = E ; I if ( E ) S E id | ilit| E + E id [a-z] ilit [0-9] Jim Hogg - UW - CSE P501
Parse Tree - First Few Steps P P S | P S S A| I A id = E ; I if ( E ) S E id | ilit| E + E id [a-z] ilit [0-9] P S a = 1 ; if ( a + 1 ) b = 2 ; S A E = ; id ilit