1 / 58

CSE P501 – Compiler Construction

CSE P501 – Compiler Construction. Scanner Regex Automata Hand-Written Scanner Grammars & BNF Next. Scanner. ‘Middle End’. Back End. Target. Source. Front End. chars. IR. IR. Scan. Select Instructions. Optimize. tokens. IR. Allocate Registers. Parse. IR. AST. Emit.

ezra
Download Presentation

CSE P501 – Compiler Construction

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE P501 – Compiler Construction Scanner Regex Automata Hand-Written Scanner Grammars& BNF Next Jim Hogg - UW - CSE P501

  2. Scanner ‘Middle End’ Back End Target Source Front End chars IR IR Scan Select Instructions Optimize tokens IR Allocate Registers Parse IR AST Emit Semantics IR IR Machine Code AST = Abstract Syntax Tree IR = Intermediate Representation Jim Hogg - UW - CSE P501

  3. Automatic or Hand-Written? JFlex regex define tokens Scanner .jflex .java OR • Write a scanner, in Java, by hand • Easyand enlightening • Will see an outline of how, later Use a scanner-generator - JFlex Jim Hogg - UW - CSE P501

  4. Reminder: a token is . . . class C { public intfac(int n) { // factorial intnn; if (n < 1) nn= 1; else nn = n * this.fac(n-1); return nn; } } Key for Char Stream: ◊ newline \n ∙ space class∙C∙{◊∙∙public∙int∙fac(int∙n)∙{∙∙//∙factorial◊∙∙∙∙int∙nn;◊∙∙∙∙if(n∙<∙1)◊∙∙∙∙∙∙nn∙=∙1;◊∙∙∙∙else◊∙∙∙∙nn∙=∙n∙*∙(this.fac(n-1));◊∙∙∙∙return∙nn;◊∙∙}◊} CLASS ID:C LBRACE PUBLIC INT ID:fac LPAREN INT ID:n RPAREN LBRACE INT ID:nn SEMI IF LPAREN ID:n LT ILIT:1 RPAREN ID:nn EQ ILIT:1 ELSE ID:nn EQ ID:n TIMES LPAREN ID:this DOT ID:fac LPAREN ID:n MINUS ILIT:1 RPAREN RPAREN SEMI RETURN ID:nn SEMI RBRACE RBRACE Jim Hogg - UW - CSE P501

  5. A Token in your Java scanner • Obviously this Token is wasteful of memory: • lexeme is not required for primitive tokens, such as LPAREN, RBRACE, et • value is only required for ILIT • But, there's only 1 token alive at any instant during parsing, so no point refining into 3 leaner variants! class Token { public int kind; // eg: LPAREN, ID, ILIT public int line; // for debugging/diagnostics public int column; // for debugging/diagnostics public String lexeme;// eg: “x”, “Total”, “(“, “42” publicint value; // attribute of ILIT } Jim Hogg - UW - CSE P501

  6. Typical Tokens • Operators & Punctuation • Single chars: + - * = / ( ] ; : • Double chars: :: <= == != • Keywords • ifwhileforgotoreturnswitchvoid … • Identifiers • A single ID token kind, parameterized by lexeme • Integer constants • A single ILIT token kind, parameterized by int value See jflex-1.5.0\examples\java\java.flex for real example Jim Hogg - UW - CSE P501

  7. Token Spotting if(a<=3)++grades[1]; // what are the tokens? (no spaces) public intfac(int n) { // what are the tokens? (need spaces?) • Counter-example: fixed-format FORTRAN: • DO 50 I = 1,99 // DO loop • DO 50 I = 1.2 // assignment: DO50I = 1.2 Jim Hogg - UW - CSE P501

  8. Principle of Longest Match • Scanner should pick the longest possible string to make up the next token (“greedy” algorithm) • Example return idx <= iffy; should be scanned into 5 tokens: • <= is one token, not two • iffy is an ID, not IF followed by ID:fy RETURN ID:idx LEQ ID:iffy SEMI Jim Hogg - UW - CSE P501

  9. Regex • The syntax, of most programming languages can be specified using Regular Expressions • “REs” in Cooper&Torczon • “regex” is more common • Tokens can be recognized by a deterministic finite automaton (DFA) • DFA (a Java class) is almost always generated from regex using a software tool, such as JFlex Jim Hogg - UW - CSE P501

  10. Regex Cheat Sheet Precedence: * (highest), concatenation, | (lowest) Parentheses can be used to group regexs as needed Notice meta-characters, in red Escaped characters: \* \+ \? \| \. \t \n Jim Hogg - UW - CSE P501

  11. Regex Examples Check free online Regex tutorials if you are rusty. Eg: http://regexone.com/ Experiment with a regex-capable editor. Eg: http://www.editpadpro.com/ Jim Hogg - UW - CSE P501

  12. regex • Defined over some alphabet Σ • For programming languages, alphabet is ASCII or Unicode • If re is a regular expression, L(re ) is the language (set of strings) generated by re Jim Hogg - UW - CSE P501

  13. regex macros • Possible syntax for numeric constants Digit = [0-9] Digits = Digit+ Number = Digits ( .Digits )? ( [eE] (+ | -)? Digits ) ? • How would you describe this set in English? • What are some examples of legal constants (strings) generated by Number? • Tools like JFlex accept these convenient macros Jim Hogg - UW - CSE P501

  14. Automata • Finite automata (state machines) can be used to recognize strings generated by regular expressions • Can build automaton by-hand or automagically • Will not build by-hand in this course • Will use the JFlex tool: given a set of regex, it generates an automaton recognizer (a Java class) Jim Hogg - UW - CSE P501

  15. Finite Automata Terminology Jim Hogg - UW - CSE P501

  16. DFA for “cat” regex = cat c a t Accepting State (double circles) Start State Jim Hogg - UW - CSE P501

  17. DFA for ILIT regex = [0-9][0-9]* = [0-9]+ 0-9 0-9 1 2 We have labelled the states Jim Hogg - UW - CSE P501

  18. DFA for ID • regex = [a-zA-Z_][a-zA-Z0-9_]* _ a-z A-Z _ a-z A-Z 1 0 0-9 Jim Hogg - UW - CSE P501

  19. DFAs work like this . . . • scan the input text string, character-by-character • following the arc/edge corresponding to the character just read • if there is no arc for the character just read, then, either: • if you are in an accepting state: you're done. Success! • if you are not in an accepting state: you're done. Failure! Jim Hogg - UW - CSE P501

  20. DFAs work like this - examples • Scan "fac(intn);" for the regex, alphaid= [a-z]+ (lower-case alphas)We hit "(" and are already in state 1. Success • Scan "23;" for regex alphaidThere is no arc for "2". We are still in state 0. Failure • Scan "today" for regex alphaidWe hit end-of-string and are already in state 1. Success a-z a-z 0 1 Note: no need to add arcs to the DFA for all error cases - they are implicit Jim Hogg - UW - CSE P501

  21. Thompson’s Construction: Combining DFAs b a DFA for: a DFA for: b a b ε NFA for: ab a ε ε NFA for a|b b ε ε Jim Hogg - UW - CSE P501

  22. Combining DFAs, cont’d b a DFA for: a DFA for: b ε ε a ε NFA for: a* ε Jim Hogg - UW - CSE P501

  23. Exercise b a t a g u b g Draw the NFA for: b(at|ag) | bug Jim Hogg - UW - CSE P501

  24. Exercise a t b a g u b g Draw the NFA for: b(at|ag) | bug Jim Hogg - UW - CSE P501

  25. NFA for a(b|c)* • To recognize "acb" successfully, we need to: • guess the future correctly • backtrack and retry if we fail to recognize • somehow execute all possible paths • None of these is attractive! Can we construct an equivalent DFA? a b c b a c Jim Hogg - UW - CSE P501

  26. Finite State Automaton (FSA) • A finite set of states • One marked as initial state • One or more marked as final states • States sometimes labeled or numbered • A set of transitions from state to state • Each labeled with symbol from Σ, or ε • Operate by reading input symbols (usually characters) • Transition can be taken if labeled with current symbol • ε-transition can be taken at any time (free bus ride) • Accept when final state reached & no more input • Scanner uses an FSA as a subroutine – accept longest match from current location each time called, even if more input • Reject if no transition possible, or no more input and not in final state (DFA) Jim Hogg - UW - CSE P501

  27. DFA vs NFA • Deterministic Finite Automata (DFA) • No choice of which transition to take • In particular, no ε transitions • No guessing • Non-deterministic Finite Automata (NFA) • Choice of transition in at least one case • Accepts if some way to reach final state on given input • Reject if no possible way to final state • How to implement in software? Jim Hogg - UW - CSE P501

  28. DFAs in Scanners • We really want DFA for speed: no backtracking, no guessing, no foretelling the future • Conversion from regex to NFA is easy, right? • But how to turn an NFA into an equivalent DFA? • Turns out to be obvious (once seen) and easy Jim Hogg - UW - CSE P501

  29. NFA to DFA NFA for a(b|c)* b 4 5 a 9 8 1 2 3 0 c 6 7 Starting with the above NFA, we want to 'collapse' epsilon edges, ending up with a DFA that recognizes, and rejects, the same char strings. Ideally, we will end up with: b a 1 0 c Jim Hogg - UW - CSE P501

  30. NFA to DFA NFA for a(b|c)* b 4 5 a 9 8 1 2 3 0 c 6 7 • Begin in the Start state • Foreach labelled arc leaving that state, what set of states can I reach, along labelled arc, or along  transitions? Jim Hogg - UW - CSE P501

  31. NFA to DFA NFA for a(b|c)* b n4 n5 a n9 n8 n1 n2 n3 n0 c n6 n7 Jim Hogg - UW - CSE P501

  32. NFA to DFA DFA for a(b|c)* d2 b b a d1 d0 b c c d3 c Jim Hogg - UW - CSE P501

  33. NFA to DFA - Even Better DFA for a(b|c)* b a d1 d0 c • Can reduce number of states further, to yield above result • If interested, see books for details • States minimization is not examined in P501 Jim Hogg - UW - CSE P501

  34. From NFA to DFA • Subset construction (equivalence class) • Construct DFA from NFA, where each DFA state represents a set of NFA states • Key idea • State of DFA after reading some input is the set of all states the NFA could have reached after reading the same input • Algorithm: example of a fixed-point computation • If NFA has n states, DFA has at most 2n states • => DFA is finite, can construct in finite # steps Jim Hogg - UW - CSE P501

  35. Build DFA for: b(at|ag) | bugfrom its NFA a t 2 3 4 b 0 1 a g 5 6 7 12 u b 8 9 g 10 11 Jim Hogg - UW - CSE P501

  36. Build DFA for: b(at|ag) | bugfrom its NFA a t 2 3 4 b 0 1 a g 5 6 7 12 u b 8 9 g 10 11 Jim Hogg - UW - CSE P501

  37. Hand-Written Scanner • Idea: show a hand-written DFA for some typical tokens • Then use to construct hand-written scanner • Setting: Parser calls scanner whenever it wants next token • JFlexprovides next_token • Scanner stores current position in input • For illustration only. Course project will use JFlex scanner-generator • Note - most commercial compilers use hand-written scanners - generally faster Jim Hogg - UW - CSE P501

  38. Scanner DFA Example – Part 1 2 3 4 1 whitespaceor comments 0 end of input Accept EOF ( Accept LPAREN ) Accept RPAREN ; Accept SEMI Jim Hogg - UW - CSE P501

  39. Scanner DFA Example – Part 2 6 7 9 10 5 = ! Accept NEQ [other ] Accept NOT 8 = < Accept LEQ [other ] Accept LESS Jim Hogg - UW - CSE P501

  40. Scanner DFA Example – Part 3 12 [0-9] 11 [0-9] [other ] Accept ILIT Jim Hogg - UW - CSE P501

  41. Scanner DFA Example – Part 4 14 • Strategies for handling identifiers vs keywords • Hand-written scanner: look up identifier-like things in table of keywords • Machine-generated scanner: generate DFA with appropriate transitions to recognize keywords [a-zA-Z0-9_] 13 [a-zA-Z] [other ] Accept ID or keyword Jim Hogg - UW - CSE P501

  42. Scanner – class, ctor, skipWhite public class Scanner { privateString prog; // the MiniJava program to be scanned private intp; // index in 'prog' of current char public Scanner(String prog) { this.prog = prog; p = 0; } private voidskipWhite() { charc = prog.charAt(p); while( Character.isWhitespace(c) ) c = prog.charAt(++p); } Jim Hogg - UW - CSE P501

  43. Scanner- id privateToken id() { intpBegin = p; // remember begin index of id charc = prog.charAt(p); // current char - alphabetic while( Character.isAlphabetic(c) || Character.isDigit(c) || c == '_') { c = prog.charAt(++p); } return newToken(ID, prog.substring(pBegin, p)); } Jim Hogg - UW - CSE P501

  44. Scanner - iLit private Token iLit() { intpBegin = p; // remember begin index of lexeme charc = prog.charAt(p); // current char intval= Character.getNumericValue(c); // convert to int while( Character.isDigit(c) ) { // step thru chars of number c = prog.charAt(++p); val= 10 * val + Character.getNumericValue(c); } String lex = prog.substring(pBegin, p); return newToken(ID, lex, val); } Jim Hogg - UW - CSE P501

  45. Scanner - nextToken public Token nextToken() { skipWhitespace(); // returns at prog[p] charc = prog.charAt(p); // current char in 'prog' charn = prog.charAt(p + 1); // next char in 'prog' switch(c) { case ‘>': if(n == '=') { p++; p++; return newToken(GEQ, “>="); } else { p++; return newToken(GT, “>"); } // . . . case'+': p++; return newToken(PLUS, "+"); // . . . } // end of switch Jim Hogg - UW - CSE P501

  46. Scanner – nextToken, cont’d An entire hand-written scanner for MiniJava takes ~100 lines of Java if(Character.isDigit(c)) { return this.iLit(); } else if (Character.isAlphabetic(c)) { return this.id(); } else { return newToken(BAD, ""); } } // end of nextToken } // end of class Scanner Jim Hogg - UW - CSE P501

  47. Grammars & BNF • Since the 60s, the syntax of every significant programming language has been specified by a formal grammar • First done in 1959 with BNF (Backus-Naur Form); used to specify ALGOL 60 syntax • Borrowed from the linguistics community (Noam Chomsky) Jim Hogg - UW - CSE P501

  48. Grammar for a Tiny Language • program  statement | program statement • statement assignStmt|ifStmt • assignStmtid = expr ; • ifStmtif ( expr ) statement • expr id |ilit| expr + expr • id a | b | c |i| j | k | n | x | y | z • ilit0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Note: often see ::= used instead of  Jim Hogg - UW - CSE P501

  49. Example Derivation program ::= statement | program statement statement ::= assignStmt | ifStmt assignStmt ::= id = expr ; ifStmt ::= if ( expr ) statement expr ::= id | ilit| expr + expr id ::= a | b | c | i | j | k | n | x | y | z ilit::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 a = 1 ; if ( a + 1 ) b = 2 ; P  S | P S S  A| I A  id = E ; I  if ( E ) S E  id | ilit| E + E id  [a-z] ilit [0-9] Jim Hogg - UW - CSE P501

  50. Parse Tree - First Few Steps P P  S | P S S  A| I A  id = E ; I  if ( E ) S E  id | ilit| E + E id  [a-z] ilit [0-9] P S a = 1 ; if ( a + 1 ) b = 2 ; S A E = ; id ilit

More Related