1 / 65

Programming Languages 2nd edition Tucker and Noonan

Programming Languages 2nd edition Tucker and Noonan. Chapter 3 Lexical and Syntactic Analysis Syntactic sugar causes cancer of the semicolon. A. Perlis. Contents. 3.1 Chomsky Hierarchy 3.2 Lexical Analysis 3.3 Syntactic Analysis. Lexical Analysis .

thy
Download Presentation

Programming Languages 2nd edition Tucker and Noonan

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Programming Languages2nd editionTucker and Noonan Chapter 3 Lexical and Syntactic Analysis Syntactic sugar causes cancer of the semicolon. A. Perlis

  2. Contents 3.1 Chomsky Hierarchy 3.2 Lexical Analysis 3.3 Syntactic Analysis

  3. Lexical Analysis • 3.1 Chomsky Hierarchy of Languages • 3.2 Purpose of Lexical Analysis • Regular Expressions • regular expressions for Clite lexicon • Finite State Automata (FSA) • FSA as a basis for a lexical analyzer • Lexical Analyzer (Lexer) Code

  4. 3.1 Chomsky Hierarchy • Each grammar class corresponds to a language class • Regular grammarslexical grammars • Context-free grammars programming language syntax • Context-sensitive grammars able to express some type rules • Unrestricted grammars – most powerfulcan express all features of languages such as C/C++

  5. Chomsky Hierarchy • Context sensitive and unrestricted grammars are not appropriate for developing translators • Given a terminal string ω and a context-sensitive language G it is undecidable whether ω is in the language defined by G, and it is undecidable whether L(G) has any valid strings. • A problem is decidable if you can write an algorithm that is guaranteed to solve the problem in a finite number of steps.

  6. Regular Grammars (for Lexical Analysis) • In terms of expressive power, equivalent to: • Regular expressions • Finite-state automata

  7. Context-Free Grammars • Capable of expressing concrete syntax of programming languages • Equivalent to • a pushdown automaton • Other grammar levels – beyond the scope of this course; see CS 403 or 603 – also correspond to theoretical machines

  8. 3.2 Lexical Analysis • Input: a sequence of characters (the program) • Discard: whitespace, comments • Output: tokens • Define: A token is a logically cohesive sequence of characters representing a single symbol; e.g. • Identifiers: numberVal • Literals: 123, 5.67, 'x', true • Keywords: bool | char ... • Operators: + - * / ... • Punctuation: ; , ( ) { }

  9. Character Sequences to Be Recognized by Clite Lexer (tokens + other) • Identifiers • Whitespace: space or tab • Literals • Comments: // to end-of- line • Keywords • End-of-line • Operators • End-of-file • Punctuation

  10. Ways to Describe Lexical Elements • Natural language descriptions • Regular grammars • Regular expressions • Context free grammars

  11. Regular Expressions • Regular expressions (regexp) are patterns that describe a particular class of strings • Used for pattern matching • One regexp can describe or match many strings • Used in many text-processing applications • Python, Perl, Tcl, UNIX utilities such as grep all use regular expressions

  12. Using Regular Expressions • An alternative to regular grammars for expressing lexical syntax • Lexical-analyzer generator programs (e.g. Lex) take regular expressions as input and produce C/C++ programs that tokenize text.

  13. With Regular Expressions You Canhttp://msdn2.microsoft.com/en-us/library/101eysae(VS.80).aspx • Test for a pattern within a string (data validation) • For example, you can test an input string to see if a telephone number pattern or a credit card number pattern occurs within the string. • Replace text. • Use a regular expression to identify specific text in a document and either remove it completely or replace it with other text. • Extract a substring from a string based upon a pattern match. • Find specific text within a document or input field.

  14. Regular Expression Notation – page 62 • RegExprMeaning • x a character x • \x an escape character, e.g., \n or \t • { name } a reference to a name • M| N M or N • M N M followed by N • M* zero or more occurrences of M • Red characters = metacharacters

  15. RegExpr Meaning • M+ One or more occurrences of M • M? Zero or one occurrence of M • [aeiou] the set of vowels/choose one • [0-9] the set of digits/choose one (‘-’ is a metachar.) • . Any single character (1-char wildcard) • \d same as [0-9] • \w same as [a-zA-Z0-9_] • \s whitespace: [ \t\n] • Differences in some representations

  16. Simple Example • gr[ae]y, (gray|grey) and gr(a|e)y are equivalent regexps. • Both match either "gray" or "grey".

  17. Pattern To Match a Date In the Formyyyy-mm-dd, yyyy.mm.dd, or yyyy/mm/dd • (19|20)\d\d[- /.](0[1-9]|1[012])[- /.] (0[1-9]|[12][0-9]|3[01]) • (19|20)\d\d : matches “19” or “20” followed by two digits • [- /.] : matches ‘-’ or ‘ ‘ or ‘/’ or ‘.’ • (0[1-9]|1[012]) : the first option matches a digit between 01 and 09, the second matches 10, 11 or 12. • (0[1-9]|[12][0-9]|3[01]) : the 1st option matches digits 01-09, the 2nd 10-29, and the 3rd matches 30 or 31.

  18. Clite Lexical Syntax: Ancillary Definitions • Category NameDefinition • anyChar [-~] // all printable ASCII chars; blank - tilde • letter [a-zA-Z] • digit [0-9] • whitespace [ \t] // blank or tab • eol \n • eof \004

  19. Clite Lexical Syntax (regexp metacharacters in red) • CategoryDefinition • keyword bool |char |else | false • | float | if | int | main • | true | while • identifier {letter}({letter} | {digit})* • integerLit {digit}+ • floatLit {digit}+\.{digit}+ • charLit ‘{anyChar}’ • operator: = |||| && | == |!= | < | <= |> |>= | + | - | * | / |!|[|] • separator: ; | . | {| } | (| ) • comment: //({anyChar} | {whitespace})*{eol}

  20. Lexical Analyzer Generators • Input: regular expressions • Output: a lexical analyzer • C/C++: Lex, Flex • Java: JLex • Regular grammars or regular expressions are converted to a deterministic finite state automaton (DFSA) and then to a lexical analyzer.

  21. Elements of a Finite State Automata • Set of states: represented by graph nodes • Input alphabet + unique end-of-input symbol • State transition function represented as labelled, directed edges (arcs) connecting graph nodes • A unique start state • One or more final states

  22. Deterministic FSA • Definition: A finite state automaton is deterministic if for each state and each input symbol, there is at most one outgoing arc from the state labelled with the input symbol.

  23. A Finite State Automaton for Identifiers • Figure 3.2 (p. 64)

  24. Use a DFSA to recognize (accept) or reject a string • Process the string, one character at a time, by making a series of moves: • Follow the exit arc that corresponds to the leftmost input symbol, thereby consuming it. • If no such arc, then either the input is accepted (if you are in the final state) or there is an error. • An input is accepted if, beginning from the start state, the automaton consumes all the input and halts in a final state.

  25. Example • (S, a2i$) ├ (I, 2i$) • ├ (I, i$) • ├ (I, $) • ├ (F, ) • Thus: (S, a2i$) ├* (F, ) • Follow the exit arc that corresponds to the leftmost input symbol, thereby consuming it. • If no such arc, then either the input is accepted (if you are in the final state) or there is an error.

  26. Practical Issues • Explicit terminator (end-of-input symbol) is used only at end of program, not each token. • The symbols l and d represent an arbitrary letter and digit, respectively. • An unlabelled arc represents any valid input symbol (other than those on labelled arcs leaving the same state).

  27. Practical Issues • When a token is recognized, move to a final state (one with no exit arc) • Recognize a non-token, move back to start • Recognize EOF means end of source code. • Automaton must be deterministic. • Recognize key words as identifiers; then do a table look-up.

  28. How It’s Used • The lexer is called from the parser. • Parser: • Get next token • Parse next token • Lexer enters Start state each time the parser calls for a new token • Lexer enters “Final” state when a legal token has been recognized. The character that causes the transition to the final state may be white space; may be the first character of the next token.

  29. Figure 3.3 (p. 66) – DFSA token recognizer

  30. Lexer Code • Parser calls lexer when it needs a new token. • Lexer must remember where it left off. • Sometimes the lexer gets one character ahead in the input; compare ab=13; to ab = 13 ; • In the first case, the identifier ab isn’t recognized until the next token, =, is read. • In the second case, blanks signify ends of tokens

  31. Lexer Code • Solutions: • peek function • pushback function • no symbol consumed by moving out of start state; i.e., always have the next character available. • when the parser calls the lexer, the lexer already has the first character of the next token, probably in a variable ch

  32. 3.2.3 - From Design to Code • private char ch = ‘ ’ ; • public Token next( ) { • do { • switch (ch) { • ... • } • } while (true); • } • Figure 3.4: Outline of Next Token Routine

  33. Remarks • Exit do-while loop only when a token is found • Loop exited via a return statement which returns control to the parser • Variable ch must be initialized to a space character; thereafter it always holds the next character to be processed.

  34. Translation Rules • Pages 67,68 give rules for translating the DFSA into code. • A Java Tokenizer Method for Clite is shown on page 69 (Figure 3.5) • Auxiliary functions described on page 68 and 70.

  35. private boolean isLetter(char c) { • return ch >= ‘a’ && ch <= ‘z’ || • ch >= ‘A’ && ch <= ‘Z’; • }

  36. private String concat(String set) { • StringBuffer r=new StringBuffer(“”); • do { • r.append(ch); • ch = nextChar( ); • } while (set.indexOf(ch) >= 0); • return r.toString( ); • }

  37. // bold indicates auxiliary methods • public Token next( ) { • do {if(isLetter(ch) {//ident or keyword • String spelling=concat(letters+digits); • return Token.keyword(spelling); • }else if(isDigit(ch)){//numeric literal • String number = concat(digits); • if (ch != ‘.’) // int literal • return Token.mkIntLiteral(number); • number += concat(digits); • return Token.mkFloatLiteral(number); • }

  38. else switch (ch) { • case ‘ ‘: case ‘\t’: case ‘\r’: case eolnCh: • ch = nextCh( ); break; • //omitted ‘/’, comments, ‘\’ • case eofCh: return Token.eofTok; • case ‘+’: ch = nextChar( ); • return Token.plusTok; • … • case ‘&’: check(‘&’); return Token.andTok; • case ‘=‘: return chkOpt(‘=‘,Token.assignTok, • Token.eqeqTok);

  39. // a first program // with 3 comments int main ( ) { char c; int i; c = 'h'; i = c + 3; } // main Token TypeToken Keyword int Keyword main Punctuation ( Punctuation ) Punctuation { Keyword char Identifier c Punctuation ; etc. Source Tokens

  40. Contents 3.1 Chomsky Hierarchy 3.2 Lexical Analysis 3.3 Syntactic Analysis

  41. Syntactic Analysis (The Parser) • Purpose: to recognize source code structure • Input: tokens • Output: parse tree or abstract syntax tree

  42. Parsing Algorithms – two types • Top-down: (recursive descent, LL) • LL = Left-to-right scan of input, Leftmost derivation • Based directly on BNF grammar for the language • Builds the parse tree in preorder • begin with the start symbol as the root of the tree • expand downward using BNF specific rules; intermediate tree nodes correspond to language non-terminals • leaves of the parse tree will be terminal symbols (tokens) • The representation of the parse tree may be converted to abstract syntax as parsing proceeds.

  43. Parsing Algorithms – two types • Bottom-up: (LR) • LL = Left-to-right scan of input, Rightmost derivation • start with the leaves (tokens) • group them together to form interior tree nodes to match rules in the grammar • End up at the root of the parse tree. • Equivalent to right-most derivations

  44. Top down Exp: Exp+term Bottom up x * y = term Partial Example: to parse x*y + z exp

  45. Recursive Descent Parsing • A recursive descent parser “builds” the parse tree in a top-down manner • Defines a method/function for each non-terminal to recognize input derivable from that nonterminal • Each method should • Recognize the longest sequence of tokens (in the input stream) derivable from that non-terminal • Return an object which is the root of a subtree.

  46. Token Implementation • Tokens have two parts: • a type (e.g., Identifier, Literal) • a value (e.g., xyz, 3.45)

  47. Auxiliary Functions for the Parser • match( )compares the current token to the expected token t • If they match, get next token and return. Return value = token.value • Else display a syntax error message. • error( ) displays the error message and exits.

  48. private String match (TokenType t) { • String value = token.value(); • if (token.type().equals t) • token = lexer.next(); • // token is a global variable • else • error(t); // function to report an error • return value; • }

  49. Grammar for Parsing Example • (remove recursion for recursive descent parsing) • Assignment→ Identifier = Expression • Expression → Term{AddOpTerm} • AddOp → + | - • Term → Factor{MulOpFactor} • MulOp → * | / • Factor → [UnaryOp]Primary • UnaryOp → - | ! • Primary → Identifier | Literal | ( Expression )

More Related