1 / 113

Lecture 3 Introduction to Parsing and Top-Down Parsing

Lecture 3 Introduction to Parsing and Top-Down Parsing. Cheng-Chia Chen. Outlines. Parsing overview Limitation of regular expressions Context-free Grammar Derivation and Parse Tree Ambiguity of CFGs Concrete Parse Tree v.s. Abstract Syntax Tree. Recursive Descent Parsing

kendall
Download Presentation

Lecture 3 Introduction to Parsing and Top-Down Parsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 3 Introduction to Parsing andTop-Down Parsing Cheng-Chia Chen

  2. Outlines • Parsing overview • Limitation of regular expressions • Context-free Grammar • Derivation and Parse Tree • Ambiguity of CFGs • Concrete Parse Tree v.s. Abstract Syntax Tree. • Recursive Descent Parsing • LL(1) parsing

  3. Parsing Overview • What is syntax ? • The way in which words are put together to form phrases, clauses, or sentences. --- Webster’s Dictionary • The function of a parser : • Input: sequence of tokens from lexer • Output: parse tree of the program

  4. ?: INT INT == ID ID Example • Java expr x == y ? 1 : 2 • Parser input ID == ID ? INT : INT • Parser output

  5. Comparison with Lexical Analysis

  6. The Role of the Parser • Not all sequences of tokens are programs . . . • . . . Parser must distinguish between valid and invalid sequences of tokens • We need • A language for describing valid sequences of tokens • A method for distinguishing valid from invalid sequences of tokens

  7. Limitation of regular expression • Recall how we define lexical structures using regular expressions. • Ex: ---- (1) • digits = [0-9]+ • sum = (digits “+”)* digits • match sums of the form: 28 + 301 + 9. • Can we use the same way to define the syntax of a language?

  8. Inadequacy of regular expressions • Consider how to use regular expressions to express sums with parentheses ? • (109+235), • 61, • (1+ (250 + 34)) • A solution: --- (2) • digits = [0-9]+ • sum = expr + expr • expr = “(“ sum “)” | digits • Note the difference b/t (1) and (2). • in (1) digits and sum can be and actually is treated as abbreviations of their right hand side(RHS). • in (2), the LHS name are used recursively at their RHS, and cannot be treated as abbreviations.

  9. Inadequacy of regular expressions • To translate (1) • digits = [0-9]+ • sum = (digits “+”)* digits • into FA, we first translate each definition into normal regular expressions by replacing every reference of definitions by its RHS recursively: • digits = [0-9]+ • sum = ([0-9]+ “+”)* [0-9]+ • and then apply normal procedures to translate regular expressions into DFAs. • But for definitions with recursion like (2), this approach does not work.

  10. --- (2) • digits = [0-9]+ • sum = expr “+” expr • expr = “(“sum“)” | digits • Repalce sum in 3. by their RHS at 2: • expr = “(“expr“+”expr“)” | digits---- (4) • Repace expr at (4) by itself, we get: • expr = “(“ (“(“ expr “+” expr “)” | digits) “+” • (“(“ expr “+” expr “)” | digits) • | digits--- (5) • Conclusion: • It is hopeless to eliminate names in RHS by finite substitutions it there are direct or indirect recursions.

  11. It can be shown that • regular expressions with non-recursive abbreviations do not increase the expressive power of regular expressions, but , • Regular expressions allowing recursive abbreviations do increase the expressive power of regular expressions • --- This formalism is called context-free grammar, • and is just what we need for parsing.

  12. Redundancy induced by recursion • Alternation(|) is not needed: • expr = ab(c|d) e ==> aux = c | d expr = a b aux e • ==> aux=c aux = d expr = a b aux e • Repetition(*) is not needed • expr = (a b c ) * ==> • expr = (a b c) expr --- right recursion • // expr = expr (a b c ) --- left recursion • expr = e.

  13. Context-Free Grammars • Programming language constructs have recursive structure • An EXPR is if EXPR then EXPR else EXPR fi , or while ( EXPR ) EXPR , or … • Context-free grammars are a natural notation for this recursive structure

  14. CFGs (Cont.) • A CFG consists of • A set of terminals T • A set of non-terminals N • A start symbolS (a non-terminal) • A set P of productions, where each rule r is of the form: AssumingX  N X e, or X  Y1 Y2 ... Ynwhere Yi N  T

  15. Notational Conventions: • In these lecture notes • Non-terminals are written upper-case (S, A,B,C,…) • Terminals are written lower-case (a,b,c,…) • X,Y,Z, … range over T U N • a,b,g, … ranges over strings of V U T. • The start symbol (S) is the left-hand side of the first production • Each terminal symbol corresponds to a token type from the lexer. • Each non-terminal symbol corresponds to a symbol occurring at the LHS of a production rule.

  16. Examples of CFGs Expr  if Expr then Expr else Expr | while Expr do Expr | id

  17. Examples of CFGs Simple arithmetic expressions: E  E * E | E + E | ( E ) | id

  18. The Language of a CFG Read productions as replacement rules: X  Y1 … Yn Means X can be replaced by Y1 … Yn X e Means X can be erased (replaced with empty string)

  19. Key Idea • Begin with a string consisting of the start symbol “S” • Replace any non-terminal Xin the string by a right-hand side of some production • Repeat (2) until there are no non-terminals in the string X  Y1 … Yn

  20. The Language of a CFG (Cont.) • Write X1…X2…Xn X1… Xi-1 Y1 … Yn Xi+1…Xn if there is a production Xi Y1 … Yn • More formally : • a X b a g b if ∃ x  g ∈ P. • or define  to be the binary relation • { (a X b,a g b ) | a X b a g b • if ∃ x  g ∈ P. } • on (T U N)*.

  21. The Language of a CFG (Cont.) • Write X1…Xn* Y1 … Ym if X1…Xn…… Y1 … Ym in 0 or more steps • Or formally: • Define * to be the reflexive and transitive closure of the relation . • i.e., a * b iff • a = b or • ∃ a1,a2,…,an (n ≥ 1) such that a a1  …  an = b.

  22. The Language of a CFG Let Gbe a context-free grammar with start symbol S. Then the language L(G) of G is the set: { a | S * a where a is a terminal string. }

  23. Terminals • Terminals are called because there are no rules for replacing them • Once generated, terminals are permanent • Terminals ought to be token types of the language

  24. Examples Strings of balanced parentheses : Two representations of the same grammar G: OR

  25. Arithmetic Example Simple arithmetic expressions: Some elements of the language:

  26. Notes The idea of a CFG is a big step. But: • Membership in a language is just “yes” or “no” • We need also the parse tree of the input • Must handle errors gracefully • Need (tools for) an implementation of CFG’s.

  27. Derivations and Parse Trees A derivation is a sequence of strings over terminals and nonterminals S0, S1 , S2 , … Sn such that 1. Si S i+1 for 0 ≤ i < n. 2. S0 = S is the start symbol. • A derivation can be drawn as a tree • Start symbol is the tree’s root • For a production X  Y1… Yn • add children Y1… Yn to node X

  28. Derivation Example • Grammar • String

  29. Derivation Example (Cont.) E E + E E * E id id id

  30. Derivation in Detail (1) E

  31. Derivation in Detail (2) E E + E

  32. Derivation in Detail (3) E E + E E * E

  33. Derivation in Detail (4) E E + E E * E id

  34. Derivation in Detail (5) E E + E E * E id id

  35. Derivation in Detail (6) E E + E E * E id id id

  36. Properties of parse trees. • A parse tree has • start symbol at the root • terminals or empty nodes at the leaves • Non-terminals at the internal nodes • if internal node X has children Y1,…,Yn, then • X  Y1 Y2… Yn is a production rule. • An in-order traversal of the leaves is the original input. • The parse tree makes explicit the structure of the input string.

  37. Left-most and Right-most Derivations • The example is a right-most derivation • At each step, replace the right-most non-terminal • There is an equivalent notion of a left-most derivation

  38. Right-most Derivation in Detail (1) E

  39. Right-most Derivation in Detail (2) E E + E

  40. Right-most Derivation in Detail (3) E E + E id

  41. Right-most Derivation in Detail (4) E E + E E * E id

  42. Right-most Derivation in Detail (5) E E + E E * E id id

  43. Right-most Derivation in Detail (6) E E + E E * E id id id

  44. Derivations and Parse Trees • Note that right-most and left-most derivations have the same parse tree • The difference is the order in which branches are added

  45. Summary of Derivations • We are not just interested in whether s 2L(G) • We need a parse tree for s • A derivation defines a parse tree • But one parse tree may have many derivations • Left-most and right-most derivations are important in parser implementation

  46. Issues • A parser consumes a sequence of tokens s and produces a parse tree • Issues: • How to recognize that s 2L(G) ? • How to generate a parse tree of s once s  L(G) • Ambiguity: Is there more than one parse tree (interpretation) for some string s ? • Error handling: What should we do if no parse tree exist for an input string.

  47. Ambiguity • Grammar E ! E + E | E * E | ( E ) | int • String int * int + int

  48. Ambiguity (Cont.) This string has two parse trees E E E + E E E * E E int int E + E * int int int int

  49. Ambiguity (Cont.) • A grammar is ambiguous if it has more than one parse tree for some string • Equivalently, there is more than one right-most or left-most derivation for some string • Ambiguity is bad • Leave meaning of some programs ill-defined • Ambiguity is common in programming languages • Arithmetic expressions • IF-THEN-ELSE

  50. Dealing with Ambiguity • There are several ways to handle ambiguity • Most direct method is to rewrite the grammar unambiguously E ! T + E | T T ! int * T | int | ( E ) • Enforces precedence of * over +

More Related