1 / 75

Introduction to Parsing

Introduction to Parsing. Outline. Regular languages revisited Parser overview Context-free grammars (CFG ’ s) Derivations. Languages and Automata. Formal languages are very important in CS Especially in programming languages Regular languages The weakest formal languages widely used

elise
Download Presentation

Introduction to Parsing

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Parsing

  2. Outline • Regular languages revisited • Parser overview • Context-free grammars (CFG’s) • Derivations

  3. Languages and Automata • Formal languages are very important in CS • Especially in programming languages • Regular languages • The weakest formal languages widely used • Many applications • We will also study context-free languages

  4. Limitations of Regular Languages • Intuition: A finite automaton that runs long enough must repeat states • Finite automaton can’t remember # of times it has visited a particular state • Finite automaton has finite memory • Only enough to store in which state it is • Cannot count, except up to a finite limit • E.g., language of balanced parentheses is not regular: { (i )i | i ¸ 0}

  5. The Functionality of the Parser • Input: sequence of tokens from lexer • Output: parse tree of the program

  6. ?: INT INT == ID ID Example • Java expr x == y ? 1 : 2 • Parser input ID == ID ? INT : INT • Parser output

  7. Comparison with Lexical Analysis

  8. The Role of the Parser • Not all sequences of tokens are programs . . . • . . . Parser must distinguish between valid and invalid sequences of tokens • We need • A language for describing valid sequences of tokens • A method for distinguishing valid from invalid sequences of tokens

  9. Context-Free Grammars • Programming language constructs have recursive structure • An EXPR is if EXPR then EXPR else EXPR fi , or while EXPR loop EXPR pool , or … • Context-free grammars are a natural notation for this recursive structure

  10. CFGs (Cont.) • A CFG consists of • A set of terminals T • A set of non-terminals N • A start symbolS (a non-terminal) • A set of productions AssumingX 2 N X !e , or X ! Y1 Y2 ... Yn where Yiµ N [ T

  11. Notational Conventions • In these lecture notes • Non-terminals are written upper-case • Terminals are written lower-case • The start symbol is the left-hand side of the first production

  12. Examples of CFGs Expr  if Expr then Expr else Expr | while Expr do Expr | id

  13. Examples of CFGs Simple arithmetic expressions:

  14. The Language of a CFG Read productions as replacement rules: X ! Y1 ... Yn Means X can be replaced by Y1 ... Yn X !e Means X can be erased (replaced with empty string)

  15. Key Idea • Begin with a string consisting of the start symbol “S” • Replace any non-terminal Xin the string by a right-hand side of some production • Repeat (2) until there are no non-terminals in the string

  16. The Language of a CFG (Cont.) More formally, write if there is a production

  17. The Language of a CFG (Cont.) Write if in 0 or more steps

  18. The Language of a CFG Let Gbe a context-free grammar with start symbol S. Then the language of G is:

  19. Terminals • Terminals are called because there are no rules for replacing them • Once generated, terminals are permanent • Terminals ought to be tokens of the language

  20. Examples L(G) is the language of CFG G Strings of balanced parentheses Two grammars: OR

  21. Arithmetic Example Simple arithmetic expressions: Some elements of the language:

  22. Notes The idea of a CFG is a big step. But: • Membership in a language is “yes” or “no”; also need parse tree of the input • Must handle errors gracefully • Need an implementation of CFG’s (e.g., Cup)

  23. More Notes • Form of the grammar is important • Many grammars generate the same language • Tools are sensitive to the grammar • Note: Tools for regular languages (e.g., flex) are also sensitive to the form of the regular expression, but this is rarely a problem in practice

  24. Derivations and Parse Trees A derivation is a sequence of productions A derivation can be drawn as a tree • Start symbol is the tree’s root • For a production add children • to node

  25. Derivation Example • Grammar • String

  26. Derivation Example (Cont.) E E + E E * E id id id

  27. Derivation in Detail (1) E

  28. Derivation in Detail (2) E E + E

  29. Derivation in Detail (3) E E + E E * E

  30. Derivation in Detail (4) E E + E E * E id

  31. Derivation in Detail (5) E E + E E * E id id

  32. Derivation in Detail (6) E E + E E * E id id id

  33. Notes on Derivations • A parse tree has • Terminals at the leaves • Non-terminals at the interior nodes • An in-order traversal of the leaves is the original input • The parse tree shows the association of operations, the input string does not

  34. Left-most and Right-most Derivations • The example is a left-most derivation • At each step, replace the left-most non-terminal • There is an equivalent notion of a right-most derivation

  35. Right-most Derivation in Detail (1) E

  36. Right-most Derivation in Detail (2) E E + E

  37. Right-most Derivation in Detail (3) E E + E id

  38. Right-most Derivation in Detail (4) E E + E E * E id

  39. Right-most Derivation in Detail (5) E E + E E * E id id

  40. Right-most Derivation in Detail (6) E E + E E * E id id id

  41. Derivations and Parse Trees • Note that right-most and left-most derivations have the same parse tree • The difference is the order in which branches are added

  42. Summary of Derivations • We are not just interested in whether s 2L(G) • We need a parse tree for s • A derivation defines a parse tree • But one parse tree may have many derivations • Left-most and right-most derivations are important in parser implementation

  43. Issues • A parser consumes a sequence of tokens s and produces a parse tree • Issues: • How do we recognize that s 2L(G) ? • A parse tree of s describes hows  L(G) • Ambiguity: more than one parse tree (interpretation) for some string s • Error: no parse tree for some string s • How do we construct the parse tree?

  44. Ambiguity • Grammar E ! E + E | E * E | ( E ) | int • String int * int + int

  45. Ambiguity (Cont.) This string has two parse trees E E E + E E E * E E int int E + E * int int int int

  46. Ambiguity (Cont.) • A grammar is ambiguous if it has more than one parse tree for some string • Equivalently, there is more than one right-most or left-most derivation for some string • Ambiguity is bad • Leaves meaning of some programs ill-defined • Ambiguity is common in programming languages • Arithmetic expressions • IF-THEN-ELSE

  47. Dealing with Ambiguity • There are several ways to handle ambiguity • Most direct method is to rewrite the grammar unambiguously E ! T + E | T T ! int * T | int | ( E ) • Enforces precedence of * over +

  48. Ambiguity: The Dangling Else • Consider the grammar E  if E then E | if E then E else E | OTHER • This grammar is also ambiguous

  49. if if E1 if E1 E4 if E2 E3 E2 E3 E4 The Dangling Else: Example • The expression if E1 then if E2 then E3 else E4 has two parse trees • Typically we want the second form

  50. The Dangling Else: A Fix • else matches the closest unmatched then • We can describe this in the grammar E  MIF /* all then are matched */ | UIF /* some then are unmatched */ MIF  if E then MIF else MIF | OTHER UIF  if E then E | if E then MIF else UIF • Describes the same set of strings

More Related