Download Presentation
## Parsing

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Parsing**Giuseppe Attardi Università di Pisa**Parsing**Calculate grammatical structure of program, like diagramming sentences, where: Tokens = “words” Programs = “sentences” For further information: Aho, Sethi, Ullman, “Compilers: Principles, Techniques, and Tools” (a.k.a, the “Dragon Book”)**Outline of coverage**• Context-free grammars • Parsing • Tabular Parsing Methods • One pass • Top-down • Bottom-up • Yacc**Parser: extracts grammatical structure of program**function-def name arguments stmt-list stmt main expression expression operator expression variable << string cout “hello, world\n”**Context-free languages**Grammatical structure defined by context-free grammar statementlabeled-statement | expression-statement | compound-statementlabeled-statementident:statement | caseconstant-expression :statementcompound-statement{declaration-list statement-list } “Context-free” = only one non-terminal in left-part terminal non-terminal**Parse trees**Parse tree = tree labeled with grammar symbols, such that: • If node is labeled A, and its children are labeled x1...xn, then there is a productionA x1...xn • “Parse tree from A” = root labeled with A • “Complete parse tree” = all leaves labeled with tokens**L**L ; E E “Frontier” a Parse trees and sentences • Frontier of tree = labels on leaves (in left-to-right order) • Frontier of tree from S is a sentential form • Frontier of a complete tree from S is a sentence**L**L L L L E ; E ; L ; E E b E E b a a a a a;E a;b;b Example G: L L ; E | E E a | b Syntax trees from start symbol (L): Sentential forms:**Derivations**Alternate definition of sentence: • Given , in V*, say is a derivation step if ’’’ and = ’’’ , where A is a production • is a sentential form iff there exists a derivation (sequence of derivation steps) S ( alternatively, we say that S* ) Two definitions are equivalent, but note that there are many derivations corresponding to each parse tree**L**E a Another example H: L E ; L | E E a | b L L L L E E ; ; L ; E E b E b a a**E**E + * E E E E * E E id + id E E id id id id Ambiguity • For some purposes, it is important to know whether a sentence can have more than one parse tree • A grammar is ambiguous if there is a sentence with more than one parse tree • Example: E E+E | E*E | id**Notes**• If e then if b then d else f • { int x; y = 0; } • A.b.c = d; • Id -> s | s.id E -> E + T -> E + T + T -> T + T + T -> id + T + T -> id + T * id + T -> id + id * id + T -> id + id * id + id**Ambiguity**• Ambiguity is a function of the grammar rather than the language • Certain ambiguous grammars may have equivalent unambiguous ones**Grammar Transformations**• Grammars can be transformed without affecting the language generated • Three transformations are discussed next: • Eliminating Ambiguity • Eliminating Left Recursion (i.e.productions of the form AA ) • Left Factoring**Eliminating Ambiguity**• Sometimes an ambiguous grammar can be rewritten to eliminate ambiguity • For example, expressions involving additions and products can be written as follows: • E E +T | T • T T *id | id • The language generated by this grammar is the same as that generated by the grammar in slide “Ambiguity”. Both generate id(+id|*id)* • However, this grammar is not ambiguous**E**+ E T id * T T id id Eliminating Ambiguity (Cont.) • One advantage of this grammar is that it represents the precedence between operators. In the parsing tree, products appear nested within additions**Eliminating Ambiguity (Cont.)**• An example of ambiguity in a programming language is the dangling else • Consider • S ifbthenSelseS | ifbthenS | a**S**S S if b then else ifbthen S a a S S b if then S S b if then else a a Eliminating Ambiguity (Cont.) • When there are two nested ifs and only one else..**Eliminating Ambiguity (Cont.)**• In most languages (including C++ and Java), each else is assumed to belong to the nearest ifthat is not already matched by an else. This association is expressed in the following (unambiguous) grammar: • S Matched • | Unmatched • Matched ifbthen Matched else Matched • | a • Unmatched ifb then S • |ifbthen Matched else Unmatched**Eliminating Ambiguity (Cont.)**• Ambiguity is a property of the grammar • It is undecidable whether a context free grammar is ambiguous • The proof is done by reduction to Post’s correspondence problem • Although there is no general algorithm, it is possible to isolate certain constructs in productions which lead to ambiguous grammars**Eliminating Ambiguity (Cont.)**• For example, a grammar containing the production AAA | would be ambiguous, because the substring aaa has two parses: A A A A A A A A a A A a a a a a • This ambiguity disappears if we use the productions • AAB |B and B or the productions • ABA |B and B .**Eliminating Ambiguity (Cont.)**• Examples of ambiguous productions: AAaA AaA |Ab AaA |aAbA • A CF language is inherently ambiguous if it has no unambiguous CFG • An example of such a language is L = {aibjcm | i=j or j=m} which can be generated by the grammar: • SAB | DC • AaA | e CcC | e • BbBc | e DaDb | e**Elimination of Left Recursion**• A grammar is left recursive if it has a nonterminal A and a derivation A + Aa for some string a. • Top-down parsing methods cannot handle left-recursive grammars, so a transformation to eliminate left recursion is needed • Immediate left recursion (productions of the form A A) can be easily eliminated: • Group the A-productions as • A A1 |A2 |… | Am| b1| b2 | … | bn • where no bi begins with A 2. Replace the A-productions by • A b1A’| b2A’ | … | bnA’ • A’ 1A’|2A’|… | mA’| e**Elimination of Left Recursion (Cont.)**• The previous transformation, however, does not eliminate left recursion involving two or more steps • For example, consider the grammar • S Aa|b • A Ac|Sd |e • S is left-recursive because S Aa Sda, but it is not immediately left recursive**Elimination of Left Recursion (Cont.)**Algorithm. Eliminate left recursion Arrange nonterminals in some order A1, A2 ,,…, An for i = 1 to n { • for j = 1 to i - 1 { • replace each production of the form Ai Ajg • by the production Ai d1 g| d2 g | … | dng • where Aj d1 | d2 |…| dnare all the current Aj-productions • } • eliminate the immediate left recursion among the Ai-productions }**Elimination of Left Recursion (Cont.)**• To show that the previous algorithm actually works, notice that iteration i only changes productions with Ai on the left-hand side. And m > i in all productions of the form Ai Am • Induction proof: • Clearly true for i = 1 • If it is true for all i < k, then when the outer loop is executed for i = k, the inner loop will remove all productions Ai Am with m<i • Finally, with the elimination of self recursion, m in the Ai Am productions is forced to be > i • At the end of the algorithm, all derivations of the form Ai + Ama will have m > i and therefore left recursion would not be possible**Left Factoring**• Left factoring helps transform a grammar for predictive parsing • For example, if we have the two productions • S ifbthenSelseS • | ifbthenS on seeing the input token if, we cannot immediately tell which production to choose to expand S • In general, if we have A b1 |b2 and the input begins with a, we do not know(without looking further) which production to use to expand A**Left Factoring (Cont.)**• However, we may defer the decision by expanding A to A’ • Then after seeing the input derived from , we may expand A’ to 1 or to2 • Left-factored, the original productions become • AA’ • A’ b1 | b2**Non-Context-Free Language Constructs**• Examples of non-context-free languages are: • L1 = {wcw | w is of the form (a|b)*} • L2 = {anbmcndm | n 1 and m 1 } • L3 = {anbncn | n 0 } • Languages similar to these that are context free • L’1 = {wcwR | w is of the form (a|b)*} (wR stands for w reversed) • This language is generated by the grammar SaSa | bSb | c • L’2 = {anbmcmdn | n 1 and m 1 } • This language is generated by the grammar SaSd | aAd AbAc | bc**Non-Context-Free Language Constructs (Cont.)**• L”2 = {anbncmdm | n 1 and m 1 } • is generated by the grammar SAB AaAb | ab BcBd | cd • L’3 = {anbn | n 1} • is generated by the grammar SaSb | ab • This language is not definable by any regular expression**Non-Context-Free Language Constructs (Cont.)**• Suppose we could construct a DFSM D accepting L’3. • D must have a finite number of states, say k. • Consider the sequence of states s0, s1, s2, …, sk entered by D having read , a, aa, …, ak. • Since D only has k states, two of the states in the sequence have to be equal. Say, sisj (i j). • From si, a sequence of ibs leads to an accepting (final) state. Therefore, the same sequence of ibs will also lead to an accepting state from sj. Therefore D would accept ajbi which means that the language accepted by D is not identical to L’3. A contradiction.**Parsing**The parsing problem is: Given string of tokens w, find a parse tree whose frontier is w. (Equivalently, find a derivation from w) A parser for a grammar G reads a list of tokens and finds a parse tree if they form a sentence (or reports an error otherwise) Two classes of algorithms for parsing: • Top-down • Bottom-up**Parser generators**• A parser generator is a program that reads a grammar and produces a parser • The best known parser generator is yacc It produces bottom-up parsers • Most parser generators - including yacc - do not work for every CFG; they accept a restricted class of CFG’s that can be parsed efficiently using the method employed by that parser generator**Top-down parsing**• Starting from parse tree containing just S, build tree down toward input. Expand left-most non-terminal. • Algorithm: (next slide)**Top-down parsing (cont.)**Let input = a1a2...an current sentential form (csf) = S loop { suppose csf = a1…akA based on ak+1…, choose production A csf becomes a1…ak }**L**L E ; L L E ; L a Top-down parsing example Grammar: H: L E ; L | E E a | b Input: a;b Parse tree Sentential form Input L a;b E;L a;b a;L a;b**L**E ; L a E L E ; L a E b Top-down parsing example (cont.) Parse tree Sentential form Input a;E a;b a;b a;b**LL(1) parsing**• Efficient form of top-down parsing • Use only first symbol of remaining input (ak+1) to choose next production. That is, employ a function M: N P in “choose production” step of algorithm. • When this is possible, grammar is called LL(1)**LL(1) examples**• Example 1: H: L E ; L | E E a | b Given input a;b, so next symbol is a. Which production to use? Can’t tell. H not LL(1)**LL(1) examples**• Example 2: Exp Term Exp’ Exp’ $ | + Exp Term id (Use $ for “end-of-input” symbol.) • Grammar is LL(1): Exp and Term have only • one production; Exp’ has two productions but only one is applicable at any time.**Nonrecursive predictive parsing**• Maintain a stack explicitly, rather than implicitly via recursive calls • Key problem during predictive parsing: determining the production to be applied for a non-terminal**Nonrecursive predictive parsing**• Algorithm. Nonrecursive predictive parsing • Set ip to point to the first symbol of w$. • repeat • Let X be the top of the stack symbol and a the symbol pointed to by ip • ifX is a terminal or $ then • ifX == athen • pop X from the stack and advance ip • else error() • else // X is a nonterminal • ifM[X,a] == XY1 Y2 … Y kthen • pop X from the stack • push YkY k-1, …, Y1 onto the stack with Y1 on top • (push nothing if Y1 Y2 … Y k is ) • output the production XY1 Y2 … Y k • else error() • until X == $**LL(1) grammars**• No left recursion A Aa : If this production is chosen, parse makes no progress. • No common prefixes A ab | ag Can fix by “left factoring”: A aA’ A’ b | g**LL(1) grammars (cont.)**• No ambiguity Precise definition requires that production to choose be unique (“choose” function M very hard to calculate otherwise)**Top-down Parsing**L Start symbol and root of parse tree Input tokens: <t0,t1,…,ti,...> E0 … En L Input tokens: <ti,...> E0 … En From left to right, “grow” the parse tree downwards ...**Checking LL(1)-ness**• For any sequence of grammar symbols , define set FIRST(a) S to be FIRST(a) = { a | a* ab for some b}**LL(1) definition**• Define: Grammar G = (N, , P, S) is LL(1)iff whenever there are two left-most derivations (in which the leftmost non-terminal is always expanded first) S * wA w * wtx S * wA w * wty • it follows that = • In other words, given • 1. a string wA in V* and • 2. t, the first terminal symbol to be derived from A • there is at most one production that can be applied to A to • yield a derivation of any terminal string beginning with wt • FIRST sets can often be calculated by inspection**FIRST Sets**• ExpTerm Exp’ • Exp’$ | +Exp • Termid • (Use $ for “end-of-input” symbol) FIRST($) = {$} FIRST(+Exp) = {+} FIRST($) FIRST(+Exp) = {} grammar is LL(1)**FIRST Sets**• L E ; L | EE a | b FIRST(E ; L) = {a, b} = FIRST(E) FIRST(E ; L) FIRST(E) {} grammar not LL(1).**Computing FIRST Sets**• Algorithm. Compute FIRST(X) for all grammar symbols X • forall X V do FIRST(X) = {} • forall X (X is a terminal) do FIRST(X) = {X} • forall productions X do FIRST(X) = FIRST(X) U {} • repeat • c: forall productions X Y1Y2 … Yk do • forall i [1,k] do • FIRST(X) = FIRST(X) U (FIRST(Yi) - {}) if FIRST(Yi) then continue c • FIRST(X) = FIRST(X) U {} • until no more terminals or are added to any FIRST set