Parsing

Parsing Giuseppe Attardi Università di Pisa

Parsing Calculate grammatical structure of program, like diagramming sentences, where: Tokens = “words” Programs = “sentences” For further information: Aho, Sethi, Ullman, “Compilers: Principles, Techniques, and Tools” (a.k.a, the “Dragon Book”)

Outline of coverage • Context-free grammars • Parsing • Tabular Parsing Methods • One pass • Top-down • Bottom-up • Yacc

Parser: extracts grammatical structure of program function-def name arguments stmt-list stmt main expression expression operator expression variable << string cout “hello, world\n”

Context-free languages Grammatical structure defined by context-free grammar statementlabeled-statement | expression-statement | compound-statementlabeled-statementident:statement | caseconstant-expression :statementcompound-statement{declaration-list statement-list } “Context-free” = only one non-terminal in left-part terminal non-terminal

Parse trees Parse tree = tree labeled with grammar symbols, such that: • If node is labeled A, and its children are labeled x1...xn, then there is a productionA x1...xn • “Parse tree from A” = root labeled with A • “Complete parse tree” = all leaves labeled with tokens

L L ; E E “Frontier” a Parse trees and sentences • Frontier of tree = labels on leaves (in left-to-right order) • Frontier of tree from S is a sentential form • Frontier of a complete tree from S is a sentence

L L L L L E ; E ; L ; E E b E E b a a a a a;E a;b;b Example G: L L ; E | E E a | b Syntax trees from start symbol (L): Sentential forms:

Derivations Alternate definition of sentence: • Given ,  in V*, say  is a derivation step if ’’’ and  = ’’’ , where A is a production •  is a sentential form iff there exists a derivation (sequence of derivation steps) S ( alternatively, we say that S* ) Two definitions are equivalent, but note that there are many derivations corresponding to each parse tree

L E a Another example H: L E ; L | E E a | b L L L L E E ; ; L ; E E b E b a a

E E + * E E E E * E E id + id E E id id id id Ambiguity • For some purposes, it is important to know whether a sentence can have more than one parse tree • A grammar is ambiguous if there is a sentence with more than one parse tree • Example: E E+E | E*E | id

Notes • If e then if b then d else f • { int x; y = 0; } • A.b.c = d; • Id -> s | s.id E -> E + T -> E + T + T -> T + T + T -> id + T + T -> id + T * id + T -> id + id * id + T -> id + id * id + id

Ambiguity • Ambiguity is a function of the grammar rather than the language • Certain ambiguous grammars may have equivalent unambiguous ones

Grammar Transformations • Grammars can be transformed without affecting the language generated • Three transformations are discussed next: • Eliminating Ambiguity • Eliminating Left Recursion (i.e.productions of the form AA ) • Left Factoring

Eliminating Ambiguity • Sometimes an ambiguous grammar can be rewritten to eliminate ambiguity • For example, expressions involving additions and products can be written as follows: • E E +T | T • T T *id | id • The language generated by this grammar is the same as that generated by the grammar in slide “Ambiguity”. Both generate id(+id|*id)* • However, this grammar is not ambiguous

E + E T id * T T id id Eliminating Ambiguity (Cont.) • One advantage of this grammar is that it represents the precedence between operators. In the parsing tree, products appear nested within additions

Eliminating Ambiguity (Cont.) • An example of ambiguity in a programming language is the dangling else • Consider • S  ifbthenSelseS | ifbthenS | a

S S S if b then else ifbthen S a a S S b if then S S b if then else a a Eliminating Ambiguity (Cont.) • When there are two nested ifs and only one else..

Eliminating Ambiguity (Cont.) • In most languages (including C++ and Java), each else is assumed to belong to the nearest ifthat is not already matched by an else. This association is expressed in the following (unambiguous) grammar: • S  Matched • | Unmatched • Matched ifbthen Matched else Matched • | a • Unmatched ifb then S • |ifbthen Matched else Unmatched

Eliminating Ambiguity (Cont.) • Ambiguity is a property of the grammar • It is undecidable whether a context free grammar is ambiguous • The proof is done by reduction to Post’s correspondence problem • Although there is no general algorithm, it is possible to isolate certain constructs in productions which lead to ambiguous grammars

Eliminating Ambiguity (Cont.) • For example, a grammar containing the production AAA | would be ambiguous, because the substring aaa has two parses: A A A A A A A A a A A a a a a a • This ambiguity disappears if we use the productions • AAB |B and B  or the productions • ABA |B and B .

Eliminating Ambiguity (Cont.) • Examples of ambiguous productions: AAaA AaA |Ab AaA |aAbA • A CF language is inherently ambiguous if it has no unambiguous CFG • An example of such a language is L = {aibjcm | i=j or j=m} which can be generated by the grammar: • SAB | DC • AaA | e CcC | e • BbBc | e DaDb | e

Elimination of Left Recursion • A grammar is left recursive if it has a nonterminal A and a derivation A + Aa for some string a. • Top-down parsing methods cannot handle left-recursive grammars, so a transformation to eliminate left recursion is needed • Immediate left recursion (productions of the form A  A) can be easily eliminated: • Group the A-productions as • A  A1 |A2 |… | Am| b1| b2 | … | bn • where no bi begins with A 2. Replace the A-productions by • A  b1A’| b2A’ | … | bnA’ • A’ 1A’|2A’|… | mA’| e

Elimination of Left Recursion (Cont.) • The previous transformation, however, does not eliminate left recursion involving two or more steps • For example, consider the grammar • S  Aa|b • A  Ac|Sd |e • S is left-recursive because S Aa Sda, but it is not immediately left recursive

Elimination of Left Recursion (Cont.) Algorithm. Eliminate left recursion Arrange nonterminals in some order A1, A2 ,,…, An for i = 1 to n { • for j = 1 to i - 1 { • replace each production of the form Ai  Ajg • by the production Ai  d1 g| d2 g | … | dng • where Aj  d1 | d2 |…| dnare all the current Aj-productions • } • eliminate the immediate left recursion among the Ai-productions }

Elimination of Left Recursion (Cont.) • To show that the previous algorithm actually works, notice that iteration i only changes productions with Ai on the left-hand side. And m > i in all productions of the form Ai  Am • Induction proof: • Clearly true for i = 1 • If it is true for all i < k, then when the outer loop is executed for i = k, the inner loop will remove all productions Ai  Am with m<i • Finally, with the elimination of self recursion, m in the Ai Am productions is forced to be > i • At the end of the algorithm, all derivations of the form Ai + Ama will have m > i and therefore left recursion would not be possible

Left Factoring • Left factoring helps transform a grammar for predictive parsing • For example, if we have the two productions • S  ifbthenSelseS • | ifbthenS on seeing the input token if, we cannot immediately tell which production to choose to expand S • In general, if we have A b1 |b2 and the input begins with a, we do not know(without looking further) which production to use to expand A

Left Factoring (Cont.) • However, we may defer the decision by expanding A to A’ • Then after seeing the input derived from , we may expand A’ to 1 or to2 • Left-factored, the original productions become • AA’ • A’ b1 | b2

Non-Context-Free Language Constructs • Examples of non-context-free languages are: • L1 = {wcw | w is of the form (a|b)*} • L2 = {anbmcndm | n  1 and m  1 } • L3 = {anbncn | n  0 } • Languages similar to these that are context free • L’1 = {wcwR | w is of the form (a|b)*} (wR stands for w reversed) • This language is generated by the grammar SaSa | bSb | c • L’2 = {anbmcmdn | n  1 and m 1 } • This language is generated by the grammar SaSd | aAd AbAc | bc

Non-Context-Free Language Constructs (Cont.) • L”2 = {anbncmdm | n  1 and m 1 } • is generated by the grammar SAB AaAb | ab BcBd | cd • L’3 = {anbn | n  1} • is generated by the grammar SaSb | ab • This language is not definable by any regular expression

Non-Context-Free Language Constructs (Cont.) • Suppose we could construct a DFSM D accepting L’3. • D must have a finite number of states, say k. • Consider the sequence of states s0, s1, s2, …, sk entered by D having read , a, aa, …, ak. • Since D only has k states, two of the states in the sequence have to be equal. Say, sisj (i j). • From si, a sequence of ibs leads to an accepting (final) state. Therefore, the same sequence of ibs will also lead to an accepting state from sj. Therefore D would accept ajbi which means that the language accepted by D is not identical to L’3. A contradiction.

Parsing The parsing problem is: Given string of tokens w, find a parse tree whose frontier is w. (Equivalently, find a derivation from w) A parser for a grammar G reads a list of tokens and finds a parse tree if they form a sentence (or reports an error otherwise) Two classes of algorithms for parsing: • Top-down • Bottom-up

Parser generators • A parser generator is a program that reads a grammar and produces a parser • The best known parser generator is yacc It produces bottom-up parsers • Most parser generators - including yacc - do not work for every CFG; they accept a restricted class of CFG’s that can be parsed efficiently using the method employed by that parser generator

Top-down parsing • Starting from parse tree containing just S, build tree down toward input. Expand left-most non-terminal. • Algorithm: (next slide)

Top-down parsing (cont.) Let input = a1a2...an current sentential form (csf) = S loop { suppose csf = a1…akA based on ak+1…, choose production A   csf becomes a1…ak }

L L E ; L L E ; L a Top-down parsing example Grammar: H: L E ; L | E E a | b Input: a;b Parse tree Sentential form Input L a;b E;L a;b a;L a;b

L E ; L a E L E ; L a E b Top-down parsing example (cont.) Parse tree Sentential form Input a;E a;b a;b a;b

LL(1) parsing • Efficient form of top-down parsing • Use only first symbol of remaining input (ak+1) to choose next production. That is, employ a function M:   N P in “choose production” step of algorithm. • When this is possible, grammar is called LL(1)

LL(1) examples • Example 1: H: L E ; L | E E a | b Given input a;b, so next symbol is a. Which production to use? Can’t tell.  H not LL(1)

LL(1) examples • Example 2: Exp Term Exp’ Exp’ $ | + Exp Term id (Use $ for “end-of-input” symbol.) • Grammar is LL(1): Exp and Term have only • one production; Exp’ has two productions but only one is applicable at any time.

Nonrecursive predictive parsing • Maintain a stack explicitly, rather than implicitly via recursive calls • Key problem during predictive parsing: determining the production to be applied for a non-terminal

Nonrecursive predictive parsing • Algorithm. Nonrecursive predictive parsing • Set ip to point to the first symbol of w$. • repeat • Let X be the top of the stack symbol and a the symbol pointed to by ip • ifX is a terminal or $ then • ifX == athen • pop X from the stack and advance ip • else error() • else // X is a nonterminal • ifM[X,a] == XY1 Y2 … Y kthen • pop X from the stack • push YkY k-1, …, Y1 onto the stack with Y1 on top • (push nothing if Y1 Y2 … Y k is  ) • output the production XY1 Y2 … Y k • else error() • until X == $

LL(1) grammars • No left recursion A  Aa : If this production is chosen, parse makes no progress. • No common prefixes A ab | ag Can fix by “left factoring”: A aA’ A’  b | g

LL(1) grammars (cont.) • No ambiguity Precise definition requires that production to choose be unique (“choose” function M very hard to calculate otherwise)

Top-down Parsing L Start symbol and root of parse tree Input tokens: <t0,t1,…,ti,...> E0 … En L Input tokens: <ti,...> E0 … En From left to right, “grow” the parse tree downwards ...

Checking LL(1)-ness • For any sequence of grammar symbols , define set FIRST(a) S to be FIRST(a) = { a | a* ab for some b}

LL(1) definition • Define: Grammar G = (N, , P, S) is LL(1)iff whenever there are two left-most derivations (in which the leftmost non-terminal is always expanded first) S * wA  w * wtx S * wA  w * wty • it follows that  = • In other words, given • 1. a string wA in V* and • 2. t, the first terminal symbol to be derived from A • there is at most one production that can be applied to A to • yield a derivation of any terminal string beginning with wt • FIRST sets can often be calculated by inspection

FIRST Sets • ExpTerm Exp’ • Exp’$ | +Exp • Termid • (Use $ for “end-of-input” symbol) FIRST($) = {$} FIRST(+Exp) = {+} FIRST($)  FIRST(+Exp) = {}  grammar is LL(1)

FIRST Sets • L E ; L | EE a | b FIRST(E ; L) = {a, b} = FIRST(E) FIRST(E ; L)  FIRST(E)  {}  grammar not LL(1).

Computing FIRST Sets • Algorithm. Compute FIRST(X) for all grammar symbols X • forall X  V do FIRST(X) = {} • forall X   (X is a terminal) do FIRST(X) = {X} • forall productions X   do FIRST(X) = FIRST(X) U {} • repeat • c: forall productions X  Y1Y2 … Yk do • forall i  [1,k] do • FIRST(X) = FIRST(X) U (FIRST(Yi) - {}) if  FIRST(Yi) then continue c • FIRST(X) = FIRST(X) U {} • until no more terminals or  are added to any FIRST set

Parsing

Parsing

Presentation Transcript

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing

Parsing