Create Presentation
Download Presentation

Download Presentation
## Topic #4: Syntactic Analysis (Parsing)

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Topic #4: Syntactic Analysis (Parsing)**EE 456 – Compiling Techniques Prof. Carl Sable Fall 2003**Parser**• Accepts string of tokens from lexical analyzer (usually one token at a time) • Verifies whether or not string can be generated by grammar • Reports syntax errors (recovers if possible)**Errors**• Lexical errors (e.g. misspelled word) • Syntax errors (e.g. unbalanced parentheses, missing semicolon) • Semantic errors (e.g. type errors) • Logical errors (e.g. infinite recursion)**Error Handling**• Report errors clearly and accurately • Recover quickly if possible • Poor error recover may lead to avalanche of errors**Error Recovery**• Panic mode: discard tokens one at a time until a synchronizing token is found • Phrase-level recovery: Perform local correction that allows parsing to continue • Error Productions: Augment grammar to handle predicted, common errors • Global Production: Use a complex algorithm to compute least-cost sequence of changes leading to parseable code**Context Free Grammars**• CFGs can represent recursive constructs that regular expressions can not • A CFG consists of: • Tokens (terminals, symbols) • Nonterminals (syntactic variables denoting sets of strings) • Productions (rules specifying how terminals and nonterminals can combine to form strings) • A start symbol (the set of strings it denotes is the language of the grammar)**Derivations (Part 1)**• One definition of language: the set of strings that have valid parse trees • Another definition: the set of strings that can be derived from the start symbol E E + E | E * E | (E) | – E | id E => -E (read E derives –E) E => -E => -(E) => -(id)**Derivations (Part 2)**• αAβ => αγβif A γis a production and α and β are arbitrary strings of grammar symbols • If a1 => a2 => … => an, we say a1 derives an • => means derives in one step • *=> means derives in zero or more steps • +=> means derives in one or more steps**Sentences and Languages**• Let L(G) be the language generated by the grammar G with start symbol S: • Strings in L(G) may contain only tokens of G • A string w is in L(G) if and only if S +=> w • Such a string w is a sentence of G • Any language that can be generated by a CFG is said to be a context-free language • If two grammars generate the same language, they are said to be equivalent**Sentential Forms**• If S *=> α, whereαmay contain nonterminals, we say thatα is a sentential form of G • A sentence is a sentential form with no nonterminals**Leftmost Derivations**• Only the leftmost nonterminal in any sentential form is replaced at each step • A leftmost step can be written as wAγlm=> wδγ • w consists of only terminals • γis a string of grammar symbols • If α derives β by a leftmost derivation, then we write αlm*=> β • If S lm*=> α then we say that α is a left-sentential form of the grammar • Analogous terms exist for rightmost derivations**Parse Trees**• A parse tree can be viewed as a graphical representation of a derivation • Every parse tree has a unique leftmost derivation (not true of every sentence) • An ambiguous grammars has: • more than one parse tree for at least one sentence • more than one leftmost derivation for at least one sentence**Capability of Grammars**• Can describe most programming language constructs • An exception: requiring that variables are declared before they are used • Therefore, grammar accepts superset of actual language • Later phase (semantic analysis) does type checking**Regular Expressions vs. CFGs**• Every construct that can be described by an RE and also be described by a CFG • Why use REs at all? • Lexical rules are simpler to describe this way • REs are often easier to read • More efficient lexical analyzers can be constructed**Verifying Grammars**• A proof that a grammar verifies a language has two parts: • Must show that every string generated by the grammar is part of the language • Must show that every string that is part of the language can be generated by the grammar • Rarely done for complete programming languages!**Eliminating Ambiguity (1)**stmt ifexprthenstmt | ifexprthenstmtelsestmt | other if E1thenif E2then S1else S2**Eliminating Ambiguity (3)**stmt matched | unmatched matched ifexprthenmatchedelsematched | other unmatched ifexprthenstmt | ifexprthenmatchedelseunmatched**Left Recursion**• A grammar is left recursive if for any nonterminal A such that there exists any derivation A +=> Aα for any string α • Most top-down parsing methods can not handle left-recursive grammars**Eliminating Left Recursion (1)**A Aα1 | Aα2 | … | Aαm | β1 | β2 | … | βn A β1A’| β2A’ | … | βnA’ A’ α1A’ | α2A’ | … | αmA’ | ε Harder case: S Aa | b A Ac | Sd | ε**Eliminating Left Recursion (2)**• First arrange the nonterminals in some order A1, A2, … An • Apply the following algorithm: for i = 1 to n { for j = 1 to i-1 { replace each production of the form Ai Ajγ by the productions Ai δ1γ | δ2γ |… | δkγ, where Aj δ1 | δ2 |… | δk are the Aj productions } eliminate the left recursion among Ai productions }**Left Factoring**• Rewriting productions to delay decisions • Helpful for predictive parsing • Not guaranteed to remove ambiguity A αβ1 | αβ2 A αA’ A’ β1 | β2**Limitations of CFGs**• Can not verify repeated strings • Example: L1 = {wcw | w is in (a|b)*} • Abstracts checking that variables are declared • Can not verify repeated counts • Example: L2 = {anbmcndm | n≥1 & m≥1} • Abstracts checking that number of formal and actual parameters are equal • Therefore, some checks put off until semantic analysis**Top Down Parsing**• Can be viewed two ways: • Attempt to find leftmost derivation for input string • Attempt to create parse tree, starting from at root, creating nodes in preorder • General form is recursive descent parsing • May require backtracking • Backtracking parsers not used frequently because not needed**Predictive Parsing**• A special case of recursive-descent parsing that does not require backtracking • Must always know which production to use based on current input symbol • Can often create appropriate grammar: • removing left-recursion • left factoring the resulting grammar**Transition Diagrams**• For parser: • One diagram for each nonterminal • Edge labels can be tokens or nonterminal • A transition on a token means we should take that transition if token is next input symbol • A transition on a nonterminal can be thought of as a call to a procedure for that nonterminal • As opposed to lexical analyzers: • One (or more) diagrams for each token • Labels are symbols of input alphabet**Creating Transition Diagrams**• First eliminate left recursion from grammar • Then left factor grammar • For each nonterminal A: • Create an initial and final state • For every production A X1X2…Xn, create a path from initial to final state with edges labeled X1,X2,…, Xn.**Using Transition Diagrams**• Predictive parsers: • Start at start symbol of grammar • From state s with edge to state t labeled with token a, if next input token is a: • State changes to t • Input cursor moves one position right • If edge labeled by nonterminal A: • State changes to start state for A • Input cursor is not moved • If final state of A reached, then state changes to t • If edge labeled by ε, state changes to t • Can be recursive or non-recursive using stack**Transition Diagram Example**E TE’ E’ +TE’ | ε T FT’ T’ *FT’ | ε F (E) | id E E + T | T T T * F | F F (E) | id E: T’: E’: T: F:**Input**Stack Nonrecursive Predictive Parsing (1)**Nonrecursive Predictive Parsing (2)**• Program considers X, the symbol on top of the stack, and a, the next input symbol • If X = a = $, parser halts successfully • if X = a ≠ $, parser pops X off stack and advances to next input symbol • If X is a nonterminal, the program consults M[X, a] (production or error entry)**Nonrecursive Predictive Parsing (3)**• Initialize stack with start symbol of grammar • Initialize input pointer to first symbol of input • After consulting parsing table: • If entry is production, parser replaces top entry of stack with right side of production (leftmost symbol on top) • Otherwise, an error recovery routine is called**FIRST**• FIRST(α) is the set of all terminals that begin any string derived from α • Computing FIRST: • If X is a terminal, FIRST(X) = {X} • If Xε is a production, add ε to FIRST(X) • If X is a nonterminal and XY1Y2…Ynis a production: • For all terminals a, add a to FIRST(X) if a is a member of any FIRST(Yi) and ε is a member of FIRST(Y1), FIRST(Y2), … FIRST(Yi-1) • If ε is a member of FIRST(Y1), FIRST(Y2), … FIRST(Yn), add ε to FIRST(X)**FOLLOW**• FOLLOW(A), for any nonterminal A, is the set of terminals a that can appear immediately to the right if A in some sentential form • More formally, a is in FOLLOW(A) if and only if there exists a derivation of the form S *=>αAaβ • $ is in FOLLOW(A) if and only if there exists a derivation of the form S *=> αA**Computing FOLLOW**• Place $ in FOLLOW(S) • If there is a production A αBβ, then everything in FIRST(β) (except for ε) is in FOLLOW(B) • If there is a production A αB, or aproduction A αBβ where FIRST(β) contains ε,then everything in FOLLOW(A) is also in FOLLOW(B)**FIRST and FOLLOW Example**E TE’ E’ +TE’ | ε T FT’ T’ *FT’ | ε F (E) | id FIRST(E) = FIRST(T) = FIRST(F) = {(, id} FIRST(E’) = {+, ε} FIRST(T’) = {*, ε} FOLLOW(E) = FOLLOW(E’) = {), $} FOLLOW(T) = FOLLOW(T’) = {+, ), $} FOLLOW(F) = {+, *, $}**Creating a Predictive Parsing Table**• For each production A α : • For each terminal a in FIRST(α) add A α to M[A, a] • If εis in FIRST(α) add A α to M[A, b] for every terminal b in FOLLOW(A) • If εis in FIRST(α) and $ is in FOLLOW(A) add A α to M[A, $] • Mark each undefined entry of M as an error entry (use some recovery strategy)**Multiply-Defined Entries Example**S iEtSS’ | a S’ eS | ε E b**LL(1) Grammars (1)**• Algorithm covered in class can be applied to any grammar to produce a parsing table • If parsing table has no multiply-defined entries, grammar is said to be “LL(1)” • First “L”, left-to-right scanning of input • Second “L”, produces leftmost derivation • “1” refers to the number of lookahead symbols needed to make decisions**LL(1) Grammars (2)**• No ambiguous or left-recursive grammar can be LL(1) • Eliminating left recursion and left factoring does not always lead to LL(1) grammar • Some grammars can not be transformed into an LL(1) grammar at all • Although the example of a non-LL(1) grammar we covered has a fix, there are no universal rules to handle cases like this**Shift-Reduce Parsing**• One simple form of bottom-up parsing is shift-reduce parsing • Starts at the bottom (leaves, terminals) and works its way up to the top (root, start symbol) • Each step is a “reduction”: • Substring of input matching the right side of a production is “reduced” • Replaced with the nonterminal on the left of the production • If all substrings are chosen correctly, a rightmost derivation is traced in reverse**Shift-Reduce Parsing Example**S aABe A Abc | b B -> d abbcde aAbcde aAde aABe S S rm=> aABe rm=>aAde rm=>aAbcde rm=> abbcde**Handles (1)**• Informally, a “handle” of a string: • Is a substring of the string • Matches the right side of a production • Reduction to left side of production is one step along reverse of rightmost derivation • Leftmost substring matching right side of production is not necessarily a handle • Might not be able to reduce resulting string to start symbol • In example from previous slide, if reduce aAbcde to aAAcde,can not reduce this to S**Handles (2)**• Formally, a handle of a right-sentential form γ: • Is a production A β and a position of γ where βmay be found and replaced with A • Replacing A by β leads to the previous right-sentential form in a rightmost derivation of γ • So if S rm*=> αAw rm=> αβw then A β in the position following α is a handle of αβw • The string w to the right of the handle contains only terminals • Can be more than one handle if grammar is ambiguous (more than one rightmost derivation)**Ambiguity and Handles Example**E E + E E E * E E (E) E id E rm=> E + E rm=> E + E * E rm=> E + E * id3 rm=> E + id2 * id3 rm=> id1 + id2 * id3 E rm=> E * E rm=> E * id3 rm=> E + E * id3 rm=> E + id2 * id3 rm=> id1 + id2 * id3**Handle Pruning**• Repeat the following process, starting from string of tokens until obtain start symbol: • Locate handle in current right-sentential form • Replace handle with left side of appropriate production • Two problems that need to be solved: • How to locate handle • How to choose appropriate production