Compiler Construction

Compiler Construction Syntax Analysis Top-down parsing

Syntax Analysis, continued

Syntax analysis • Last week we covered • The goal of syntax analysis • Context-free grammars • Top-down parsing (a simple but weak parsing method) • Today, we will • Wrap up top-down parsing, including LL(1) • Start on bottom-up parsing • Shift-reduce parsers • LR parsers: SLR(1), LR(1), LALR(1)

Top-Down Parsing

Recursive descent (Last Week) • Recursive descent parsers simply try to build a parse tree, top-down, and BACKTRACK on failure. • Recursion and backtracking are inefficient. • It would be better if we always knew the correct action to take. • It would be better if we could avoid recursive procedure calls during parsing. • PREDICTIVE PARSERS can solve both problems.

Predictive parsers • A predictive parser always knows which production to use, so backtracking is not necessary. • Example: for the productionsstmt -> if ( expr ) stmt else stmt | while ( expr ) stmt | for ( stmt expr stmt ) stmt • a recursive descent parser would always know which production to use, depending on the input token.

Transition diagrams • Transition diagrams can describe recursive parsers, just like they can describe lexical analyzers, but the diagrams are slightly different. • Construction: • Eliminate left recursion from G • Left factor G • For each non-terminal A, do • Create an initial and final (return) state • For each production A -> X1 X2 … Xn, create a path from the initial to the final state with edges X1 X2 … Xn.

Using transition diagrams • Begin in the start state for the start symbol • When we are in state s with edge labeled by terminal a to state t, if the next input symbol is a, move to state t and advance the input pointer. • For an edge to state t labeled with non-terminal A, jump to the transition diagram for A, and when finished, return to state t • For an edge labeled ε, move immediately to t. • Example (4.15 in text): parse the string “id + id * id”

Example transition diagrams • An expression grammar with left recursion and ambiguity removed: • E -> T E’ • E’ -> + T E’ | ε • T -> F T’ • T’ -> * F T’ | ε • F -> ( E ) | id Corresponding transition diagrams:

Predictive parsing without recursion • To get rid of the recursive procedure calls, we maintain our own stack.

The parsing table and parsing program • The table is a 2D array M[A,a] where A is a nonterminal symbol and a is a terminal or $. • At each step, the parser considers the top-of-stack symbol X and input symbol a: • If both are $, accept • If they are the same (nonterminals), pop X, advance input • If X is a nonterminal, consult M[X,a]. If M[X,a] is “ERROR” call an error recovery routine. Otherwise, if M[X,a] is a production of he grammar X -> UVW, replace X on the stack with WVU (U on top)

Example • Use the table-driven predictive parser to parseid + id * id • Assuming parsing table Initial stack is $E Initial input is id + id * id $

Building a predictive parse table • We still don’t know how to create M, the parse table. • The construction requires two functions: FIRST and FOLLOW. • For a string of grammar symbols α, FIRST(α) is the set of terminals that begin all possible strings derived from α. If α =*> ε, then ε is also in FIRST(α). • FOLLOW(A) for nonterminal A is the set of terminals that can appear immediately to the right of A in some sentential form. If A can be the last symbol in a sentential form, then $ is also in FOLLOW(A).

How to compute FIRST(α) • If X is a terminal, FIRST(X) = X. • Otherwise (X is a nonterminal), • 1. If X -> ε is a production, add ε to FIRST(X) • 2. If X -> Y1… Yk is a production, then place a in FIRST(X) if for some i, a is in FIRST(Yi) and Y1…Yi-1 =*> ε. • Given FIRST(X) for all single symbols X, • Let FIRST(X1…Xn) = FIRST(X1) • If ε ∈ FIRST(X1), then add FIRST(X2), and so on…

How to compute FOLLOW(A) • Place $ in FOLLOW(S) (for S the start symbol) • If A -> α B β, then FIRST(β)-ε is placed in FOLLOW(B) • If there is a production A -> α B or a production A -> α B β where β =*> ε, then everything in FOLLOW(A) is in FOLLOW(B). • Repeatedly apply these rules until no FOLLOW set changes.

Example FIRST and FOLLOW • For our favorite grammar:E -> TE’E’ -> +TE | εT -> FT’T’ -> *FT’ | εF -> (E) | id • What is FIRST() and FOLLOW() for all nonterminals?

Parse table construction withFIRST/FOLLOW • Basic idea: if A -> α and a is in FIRST(α), then we expand A to α any time the current input is a and the top of stack is A. • Algorithm: • For each production A -> α in G, do: • For each terminal a in FIRST(α) add A -> α to M[A,a] • If ε ∈ FIRST(α), for each terminal b in FOLLOW(A), do: • add A -> α to M[A,b] • If ε ∈ FIRST(α) and $ is in FOLLOW(A), add A -> α to M[A,$] • Make each undefined entry in M[ ] an ERROR

Example predictive parse table construction • For our favorite grammar:E -> TE’E’ -> +TE | εT -> FT’T’ -> *FT’ | εF -> (E) | id • What the predictive parsing table?

LL(1) grammars • The predictive parser algorithm can be applied to ANY grammar. • But sometimes, M[ ] might have multiply defined entries. • Example: for if-else statements and left factoring:stmt -> if ( expr ) stmt optelseoptelse -> else stmt | ε • When we have “optelse” on the stack and “else” in the input, we have a choice of how to expand optelse (“else” is in FOLLOW(optelse) so either rule is possible)

LL(1) grammars • If the predictive parsing construction for G leads to a parse table M[ ] WITHOUT multiply defined entries,we say “G is LL(1)” 1 symbol of lookahead Leftmost derivation Left-to-right scan of the input

LL(1) grammars • Necessary and sufficient conditions for G to be LL(1): • If A -> α | β • There does not exist a terminal a such thata ∈ FIRST(α) and a ∈ FIRST(β) • At most one of α and β derive ε • If β =*> ε, then FIRST(α) does not intersect with FOLLOW(β). This is the same as saying the predictive parser always knows what to do!

Top-down parsing summary • RECURSIVE DESCENT parsers are easy to build, but inefficient, and might require backtracking. • TRANSITION DIAGRAMS help us build recursive descent parsers. • For LL(1) grammars, it is possible to build PREDICTIVE PARSERS with no recursion automatically. • Compute FIRST() and FOLLOW() for all nonterminals • Fill in the predictive parsing table • Use the table-driven predictive parsing algorithm

Bottom-Up Parsing

Bottom-up parsing • Now, instead of starting with the start symbol and working our way down, we will start at the bottom of the parse tree and work our way up. • The style of parsing is called SHIFT-REDUCE • SHIFT refers to pushing input symbols onto a stack. • REDUCE refers to “reduction steps” during a parse: • We take a substring matching the RHS of a rule • Then replace it with the symbol on the LHS of the rule • If you can reduce until you have just the start symbol, you have succeeded in parsing the input string.

Reduction example • S -> aABe • Grammar: A -> Abc | b Input: abbcbcde • B -> d • Reduction steps: abbcbcde • aAbcbcde • aAbcde • aAde • aABe • S <-- SUCCESS! In reverse, the reduction traces out a rightmost derivation.

Handles • The HANDLE is the part of a sentential form that gets reduced in a backwards rightmost derivation. • Sometimes part of a sentential form will match a RHS in G, but if that string is NOT reduced in the backwards rightmost derivation, it is NOT a handle. • Shift-reduce parsing, then, is really all about finding the handle at each step then reducing the handle. • If we can always find the handle, we never have to backtrack. • Finding the handle is called HANDLE PRUNING.

Shift-reduce parsing with a stack • A stack helps us find the handle for each reduction step. • The stack holds grammar symbols. • An input buffer holds the input string. • $ marks the bottom of the stack and the end of input. • Algorithm: • Shift 0 or more input symbols onto the stack, until a handle β is on top of the stack. • Reduce β to the LHS of the appropriate production. • Repeat until we see $S on stack and $ in input.

Shift-reduce example • E -> E + E • Grammar: E -> E * E w = id + id * id • E -> ( E ) • E -> id • STACK INPUT ACTION • 1. $ id+id*id$ shift

Shift-reduce parsing actions • SHIFT: The next input symbol is pushed onto the stack. • REDUCE: When the parser knows the right end of a handle is on the stack, the handle is replaced with the corresponding LHS. • ACCEPT: Announce success (input is $, stack is $S) • ERROR: The input contained a syntax error; call an error recovery routine.

Conflicts during shift/reduce parsing • Like predictive parsers, sometimes a shift-reduce parser won’t know what to do. • A SHIFT/REDUCE conflict occurs when the parser can’t decide whether to shift the input symbol or reduce the current top of stack. • A REDUCE/REDUCE conflict occurs when the parser doesn’t know which of two or more rules to use for reduction. • A grammar whose shift-reduce parser contains errors is said to be “Not LR”

Example shift/reduce conflict • Ambiguous grammars are NEVER LR. • stmt -> if ( expr ) stmt • | if ( expr ) stmt else stmt • | other • If we have a shift-reduce parser in configuration • STACK INPUT • … if ( expr ) stmt else … $ • what to do? • We could reduce “if ( expr ) stmt” to “stmt” (assuming the else is part of a different surrounding if-else statement) • We could also shift the “else” (assuming this else goes with the current if)

Example reduce/reduce conflict • Some languages use () for function calls AND array refs. • stmt -> id ( parameter_list ) • stmt -> expr := expr • parameter_list -> parameter_list , parameter • parameter_list -> parameter • parameter -> id • expr -> id ( expr_list ) • expr -> id • expr_list -> expr_list , expr • expr_list -> expr

Example reduce/reduce conflict • For input A(I,J) we would get token stream id(id,id) • The first three tokens would certainly be shifted: • STACK INPUT • … id ( id , id ) … • The id on top of the stack needs to be reduced, but we have two choices: parameter -> id OR expr -> id • The stack gives no clues. To know which rule to use, we need to look up the first ID in the symbol table to see if it is a procedure name or an array name. • One solution is to have the lexer return “procid” for procedure names. Then the shift-reduce parser can look into the stack to decide which reduction to use.

LR (Bottom-Up) Parsers

Relationship between parser types

LR parsing • A major type of shift-reduce parsing is called LR(k). • “L” means left-to-right scanning of the input • “R” means rightmost derivation • “k” means lookahead of k characters (if omitted, assume k=1) • LR parsers have very nice properties: • They can recognize almost all programming language constructs for which we can write a CFG • They are the most powerful type of shift-reduce parser, but they never backtrack, and are very efficient • They can parse a proper superset of the languages parsable by predictive parsers • They tell you as soon as possible when there’s a syntax error. • DISADVANTAGE: hard to build by hand (we need something like yacc)

LR parsing

LR parsing • The parser’s structure is similar to predictive parsing. • The STACK now stores pairs (Xi, si). • Xi is a grammar symbol. • si is a STATE. • The parse table now has two parts: ACTION and GOTO. • The action table specifies whether to SHIFT, REDUCE, ACCEPT, or flag an ERROR given the state on the stack and the current input. • The goto table specifies what state to go to after a reduction is performed.

Parser configurations • A CONFIGURATION of the LR parser is a pair (STACK, INPUT): ( s0 X1 s1… Xm sm, ai ai+1… an $ ) • The stack configuration is just a list of the states and grammar symbols currently on the stack. • The input configuration is the list of unprocessed input symbols. • Together, the configuration represents a right-sentential form X1… Xm ai ai+1… an (some intermediate step in a right derivation of the input from the start symbol)

The LR parsing algorithm • At each step, the parser is in some configuration. • The next move depends on reading ai from the input and sm from the top of the stack. • If action[sm,ai] = shift s, we execute a SHIFT move, entering the configuration ( s0 X1 s1… Xm sm ai s, ai+1… an $ ). • If action[sm,ai] = reduce A -> β, then we enter the configuration ( s0 X1 s1… Xm-r sm-r A s, ai+1… an $ ), where r = | β | and s = goto[sm-r,A]. • If action[sm,ai] = accept, we’re done. • If action[sm,ai] = error, we call an error recovery routine.

LR parsing example • Grammar: • 1. E -> E + T • 2. E -> T • 3. T -> T * F • 4. T -> F • 5. F -> ( E ) • 6. F -> id

LR parsing example • CONFIGURATIONS • STACK INPUT ACTION • 0 id * id + id $ shift 5

LR grammars • If it is possible to construct an LR parse table for G, we say “G is an LR grammar”. • LR parsers DO NOT need to parse the entire stack to decide what to do (other shift-reduce parsers might). • Instead, the STATE symbol summarizes all the information needed to make the decision of what to do next. • The GOTO function corresponds to a DFA that knows how to find the HANDLE by reading the top of the stack downwards. • In the example, we only looked at 1 input symbol at a time. This means the grammar is LR(1).

How to construct an LR parse table? • We will look at 3 methods: • Simple LR (SLR): simple but not very powerful • Canonical LR: very powerful but too many states • LALR: almost as powerful with many fewer states • yacc uses the LALR algorithm.

Compiler Construction