- 163 Views
- Uploaded on
- Presentation posted in: General

Compiler Construction

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Compiler Construction

Syntax Analysis

Top-down parsing

Syntax Analysis, continued

- Last week we covered
- The goal of syntax analysis
- Context-free grammars
- Top-down parsing (a simple but weak parsing method)

- Today, we will
- Wrap up top-down parsing, including LL(1)
- Start on bottom-up parsing
- Shift-reduce parsers
- LR parsers: SLR(1), LR(1), LALR(1)

Top-Down Parsing

- Recursive descent parsers simply try to build a parse tree, top-down, and BACKTRACK on failure.
- Recursion and backtracking are inefficient.
- It would be better if we always knew the correct action to take.
- It would be better if we could avoid recursive procedure calls during parsing.
- PREDICTIVE PARSERS can solve both problems.

- A predictive parser always knows which production to use, so backtracking is not necessary.
- Example: for the productionsstmt -> if ( expr ) stmt else stmt | while ( expr ) stmt | for ( stmt expr stmt ) stmt
- a recursive descent parser would always know which production to use, depending on the input token.

- Transition diagrams can describe recursive parsers, just like they can describe lexical analyzers, but the diagrams are slightly different.
- Construction:
- Eliminate left recursion from G
- Left factor G
- For each non-terminal A, do
- Create an initial and final (return) state
- For each production A -> X1 X2 … Xn, create a path from the initial to the final state with edges X1 X2 … Xn.

- Begin in the start state for the start symbol
- When we are in state s with edge labeled by terminal a to state t, if the next input symbol is a, move to state t and advance the input pointer.
- For an edge to state t labeled with non-terminal A, jump to the transition diagram for A, and when finished, return to state t
- For an edge labeled ε, move immediately to t.
- Example (4.15 in text): parse the string “id + id * id”

- An expression grammar with left recursion and ambiguity removed:
- E -> T E’
- E’ -> + T E’ | ε
- T -> F T’
- T’ -> * F T’ | ε
- F -> ( E ) | id

Corresponding transition diagrams:

- To get rid of the recursive procedure calls, we maintain our own stack.

- The table is a 2D array M[A,a] where A is a nonterminal symbol and a is a terminal or $.
- At each step, the parser considers the top-of-stack symbol X and input symbol a:
- If both are $, accept
- If they are the same (nonterminals), pop X, advance input
- If X is a nonterminal, consult M[X,a]. If M[X,a] is “ERROR” call an error recovery routine. Otherwise, if M[X,a] is a production of he grammar X -> UVW, replace X on the stack with WVU (U on top)

- Use the table-driven predictive parser to parseid + id * id
- Assuming parsing table

Initial stack is $E

Initial input is id + id * id $

- We still don’t know how to create M, the parse table.
- The construction requires two functions: FIRST and FOLLOW.
- For a string of grammar symbols α, FIRST(α) is the set of terminals that begin all possible strings derived from α. If α =*> ε, then ε is also in FIRST(α).
- FOLLOW(A) for nonterminal A is the set of terminals that can appear immediately to the right of A in some sentential form. If A can be the last symbol in a sentential form, then $ is also in FOLLOW(A).

- If X is a terminal, FIRST(X) = X.
- Otherwise (X is a nonterminal),
- 1. If X -> ε is a production, add ε to FIRST(X)
- 2. If X -> Y1… Yk is a production, then place a in FIRST(X) if for some i, a is in FIRST(Yi) and Y1…Yi-1 =*> ε.

- Given FIRST(X) for all single symbols X,
- Let FIRST(X1…Xn) = FIRST(X1)
- If ε ∈ FIRST(X1), then add FIRST(X2), and so on…

- Place $ in FOLLOW(S) (for S the start symbol)
- If A -> α B β, then FIRST(β)-ε is placed in FOLLOW(B)
- If there is a production A -> α B or a production A -> α B β where β =*> ε, then everything in FOLLOW(A) is in FOLLOW(B).
- Repeatedly apply these rules until no FOLLOW set changes.

- For our favorite grammar:E -> TE’E’ -> +TE | εT -> FT’T’ -> *FT’ | εF -> (E) | id
- What is FIRST() and FOLLOW() for all nonterminals?

- Basic idea: if A -> α and a is in FIRST(α), then we expand A to α any time the current input is a and the top of stack is A.
- Algorithm:
- For each production A -> α in G, do:
- For each terminal a in FIRST(α) add A -> α to M[A,a]
- If ε ∈ FIRST(α), for each terminal b in FOLLOW(A), do:
- add A -> α to M[A,b]
- If ε ∈ FIRST(α) and $ is in FOLLOW(A), add A -> α to M[A,$]
- Make each undefined entry in M[ ] an ERROR

- For our favorite grammar:E -> TE’E’ -> +TE | εT -> FT’T’ -> *FT’ | εF -> (E) | id
- What the predictive parsing table?

- The predictive parser algorithm can be applied to ANY grammar.
- But sometimes, M[ ] might have multiply defined entries.
- Example: for if-else statements and left factoring:stmt -> if ( expr ) stmt optelseoptelse -> else stmt | ε
- When we have “optelse” on the stack and “else” in the input, we have a choice of how to expand optelse (“else” is in FOLLOW(optelse) so either rule is possible)

- If the predictive parsing construction for G leads to a parse table M[ ] WITHOUT multiply defined entries,we say “G is LL(1)”

1 symbol of lookahead

Leftmost derivation

Left-to-right scan of the input

- Necessary and sufficient conditions for G to be LL(1):
- If A -> α | β
- There does not exist a terminal a such thata ∈ FIRST(α) and a ∈ FIRST(β)
- At most one of α and β derive ε
- If β =*> ε, then FIRST(α) does not intersect with FOLLOW(β).

This is the same as saying the

predictive parser always

knows what to do!

- RECURSIVE DESCENT parsers are easy to build, but inefficient, and might require backtracking.
- TRANSITION DIAGRAMS help us build recursive descent parsers.
- For LL(1) grammars, it is possible to build PREDICTIVE PARSERS with no recursion automatically.
- Compute FIRST() and FOLLOW() for all nonterminals
- Fill in the predictive parsing table
- Use the table-driven predictive parsing algorithm

Bottom-Up Parsing

- Now, instead of starting with the start symbol and working our way down, we will start at the bottom of the parse tree and work our way up.
- The style of parsing is called SHIFT-REDUCE
- SHIFT refers to pushing input symbols onto a stack.
- REDUCE refers to “reduction steps” during a parse:
- We take a substring matching the RHS of a rule
- Then replace it with the symbol on the LHS of the rule

- If you can reduce until you have just the start symbol, you have succeeded in parsing the input string.

- S -> aABe
- Grammar: A -> Abc | b Input: abbcbcde
- B -> d
- Reduction steps: abbcbcde
- aAbcbcde
- aAbcde
- aAde
- aABe
- S <-- SUCCESS!

In reverse, the

reduction traces

out a rightmost

derivation.

- The HANDLE is the part of a sentential form that gets reduced in a backwards rightmost derivation.
- Sometimes part of a sentential form will match a RHS in G, but if that string is NOT reduced in the backwards rightmost derivation, it is NOT a handle.
- Shift-reduce parsing, then, is really all about finding the handle at each step then reducing the handle.
- If we can always find the handle, we never have to backtrack.
- Finding the handle is called HANDLE PRUNING.

- A stack helps us find the handle for each reduction step.
- The stack holds grammar symbols.
- An input buffer holds the input string.
- $ marks the bottom of the stack and the end of input.
- Algorithm:
- Shift 0 or more input symbols onto the stack, until a handle β is on top of the stack.
- Reduce β to the LHS of the appropriate production.
- Repeat until we see $S on stack and $ in input.

- E -> E + E
- Grammar: E -> E * E w = id + id * id
- E -> ( E )
- E -> id
- STACK INPUT ACTION
- 1. $ id+id*id$ shift

- SHIFT: The next input symbol is pushed onto the stack.
- REDUCE: When the parser knows the right end of a handle is on the stack, the handle is replaced with the corresponding LHS.
- ACCEPT: Announce success (input is $, stack is $S)
- ERROR: The input contained a syntax error; call an error recovery routine.

- Like predictive parsers, sometimes a shift-reduce parser won’t know what to do.
- A SHIFT/REDUCE conflict occurs when the parser can’t decide whether to shift the input symbol or reduce the current top of stack.
- A REDUCE/REDUCE conflict occurs when the parser doesn’t know which of two or more rules to use for reduction.
- A grammar whose shift-reduce parser contains errors is said to be “Not LR”

- Ambiguous grammars are NEVER LR.
- stmt -> if ( expr ) stmt
- | if ( expr ) stmt else stmt
- | other

- If we have a shift-reduce parser in configuration
- STACK INPUT
- … if ( expr ) stmt else … $
- what to do?
- We could reduce “if ( expr ) stmt” to “stmt” (assuming the else is part of a different surrounding if-else statement)
- We could also shift the “else” (assuming this else goes with the current if)

- Some languages use () for function calls AND array refs.
- stmt -> id ( parameter_list )
- stmt -> expr := expr
- parameter_list -> parameter_list , parameter
- parameter_list -> parameter
- parameter -> id
- expr -> id ( expr_list )
- expr -> id
- expr_list -> expr_list , expr
- expr_list -> expr

- For input A(I,J) we would get token stream id(id,id)
- The first three tokens would certainly be shifted:
- STACK INPUT
- … id ( id , id ) …
- The id on top of the stack needs to be reduced, but we have two choices: parameter -> id OR expr -> id
- The stack gives no clues. To know which rule to use, we need to look up the first ID in the symbol table to see if it is a procedure name or an array name.
- One solution is to have the lexer return “procid” for procedure names. Then the shift-reduce parser can look into the stack to decide which reduction to use.

LR (Bottom-Up) Parsers

- A major type of shift-reduce parsing is called LR(k).
- “L” means left-to-right scanning of the input
- “R” means rightmost derivation
- “k” means lookahead of k characters (if omitted, assume k=1)
- LR parsers have very nice properties:
- They can recognize almost all programming language constructs for which we can write a CFG
- They are the most powerful type of shift-reduce parser, but they never backtrack, and are very efficient
- They can parse a proper superset of the languages parsable by predictive parsers
- They tell you as soon as possible when there’s a syntax error.

- DISADVANTAGE: hard to build by hand (we need something like yacc)

- The parser’s structure is similar to predictive parsing.
- The STACK now stores pairs (Xi, si).
- Xi is a grammar symbol.
- si is a STATE.

- The parse table now has two parts: ACTION and GOTO.
- The action table specifies whether to SHIFT, REDUCE, ACCEPT, or flag an ERROR given the state on the stack and the current input.
- The goto table specifies what state to go to after a reduction is performed.

- A CONFIGURATION of the LR parser is a pair (STACK, INPUT): ( s0 X1 s1… Xm sm, ai ai+1… an $ )
- The stack configuration is just a list of the states and grammar symbols currently on the stack.
- The input configuration is the list of unprocessed input symbols.
- Together, the configuration represents a right-sentential form X1… Xm ai ai+1… an (some intermediate step in a right derivation of the input from the start symbol)

- At each step, the parser is in some configuration.
- The next move depends on reading ai from the input and sm from the top of the stack.
- If action[sm,ai] = shift s, we execute a SHIFT move, entering the configuration ( s0 X1 s1… Xm sm ai s, ai+1… an $ ).
- If action[sm,ai] = reduce A -> β, then we enter the configuration ( s0 X1 s1… Xm-r sm-r A s, ai+1… an $ ), where r = | β | and s = goto[sm-r,A].
- If action[sm,ai] = accept, we’re done.
- If action[sm,ai] = error, we call an error recovery routine.

- Grammar:
- 1. E -> E + T
- 2. E -> T
- 3. T -> T * F
- 4. T -> F
- 5. F -> ( E )
- 6. F -> id

- CONFIGURATIONS
- STACK INPUT ACTION
- 0 id * id + id $ shift 5

- If it is possible to construct an LR parse table for G, we say “G is an LR grammar”.
- LR parsers DO NOT need to parse the entire stack to decide what to do (other shift-reduce parsers might).
- Instead, the STATE symbol summarizes all the information needed to make the decision of what to do next.
- The GOTO function corresponds to a DFA that knows how to find the HANDLE by reading the top of the stack downwards.
- In the example, we only looked at 1 input symbol at a time. This means the grammar is LR(1).

- We will look at 3 methods:
- Simple LR (SLR): simple but not very powerful
- Canonical LR: very powerful but too many states
- LALR: almost as powerful with many fewer states

- yacc uses the LALR algorithm.