- 95 Views
- Uploaded on
- Presentation posted in: General

Winter 2012-2013 Compiler Principles Syntax Analysis (Parsing) – Part 2

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Winter 2012-2013Compiler PrinciplesSyntax Analysis (Parsing) – Part 2

Mayer Goldberg and Roman Manevich

Ben-Gurion University

- Review top-down parsing
- Recursive descent

- LL(1) parsing
- Start bottom-up parsing
- LR(k)
- SLR(k)
- LALR(k)

- Parser repeatedly faces the following problem
- Given program P starting with terminal t
- Non-terminal N with list of possible production rules: N α1 … N αk

- Predict which rule can should be used to derive P

- Define a function for every nonterminal
- Every function work as follows
- Find applicable production rule
- Terminal function checks match with next input token
- Nonterminal function calls (recursively) other functions

- If there are several applicable productions for a nonterminal, use lookahead

E LIT | (E OP E) | not E

LIT true|false

OP and | or | xor

not ( not true or false )

production to apply known from next token

E

E

E =>

notE =>

not ( E OP E ) =>

not ( not E OP E ) =>

not ( not LIT OP E ) =>

not ( not true OP E ) =>

not ( not true or E ) =>

not ( not true or LIT ) =>

not ( not true or false )

not

(

E

OP

E

)

not

LIT

or

LIT

true

false

- Manually constructed
- Recursive descent (previous lecture, review now)

- Generated (this lecture)
- Based on pushdown automata
- Does not use recursion

- Define a function for every nonterminal
- Every function work as follows
- Find applicable production rule
- Terminal function checks match with next input token
- Nonterminal function calls (recursively) other functions

- If there are several applicable productions for a nonterminal, use lookahead

E LIT | (E OP E) | not E

LIT true|false

OP and | or | xor

match(token t) {

if (current == t)

current = next_token()

else

error

}

Variable current holds the current input token

E LIT | (E OP E) | not E

LIT true|false

OP and | or | xor

E() {

if (current {TRUE, FALSE}) // E LIT

LIT();

else if (current == LPAREN)// E ( E OP E )

match(LPAREN); E(); OP(); E(); match(RPAREN);

else if (current == NOT)// E not E

match(NOT); E();

else

error;

}

LIT() {

if (current == TRUE) match(TRUE);

else if (current == FALSE) match(FALSE);

else error;

}

E() {

if (current {TRUE, FALSE})LIT();

else if (current == LPAREN)match(LPARENT);E(); OP(); E();match(RPAREN);

else if (current == NOT)match(NOT); E();

else error;

}

E → LIT

| ( E OP E )

| not E

LIT → true

| false

OP → and

| or

| xor

LIT() {

if (current == TRUE)match(TRUE);

else if (current == FALSE)match(FALSE);

elseerror;

}

OP() {

if (current == AND)match(AND);

else if (current == OR)match(OR);

else if (current == XOR)match(XOR);

elseerror;

}

p. 189

- For simplicity, let’s assume no null production rules
- See book for general case

- Find out the token that can appear first in a rule – FIRST sets

- For every production rule Aα
- FIRST(α) = all terminals that α can start with
- Every token that can appear as first in α under some derivation for α

- In our Boolean expressions example
- FIRST( LIT ) = { true, false }
- FIRST( ( E OP E ) ) = { ‘(‘ }
- FIRST( not E ) = { not }

- No intersection between FIRST sets => can always pick a single rule
- If the FIRST sets intersect, may need longer lookahead
- LL(k) = class of grammars in which production rule can be determined using a lookahead of k tokens
- LL(1) is an important and useful class

Assume no null productions A

Initially, for all nonterminalsA, setFIRST(A) = { t | Atω for some ω }

Repeat the following until no changes occur:for each nonterminal A for each production A Bω set FIRST(A) = FIRST(A) ∪ FIRST(B)

This is known as fixed-point computation

STMT if EXPR then STMT

| while EXPR do STMT

| EXPR ;

EXPR TERM -> id

| zero? TERM

| not EXPR

| ++ id

| -- id

TERM id

| constant

STMT if EXPR then STMT

| while EXPR do STMT

| EXPR ;

EXPR TERM -> id

| zero? TERM

| not EXPR

| ++ id

| -- id

TERM id

| constant

STMT if EXPR then STMT

| while EXPR do STMT

| EXPR ;

EXPR TERM -> id

| zero? TERM

| not EXPR

| ++ id

| -- id

TERM id

| constant

STMT if EXPR then STMT

| while EXPR do STMT

| EXPR ;

EXPRTERM -> id

| zero? TERM

| not EXPR

| ++ id

| -- id

TERM id

| constant

STMT if EXPR then STMT

| while EXPR do STMT

| EXPR ;

EXPR TERM -> id

| zero? TERM

| not EXPR

| ++ id

| -- id

TERM id

| constant

- A grammar is in the class LL(K) when it can be derived via:
- Top-down derivation
- Scanning the input from left to right (L)
- Producing the leftmost derivation (L)
- With lookahead of k tokens (k)
- For every two productions Aα and Aβ we have FIRST(α) ∩ FIRST(β) = {}(and FIRST(A) ∩ FOLLOW(A) = {} for null productions)

- A language is said to be LL(k) when it has an LL(k) grammar
- What can we do if grammar is not LL(k)?

- Pushdown automaton uses
- Prediction stack
- Input stream
- Transition table
- nonterminals x tokens -> production alternative
- Entry indexed by nonterminal N and token t contains the alternative of N that must be predicated when current input starts with t

- Two possible moves
- Prediction
- When top of stack is nonterminal N, pop N, lookup table[N,t]. If table[N,t] is not empty, push table[N,t] on prediction stack, otherwise – syntax error

- Match
- When top of prediction stack is a terminal T, must be equal to next input token t. If (t == T), pop T and consume t
- If (t ≠ T) syntax error

- Prediction
- Parsing terminates when prediction stack is empty
- If input is empty at that point, success. Otherwise, syntax error

Predictive Parsing program

Stack

Output

Parsing Table

(1) E → LIT

(2) E → ( E OP E )

(3) E → not E

(4) LIT → true

(5) LIT → false

(6) OP → and

(7) OP → or

(8) OP → xor

Which rule should be used

Input tokens

Nonterminals

aacbb$

A aAb | c

abcbb$

A aAb | c

- Compute parsing table
- If table is conflict-free then we have an LL(k) parser

- If table contains conflicts investigate
- If grammar is ambiguous try to disambiguate
- Try using left-factoring/substitution/left-recursion elimination to remove conflicts

Sometimes it will be useful to transform a grammar G with start non-terminal S into a grammar G’ with a new start non-terminal S‘ with a new production rule S’ S $where $ is not part of the set of tokens

To parse an input P with G’ we change it into P$

Simplifies top-down parsing with null productions and LR parsing

- No problem with left recursion
- Widely used in practice
- Shift-reduce parsing: LR(k), SLR, LALR
- All follow the same pushdown-based algorithm
- Read input left-to-right producing rightmost derivation
- Differ on type of “LR Items”

- Parser generator CUP implements LALR

- The opposite of derivation is called reduction
- Let Aα be a production rule
- Let βAµ be a sentence
- Replace left-hand side of rule in sentence:βAµ=> βαµ

- A handle is a substring that is reduced during a series of steps in a rightmost derivation

1 + (2) + (3)

E E + (E)

E i

E + (2) + (3)

Each non-leaf node represents a handle

E + (E) + (3)

Rightmost derivation in reverse

E + (3)

E

E + (E)

E

E

E

E

E

1

+

(

2

)

+

3

(

)

To be matched

Already matched

Input

N αβ

Hypothesis about αβ being a possible handle, so far we’ve matched α, expecting to see β

N αβ

Shift Item

N αβ

Reduce Item

Z exprEOF

expr term | expr+ term

term ID | (expr)

Z E $

E T | E + T

T i | ( E )

(just shorthand of the grammar on the top)

Z E $

E T | E + T

T i | ( E )

i

+

i

$

Z E $

Why do we need these additional LR items?

Where do they come from?

What do they mean?

E T

E E + T

T i

T (E )

{ Z E $,

Z E $

E T | E + T

T i | ( E )

E T,

-closure({Z E $}) =

E E + T,

T i ,

T ( E ) }

Given a set S of LR(0) items

If P αNβ is in S

then for each rule N in the grammarS must also contain N

Z E $

E T | E + T

T i | ( E )

i

+

i

$

Remember position from which we’re trying to reduce

Items denote possible future handles

Z E $

E T

E E + T

T i

T ( E )

Z E $

E T | E + T

T i | ( E )

i

+

i

$

Match items with current token

Z E $

T i

Reduce item!

E T

E E + T

T i

T ( E )

Z E $

E T | E + T

T i | ( E )

T

+

i

$

i

Z E $

Reduce item!

E T

E T

E E + T

T i

T ( E )

Z E $

E T | E + T

T i | ( E )

E

+

i

$

T

i

Z E $

Reduce item!

E T

E T

E E + T

T i

T ( E )

Z E $

E T | E + T

T i | ( E )

E

+

i

$

T

i

Z E $

Z E$

E T

E E + T

E E+ T

T i

T ( E )

Z E $

E T | E + T

T i | ( E )

E

+

i

$

T

i

Z E $

Z E$

E E+T

E T

T i

E E + T

E E+ T

T ( E )

T i

T ( E )

Z E $

E T | E + T

T i | ( E )

E

+

T

$

i

T

i

Z E $

Z E$

E E+T

E T

T i

E E + T

E E+ T

T ( E )

T i

T ( E )

Z E $

E T | E + T

T i | ( E )

E

+

T

$

T

i

i

Reduce item!

Z E $

Z E$

E E+T

E E+T

E T

T i

E E + T

E E+ T

T ( E )

T i

T ( E )

Z E $

E T | E + T

T i | ( E )

E

$

E

+

T

T

i

i

Z E $

Z E$

E T

E E + T

E E+ T

T i

T ( E )

Z E $

E T | E + T

T i | ( E )

E

$

E

+

T

T

i

Reduce item!

i

Z E $

Z E$

Z E$

E T

E E + T

E E+ T

T i

T ( E )

Z E $

E T | E + T

T i | ( E )

Z

E

$

E

+

T

Reduce item!

i

T

Z E $

i

Z E$

Z E$

E T

E E + T

E E+ T

T i

T ( E )

- Initial set
- Z is in the start symbol
- -closure({ Zα | Zαis in the grammar } )

- Next set from a set S and the next symbol X
- step(S,X) = { NαXβ | NαXβ in the item set S}
- nextSet(S,X) = -closure(step(S,X))

reduce state

shift state

q6

E T

T

T

q7

q0

T (E)

E T

E E + T

T i

T (E)

Z E$

E T

E E + T

T i

T (E)

(

q5

i

i

T i

E

E

(

(

i

q1

q8

q3

Z E$

E E+ T

T (E)

E E+T

E E+T

T i

T (E)

+

+

$

)

q9

q2

Z E$

T (E)

T

q4

E E + T

empty – error move

ACTION

Table

GOTO Table

- Two types of rows:
- Shift row – tells which state to GOTO for current token
- Reduce row – tells which rule to reduce (independent of current token)
- GOTO entries are blank

+

i

$

input

stack

q0

i

q5

- Input – remainder of text to be processed
- Stack – sequence of pairs N, qi
- N – symbol (terminal or non-terminal)
- qi – state at which decisions are made

- Initial stack contains q0

shift

i

+

i

$

+

i

$

input

input

stack

stack

q0

q0

i

q5

- Two moves: shift and reduce
- Shift move
- Remove first token from input
- Push it on the stack
- Compute next state based on GOTO table
- Push new state on the stack
- If new state is error – report error

ReduceT i

+

i

$

+

i

$

input

input

stack

stack

q0

i

q5

q0

T

q6

- Reduce move
- Using a rule N α
- Symbols in α and their following states are removed from stack
- New state computed based on GOTO table (using top of stack, before pushing N)
- N is pushed on the stack
- New state pushed on top of N

Z E $

E T

E E + T

T i

T ( E )

Warning: numbers mean different things!

rn = reduce using rule number n

sm = shift to state m

top is on the right

Z E $

E T

E E + T

T i

T ( E )

Can make a transition diagram for any grammar

Can make a GOTO table for every grammar

Cannot make a deterministic ACTION table for every grammar

…

T

q0

Z E$

E T

E E + T

T i

T (E)

T i[E]

(

…

q5

i

T i

T i[E]

E

Shift/reduce conflict

…

Z E $

E T

E E + T

T i

T ( E )

T i[E]

…

T

q0

Z E$

E T

E E + T

T i

T (E)

T i[E]

(

…

q5

i

T i

V i

E

reduce/reduce conflict

…

Z E $

E T

E E + T

T i

V iT ( E )

- Any grammar with an -rule cannot be LR(0)
- Inherent shift/reduce conflict
- A – reduce item
- PαAβ – shift item
- A can always be predicted from P αAβ

ACTION

Table

GOTO Table

ACTION table determined only by transition diagram, ignores input

A handle should not be reduced to a non-terminal N if the look-ahead is a token that cannot follow N

A reduce item N α is applicable only when the look-ahead is in FOLLOW(N)

Differs from LR(0) only on the ACTION table

Lookahead token from the input

Remember:

In contrast, GOTO table is indexed by state and a grammar symbol from the stack

Z E $

E T

E E + T

T i

T ( E )

vs.

SLR – use 1 token look-ahead

LR(0) – no look-ahead

… as before…

T i

T i[E]

(0) S’ → S

(1) S → L = R

(2) S → R

(3) L → * R

(4) L → id

(5) R → L

q3

R

S → R

q0

S’ → S

S → L = R

S → R

L → * R

L → id

R → L

S

q1

S’ → S

q9

S → L = R

q2

L

S → L = R

R → L

=

R

q6

S → L = R

R → L

L → * R

L → id

q5

id

*

L → id

id

*

id

q4

L → * R

R → L

L → * R

L → id

*

L

q8

R → L

L

q7

L → * R

R

(0) S’ → S

(1) S → L = R

(2) S → R

(3) L → * R

(4) L → id

(5) R → L

q2

S → L = R

R → L

=

q6

S → L = R

R → L

L → * R

L → id

- S → L = R vs. R → L
- FOLLOW(R) contains =
- S ⇒ L = R ⇒ * R = R

- SLR cannot resolve the conflict either

In SLR: a reduce item N α is applicable only when the look-ahead is in FOLLOW(N)

But FOLLOW(N) merges look-ahead for all alternatives for N

LR(1) keeps look-ahead with each LR item

Idea: a more refined notion of follows computed per item

[L → ● id, *]

[L → ● id, =]

[L → ● id, id]

[L → ● id, $]

[L → id ●, *]

[L → id ●, =]

[L → id ●, id]

[L → id ●, $]

(0) S’ → S

(1) S → L = R

(2) S → R

(3) L → * R

(4) L → id

(5) R → L

- LR(1) item is a pair
- LR(0) item
- Look-ahead token

- Meaning
- We matched the part left of the dot, looking to match the part on the right of the dot, followed by the look-ahead token.

- Example
- The production L id yields the following LR(1) items

- For every [A → α ● Bβ , c] in S
- for every production B→δ and every token b in the grammar such that b FIRST(βc)
- Add [B → ● δ , b] to S

q3

q0

(S → R ∙ , $)

(S’ → ∙ S , $)

(S → ∙ L = R , $)

(S → ∙ R , $)

(L → ∙ * R , = )

(L → ∙ id , = )

(R → ∙ L , $ )

(L → ∙ id , $ )

(L → ∙ * R , $ )

q11

q9

(S → L = R ∙ , $)

(L → id ∙ , $)

R

q1

S

(S’ → S ∙ , $)

id

R

id

L

q2

q12

(S → L ∙ = R , $)

(R → L ∙ , $)

(R → L ∙ , $)

=

id

*

*

q10

(S → L = ∙ R , $)

(R → ∙ L , $)

(L → ∙ * R , $)

(L → ∙ id , $)

L

q6

q4

q5

id

(L → id ∙ , $)

(L → id ∙ , =)

(L → * ∙ R , =)

(R → ∙ L , =)

(L → ∙ * R , =)

(L → ∙ id , =)

(L → * ∙ R , $)

(R → ∙ L , $)

(L → ∙ * R , $)

(L → ∙ id , $)

(L → * ∙ R , $)

(R → ∙ L , $)

(L → ∙ * R , $)

(L → ∙ id , $)

*

L

q8

R

q7

(R → L ∙ , =)

(R → L ∙ , $)

R

q13

(L → * R ∙ , =)

(L → * R ∙ , $)

(L → * R ∙ , $)

q2

(S → L ∙ = R , $)

(R → L ∙ , $)

=

(S → L = ∙ R , $)

(R → ∙ L , $)

(L → ∙ * R , $)

(L → ∙ id , $)

q6

Is there a conflict now?

LR tables have large number of entries

Often don’t need such refined observation (and cost)

LALR idea: find states with the same LR(0) component and merge their look-ahead component as long as there are no conflicts

LALR not as powerful as LR(1)