Leonidas fegaras
Advertisement
This presentation is the property of its rightful owner.
1 / 24

Parsing #1 PowerPoint PPT Presentation

Leonidas Fegaras. Parsing #1. Parser. A parser recognizes sequences of tokens according to some grammar and generates Abstract Syntax Trees (ASTs) A context-free grammar (CFG) has a finite set of terminals (tokens) a finite set of nonterminals from which one is the start symbol

Download Presentation

Parsing #1

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Leonidas fegaras

Leonidas Fegaras

Parsing #1


Parser

Parser

  • A parser recognizes sequences of tokens according to some grammar and generates Abstract Syntax Trees (ASTs)

  • A context-free grammar (CFG) has

    • a finite set of terminals (tokens)

    • a finite set of nonterminals from which one is the start symbol

    • and a finite set of productions of the form:

      A ::= X1 X2 ... Xn

      where A is a nonterminal and each Xi is either a terminal or nonterminal symbol

get token

get next character

AST

scanner

parser

source file

token


Example

Example

  • Expressions:

    E ::= E + T

    | E - T

    | T

    T ::= T * F

    | T / F

    | F

    F ::= num

    | id

  • Nonterminals: E T F

  • Start symbol: E

  • Terminals: + - * / id num

  • Example: x+2*y

  • ... or equivalently:

    E ::= E + T

    E ::= E - T

    E ::= T

    T ::= T * F

    T ::= T / F

    T ::= F

    F ::= num

    F ::= id


Derivations

Derivations

  • Notation:

    • terminals: t, s, ...

    • nonterminals: A, B, ...

    • symbol (terminal or nonterminal): X, Y, ...

    • sequence of symbols: a, b, ...

  • Given a production:

    A ::= X1 X2 ... Xn

    the form aAb => aX1 X2 ... Xnb is called a derivation

  • eg, using the production T ::= T * F we get

    T / F + 1 - x => T * F / F + 1 - x

  • Leftmost derivation: when you always expand the leftmost nonterminal in the sequence

  • Rightmost derivation: ... rightmost nonterminal


Top down parsing

Top-down Parsing

  • It starts from the start symbol of the grammar and applies derivations until the entire input string is derived

  • Example that matches the input sequence id(x) + num(2) * id(y)

    E => E + Tuse E ::= E + T

    => E + T * Fuse T ::= T * F

    => T + T * Fuse E ::= T

    => T + F * Fuse T ::= F

    => T + num * Fuse F ::= num

    => F + num * Fuse T ::= F

    => id + num * Fuse F ::= id

    => id + num * iduse F ::= id

  • You may have more than one choice at each derivation step:

    • my have multiple nonterminals in each sequence

    • for each nonterminal in the sequence, may have many rules to choose from

  • Wrong predictions will cause backtracking

    • need predictive parsing that never backtracks


  • Bottom up parsing

    Bottom-up Parsing

    • It starts from the input string and uses derivations in the opposite directions (from right to left) until you derive the start symbol

    • Previous example:

      id(x) + num(2) * id(y)

      <= id(x) + num(2) * Fuse F ::= id

      <= id(x) + F * Fuse F ::= num

      <= id(x) + T * Fuse T ::= F

      <= id(x) + Tuse T ::= T * F

      <= F + Tuse F ::= id

      <= T + Tuse T ::= F

      <= E + Tuse E ::= T

      <= Euse E ::= E + T

    • At each derivation step, need to recognize a handle (the sequence of symbols that matches the right-hand-side of a production)


    Parse tree

    Parse Tree

    • Given the derivations used in the top-down/bottom-up parsing of an input sequence, a parse tree has

      • the start symbol as the root

      • the terminals of the input sequence as leafs

      • for each production A ::= X1 X2 ... Xn used in a derivation, a node A with children X1 X2 ... Xn

        E

        E T

        T T F

        F F

        id(x) + num(2) * id(y)

    E => E + T

    => E + T * F

    => T + T * F

    => T + F * F

    => T + num * F

    => F + num * F

    => id + num * F

    => id + num * id


    Playing with associativity

    Playing with Associativity

    • What about this grammar?

      E ::= T + E

      | T - E

      | T

      T ::= F * T

      | F / T

      | F

      F ::= num

      | id

    • Right associative

    • Now x+y+z is equivalent to x+(y+z)

    E

    T E

    F T E

    F T

    F

    id(x) + id(y) + id(z)


    Ambiguous grammars

    Ambiguous Grammars

    • What about this grammar?

      E ::= E + E

      | E - E

      | E * E

      | E / E

      | num

      | id

    • Operators + - * / have the same precedence!

    • It is ambiguous: has more than one parse tree for the same input sequence (depending which derivations are applied each time)

    E

    E E

    E E

    id(x) * id(y) + id(z)

    E

    E E

    E E

    id(x) * id(y) + id(z)


    Predictive parsing

    Predictive Parsing

    • The goal is to construct a top-down parser that never backtracks

    • Always leftmost derivations

      • left recursion is bad!

    • We must transform a grammar in two ways:

      • eliminate left recursion

      • perform left factoring

    • These rules eliminate most common causes for backtracking although they do not guarantee a completely backtrack-free parsing


    Left recursion elimination

    Left Recursion Elimination

    • For example, the grammar

      A ::= A a

      | b

      recognizes the regular expression ba*.

    • But a top-down parser may have hard time to decide which rule to use

    • Need to get rid of left recursion:

      A ::= b A'

      A' ::= a A'

      |

      ie, A' parses the RE a*.

    • The second rule is recursive, but not left recursive


    Left recursion elimination cont

    Left Recursion Elimination (cont.)

    • For each nonterminal X, we partition the productions for X into two groups:

      • one that contains the left recursive productions

      • the other with the rest

        That is:

        X ::= X a1

        ...

        X ::= X an

        where a and b are symbol sequences.

    • Then we eliminate the left recursion by rewriting these rules into:

      X ::= b1 X'

      ...

      X ::= bm X'

    X ::= b1

    ...

    X ::= bm

    X' ::= a1 X'

    ...

    X' ::= an X'

    X' ::=


    Example1

    Example

    E ::= T E'

    E' ::= + T E'

    | - T E'

    |

    T ::= F T'

    T' ::= * F T'

    | / F T'

    |

    F ::= num

    | id

    E ::= E + T

    | E - T

    | T

    T ::= T * F

    | T / F

    | F

    F ::= num

    | id


    Example2

    Example

    • A grammar that recognizes regular expressions:

      R ::= R R

      | R bar R

      | R *

      | ( R )

      | char

    • After left recursion elimination:

      R ::= ( R ) R'

      | char R'

      R' ::= R R'

      | bar R R'

      | * R'

      |


    Left factoring

    Left Factoring

    • Factors out common prefixes:

      X ::= a b1

      ...

      X ::= a bn

    • becomes:

      X ::= a X'

      X' ::= b1

      ...

      X' ::= bn

    • Example:

      E ::= T + E

      | T - E

      | T

    E ::= T E'

    E' ::= + E

    | - E

    |


    Recursive descent parsing

    Recursive Descent Parsing

    static void E () { T(); Eprime(); }

    static void Eprime () {

    if (current_token == PLUS)

    { read_next_token(); T(); Eprime(); }

    else if (current_token == MINUS)

    { read_next_token(); T(); Eprime(); };

    }

    static void T () { F(); Tprime(); }

    static void Tprime() {

    if (current_token == TIMES)

    { read_next_token(); F(); Tprime(); }

    else if (current_token == DIV)

    { read_next_token(); F(); Tprime(); };

    }

    static void F () {

    if (current_token == NUM || current_token == ID)

    read_next_token();

    else error();

    }

    E ::= T E'

    E' ::= + T E'

    | - T E'

    |

    T ::= F T'

    T' ::= * F T'

    | / F T'

    |

    F ::= num

    | id


    Predictive parsing using a table

    Predictive Parsing Using a Table

    • The symbol sequence from a derivation is stored in a stack (first symbol on top)

    • if the top of the stack is a terminal, it should match the current token from the input

    • if the top of the stack is a nonterminal X and the current input token is t, we get a rule for the parse table: M[X,t]

    • the rule is used as a derivation to replace X in the stack with the right-hand symbols

    push(S);

    read_next_token();

    repeat

    X = pop();

    if (X is a terminal or '$')

    if (X == current_token)

    read_next_token();

    else error();

    else if (M[X,current_token]

    == "X ::= Y1 Y2 ... Yk")

    { push(Yk);

    ...

    push(Y1);

    }

    else error();

    until X == '$';


    Parsing table example

    Parsing Table Example

    1) E ::= T E' $

    2) E' ::= + T E'

    3) | - T E'

    4) |

    5) T ::= F T'

    6) T' ::= * F T'

    7) | / F T'

    8) |

    9) F ::= num

    10) | id

    num id + - * / $

    E 1 1

    E' 2 3 4

    T 5 5

    T' 8 8 6 7 8

    F 9 10


    Example parsing x 2 y

    Example: Parsing x-2*y$

    top

    Stack current_token Rule

    ExM[E,id] = 1 (using E ::= T E' $)

    $ E' TxM[T,id] = 5 (using T ::= F T')

    $ E' T' FxM[F,id] = 10 (using F ::= id)

    $ E' T' idxread_next_token

    $ E' T'-M[T',-] = 8 (using T' ::= )

    $ E'-M[E',-] = 3 (using E' ::= - T E')

    $ E' T --read_next_token

    $ E' T 2M[T,num] = 5 (using T ::= F T')

    $ E' T' F2M[F,num] = 9 (using F ::= num)

    $ E' T' num2read_next_token

    $ E' T'*M[T',*] = 6 (using T' ::= * F T')

    $ E' T' F **read_next_token

    $ E' T' FyM[F,id] = 10 (using F ::= id)

    $ E' T' idyread_next_token

    $ E' T'$M[T',$] = 8 (using T' ::= )

    $ E'$M[E',$] = 4 (using E' ::= )

    $$stop (accept)


    Constructing the parsing table

    Constructing the Parsing Table

    • FIRST[a] is the set of terminals t that result after a number of derivations on the symbol sequence a

      • ie, a => ... => tb for some symbol sequence b

    • FIRST[ta]={t} eg, FIRST[3+E]={3}

    • FIRST[X]=FIRST[a1]  …  FIRST[an] for each production X ::= ai

    • FIRST[Xa]=FIRST[X] but if X has an empty derivation then FIRST[Xa]=FIRST[X]  FIRST[a]

  • FOLLOW[X] is the set of all terminals that follow X in any legal derivation

    • find all productions Z ::= a X b in which X appears at the RHS; then

      • FIRST[b] must be included in FOLLOW[X]

      • if b has an empty derivation, FOLLOW[Z] must be included in FOLLOW[X]


  • Example3

    Example

    1) E ::= T E' $

    2) E' ::= + T E'

    3) | - T E'

    4) |

    5) T ::= F T'

    6) T' ::= * F T'

    7) | / F T'

    8) |

    9) F ::= num

    10) | id

    FIRST FOLLOW

    E {num,id} {}

    E' {+,-} {$}

    T {num,id} {+,-,$}

    T' {*,/} {+,-,$}

    F {num,id} {+,-,*,/,$}


    Constructing the parsing table cont

    Constructing the Parsing Table (cont.)

    • For each rule X ::= a do:

      • for each t in FIRST[a], add X ::= a to M[X,t]

      • if a can be reduced to the empty sequence, then for each t in FOLLOW[X], add X ::= a to M[X,t]

    FIRST FOLLOW

    E {num,id} {}

    E' {+,-} {$}

    T {num,id} {+,-,$}

    T' {*,/} {+,-,$}

    F {num,id} {+,-,*,/,$}

    1) E ::= T E' $

    2) E' ::= + T E'

    3) | - T E'

    4) |

    5) T ::= F T'

    6) T' ::= * F T'

    7) | / F T'

    8) |

    9) F ::= num

    10) | id

    num id + - * / $

    E 1 1

    E' 2 3 4

    T 5 5

    T' 8 8 6 7 8

    F 9 10


    Another example

    Another Example

    0) G := S $

    1) S ::= ( L )

    2) S ::= a

    3) L ::= S L'

    4) L' ::= , S L'

    5) L' ::=

    G ::= S $

    S ::= ( L )

    | a

    L ::= L , S

    | S

    ( ) a , $

    G 0 0

    S 1 2

    L 3 3

    L' 5 4


    Chapter 3

    LL(1)

    • A grammar is called LL(1) if each element of the parsing table of the grammar has at most one production element

      • the first L in LL(1) means that we read the input from left to right

      • the second L means that it uses left-most derivations only

      • the number 1 means that we need to look one token ahead from the input


  • Login