1 / 21

COP4020 Programming Languages

COP4020 Programming Languages. Syntax analysis Prof. Xin Yuan. Overview. Syntax analysis overview Grammar and context-free grammar Grammar derivations Parse trees. Syntax analysis. Syntax analysis is done by the parser.

lilah
Download Presentation

COP4020 Programming Languages

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COP4020Programming Languages Syntax analysis Prof. Xin Yuan

  2. Overview • Syntax analysis overview • Grammar and context-free grammar • Grammar derivations • Parse trees COP4020 Spring 2014

  3. Syntax analysis • Syntax analysis is done by the parser. • Detects whether the program is written following the grammar rules and reports syntax errors. • Produces a parse tree from which intermediate code can be generated. token Rest of front end Lexical analyzer Parse tree Int. code Source program parser Request for token Symbol table

  4. The syntax of a programming language is described by a context-free grammar(Backus-Naur Form (BNF)). • Similar to the languages specified by regular expressions, but more general. • A grammar gives a precise syntactic specification of a language. • From some classes of grammars, tools exist that can automatically construct an efficient parser. These tools can also detect syntactic ambiguities and other problems automatically. • A compiler based on a grammatical description of a language is more easily maintained and updated.

  5. Grammars A grammar has four components G=(N, T, P, S):T is a finite set of tokens (terminal symbols)N is a finite set of nonterminalsP is a finite set of productions of the form . Where and S is a special nonterminal thatis a designated start symbol COP4020 Spring 2014

  6. Example • Grammar for expression (T=?, N=?, P=?, S=?) • Production: E ->E+E E-> E-E E-> (E) E-> -E E->num E->id • How does this correspond to a language? Informally, you can expand the non-terminals using the productions until all are expanded: the ending sentence (a sequence of tokens) is recognized by the grammar. COP4020 Spring 2014

  7. Language recognized by a grammar * * We say “aAbderivesawb in one step”, denoted as “aAb=>awb”, if A->w is a production and a and b are arbitrary strings of terminal or nonterminal symbols. We say a1 derives am if a1=>a2=>…=>am, written as a1=>am The languages L(G) defined by G are the set of strings of the terminals w such that S=>w. COP4020 Spring 2014

  8. Example A->aA A->bA A->a A->b • G=(N, T, P, S) • N=? • T=? • P=? • S=? • What is the language recognized by this grammer? COP4020 Spring 2014

  9. Chomsky Hierarchy (classification of grammars) • A grammar is said to be • regular if it is • right-linear, where each production in P has the form, or . Here, A and B are non-terminals and w is a terminal • or left-linear • context-free if each production in P is of the form , where and • context sensitive if each production in P is of the form where • unrestricted if each production in P is of the form where • All languages recognized by regular expression can be represented by a regular grammar.

  10. A context free grammar has four components G=(N, T, P, S):T is a finite set of tokens (terminal symbols)N is a finite set of nonterminalsP is a finite set of productions of the form Where and .S is a special nonterminal thatis a designated start symbol. • Context free grammar is more expressive than regular expression. Consider language {ab, aabb, aaabbb, …}

  11. BNF Notation (another form of context free grammar) • Backus-Naur Form (BNF) notation for productions:<nonterminal> ::=sequence of (non)terminalswhere • Each terminal in the grammar is a token • A <nonterminal> defines a syntactic category • The symbol | denotes alternative forms in a production • The special symbol  denotes empty COP4020 Spring 2014

  12. Example <Program> ::= program <id> ( <id> <More_ids> ); <Block> .<Block> ::= <Variables> begin <Stmt> <More_Stmts> end<More_ids> ::= , <id> <More_ids> | <Variables> ::= var <id> <More_ids> : <Type> ; <More_Variables> | <More_Variables> ::= <id> <More_ids> : <Type> ; <More_Variables> | <Stmt> ::= <id> := <Exp> | if <Exp> then <Stmt> else <Stmt> | while <Exp> do <Stmt> | begin <Stmt> <More_Stmts> end<More_Stmts> ::= ; <Stmt> <More_Stmts> | <Exp> ::= <num> | <id> | <Exp> + <Exp> | <Exp> - <Exp> COP4020 Spring 2014

  13. Derivations • From a grammar we can derive strings (= sequences of tokens) • The opposite process of parsing • Starting with the grammar’s designated start symbol, in each derivation step a nonterminal is replaced by a right-hand side of a production for that nonterminal • A sentence (in the language) is a sequence of terminals that can be derived from the start symbol. • A sentential form is a sequence of terminals and nonterminals that can be derived from the start symbol. COP4020 Spring 2014

  14. Example Derivation <expression> ::= identifier               | unsigned_integer               | - <expression>               | ( <expression> )               | <expression> <operator> <expression> <operator> ::= + | - | * | / Start symbol <expression>  <expression> <operator> <expression>  <expression> <operator> identifier  <expression> + identifier  <expression> <operator> <expression> + identifier  <expression> <operator> identifier + identifier  <expression> * identifier + identifier  identifier * identifier + identifier Replacement of nonterminal with one of its productions Sentential forms The final string is the yield COP4020 Spring 2014

  15. Rightmost versus Leftmost Derivations • When the nonterminal on the far right (left) in a sentential form is replaced in each derivation step the derivation is called right-most (left-most) Replace in rightmost derivation <expression>  <expression> <operator> <expression>  <expression> <operator> identifier Replace in rightmost derivation Replace in leftmost derivation <expression>  <expression> <operator> <expression>  identifier <operator> <expression> Replace in leftmost derivation COP4020 Spring 2014

  16. A Language Generated by a Grammar • A context-free grammar is a generator of a context-free language • The language defined by a grammar G is the set of all strings wthat can be derived from the start symbol SL(G) = { w | S* w } <S> ::= a | ‘(’ <S> ‘)’ L(G) = { set of all strings a (a) ((a)) (((a))) … } <S> ::= <B> | <C><B> ::= <C> + <C><C> ::= 0 | 1 L(G) = { 0+0, 0+1, 1+0, 1+1, 0, 1 } COP4020 Spring 2014

  17. Parse Trees • A parse tree depicts the end result of a derivation • The internal nodes are the nonterminals • The children of a node are the symbols (terminals and nonterminals) on a right-hand side of a production • The leaves are the terminals <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier COP4020 Spring 2014

  18. Parse Trees <expression>  <expression> <operator> <expression>  <expression> <operator> identifier  <expression> + identifier  <expression> <operator> <expression> + identifier  <expression> <operator> identifier + identifier  <expression> * identifier + identifier  identifier * identifier + identifier <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier COP4020 Spring 2014

  19. Ambiguity • There is another parse tree for the same grammar and input: the grammar is ambiguous • This parse tree is not desired, since it appears that + has precedence over * <expression> <expression> <operator> <expression> <expression> <operator> <expression> identifier * identifier + identifier COP4020 Spring 2014

  20. Ambiguous Grammars • Ambiguous grammar: more than one distinct derivation of a string results in different parse trees • A programming language construct should have only one parse tree to avoid misinterpretation by a compiler • For expression grammars, associativity and precedence of operators is used to disambiguate <expression> ::= <term> | <expression> <add_op> <term> <term> ::= <factor> | <term> <mult_op> <factor> <factor> ::= identifier | unsigned_integer | - <factor> | ( <expression> ) <add_op> ::= + | - <mult_op> ::= * | / COP4020 Spring 2014

  21. Ambiguous if-then-else:the “Dangling Else” • A classical example of an ambiguous grammar are the grammar productions for if-then-else:<stmt> ::= if <expr> then <stmt> | if <expr> then <stmt> else <stmt> • It is possible to hack this into unambiguous productions for the same syntax, but the fact that it is not easy indicates a problem in the programming language design • Ada uses different syntax to avoid ambiguity:<stmt> ::= if <expr> then <stmt> end if | if <expr> then <stmt> else <stmt> end if COP4020 Spring 2014

More Related