1 / 32

Syntax and Grammars

Syntax and Grammars. Definitions. Syntax is the form of its expressions, statements and program units Semantics is the meaning of those expressions, statements and program units Syntax can be precisely defined, while semantics are harder to describe:

kyran
Download Presentation

Syntax and Grammars

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Syntax and Grammars

  2. Definitions • Syntax is the form of its expressions, statements and program units • Semantics is the meaning of those expressions, statements and program units • Syntax can be precisely defined, while semantics are harder to describe: • syntax easier were to sentence would unimportant this read if be • this sentence would be easier to read if syntax were unimportant • time flies like an arrow, fruit flies like a banana

  3. Context-sensitive vs. context-free • English is context sensitive since the meaning of a word can change based on the words around it • Programming languages are usually context-free • The meaning of the “words” in a program are determined by the syntax rules of the language

  4. Describing Syntax • Two approaches: • Language recognizers • Given a string of characters, determines if the string is valid in the language • Approach taken by the compiler • Language generators • Allows a new valid sentence to be created each time it is used • More easily read and understood • These formalizations aid in compiler design, and language acquisition

  5. Language Generators • Usually called grammars • Chomsky (1956, 1959) • Linguist who described four classes of language generators, two which are useful for programming language syntax: context-free and regular • Interested in the theory • Backus and Naur • Computer scientists describing PL syntax • Interested in the practical aspect of describing ALGOL 58 and ALGOL 60 • BNF (Backus-Naur Form) now the standard

  6. Backus-Naur Form • BNF is a metalanguage • A language that describes another language • Gives details for each of characters, lexemes and tokens • Pieces of syntax • Individual characters e.g. alphanumeric keys • Lexemes • Small units • E.g. numeric literals, operators, special words • Tokens • Categories for lexemes • E.g. identifier, mult_op, semicolon, int_literal

  7. BNF Example – C++ Identifier <identifier> ::= <first_letter> <valid_sequence> <first_letter> ::= <letter> | _ <valid_sequence> ::= <valid_symbol> <valid_sequence> |  <valid_symbol>::= <letter> | <digit> | _ <letter> ::= A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

  8. This sequence of steps is called a derivation. Is “R2D2” a valid identifier? <identifier> <first_letter> <valid_sequence> <letter> <valid_sequence> R <valid_sequence> R <valid_symbol> <valid_sequence> R <digit> <valid_sequence> R 2 <valid_sequence> R 2 <valid_symbol> <valid_sequence> R 2 <letter> <valid_sequence> R 2 D <valid_sequence> R 2 D <valid_symbol> <valid_sequence> R 2 D <digit> <valid_sequence> R 2 D 2 <valid_sequence> R 2 D 2 

  9. Parse trees <identifier> <first letter> <valid sequence> <valid symbol> <valid sequence> R . . . <valid digit> 2

  10. Formal Definitions Let S be a set of symbols. • A string overSis a finite sequence of zero or more symbols from the set S. • Symbols whose meaning is predefined are called terminals. • Symbols like A, _, 6, etc. are terminals in our <identifier> BNF. • Symbols whose meanings must be defined are called non-terminals, and are enclosed in angle-brackets (< and >). • Symbols like <identifier>, <letter>, etc. are non-terminals. • Like variables, non-terminals usually describe what they represent. • One symbol is designated as the starting non-terminal. • The symbol <identifier>is the starting non-terminal in our BNF.

  11. Formal Definitions continued • Each non-terminal must be defined by a production (rule): <LHS> ::= <RHS> • where: <LHS>is the nonterminal being defined, • <RHS>is a string of terminals and/or • nonterminalsdefining <LHS>. • Different productions defining the same non-terminal: <NTi> ::= Def1 <NTi> ::= Def2 ... <NTi> ::= Defn • can be written in shorthand using the OR (|) operator: <NTi> ::= Def1 | Def2 | … | Defn

  12. Formal Definitions continued • A BNF is a quadruple: (S, N, P, S), where: S is the set of symbols in the BNF; N is the subset of S that are nonterminals in the language; P is the set of productions defining the symbols in N; S is the element of N that is the starting nonterminal. • Aderivationis a sequence of strings, beginning with the starting nonterminal S, in which each successive string replaces a nonterminal with one of its productions, and in which the final string consists solely of terminals. A derivation is sometimes called a parse.

  13. Formal Definitions continued • A BNF derivation tree (or parse tree) is a tree, such that • the root of the tree is the starting nonterminal (S) in the BNF; • the children of the root are the symbols (L to R) in a production whose <LHS> is S, the starting nonterminal; • each terminal child is a leaf; and • each nonterminal child is the root of a derivation tree for that nonterminal. • The act of building a derivation tree for a sentence (to check its correctness) is called parsing that sentence. The set of valid sentences in a language is the set of all sentences for which a parse tree exists!

  14. Recursive Productions Our identifier-BNF permits identifiers to have arbitrary lengths through the use of recursive productions: <valid_sequence> ::= <valid_symbol> <valid_sequence> |  The first production is recursive because the non-terminal on the <LHS> appears in the production (on the <RHS>). The recursive production provides for unrestricted repetition of the non-terminal being defined, which is useful when what is being defined can be appear 0 or more times. • The e-productionprovides: • a base-casefor trivial instances of the non-terminal, and • an anchorto terminate the recursion.

  15. Writing BNFs: A 3-Step Process A. Start with the non-terminal you’re defining. B. Build the productions to define the non-terminal: 1. Start with the question: What comes first? 2. If a construct is optional: a. create a nonterminal for that construct; b. add a production for the non-optional case; c. add an e-production for the optional case. 3. If a construct can be repeated zero or more times: a. create a nonterminal for that construct; b. add an e-production for the zero-reps case; c. add a recursive production for the other cases. C. For each nonterminal in the <RHS> of every production: repeat step B until all nonterminals have been defined.

  16. Example: C++ if statement A. Create a non-terminal for what we’re defining: <if_stmt> B. Build a production to define it: what comes first? - keyword if - open parentheses - an expression - close parentheses - a statement - (optional) an else and a statement <if_stmt> ::= if ( <expr> ) <stmt> <else_part> C. Repeat B for each undefined non-terminal: <else_part> ::= else <stmt> |  (For now we won’t worry about <expr> and <stmt>)

  17. How are we doing? <if_stmt> ::= if ( <expr> ) <stmt> <else_part> <else_part> ::= else <stmt> |  if ( a < b) if (a < c) S1 else S2 Uh oh…

  18. Ambiguities • If multiple parse trees exist for a valid expression, the grammar is said to be ambiguous • C++ allows this ambiguity in the if expression, and instead uses a semantic rule to resolve the ambiguity: • Anelse associates with the closest prior unterminatedif • Not all ambiguity is ok • Many times, if there is ambiguity, the grammar has been poorly structured

  19. Example <assign> ::= <id> = <expr> <id> ::= A|B|C <expr>::= <expr> + <expr> | <expr> * <expr> | ( <expr> ) | <id> Consider the expression A = B + C * A

  20. From Concepts of Programming Languages, 9th ed. Sebesta

  21. Precedence • Order of evaluations can be built in to the grammar • An expression lower in the parse tree then has higher precedence <expr> ::= … | <add_expr> | … <add_expr> ::= <term> | <add_expr> +<term> | <add_expr> -<term> <factor> | <term> * <factor> | <term> /<factor> | <term> % <factor> <term> ::= <factor> ::= ( <add_expr> ) | <value>

  22. Example: 2+3*4 <expr> <add_expr> <add_expr> + <term> <term> <term> * <factor> <value> <factor> <factor> <value> <value> 4 2 3

  23. Associativity • When two operators in an expression have the same precedence (e.g. + and -, or * and /), a semantic rule is required to indicate which should have precedence • Associativity can matter in computing where it does not matter in mathematics • E.g. 1000000 plus ten 1s on a machine with 7 digits of accuracy • A grammar can imply operator associativity

  24. - is left-associative; which is correct? = is right-associative; which is correct? - - = = 8 - - 2 x = = z 4 2 8 4 y z x y Associativity Examples

  25. Left-recursive productions • Left associativity can be built into a grammar using left-recursive productions <add_expr> ::= <mul_expr> | <add_expr> + <mul_expr> | <add_expr> - <mul_expr> - - 2 8 4

  26. Right-recursive productions • Right associativity can be built into a grammar using right-recursive productions <assign_expr> ::= <lvalue> <assign_op> <assign_expr> = x = y z

  27. Exercise <add_expr> ::= <term> | <add_expr> +<term> | <add_expr> -<term> <factor> | <term> * <factor> | <term> /<factor> | <term> % <factor> <term> ::= <value> ** <factor>| <value> <factor> ::= <value> ::= ( <add_expr> ) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Draw the parse tree for: 4 ** 2 ** 3 + 5 * 6 + 7

  28. Extended BNF (EBNF) • No more description power than regular BNF • Increased readability and writability • Variety of implementations • Instead of <> indicating a non-terminal, bold, underline or ‘quote’ terminals • Capitalize first letter of non-terminals

  29. Common extensions • [ and ] surround optional symbols • No productions! • E.g. if_stmt ::= if(expr)stmt[ else stmt ] • { and } surround symbols repeated zero or more times • No recursion! • E.g. ident ::= letter { letter | digit } • ( and ) indicate groupings from which a single choice must be made • E.g. term ::= term (* | / | %) factor

  30. Why BNF? • The recursive productions make it easier for a compiler to parse the “sentences” in the language • Basic parsing algorithm 0. Push S (the starting symbol) onto a stack. 1. Get the first terminal symbol t from the input file. 2. Repeat the following steps: a. Pop the stack into topSymbol; b. If topSymbol is a nonterminal: 1) Choose a production p of topSymbol based on t 2) If p != e: Push p right-to-left onto the stack. c. Else if topSymbol is a terminal && topSymbol == t: Get the next terminal symbol t from the input file. d. Else Generate a ‘parse error’ message. while the stack is not empty.

  31. stack: <var_dec> <type> int id <id_list> , id <id_list> ; ; id id <id_list> ; id <id_list> <var_dec> <var_dec> <id_list> <id_list> ; <var_dec> <id_list> ; ; ; <var_dec> ; <var_dec> <var_dec> <var_dec> <var_dec> t: int Example Suppose our grammar is: <var_dec> ::= <type> id <id_list> ; <var_dec> | e <type> ::= int | char | float | double | … <id_list> ::= , <id> <id-list> | e Let’s parse the declaration: intx, y; assuming that <var_dec> is our starting symbol. <var_dec> id (x) , id (y) ;

  32. Summary • The EBNF is simpler and easier to use, so it is frequently used in language guides/manuals. • The BNF’s recursive- and e-productions simplify the task of parsing, so it is more useful to compiler-writers.

More Related