Syntax and Grammars

Syntax and Grammars

Definitions • Syntax is the form of its expressions, statements and program units • Semantics is the meaning of those expressions, statements and program units • Syntax can be precisely defined, while semantics are harder to describe: • syntax easier were to sentence would unimportant this read if be • this sentence would be easier to read if syntax were unimportant • time flies like an arrow, fruit flies like a banana

Context-sensitive vs. context-free • English is context sensitive since the meaning of a word can change based on the words around it • Programming languages are usually context-free • The meaning of the “words” in a program are determined by the syntax rules of the language

Describing Syntax • Two approaches: • Language recognizers • Given a string of characters, determines if the string is valid in the language • Approach taken by the compiler • Language generators • Allows a new valid sentence to be created each time it is used • More easily read and understood • These formalizations aid in compiler design, and language acquisition

Language Generators • Usually called grammars • Chomsky (1956, 1959) • Linguist who described four classes of language generators, two which are useful for programming language syntax: context-free and regular • Interested in the theory • Backus and Naur • Computer scientists describing PL syntax • Interested in the practical aspect of describing ALGOL 58 and ALGOL 60 • BNF (Backus-Naur Form) now the standard

Backus-Naur Form • BNF is a metalanguage • A language that describes another language • Gives details for each of characters, lexemes and tokens • Pieces of syntax • Individual characters e.g. alphanumeric keys • Lexemes • Small units • E.g. numeric literals, operators, special words • Tokens • Categories for lexemes • E.g. identifier, mult_op, semicolon, int_literal

BNF Example – C++ Identifier <identifier> ::= <first_letter> <valid_sequence> <first_letter> ::= <letter> | _ <valid_sequence> ::= <valid_symbol> <valid_sequence> |  <valid_symbol>::= <letter> | <digit> | _ <letter> ::= A | B | C | D | E | F | G | H | I | J | K | L | M | N | O | P | Q | R | S | T | U | V | W | X | Y | Z | a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z <digit> ::= 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

This sequence of steps is called a derivation. Is “R2D2” a valid identifier? <identifier> <first_letter> <valid_sequence> <letter> <valid_sequence> R <valid_sequence> R <valid_symbol> <valid_sequence> R <digit> <valid_sequence> R 2 <valid_sequence> R 2 <valid_symbol> <valid_sequence> R 2 <letter> <valid_sequence> R 2 D <valid_sequence> R 2 D <valid_symbol> <valid_sequence> R 2 D <digit> <valid_sequence> R 2 D 2 <valid_sequence> R 2 D 2 

Parse trees <identifier> <first letter> <valid sequence> <valid symbol> <valid sequence> R . . . <valid digit> 2

Formal Definitions Let S be a set of symbols. • A string overSis a finite sequence of zero or more symbols from the set S. • Symbols whose meaning is predefined are called terminals. • Symbols like A, _, 6, etc. are terminals in our <identifier> BNF. • Symbols whose meanings must be defined are called non-terminals, and are enclosed in angle-brackets (< and >). • Symbols like <identifier>, <letter>, etc. are non-terminals. • Like variables, non-terminals usually describe what they represent. • One symbol is designated as the starting non-terminal. • The symbol <identifier>is the starting non-terminal in our BNF.

Formal Definitions continued • Each non-terminal must be defined by a production (rule): <LHS> ::= <RHS> • where: <LHS>is the nonterminal being defined, • <RHS>is a string of terminals and/or • nonterminalsdefining <LHS>. • Different productions defining the same non-terminal: <NTi> ::= Def1 <NTi> ::= Def2 ... <NTi> ::= Defn • can be written in shorthand using the OR (|) operator: <NTi> ::= Def1 | Def2 | … | Defn

Formal Definitions continued • A BNF is a quadruple: (S, N, P, S), where: S is the set of symbols in the BNF; N is the subset of S that are nonterminals in the language; P is the set of productions defining the symbols in N; S is the element of N that is the starting nonterminal. • Aderivationis a sequence of strings, beginning with the starting nonterminal S, in which each successive string replaces a nonterminal with one of its productions, and in which the final string consists solely of terminals. A derivation is sometimes called a parse.

Formal Definitions continued • A BNF derivation tree (or parse tree) is a tree, such that • the root of the tree is the starting nonterminal (S) in the BNF; • the children of the root are the symbols (L to R) in a production whose <LHS> is S, the starting nonterminal; • each terminal child is a leaf; and • each nonterminal child is the root of a derivation tree for that nonterminal. • The act of building a derivation tree for a sentence (to check its correctness) is called parsing that sentence. The set of valid sentences in a language is the set of all sentences for which a parse tree exists!

Recursive Productions Our identifier-BNF permits identifiers to have arbitrary lengths through the use of recursive productions: <valid_sequence> ::= <valid_symbol> <valid_sequence> |  The first production is recursive because the non-terminal on the <LHS> appears in the production (on the <RHS>). The recursive production provides for unrestricted repetition of the non-terminal being defined, which is useful when what is being defined can be appear 0 or more times. • The e-productionprovides: • a base-casefor trivial instances of the non-terminal, and • an anchorto terminate the recursion.

Writing BNFs: A 3-Step Process A. Start with the non-terminal you’re defining. B. Build the productions to define the non-terminal: 1. Start with the question: What comes first? 2. If a construct is optional: a. create a nonterminal for that construct; b. add a production for the non-optional case; c. add an e-production for the optional case. 3. If a construct can be repeated zero or more times: a. create a nonterminal for that construct; b. add an e-production for the zero-reps case; c. add a recursive production for the other cases. C. For each nonterminal in the <RHS> of every production: repeat step B until all nonterminals have been defined.

Example: C++ if statement A. Create a non-terminal for what we’re defining: <if_stmt> B. Build a production to define it: what comes first? - keyword if - open parentheses - an expression - close parentheses - a statement - (optional) an else and a statement <if_stmt> ::= if ( <expr> ) <stmt> <else_part> C. Repeat B for each undefined non-terminal: <else_part> ::= else <stmt> |  (For now we won’t worry about <expr> and <stmt>)

How are we doing? <if_stmt> ::= if ( <expr> ) <stmt> <else_part> <else_part> ::= else <stmt> |  if ( a < b) if (a < c) S1 else S2 Uh oh…

Ambiguities • If multiple parse trees exist for a valid expression, the grammar is said to be ambiguous • C++ allows this ambiguity in the if expression, and instead uses a semantic rule to resolve the ambiguity: • Anelse associates with the closest prior unterminatedif • Not all ambiguity is ok • Many times, if there is ambiguity, the grammar has been poorly structured

From Concepts of Programming Languages, 9th ed. Sebesta

Example: 2+3*4 <expr> <add_expr> <add_expr> + <term> <term> <term> * <factor> <value> <factor> <factor> <value> <value> 4 2 3

Associativity • When two operators in an expression have the same precedence (e.g. + and -, or * and /), a semantic rule is required to indicate which should have precedence • Associativity can matter in computing where it does not matter in mathematics • E.g. 1000000 plus ten 1s on a machine with 7 digits of accuracy • A grammar can imply operator associativity

- is left-associative; which is correct? = is right-associative; which is correct? - - = = 8 - - 2 x = = z 4 2 8 4 y z x y Associativity Examples

Left-recursive productions • Left associativity can be built into a grammar using left-recursive productions <add_expr> ::= <mul_expr> | <add_expr> + <mul_expr> | <add_expr> - <mul_expr> - - 2 8 4

Right-recursive productions • Right associativity can be built into a grammar using right-recursive productions <assign_expr> ::= <lvalue> <assign_op> <assign_expr> = x = y z

Exercise <add_expr> ::= <term> | <add_expr> +<term> | <add_expr> -<term> <factor> | <term> * <factor> | <term> /<factor> | <term> % <factor> <term> ::= <value> ** <factor>| <value> <factor> ::= <value> ::= ( <add_expr> ) | 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Draw the parse tree for: 4 ** 2 ** 3 + 5 * 6 + 7

Extended BNF (EBNF) • No more description power than regular BNF • Increased readability and writability • Variety of implementations • Instead of <> indicating a non-terminal, bold, underline or ‘quote’ terminals • Capitalize first letter of non-terminals

Common extensions • [ and ] surround optional symbols • No productions! • E.g. if_stmt ::= if(expr)stmt[ else stmt ] • { and } surround symbols repeated zero or more times • No recursion! • E.g. ident ::= letter { letter | digit } • ( and ) indicate groupings from which a single choice must be made • E.g. term ::= term (* | / | %) factor

Why BNF? • The recursive productions make it easier for a compiler to parse the “sentences” in the language • Basic parsing algorithm 0. Push S (the starting symbol) onto a stack. 1. Get the first terminal symbol t from the input file. 2. Repeat the following steps: a. Pop the stack into topSymbol; b. If topSymbol is a nonterminal: 1) Choose a production p of topSymbol based on t 2) If p != e: Push p right-to-left onto the stack. c. Else if topSymbol is a terminal && topSymbol == t: Get the next terminal symbol t from the input file. d. Else Generate a ‘parse error’ message. while the stack is not empty.

stack: <var_dec> <type> int id <id_list> , id <id_list> ; ; id id <id_list> ; id <id_list> <var_dec> <var_dec> <id_list> <id_list> ; <var_dec> <id_list> ; ; ; <var_dec> ; <var_dec> <var_dec> <var_dec> <var_dec> t: int Example Suppose our grammar is: <var_dec> ::= <type> id <id_list> ; <var_dec> | e <type> ::= int | char | float | double | … <id_list> ::= , <id> <id-list> | e Let’s parse the declaration: intx, y; assuming that <var_dec> is our starting symbol. <var_dec> id (x) , id (y) ;

Summary • The EBNF is simpler and easier to use, so it is frequently used in language guides/manuals. • The BNF’s recursive- and e-productions simplify the task of parsing, so it is more useful to compiler-writers.

Syntax and Grammars

Syntax and Grammars

Presentation Transcript

Parsers and Grammars

Grammars

Grammar and Grammars

Syntax and Context-Free Grammars

Grammar and Grammars

Prolog and grammars

Languages and Grammars

Languages and Grammars

Grammars

Languages and Grammars

Grammars and ambiguity

Session 13 Context-Free Grammars and Language Syntax

Grammars and Parsing

Languages and grammars

Grammars and Automata

Grammars and Parsing

Parsers and Grammars

Syntax and Processing it: Definite Clause Grammars in Prolog (optional material)

Introduction to Syntax and Context-Free Grammars

Grammars and ambiguity

Grammars and Parsing