600 likes | 657 Views
Introduction to syntax and semantics in programming languages, defining language's structure and meaning. Explore language design, syntax, and parsing processes. Examples in C along with tokenizing and lexical analysis concepts.
E N D
Structure of programming languages Syntax and Semantics
Introduction • Syntax: the form or structure of the expressions, statements, and program units • Semantics: the meaning of the expressions, statements, and program units
Introduction • Syntax and semantics provide a language’s definition • Users of a language definition • Other language designers • Implementers • Programmers (the users of the language)
Syntax of a Languages • Motivation for using a subset of C: Grammar Language (pages) Reference Pascal 5 Jensen & Wirth C 6 Kernighan & Richie C++ 22 Stroustrup Java 14 Gosling, et. al. • The grammar fits on one page, so it’s a far better tool for studying language design.
C Grammar: Statements Program int main ( ){ Declarations Statements } Declarations {Declaration } Declaration Type Identifier [ [ Integer ] ]{ , Identifier [ [ Integer ] ] }; Type int | bool | float | char Statements {Statement } Statement ; | Block | Assignment | IfStatement | WhileStatement Block { Statements } Assignment Identifier [ [ Expression ] ]=Expression ; IfStatement if (Expression ) Statement [else Statement ] WhileStatement while (Expression ) Statement
C Grammar: Expressions Expression Conjunction {|| Conjunction } Conjunction Equality {&& Equality } Equality Relation [ EquOp Relation ] EquOp == | != Relation Addition [ RelOp Addition ] RelOp < | <= | > | >= Addition Term { AddOp Term } AddOp + | - Term Factor { MulOp Factor } MulOp * | / | % Factor [ UnaryOp ]Primary UnaryOp - | ! Primary Identifier [[ Expression ]] | Literal | ( Expression ) | Type ( Expression )
Describing Syntax: Terminology • A sentenceis a string of characters over some alphabet • A language is a set of sentences • Alexemeis the lowest level syntactic unit of a language (e.g., *, sum, begin) • A tokenis a category of lexemes (e.g., identifier)
Formal Definition of Languages • Recognizers • A recognition device reads input strings of the language and decides whether the input strings belong to the language • Example: syntax analysis part of a compiler • Generators • A device that generates sentences of a language • One can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator
Lexical Analysis • The process of converting a character stream into a corresponding sequence of meaningful symbols (called tokens or lexemes) is called tokenizing, lexing or lexical analysis. A program that performs this process is called a tokenizer, lexer, or scanner. • In Scheme, we tokenize (set! x (+ x 1)) as ( set! x ( + x 1 ) ) • Similarly, in Java, we tokenizeSystem.out.println("Hello World!"); as System . out . println ( "Hello World!" ) ;
Lexical Analysis • Lexical analyzer splits it into tokens • Token = sequence of characters (symbolic name) representing a single terminal symbol • Identifiers: myVariable … • Literals: 123 5.67 true … • Keywords: char sizeof … • Operators: + - * / … • Punctuation: ; , } { … • Discards whitespace and comments
Examples of Tokens in C • Lexical analyzer usually represents each token by a unique integer code • “+” { return(PLUS); } // PLUS = 401 • “-” { return(MINUS); } // MINUS = 402 • “*” { return(MULT); } // MULT = 403 • “/” { return(DIV); } // DIV = 404 • Some tokens require regular expressions • [a-zA-Z_][a-zA-Z0-9_]* { return (ID); } // identifier • [1-9][0-9]* { return(DECIMALINT); } • 0[0-7]* { return(OCTALINT); } • (0x|0X)[0-9a-fA-F]+ { return(HEXINT); }
Reserved Keywords in C • auto, break, case, char, const, continue, default, do, double, else, enum, extern, float, for, goto, if, int, long, register, return, short, signed, sizeof, static, struct, switch, typedef, union, unsigned, void, volatile, wchar_t, while • C++ added a bunch: bool, catch, class, dynamic_cast, inline, private, protected, public, static_cast, template, this, virtual and others • Each keyword is mapped to its own token
Whitespace Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment No token may contain embedded whitespace (unless it is a character or string literal) Example: >= one token > = two tokens
program confusing; const true = false; begin if (a<b) = true then f(a) else … Redefining Identifiers can be dangerous
Parsing Process • Call the scanner to get tokens • Build a parse tree from the stream of tokens • A parse tree shows the syntactic structure of the source program. • Add information about identifiers in the symbol table • Report error, when found, and recover from the error
Grammars • A meta-language is a language used to define other languages • A grammar is a meta-language used to define the syntax of a language. It consists of: • Finite set of terminal symbols • Finite set of non-terminal symbols • Finite set of production rules • Start symbol • Language = (possibly infinite) set of all sequences of symbols that can be derived by applying production rules starting from the start symbol Backus-Naur Form (BNF)
Describing Syntax • Backus-Naur Form and Context-Free Grammars (BNF) • Most widely known method for describing programming language syntax • Extended BNF • Improves readability and writability of BNF
Backus-Naur Form (BNF) • Backus-Naur Form (1959) • Invented by John Backus to describe Algol 58 • BNF is equivalent to context-free grammars • BNF is a meta-language used to describe another language • In BNF, abstractions are used to represent classes of syntactic structures--they act like syntactic variables (also called non-terminal symbols)
Consider the grammar: binaryDigit 0 binaryDigit 1 or equivalently: binaryDigit 0 | 1 Here, | is a metacharacter that separates alternatives. Example: Binary Digits
Example: Decimal Numbers • Grammar for unsigned decimal integers • Terminal symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 • Non-terminal symbols: Digit, Integer • Production rules: • Integer Digit | Integer Digit • Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Start symbol: Integer • Can derive any unsigned integer using this grammar • Language = set of all unsigned decimal integers
Derivation of 352 as an Integer Production rules: Integer Digit | Integer Digit Digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Integer Integer Digit Integer 2 Integer Digit 2 Integer 5 2 Digit 5 2 3 5 2 Rightmost derivation At each step, the rightmost non-terminal is replaced
+ F id id E * id Parse Tree • A labeled tree in which • the interior nodes are labeled by nonterminals • leaf nodes are labeled by terminals • the children of an interior node represent a replacement of the associated nonterminal in a derivation • corresponding to a derivation E
st ifStatement if elsePart st ( exp ) if true st return else st true return Abstract Syntax Tree for If Statement
The following grammar defines the language of arithmetic expressions with 1-digit integers, addition, and subtraction. Expr Expr + Term | Expr – Term | Term Term 0 | ... | 9 | ( Expr ) Arithmetic Expression Grammar
BNF Fundamentals • Non-terminals: BNF abstractions • Terminals: lexemes and tokens • Grammar: a collection of rules • Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt>
Regular Expressions • x character x • \x escaped character, e.g., \n • { name } reference to a name • M | N M or N • M N M followed by N • M* 0 or more occurrences of M • M+ 1 or more occurrences of M • [x1 … xn] One of x1 … xn • Example: [aeiou] – vowels, [0-9] - digits
An Example Grammar <program> <stmts> <stmts> <stmt> | <stmt> ; <stmts> <stmt> <var> = <expr> <var> a | b | c | d <expr> <term> + <term> | <term> - <term> <term> <var> | const
Derivation • A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols), e.g., <program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const
S SS | (S)S | () | S SS (S)SS (S)S(S)S (S)S(())S ((S)S)S(())S ((S)())S(())S ((())())S(())S ((())()) (())S ((())())(()) E E O E | (E) | id O + | - | * | / E E O E (E) O E (E O E) O E * ((E O E) O E) O E ((id O E)) O E) O E ((id + E)) O E) O E ((id + id)) O E) O E * ((id + id)) * id) + id Examples
Each step of the derivation is a replacement of the leftmost nonterminals in a sentential form. E E O E (E) O E (E O E) O E (id O E) O E (id + E) O E (id + id) O E (id + id) * E (id + id) * id Each step of the derivation is a replacement of the rightmost nonterminals in a sentential form. E EO E E O id E * id (E) * id (E O E) * id (E O id) * id (E + id) * id (id + id) * id Leftmost Derivation Rightmost Derivation
Parse Tree • A hierarchical representation of a derivation <program> <stmts> <stmt> <var> = <expr> a <term> + <term> <var> const b
CFG For Floating Point Numbers ::= stands for production rule; <…> are non-terminals; | represents alternatives for the right-hand side of a production rule Sample parse tree:
Ambiguity in Expressions • Which operation is to be done first? • solved by precedence • An operator with higher precedence is done before one with lower precedence. • An operator with higher precedence is placed in a rule (logically) further from the start symbol. • solved by associativity • If an operator is right-associative (or left-associative), an operand in between 2 operators is associated to the operator to the right (left). • Right-associated : W + (X + (Y + Z)) • Left-associated : ((W + X) + Y) + Z
Associativity and Precedence • A grammar can be used to define associativity and precedence among the operators in an expression. E.g., + and - are left-associative operators in mathematics; * and / have higher precedence than + and - . • Consider the grammar G1: Expr -> Expr + Term | Expr – Term | Term Term -> Term * Factor | Term / Factor | Term % Factor | Factor Factor -> Primary ** Factor | Primary Primary -> 0 | ... | 9 | ( Expr )
E F F * F X F * F X X ( E ) id4 id3 E + E F F X X id1 id2 Precedence (cont’d) E E + E | E - E | F F F * F | F / F | X X ( E ) | id (id1 + id2) * id3 * id4
E E * E E + E E E E id3 + E E * id1 id2 id1 id2 id3 E E + E F F F * F id1 id2 id3 Precedence E E + E E E - E E E * E E E / E E id E E + E E E - E E F F F * F F F / F Fid
Precedence Associativity Operators 3 right ** 2 left * / % \ 1 left + - Note: These relationships are shown by the structure of the parse tree: highest precedence at the bottom, and left-associativity on the left at each level. Associativity and Precedence for Grammar
Associativity of Operators • Operator associativity can also be indicated by a grammar <expr> -> <expr> + <expr> | const (ambiguous) <expr> -> <expr> + const | const (unambiguous) <expr> <expr> <expr> + const <expr> + const const
E F X / F id4 * X F X id3 ( E ) + E F F X X id2 id1 Associativity • Left-associative operators E E + F | E - F | F F F * X | F / X | X X ( E ) | id (id1 + id2) * id3 / id4 = (((id1 + id2) * id3) / id4)
Ambiguity in Grammars • A grammar is ambiguous if and only if it generates a sentential form that has two or more distinct parse trees
With which ‘if’ does the following ‘else’ associate? if (x < 0) if (y < 0) y = y - 1; else y = 0; Answer: either one! Example of Dangling Else
An Ambiguous Expression Grammar <expr> <expr> <op> <expr> | const <op> / | - <expr> <expr> <expr> <op> <expr> <expr> <op> <op> <expr> <expr> <op> <expr> <expr> <op> <expr> const - const / const const - const / const
E E * E E + E E E E id3 + E E * id1 id2 id1 id2 id3 Example: Ambiguous Grammar E E + E E E - E E E * E E E / E E id
Solving the dangling else ambiguity • Algol 60, C, C++: associate each else with closest if; use {} or begin…end to override. • Algol 68, Modula, Ada: use explicit delimiter to end every conditional (e.g., if…fi) • Java: rewrite the grammar to limit what can appear in a conditional: IfThenStatement -> if ( Expression ) Statement IfThenElseStatement -> if ( Expression )StatementNoShortIf else Statement The category StatementNoShortIf includes all except IfThenStatement.
Ambiguity in Expressions • Which operation is to be done first? • solved by precedence • An operator with higher precedence is done before one with lower precedence. • An operator with higher precedence is placed in a rule (logically) further from the start symbol. • solved by associativity • If an operator is right-associative (or left-associative), an operand in between 2 operators is associated to the operator to the right (left). • Right-associated : W + (X + (Y + Z)) • Left-associated : ((W + X) + Y) + Z
An Unambiguous Expression Grammar • If we use the parse tree to indicate precedence levels of the operators, we cannot have ambiguity <expr> <expr> - <term> | <term> <term> <term> / const| const <expr> <expr> - <term> <term> <term> / const const const
Extended BNF • Optional parts are placed in brackets [ ] <proc_call> -> ident [(<expr_list>)] • Alternative parts of RHSs are placed inside parentheses and separated via vertical bars <term> → <term>(+|-) const • Repetitions (0 or more) are placed inside braces { } <ident> → letter {letter|digit}