Syntax and Semantics in Programming Languages

Structure of programming languages Syntax and Semantics

Introduction • Syntax: the form or structure of the expressions, statements, and program units • Semantics: the meaning of the expressions, statements, and program units

Introduction • Syntax and semantics provide a language’s definition • Users of a language definition • Other language designers • Implementers • Programmers (the users of the language)

Syntax of a Languages • Motivation for using a subset of C: Grammar Language (pages) Reference Pascal 5 Jensen & Wirth C 6 Kernighan & Richie C++ 22 Stroustrup Java 14 Gosling, et. al. • The grammar fits on one page, so it’s a far better tool for studying language design.

C Grammar: Statements Program int main ( ){ Declarations Statements } Declarations  {Declaration } Declaration  Type Identifier [ [ Integer ] ]{ , Identifier [ [ Integer ] ] }; Type  int | bool | float | char Statements  {Statement } Statement  ; | Block | Assignment | IfStatement | WhileStatement Block  { Statements } Assignment  Identifier [ [ Expression ] ]=Expression ; IfStatement  if (Expression ) Statement [else Statement ] WhileStatement  while (Expression ) Statement

C Grammar: Expressions Expression  Conjunction {|| Conjunction } Conjunction  Equality {&& Equality } Equality  Relation [ EquOp Relation ] EquOp  == | != Relation  Addition [ RelOp Addition ] RelOp  < | <= | > | >= Addition  Term { AddOp Term } AddOp  + | - Term  Factor { MulOp Factor } MulOp  * | / | % Factor  [ UnaryOp ]Primary UnaryOp  - | ! Primary  Identifier [[ Expression ]] | Literal | ( Expression ) | Type ( Expression )

Describing Syntax: Terminology • A sentenceis a string of characters over some alphabet • A language is a set of sentences • Alexemeis the lowest level syntactic unit of a language (e.g., *, sum, begin) • A tokenis a category of lexemes (e.g., identifier)

Formal Definition of Languages • Recognizers • A recognition device reads input strings of the language and decides whether the input strings belong to the language • Example: syntax analysis part of a compiler • Generators • A device that generates sentences of a language • One can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator

Lexical Analysis • The process of converting a character stream into a corresponding sequence of meaningful symbols (called tokens or lexemes) is called tokenizing, lexing or lexical analysis. A program that performs this process is called a tokenizer, lexer, or scanner. • In Scheme, we tokenize (set! x (+ x 1)) as ( set! x ( + x 1 ) )‏ • Similarly, in Java, we tokenizeSystem.out.println("Hello World!"); as System . out . println ( "Hello World!" ) ;

Lexical Analysis • Lexical analyzer splits it into tokens • Token = sequence of characters (symbolic name) representing a single terminal symbol • Identifiers: myVariable … • Literals: 123 5.67 true … • Keywords: char sizeof … • Operators: + - * / … • Punctuation: ; , } { … • Discards whitespace and comments

Examples of Tokens in C

Examples of Tokens in C • Lexical analyzer usually represents each token by a unique integer code • “+” { return(PLUS); } // PLUS = 401 • “-” { return(MINUS); } // MINUS = 402 • “*” { return(MULT); } // MULT = 403 • “/” { return(DIV); } // DIV = 404 • Some tokens require regular expressions • [a-zA-Z_][a-zA-Z0-9_]* { return (ID); } // identifier • [1-9][0-9]* { return(DECIMALINT); } • 0[0-7]* { return(OCTALINT); } • (0x|0X)[0-9a-fA-F]+ { return(HEXINT); }

Reserved Keywords in C • auto, break, case, char, const, continue, default, do, double, else, enum, extern, float, for, goto, if, int, long, register, return, short, signed, sizeof, static, struct, switch, typedef, union, unsigned, void, volatile, wchar_t, while • C++ added a bunch: bool, catch, class, dynamic_cast, inline, private, protected, public, static_cast, template, this, virtual and others • Each keyword is mapped to its own token

Whitespace Whitespace is any space, tab, end-of-line character (or characters), or character sequence inside a comment No token may contain embedded whitespace (unless it is a character or string literal) Example: >= one token > = two tokens

program confusing; const true = false; begin if (a<b) = true then f(a) else … Redefining Identifiers can be dangerous

Parsing Process • Call the scanner to get tokens • Build a parse tree from the stream of tokens • A parse tree shows the syntactic structure of the source program. • Add information about identifiers in the symbol table • Report error, when found, and recover from the error

Grammars • A meta-language is a language used to define other languages • A grammar is a meta-language used to define the syntax of a language. It consists of: • Finite set of terminal symbols • Finite set of non-terminal symbols • Finite set of production rules • Start symbol • Language = (possibly infinite) set of all sequences of symbols that can be derived by applying production rules starting from the start symbol Backus-Naur Form (BNF)

Describing Syntax • Backus-Naur Form and Context-Free Grammars (BNF) • Most widely known method for describing programming language syntax • Extended BNF • Improves readability and writability of BNF

Backus-Naur Form (BNF) • Backus-Naur Form (1959) • Invented by John Backus to describe Algol 58 • BNF is equivalent to context-free grammars • BNF is a meta-language used to describe another language • In BNF, abstractions are used to represent classes of syntactic structures--they act like syntactic variables (also called non-terminal symbols)

Consider the grammar: binaryDigit 0 binaryDigit 1 or equivalently: binaryDigit 0 | 1 Here, | is a metacharacter that separates alternatives. Example: Binary Digits

Example: Decimal Numbers • Grammar for unsigned decimal integers • Terminal symbols: 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 • Non-terminal symbols: Digit, Integer • Production rules: • Integer  Digit | Integer Digit • Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 • Start symbol: Integer • Can derive any unsigned integer using this grammar • Language = set of all unsigned decimal integers

Derivation of 352 as an Integer Production rules: Integer  Digit | Integer Digit Digit  0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 Integer  Integer Digit  Integer 2  Integer Digit 2  Integer 5 2  Digit 5 2  3 5 2 Rightmost derivation At each step, the rightmost non-terminal is replaced

+ F id id E * id Parse Tree • A labeled tree in which • the interior nodes are labeled by nonterminals • leaf nodes are labeled by terminals • the children of an interior node represent a replacement of the associated nonterminal in a derivation • corresponding to a derivation E

Parse Tree for 352 as an Integer

st ifStatement if elsePart st ( exp ) if true st return else st true return Abstract Syntax Tree for If Statement

Parse of the String 5-4+3

BNF Fundamentals • Non-terminals: BNF abstractions • Terminals: lexemes and tokens • Grammar: a collection of rules • Examples of BNF rules: <ident_list> → identifier | identifier, <ident_list> <if_stmt> → if <logic_expr> then <stmt>

Regular Expressions • x character x • \x escaped character, e.g., \n • { name } reference to a name • M | N M or N • M N M followed by N • M* 0 or more occurrences of M • M+ 1 or more occurrences of M • [x1 … xn] One of x1 … xn • Example: [aeiou] – vowels, [0-9] - digits

Derivation • A derivation is a repeated application of rules, starting with the start symbol and ending with a sentence (all terminal symbols), e.g., <program> => <stmts> => <stmt> => <var> = <expr> => a = <expr> => a = <term> + <term> => a = <var> + <term> => a = b + <term> => a = b + const

S  SS | (S)S | () | S  SS  (S)SS (S)S(S)S  (S)S(())S  ((S)S)S(())S  ((S)())S(())S  ((())())S(())S  ((())()) (())S  ((())())(()) E  E O E | (E) | id O  + | - | * | / E  E O E  (E) O E  (E O E) O E * ((E O E) O E) O E  ((id O E)) O E) O E  ((id + E)) O E) O E  ((id + id)) O E) O E * ((id + id)) * id) + id Examples

Each step of the derivation is a replacement of the leftmost nonterminals in a sentential form. E E O E  (E) O E  (E O E) O E  (id O E) O E  (id + E) O E  (id + id) O E  (id + id) * E  (id + id) * id Each step of the derivation is a replacement of the rightmost nonterminals in a sentential form. E  EO E  E O id E * id  (E) * id  (E O E) * id  (E O id) * id  (E + id) * id  (id + id) * id Leftmost Derivation Rightmost Derivation

Parse Tree • A hierarchical representation of a derivation <program> <stmts> <stmt> <var> = <expr> a <term> + <term> <var> const b

CFG For Floating Point Numbers ::= stands for production rule; <…> are non-terminals; | represents alternatives for the right-hand side of a production rule Sample parse tree:

Ambiguity in Expressions • Which operation is to be done first? • solved by precedence • An operator with higher precedence is done before one with lower precedence. • An operator with higher precedence is placed in a rule (logically) further from the start symbol. • solved by associativity • If an operator is right-associative (or left-associative), an operand in between 2 operators is associated to the operator to the right (left). • Right-associated : W + (X + (Y + Z)) • Left-associated : ((W + X) + Y) + Z

Associativity and Precedence • A grammar can be used to define associativity and precedence among the operators in an expression. E.g., + and - are left-associative operators in mathematics; * and / have higher precedence than + and - . • Consider the grammar G1: Expr -> Expr + Term | Expr – Term | Term Term -> Term * Factor | Term / Factor | Term % Factor | Factor Factor -> Primary ** Factor | Primary Primary -> 0 | ... | 9 | ( Expr )

E F F * F X F * F X X ( E ) id4 id3 E + E F F X X id1 id2 Precedence (cont’d) E E + E | E - E | F F  F * F | F / F | X X ( E ) | id (id1 + id2) * id3 * id4

E E * E E + E E E E id3 + E E * id1 id2 id1 id2 id3 E E + E F F F * F id1 id2 id3 Precedence E E + E E  E - E E  E * E E  E / E E  id E E + E E  E - E E  F F F * F F F / F Fid

Precedence Associativity Operators 3 right ** 2 left * / % \ 1 left + - Note: These relationships are shown by the structure of the parse tree: highest precedence at the bottom, and left-associativity on the left at each level. Associativity and Precedence for Grammar

Associativity of Operators • Operator associativity can also be indicated by a grammar <expr> -> <expr> + <expr> | const (ambiguous) <expr> -> <expr> + const | const (unambiguous) <expr> <expr> <expr> + const <expr> + const const

E F X / F id4 * X F X id3 ( E ) + E F F X X id2 id1 Associativity • Left-associative operators E E + F | E - F | F F F * X | F / X | X X ( E ) | id (id1 + id2) * id3 / id4 = (((id1 + id2) * id3) / id4)

Ambiguity in Grammars • A grammar is ambiguous if and only if it generates a sentential form that has two or more distinct parse trees

With which ‘if’ does the following ‘else’ associate? if (x < 0) if (y < 0) y = y - 1; else y = 0; Answer: either one! Example of Dangling Else

An Ambiguous Expression Grammar <expr>  <expr> <op> <expr> | const <op>  / | - <expr> <expr> <expr> <op> <expr> <expr> <op> <op> <expr> <expr> <op> <expr> <expr> <op> <expr> const - const / const const - const / const

E E * E E + E E E E id3 + E E * id1 id2 id1 id2 id3 Example: Ambiguous Grammar E E + E E E - E E E * E E  E / E E  id

Solving the dangling else ambiguity • Algol 60, C, C++: associate each else with closest if; use {} or begin…end to override. • Algol 68, Modula, Ada: use explicit delimiter to end every conditional (e.g., if…fi) • Java: rewrite the grammar to limit what can appear in a conditional: IfThenStatement -> if ( Expression ) Statement IfThenElseStatement -> if ( Expression )StatementNoShortIf else Statement The category StatementNoShortIf includes all except IfThenStatement.

Ambiguity in Expressions • Which operation is to be done first? • solved by precedence • An operator with higher precedence is done before one with lower precedence. • An operator with higher precedence is placed in a rule (logically) further from the start symbol. • solved by associativity • If an operator is right-associative (or left-associative), an operand in between 2 operators is associated to the operator to the right (left). • Right-associated : W + (X + (Y + Z)) • Left-associated : ((W + X) + Y) + Z

An Unambiguous Expression Grammar • If we use the parse tree to indicate precedence levels of the operators, we cannot have ambiguity <expr>  <expr> - <term> | <term> <term>  <term> / const| const <expr> <expr> - <term> <term> <term> / const const const

Extended BNF • Optional parts are placed in brackets [ ] <proc_call> -> ident [(<expr_list>)] • Alternative parts of RHSs are placed inside parentheses and separated via vertical bars <term> → <term>(+|-) const • Repetitions (0 or more) are placed inside braces { } <ident> → letter {letter|digit}

Syntax and Semantics in Programming Languages

Syntax and Semantics in Programming Languages

Presentation Transcript

Java Syntax and Semantics

Syntax and Semantics

Syntax and Semantics

(SYNTAX-DRIVEN) SEMANTICS

Syntax and Semantics

Syntax and Semantics

Syntax Simple Semantics

Introduction: syntax and semantics

Syntax and Semantics

SYNTAX and SEMANTICS

Syntax and Semantics

Semantics 2: Syntax-Semantics Interface

Implementation, Syntax, and Semantics

Syntax and Semantics

Syntax != Semantics

Syntax-Semantics Mapping

C++ Syntax and Semantics

Syntax and semantics of C#.

Syntax Simple Semantics

Syntax and Semantics of Prolog

Syntax versus Semantics

Describing Syntax and Semantics

Sea Ice

Sea Ice