Chapter 3

Chapter 3 Describing Syntax and Semantics

Syntax and Semantics • Syntax of a programming language: the form of its expressions, statements, and program units • Semantics: the meaning of the expressions, statements, and program units • Syntax and semantics provide a language’s definition

An Example of Syntax and Semantic • The syntax of a Java while statement: while (<boolean_expr>) <statement> • The semantic of the above statement: when the current value of the Boolean expression is true, the embedded statement is executed; otherwise, control continues after the while construct. Then control implicitly returns to the Boolean expression to repeat the process

Description Easiness • Describing syntax is easier than describing semantics • partly because a concise and universally accepted notation is available for syntax description • but none has yet been developed for semantics

Definitions of Languages and Sentences • A language, whether natural (such as English) or artificial (such as Java), is a set of strings of characters from some alphabet. • The strings of a language are called sentences or statements.

Syntax Rules of a Language • The syntax rules of a language specify which strings of characters from the language’s alphabet are in the language.

Definitions of Lexemes • The lexeme of a programming language include its • numeric literals • operators • and special words, among others. • (e.g., *, sum, begin) • Alexemeis the lowest level syntactic unit of a language: • for simplicity’s sake, formal descriptions of the syntax of programming languages often do not include description of them • However the description of lexemes can be given by a lexical specification which is usually separate from the syntactic description of the language.

An Informal Definition of a Program • A program could be thought as strings of lexemes rather than of characters.

Lexeme Group • Lexemes are partitioned into groups. • For example, in a programming language, the names of variables, methods, classes, and so forth form a group called identifier. • Each of these groups is represented by a name, or token.

Definitions of Tokens • A tokenis a category of lexemes (e.g., identifier) • a token may have only a single lexeme • For example, the token for the arithmetic operator symbol +, which may have the name plus_op, has just one possible lexeme. • a lexeme of a token can also be called an instance of the token

Example of Tokens • Identifier is a token that can have lexemes, or instances, such as sum and total.

Lexemes and Tokens [csusb], [Lee] • Lexemes: a string of characters in a language that is treated as a single unit in the syntax and semantics. • For example identifiers, numbers, and operators are often lexemes • In a programming language, there are a very large number of lexemes, perhaps even an infinite number; however, there are only a small number of tokens.

Examples of Lexemes and Tokens [Lee] • while (y >= t) y = y - 3 ; will be represented by the set of pairs: a token with only one lexeme

Ways to Define a Language • In general, languages can be formally defined in two distinct ways: by recognition and by generation. • P.S.: Although neither provides a definition that is practical by itself for people trying to lean or even use a programming language.

Language Recognizer • a recognition device that • reads strings of characters from the alphabet of a language and • decides whether an input string was or was not in the language • Example: syntax analysis part of a compiler is a recognizer for the language the compiler translates • are not used to enumerate all of the sentences of a language

Language Generators • a device that generates the sentences of a language. • We can think of the generator as having a button that produces a sentence of the language every time it is pushed • However, the particular sentence that is produced by a generator when its button is pushed is unpredictable • One can determine if the syntax of a particular sentence is correct by comparing it to the structure of the generator • People learn a language from examples of its sentences

Grammars [wiki] • In computer science and linguistics, a formal grammar, or sometimes simply grammar, is a precise description of a formal language — that is, of a set of strings. • Commonly used to describe the syntax of programming languages

Formal Grammars [wiki] • A formal grammar, or sometimes simply grammar, consists of: • a finite set of terminal symbols; • a finite set of nonterminal symbols; • a finite set of production rules with a left- and a right-hand side consisting of a sequence of these symbols • a start symbol.

Grammar Example [wiki] • Nonterminals are usually represented by uppercase letters • terminals by lowercase letters • the start symbol by S. • For example, the grammar with • terminals {a,b}, • nonterminals {S,A,B}, • production rules • SABS • S ε (where ε is the empty string) • BAAB • BS b • Bb bb • Abab • Aa aa • start symbol S, • defines the language of all words of the form anbn (i.e. n copies of a followed by n copies of b).

Formal Grammars and Formal Languages [wiki] • A formal grammar defines (or generates) a formal language, which is a (possibly infinite) set of sequences of symbols that may be constructed by applying production rules to a sequence of symbols which initially contains just the start symbol.

Grammar Classes [wiki] • In the mid-1950s, according to the format of the production rules, Chomsky described four classes of generative devices or grammars that define four classes of languages: • recursively enumerable • context-sensitive • context-free • regular

Application of Context-free and Regular Grammars • The tokens of programming languages can be described by regular grammars. • Whole programming languages, with minor exceptions, can be described by context-free grammars.

Review of Context-Free Grammars • Context-Free Grammars • Developed by Noam Chomsky in the mid-1950s • Language generators, meant to describe the syntax of natural languages • Define a class of languages called context-free languages

Terminology - metalanguage • A metalanguage is a language that is used to describe another language.

Backus-Naur Form (BNF) • Backus-Naur Form (1959) • Invented by John Backus to describe Algol 58 • Most widely known method for describing programming language syntax • BNF is equivalent to context-free grammars • BNF is a metalanguage used to describe another language

Extended BNF • Extended BNF • Improves readability and writability of BNF

BNF Abstraction • BNF uses abstractions for syntactic structures • For example, a simple Javaassignment statement might be represented by the abstraction <assign>. • (pointed brackets are often used to delimit names of abstractions) • In fact, tokens are also abstractions.

Synonym Nonterminals symbols (nonterminals) BNF abstractions Terminals Lexemes and tokens

BNF Rules • A rule, also called a production, has a left-hand side (LHS) and a right-hand side (RHS), and consists of terminal and nonterminal symbols. • An abstraction is defined by a rule. • Examples of BNF rules: <ident_list> → identifier | identifier,<ident_list> <if_stmt> → if <logic_expr> then <stmt> • A grammar is a finite nonempty set of rules.

Multiple Definitions of a Nonterminal • Nonterminal symbols can have two or more distinct definitions, representing two or more possible syntactic forms in the language. • Multiple definitions can be written as a single rule, with different definitions separated by the symbol |.

Example of Multiple Definitions of a Nonterminal • For example, an Adaif statement can be described with the rules: <if_stmt> if <logic_expr> then <stmt> <if_stmt> if <logic_expr> then <stmt> else <stmt> Or with the rule <if_stmt> if <logic_expr> then <stmt> | if <logic_expr> then <stmt> else <stmt>

A Rule Example <assign>  <var> = <expression> • The above rule specifies that the abstraction <assign> is defined as an instance of the of the abstraction <var>, followed by the lexeme =, followed by an instance of the abstraction <expression>. • One example whose syntactic structure is described by the rule is: total = subtotal1 + subtotal2

Expressive Capability of BNF • Although BNF is simple, it is sufficiently powerful to describe nearly all of the syntax of programming languages. • BNF can describe • Lists of similar construct • The order in which different constructs must appear • Nested structures to any depth • Imply operator precedence • Imply operator associativity

Variable-length Lists and Recursive Rules • Variable-length lists in mathematics are often written using an ellipsis (…). • 1,2, … is an example. • BNF does not include the ellipsis, it uses recursive rules as an alternative. • A rule is recursive if its LHS appears in its RHS.

Example of a Recursive Rule • Syntactic lists are described using recursion <ident_list>  ident | ident, <ident_list> the above rule defines <ident_list> as either a single token (identifier) or an identifier followed by a comma followed by another instance of <ident_list>.

Grammar and Derivation • A grammar is a generative device for defining languages. • The sentences of a languages are created through deviations. • A derivation is a repeated application of rules, starting with a special nonterminal of the grammar called the start symbol and ending with a sentence (all terminal symbols)

An Example of Start Symbols • In a grammar for a complete programming language, the start symbol represents a complete program and is usually named <program>.

Example 3.1 - An Example Grammar • What follows is a grammar for a small language. P.S.:<program> is the start symbol.

Language Defined by the Grammar in Example 3.1 • The language in Example 3.1 has only one statement form: assignment. • A program consists of the special word begin, followed by a list of statements separated by semicolons, followed by the special word end. • An expression is either a single variable, or two variables separated by either a + or – operator. • The only variable names in this language are A, B, and C.

An Example Derivation <program>=> begin <stmt_list> end => begin <stmt> ; <stmt_list> end => begin <var> = <expression>; <stmt_list> end => begin A = <expression> ; <stmt_list> end => begin A = <var> + < var > ; <stmt_list> end => begin A = B + <var> ; <stmt_list> end => begin A = B + C ; <stmtjist> end => begin A = B + C ; <stmt> end => begin A = B + C ; <var> := <expression> end => begin A=B + C;B=<expression> end => begin A = B + C ;B= <var> end => begin A = B + C ; B = C end derive

Sentential Form • In the previous slide, each successive string in the sequence is derived from the previous string by replacing one of the nonterminals with one of that nonterminal’s definitions. • Every string of symbols in the derivation is a sentential form.

Order of Derivation • A leftmost derivation is one in which the leftmost nonterminal in each sentential form is the one that is expanded. • The derivation continues until the sentential form contains no nonterminals. • The sentential form, consisting of only terminals, or lexemes, is the generated sentence. • Rightmost derivation • A derivation may be neither leftmost nor rightmost. • However the derivation order has no effect on the language generated by a grammar.

Using the Grammar in Example 3.1 to Generate Different Sentences • By choosing alternative RHSs of rules with which to replace nonterminals in the derivation, different sentences in the language can be generated. • By exhaustively choosing all combinations of choices, the entire language can be generated. • This language, like most others, is infinite, so one cannot generate all the sentences in the language in finite time.

Example 3.2 • This grammar describes assignment statements whose right sides are arithmetic expressions with multiplication and addition operators and parentheses.

The Leftmost Derivation for A=B*(A+C) <assign> => <id> = <expr> => A = <expr> => A = <id> * <expr> => A = B * <expr> => A = B * (< expr >) => A = B * (<id> + <expr>) => A = B * (A + <expr>) => A = B * (A + <id>) => A = B * (A + C)

Parse Trees • One of the most attractive features of grammars is that they naturally describe the hierarchical syntactic structure of the sentences of the languages they define. • These hierarchical structures are called parse trees.

Figure 3.1 - a Parse Tree for A=B*(A+C) • The right side figure is a parse tree. • This parse tree shows the structure of the assignment statement derived previously.

A Derivation and Its Corresponding Parse Tree <assign> => <id> = <expr> => A = <expr> => A=<id>*<expr> => A = B * <expr> => A = B * (< expr >) => A = B* ( <id> + <expr> ) => A = B* (A+<expr> ) => A = B*<A+<id> ) => A=B*(A+C)

Mapping between a Grammar and Its Corresponding Parse Tree • Every internal node of a parse tree is labeled with a nonterminal symbol. • Every leaf is labeled with a terminal symbol. • Every subtree of a parse tree describes one instance of an abstraction in the statement.

Ambiguity in Grammars • A grammar is ambiguous if and only if it generates a sentential form that has two or more distinct parse trees.

Chapter 3

Chapter 3

Presentation Transcript

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

chapter 3

CHAPTER 3-3

Chapter 3-3

Chapter 3

Chapter 3 Chapter 3

CHAPTER 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

Chapter 3

CHAPTER 3

Chapter 3