CS 152, Programming Paradigms Fall 2012, SJSU

CS 152, Programming ParadigmsFall 2012, SJSU Jeff Smith

Programs as text strings • We can think of programsas strings of • characters, or • tokens (substrings similar to words), or • token types (or typed tokens) • We’ll consider (untyped) tokens first. • So programs are finite sequences of tokens • as English sentences are sequences of words • And languages are sets of programs

Identifying tokens • Splitting a program into tokens is called • lexical analysis, or • lexing, or • scanning • A scanning algorithm may give each token a type, e.g. • identifier, integer literal, addition operator • We’ll touch on scanning algorithms later.

Languages and grammars • A (formal) language is a set of finite strings over a finite alphabet. • An alphabet is a finite set of symbols, e.g. • the ASCII characters • the Unicode characters • the legal token types of Java • but not the infinite set of legal tokens of Java!

Programs and languages • A string is a program in programming language L iff it is a member of L. • So how does one determine whether a string is a member of a language L? • The answer depends on the notions of constituency and linear order.

Grammar rules • A language can be defined in terms of rules (or productions). • These rules specify the constituents of its members (and of their constituents), and the order in which they must appear. • Constituents without their own constituents are called terminals. Terminals must be members of the alphabet.

A sample grammar rule • The rule • <program> ::= begin <block> end . • says that a program may consist of • the terminal "begin" • followed by a block constituent, • followed by the terminal "end" • followed by the terminal "." • Here, terminals correspond to tokens • or more precisely, to token types

Grammar rules and constituents • Terminals like begin and . and end • correspond to token types with just one instance • The constituency of <block> would be given by one or more other rules of the grammar

Grammars • A grammar specifies a language by • listing the legal terminal symbols, • listing the legal nonterminal symbols, • (e.g., <program>, <block>), • listing the rules, and • saying which nonterminal is the start symbol • and thus represents a member of the language

Context-free grammars • The only grammars we will consider are context-free grammars (CFGs). • In a CFG, each rule has • a nonterminal on its left-hand side (LHS) • a string of symbols (terminals or nonterminals) on its right-hand side (RHS).

Notation for rules • Nonterminals may be distinguished from terminals by • delimiting them with angle brackets, or • beginning them with a capital letter, or writing them in italics, or • printing terminals in bold face. • The LHS and RHS of a rule are separated by a "->" or "::=" symbol.

Notation for combining rules • Rules with the same LHS represent optionality, e.g. • <operator> ::= + • <operator> ::= - • Such rules may be combined using a vertical bar convention, e.g. • <operator> ::= + | - • Any number of rules with the same LHS can be combined this way.

Terminology for rules • Grammar rules are sometimes called productions. • Nonterminal symbols are sometimes called variables.

Metasymbols and BNF • In the rule • <operator> ::= + | - • the vertical bar is a metasymbol • it’s neither a terminal nor a nonterminal. • The notational conventions described above are called Backus-Naur form, or BNF.

Grammars and parsing • Any CFG G defines a language L(G) • L(G) is the set of strings that can be generated from G’s start symbol, consistent with G’s rules • Proving that a string is in L(G) is called parsing. • Such a proof may involve either a derivation or a parse tree.

Derivations • In this course we won’t talk much about derivations. • But intuitively, a derivation is a sequence of strings (cf. Figure 5.3, p 210) such that • the first string is the start symbol • the last string is the string to be parsed • every string can be obtained from its predecessor by a rewriting step that is licensed by a rule of the grammar

Parse trees • In a parse tree (cf. L&L, Section 6.3), • parent nodes correspond to LHSs of rules • children correspond (in order) to RHSs, • leaves correspond to terminals. • The string of leaves (from left to right) is the yield of the parse tree.

Identifying tokens (again) • We haven’t yet said how to identify tokens, given a program as a character string. • It would be simplest to require tokens to be • single characters, or • delimited by whitespace characters, or • of bounded length • But these restrictions are rare • they greatly inconvenience programmers

Grammars for token types • CFGs may be used for token types, e.g. • <identifier> ::= <letter> • <identifier> ::= <letter> <identifier> • would recognize nonempty strings of letters • But this can introduce ambiguity • e.g., is the string doif 1 token? 2? 3? 4?

Scanning in real parsers • It’s generally efficient for a parser to have a special preprocessing step for scanning. • Scanning algorithms are simplified parsers • using CFGs for token types • such CFGs are generally simple • for details, see CS 154 • and its treatment of regular expressions

Scanning issues • It’s common for scanners to • work left to right • treat whitespace characters as delimiters • otherwise disambiguate by choosing the longest of the possible tokens • determine a type for each token • Types may be singleton types or not, e.g. • a token if might be in a type by itself • a token type identifier might have infinitely many instances

Categories of tokens and token types • Keywords • reserved words, predefined identifiers • generally in a type by themselves • Literals (cf. constants) • numeric, string, Boolean, array, enumeration members, lists, … • Identifiers • for variables, functions, data types, …

Typed tokens • Grammar symbols representing token types • are sometimes called preterminals • are treated as terminals by the parser • the tokens themselves are to be recognized by a scanner • may appear as nonleaves in parse trees • with a single child representing the token

Why "lexical"? • CFGs for English can have preterminals • especially for lexical categories • e.g., N (for nouns), V (for verbs), … • Rules for these preterminals form a lexicon • a list of words labeled with their categories • This allows for a simple CFG for English • or at least a healthy fragment of English

A CFG for a fragment of English • S -> NP VP • NP -> Det N • NP -> Det N PP • PP -> P NP • VP -> V NP • Here N, V, P, and Det are preterminals. • the sentence the dog chased a cat would look like Det N V Det N to the parser.

EBNF • Occasionally, certain extensions to BNF notation are convenient. • The term EBNF (for extended Backus-Naur form) is used to cover these extensions. • These extensions introduce new metasymbols, given below with their interpretations.

EBNF constructions • ( ) parentheses, for removing ambiguity, • e.g.,(a|b)c vs. a|bc • [ ] brackets, for optionality • 0 or 1 times • { } braces, for indefinite repetition • 0 or more times • Sometimes the first of these is considered part of ordinary BNF.

A very simple grammar • S -> x | x S S • Here, the single terminal represents a token. • This grammar generates all strings of x's of odd length.

Grammars for algebraic expressions • An ambiguous grammar G: • E -> E + E | E * E | ( E ) | x | y • Here the parentheses aren’t metasymbols • They are terminal symbols of the grammar • Like the other terminals, they represent tokens • An unambiguous grammar for L(G) • E -> T | E + T • T -> F | T * F • F -> x | y | ( E )

Ambiguity • Ambiguity is a property of grammars • A language can have both ambiguous and unambiguous grammars – see the previous slide • A grammar G is ambiguous iff some string in L(G) has two or more legal parse trees • That slide’s 2nd grammar disambiguates re both associativity and precedence • cf. L&L, Section 6.4

A grammar for a simple class of identifiers • <identifier> ::= <nondigit> • <identifier> ::= <identifier> <nondigit> • <identifier> ::= <identifier> <digit> • Note the absence of rules for <digit> and <nondigit> • These are preterminal symbols • the corresponding token types are to be identified by the scanner rather than the parser

C language if statements (cf. Kernighan & Ritchie) • <selection-statement> ::= • if ( <expression> ) <statement> • [ else <statement> ] | … • <statement> ::= • <compound-statement> | … • <compound-statement> ::= • { [<declaration-list>] [<statement-list>] } • Here the braces are terminal symbols

Ada’s if statements • <if-statement> ::= • if <boolean-condition> then • <sequence-of-statements> • { elsif <boolean-condition> then • <sequence-of-statements> } • [else <sequence-of-statements>] • end if ;

General statements in Ada • <statement> ::= • null | • <assignment-statement > | • <if-statement> | • <loop-statement> | ... • <sequence-of-statements> ::= • <statement> { <statement> }

Translation steps (idealized) • character string • lexical analysis (scanning, tokenizing) • string of tokens • syntactic analysis (parsing) • parse tree (or syntax tree) • semantic analysis, ...

A BNF grammar for the Scheme language • <expression> ::= <atom> | <list> • <atom> ::= <literal> | <identifier> • <list> ::= () | ( <expressions> ) • <expressions> ::= <expression> | <expression> <expressions> • Here <literal> and <identifier> are preterminals; the parentheses are terminals in types by themselves.

Two parsing strategies • bottom up (shift-reduce) • match tokens with RHS's of rules • when a full RHS is found, replace it by the LHS • top down (recursive descent) • expand the rules, matching input tokens as predicted by rules

Recursive descent parsing • A recursive descent parser has one recognizer function per nonterminal. • In the simplest case, each recognizer calls the recognizers for the nonterminals on the RHS. • e.g., the rule S -> NP VP would have a recognizer s() with body • np( ); vp( );

Complications in recursive descent • scanning issues • RHSs with terminals • conflict between rules with the same LHS • optionality • including indefinite repetition • output • error handling

Terminal symbols • Terminal symbols may be handled by matching them with the next unread symbol in the input. • That is, one lookahead symbol is checked. • If there is a match, the next unread symbol is updated. • If not, there is a syntax error in the input.

Example with terminal symbols • For example, the rule F -> ( E ) could give a recognizer f() with body • match( '(‘ ); • e( ); • match( ')‘ );

Rule conflict • If there is a more than one rule for a nonterminal, a conditional statement can be used. • The condition can involve the lookahead token. • An example is given for the nonterminal factor in L&L, p. 228.

Optionality • Optionality effectively gives multiple rules for the nonterminal on the LHS. • e.g., the ifStatement code, L&L, p. 227 • The same applies to indefinite repetition. • Here the repetition may be handled by a while loop. • e.g. the expr recognizer, L&L p. 229

Rule conflict -- details • If a nonterminal Y has several rules with RHSs a, b, g, ..., we've seen that Y's recognizer uses a conditional statement. • If the conditional's lookahead symbol • is in First(a), one case will apply • is in First(b), another case will apply • etc. • Here, First(X) is the set of terminals that may begin the yield of X.

The First function • For simple grammars, the First function may be easy to compute by hand. • Tucker & Noonan give a general algorithm for finding First(X)for a symbol X • it works even for sequences of symbols. • Recursive descent works only if First(a) is disjoint from First(b) for all a and b • in the situation of the previous slide

Left recursion • Recursive descent parsing requires the absence of left recursion. • In left recursion, a nonterminal starts the RHS of one or more of its rules, as in • E -> E + E | T • If the lookahead token t is also the first token of a string generated from T, the parser won’t know which E rule to apply.

Another potential problem • Another problem for recursive descent parsers arises from optionality. • Given a rule NP -> Det {Adj} N, there’d be a conflict between parsing rich as a N and an Adj in a sentence beginning with • the rich • This problem can be dealt with in terms of a Follow function (cf. L&L, p. 232) .

Abstract syntax trees • Parse trees aren’t the best interface between syntactic and semantic processing • (Abstract) syntax trees can be better • cf. L&L, p. 216 • For syntax trees (unlike parse trees) • nonterminal symbols needn’t appear • the form isn’t completely determined by the grammar

CS 152, Programming Paradigms Fall 2012, SJSU