1 / 63

Languages and grammars

Languages and grammars. A (formal) language is a set of finite strings over a finite alphabet. An alphabet is just a finite set of symbols (e.g., ASCII, Unicode). Programs and languages. A string is a program in programming language L iff it is a member of L.

ted
Download Presentation

Languages and grammars

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Languages and grammars • A (formal) language is a set of finite strings over a finite alphabet. • An alphabet is just a finite set of symbols (e.g., ASCII, Unicode).

  2. Programs and languages • A string is a program in programming language L iff it is a member of L. • Question: how to determine whether a string is a member of a language L? • The answer depends on the notions of constituency and linear order.

  3. Grammar rules • Language can be defined in terms of rules (or productions). • These rules specify the constituents of its members (and of their constituents), and the order in which they must appear. • Pieces without constituents (terminals) must be members of the alphabet

  4. A sample grammar rule • The rule • <program> ::= begin <block> end . • says that a program may (not must) consist of • the terminal "begin" • followed by a block constitutent, • followed by the terminal "end" • followed by the terminal "." • The symbol <block> is a nonterminal -- it does not correspond to a member of the alphabet. • Actually, if the alphabet is ASCII or Unicode, none of the three terminal symbols belongs to the alphabet either, although unlike <block>, these symbols are expected to appear in the program text. We address this issue later.

  5. Grammars • Additional rules can give the constituents of constituents like blocks. • Rules that define a language in this way make up a grammar. • Grammars must also identify • which nonterminals correspond to language members (e.g., <program>) • which symbols are terminal symbols

  6. Context-free grammars • The only grammars we will consider are context-free grammars (CFGs). • In a CFG, each rule has a nonterminal on its left-hand side (LHS), and a string of symbols (terminals or nonterminals) on its right-hand side (RHS). • Nonterminals are also called variables.

  7. Notation for rules • Nonterminals may be distinguished from terminals by • delimiting them with angle brackets, or • beginning them with a capital letter, or writing them in italics, or • printing terminals in bold face. • The LHS and RHS of a rule are separated by a "->" or "::=" symbol.

  8. Notation for combining rules • Rules with the same LHS represent optionality, e.g. • <operator> ::= + • <operator> ::= - • Such rules may be combined using a vertical bar convention, e.g. • <operator> ::= + | - • Any number of rules with the same LHS can be combined this way. • Note that the vertical bar is neither a terminal nor a nonterminal. • Sometimes such a symbol is called a metasymbol. • The notational conventions described above are called Backus-Naur form, or BNF.

  9. Grammar summary • In summary, a grammar is specified by • a finite set of terminals, • a finite set of nonterminals, • a finite set of rules (or productions), and • a start symbol (a nonterminal) • The start symbol tells what it is that the grammar is defining.

  10. Grammars and parsing • Any CFG G defines a language L(G) -- the set of strings that can be generated from its start symbol using its rules. • Proving that a string has the correct constituency and linear ordering properties to be in L(G) is called parsing.

  11. Parsing and parse trees • One way of summarizing a parse is with a parse tree (cf. T&N, Sec 2.1.3). • Here parent nodes correspond to LHSs of rules, children (in order) to RHSs, and leaves to terminals. • The string of leaves (from left to right) is the yield of the parse tree.

  12. Terminals and the alphabet • We still have not resolved a mismatch between terminals and the alphabet • We have suggested that alphabets are sets of characters. • We have said that terminals must be members of the alphabet. • But our sample terminals (e.g., begin, end) were not characters

  13. Tokens • One way out: replace "begin" by <begin> and add a rule with five RHS terminals, i.e. • <begin> -> b e g i n • However there are independent reasons to instead treat such substrings as special entities called tokens.

  14. Typed tokens • It helps to identify tokens representing identifiers, integer literals, etc. as instances of a type (or category). • Reserved words like "begin" would be the only instances of their types. • Type names can be represented as strings with the same spelling as the category name, or as members of an enumerated type.

  15. Lexical analysis • Real parsers group characters into (typed) tokens in a preprocessing step called lexical analysis (or scanning). • This step allows CFGs • to have terminals with multiple characters • to treat nonterminals representing types as special terminals called preterminals. • The constituency of preterminals is handled by special scanning rules.

  16. Why "lexical"? • CFGs for English tend to have preterminals N, V, P, etc. for lexical categories noun, verb, preposition, etc. • Rules for these preterminals form a lexicon -- a list of words in the language labeled with their categories • Omitting these rules allows generation much of English by a simple CFG.

  17. A CFG for a fragment of English • S -> NP VP • NP -> Det N • NP -> Det N PP • PP -> P NP • VP -> V

  18. EBNF • Occasionally, certain extensions to BNF notation are convenient. • The term EBNF (for extended Backus-Naur form) is used to cover these extensions. • These extensions introduce new metasymbols, given below with their interpretations.

  19. EBNF constructions • ( ) parentheses, for removing ambiguity, e.g. • (a|b)c vs. a | bc • [ ] brackets, for optionality (0 or 1 times) • { } braces, for indefinite repetition (0 or more times) • Sometimes the first of these is considered part of ordinary BNF.

  20. A very simple grammar • S -> x | x S S • This grammar generates all strings of x's of odd length.

  21. An ambiguous grammar for algebraic expressions • E -> E + E | E -> E * E • E -> x | y • E -> ( E ) • Note that here the parenthesis symbols are terminal symbols of the grammar (not metasymbols) • An unambiguous grammar for algebraic expressions • E -> T | E + T • T -> F | T * F • F -> x | y | ( E ) • Once again the parenthesis symbols are terminals

  22. A grammar for a simple class of identifiers • <identifier> ::= <nondigit> • <identifier> ::= <identifier> <nondigit> • <identifier> ::= <identifier> <digit> • Note that we assume that digits and nondigits are identified by the scanner, and not the parser

  23. if-statements in C • <selection-statement> ::= • if ( <expression> ) <statement> [ else <statement> ] | … • <statement> ::= • <compound-statement> | … • <compound-statement> ::= • { [<declaration-list>] [<statement-list>] } • Here the braces are terminal symbols

  24. if-statements in Ada • <if-statement> ::= • if <boolean-condition> then • <sequence-of-statements> • { elsif <boolean-condition> then • <sequence-of-statements> } • [else <sequence-of-statements>] • end if ;

  25. statements in Ada • <statement> ::= • null | • <assignment-statement > | • <if-statement> | • <loop-statement> | ... • <sequence-of-statements> ::= • <statement> { <statement> } • Translation steps (idealized) • character string • lexical analysis (scanning, tokenizing) • string of tokens • syntactic analysis (parsing) • parse tree (or syntax tree) • semantic analysis, ...

  26. Scanning (lexical analysis) • Scanning could be done by a parser; a special-purpose scanner is generally more efficient • how to recognize tokens • longest substring • white space • The scanner needs to identify categories of tokens for the parser. • Categories of tokens • keywords • reserved word, predefined identifiers • literals (cf. constants) • numeric, string, Boolean, array, enumeration members, Lisp lists, … • identifiers

  27. Two parsing strategies • bottom up (shift-reduce) • match tokens with RHS's of rules • when a full RHS is found, replace it by the LHS • top down (recursive descent) • expand the rules, matching input tokens as predicted by rules

  28. Categories of tokens • keywords • reserved words, predefined identifiers • literals (cf. constants) • numeric, string, Boolean, array, enumeration members, Lisp lists, … • identifiers

  29. Recursive descent parsing • A recursive descent parser has one recognizer function per nonterminal. • In the simplest case, each recognizer calls the recognizers for the nonterminals on the RHS. • e.g., the rule S -> NP VP would have a recognitizer s() with body • np(); vp();

  30. Complications in recursive descent • scanning issues • RHSs with terminals • conflict between two rules with the same LHS • optionality (including indefinite repetition) • output and error handling

  31. Terminal symbols • Terminal symbols may be handled by matching them with the next unread symbol in the input. • That is, one lookahead symbol is checked. • If there is a match, the next unread symbol is updated. • Else there is a syntax error in the input.

  32. Example with terminal symbols • For example, the rule F -> ( E ) could give a recognizer f() with body • match('('); • e(); • match(')');

  33. Rule conflict • If there is a more than one rule for a nonterminal, a conditional statement can be used. • The condition can involve the lookahead token. • An example is given for the nonterminal "primary" in T&N, p. 79.

  34. Optionality • Optionality (the use of brackets in EBNF) effectively gives multiple rules for the nonterminal on the LHS. • e.g., the "factor" recognizer, T&N, p. 79. • The same applies to indefinite repetition (the use of braces in EBNF). • Here the repetition may be handled by a while loop, (cf. "term", T&N p. 79).

  35. Rule conflict -- details • If a nonterminal Y has several rules with RHSs a, b, g, ..., we've seen that Y's recognizer uses a conditional statement. • The conditional's first case will be used if the lookahead symbol is in First(a), the second case if it's in First(b), etc. • Here, First(X) is the set of terminals that may begin the yield of X.

  36. The First function • T&N describe an algorithm for computing the First function for any grammar symbol. • It may be used to find all values of First(X) • In recursive descent parsing, First(X) must be disjoint from First(Y) for any two RHSs X and Y (for the same LHS) .

  37. Left recursion • Recursive descent parsing requires the absence of left recursion. • In left recursion, a nonterminal starts the RHS of one of its rules, as in • E -> E + E | T • If t is the first token of a string generated from T, and also the lookahead token, it can't be decided which E rule to apply.

  38. Another potential problem • Another problem for recursive descent parsers arises from optionality. • Such a parser using a rule NP -> Det {Adj} N can't tell whether "rich" is an N or an Adj in a sentence beginning with • the rich • This problem can be dealt with in terms of a Follow function (cf. Louden) .

  39. Abstract syntax trees • (Abstract) syntax trees may replace parse trees as the interface between syntactic and semantic processing -- cf. T&N, Section 2.5.1. • Symbols irrelevant to semantic processing needn’t appear in syntax trees. • So the form of a syntax tree is not completely determined by the grammar.

  40. Bindings • A variable (cf. T&N, p 88) is an entity with attributes including name, address, type, and value. • Much semantic behavior can be understood in terms of the binding of attributes to their values. • Many differences between programming languages can be understood in terms of when and how such bindings are made.

  41. Static vs. dynamic • The following terms apply to bindings (and many other concepts): • Static • pertaining to compilation time • (or more generally, to a time before execution time) • Dynamic: • pertaining to run time (execution time)

  42. Examples of static bindings • value • of predefined identifers • of largest possible int • address • for global variables (relative to beginning of program storage)

  43. More examples of static bindings • isConstant • for variables • type • for variables • body & arity & return type & parameter type(s) • for functions in C (local or external)

  44. Examples of dynamic bindings • value • for typical variables • address • for local variables (cf. use of new) • parameter value • for functions • method body • for methods in Java

  45. Lvalues and Rvalues • Most languages give different interpretations to variables on different sides of an assignment operator, e.g. • a := b; • The first (the lvalue) refers to an address while the second (the rvalue) refers to a value.

  46. Creating bindings • Bindings for user-defined variables are created by declarations, which for us includes • explicit declaration, • implicit declaration, and • definition

  47. Resolving bindings • Many languages allow reuse of names during program execution. • So when a program uses a name, it needs to know what the bindings are for that name then and there. • That is, it needs to know which declaration for the name is being used to determine the bindings.

  48. Scope and scoping policies • Every language has a scoping policy to determine which bindings apply. • For T&N, the scope of a name is that portion of a program for which the name's bindings are in effect. • Scoping policies may specify those program segments that may serve as scopes of bindings -- T&N call these segments simply scopes.

  49. Languages and scopes • What may count as a scope is language-dependent (cf. Table 4.1, p. 90, T&N). • Typical possibilities are • function definitions • class definitions • loops • compilation units • compound statements (blocks)

  50. Overlapping scopes • Languages may allow certain types of scopes to be nested. • Scopes may not otherwise overlap. • So the notion of moving outward from one scope to another is always well-defined.

More Related