Grammars, constituency and order

Grammars, constituency and order • A grammar describes the legal strings of a language in terms of constituency and order. • For example, a grammar for a fragment of English might say that a legal sentence consists of • a noun phrase (subject), • followed by a verb phrase (predicate). • This rule is commonly written as • S → NP VP

Constituents of constituents • The constituents of constituents may be described by other rules. • They would refine, for example, the initial decomposition • [the dog] [chased a cat] • into a complete decomposition • [[the] [dog]] [[chased] [[a] [cat]]] • according to the following rules:

Grammar rules for a fragment of English • S → NP VP • NP → Det N • VP → V NP • N → dog • N → cat • Det → the • Det → a • V → chased

Parse trees (derivation trees) • Hierarchical decomposition of sentences are more commonly expressed by special trees, known as parse trees or derivation trees. • For our sample sentence, we would have the parse tree below

Sentences generated by grammars • The grammar with the rules above would also allow, or generate, sentences like • a dog chased a cat • the dog chased a dog • a cat chased the dog since parse trees could be constructed for these sentences.

Context-free grammars (CFGs) • In the example above, the alphabet Σ consisted of the set of English words. • A grammar also needs to specify symbols aside from Σ, and rules, so more precisely …

CFGs defined • … a context-free grammar (CFG) consists of • a set T of terminal symbols (analogous to Σ) • a set V of other variables (or nonterminals) • a start symbol S, which is a member of V • a collection P of rules (or productions), each with • a left-hand side (LHS) from V, and • a right-hand side (RHS) from (V U T)*

Context freedom • The notion of context freedom means that any category may be expanded in accordance with the rules no matter where it appears. • So for example, the noun phrases that are legal subjects are the same as those that are legal objects • that is, are NPs in the V → NP rule.

Rules for a CFG for L(0(1+2)*) • S → 0X • X → l • X → YX • Y → 1 • Y → 2

CFGs for palindromes • A CFG for even-length palindromes over {0,1}: • S → l | 0S0 | 1S1 • A CFG for odd-length palindromes over {0,1}: • S → 0 | 1 | 0S0 | 1S1 • Here we use the common convention allowing several rules with the same LHS to be combined into one, with vertical bars separating the RHSs.

Rules for other 1-variable CFGs • for all palindromes over {0,1}: • S → l | 0 | 1 | 0S0 | 1S1 • for nonempty sequences of balanced parentheses: • S → ( ) | ( S ) | SS • for {0n1n | n ≥ 0} • S → l | 0S1 • for { x e {0,1} | x has as many 0's as 1's} • S → l | 0S1 | 1S0 | SS

Parse trees and grammars • A parse tree is legal for a CFG iff it satisfies each correspondence: • root ↔ start symbol • parent node ↔ LHS of a grammar rule • child node ↔ symbol from the RHS of a rule whose LHS is the parent node • leaf ↔ terminal symbol (or l) • Also, the ordering of children of a node must match the ordering of the RHS symbols in the corresponding rule.

Partial parse (derivation) trees • It's convenient to allow representation of the progress of a parse by allowing leaves to be labeled by a nonterminal symbol (and perhaps ignoring the constraint on roots) • In any case, the left to right sequence of leaf labels (ignoring those labeled by l) is called the yield of the parse tree • so the yield is a string of terminals

Notational conventions • Lower case letters are interpreted as for DFAs • those near the beginning of the alphabet represent terminals; those near the end of the alphabet represent strings • Capital letters represent nonterminals (variables) • Greek letters represent strings of variables and terminals • so a generic rule looks like A → g

Derivations and rewrite rules • CFG rules are also rewrite rules. • Here the rule S → NP VP would allow rewriting of S as NP VP • Intuitively, G generates a string x iff x can be derived from S by repeated rewriting • For example, we get the legal derivation S => NP VP => Det N VP => the N VP => the dog VP => the dog V NP => the dog chased NP => the dog chased Det N => the dog chased a N => the dog chased a cat

Leftmost and rightmost derivations • For every parse tree there are unique leftmost and rightmost derivations • The rightmost derivation corresponding to the parse tree above is • S => NP VP => NP V NP => NP V Det N => NP V Det cat => NP V a cat => NP chased a cat => Det N chased a cat => Det dog chased a cat => the dog chased a cat

Derivations and parse trees • All but the simplest parse trees will have other associated derivations besides the leftmost and rightmost. • For every derivation there is a unique associated parse tree.

Derivations and sentential forms • The => relation used above can be defined precisely by saying that • aAb => agb iff there is a rule A -> g in G • we may subscript the => symbol by G if there’s doubt about which grammar is being used. • Then using the symbol =>* for the (recursive) transitive closure of the => relation, we say • a sentential form for G is a string a from V U T such that S =>* a

Context-free languages (CFLs) • Fact: A CFG G with start symbol S licenses a parse tree for w iff S =>* w • Def) L(G) (the language generated by G) is {x | G generates x}, or equivalently {x | G’s start symbol derives x}, or {x ε T* | x is a sentential form for G}, • A language generated by a context-free grammar is called a context-free language

Ambiguous grammars • Here’s a 1-variable CFG for a subset of algebraic expressions: • E → x | y | E+E | E*E | (E) • Note that this grammar allows multiple parse trees for some strings, like x+y*y. • A grammar with this property is said to be ambiguous.

An unambiguous grammar for algebraic expressions • Rules for an unambiguous grammar for the above language are given below: • E → E + T | T • T → T * F | F • F → x | y | ( E )

Inherent ambiguity • Ambiguity is common in natural languages. • But we don't want it in programming languages! • Often ambiguity can be removed. • i.e., a grammar can be replaced by an unambiguous one, as seen above • But there are languages for which all grammars are ambiguous. • These languages are said to be inherently ambiguous.

Regular languages and CFLs • We’ve already seen examples of CFLs that aren’t regular languages • But it's fairly easy to show that all regular languages are context-free. • The languages {a}, {l}, and f have grammars with respective productions • S → a • S → l • [no productions]

All regular languages are CFLs • Suppose L1 and L2 have respective start symbols S1 and S2. • Then we may get grammars with start symbol S for their union, for their concatenation, and for L1* by adding the respective productions • S → S1 | S2 • S → S1S2 • S → l | S1S2 • So all regular languages are CFLs

Grammars for regular languages • Any regular language can be generated by a special type of CFG. • Def) A right-linear grammar is a CFG where the RHS of each rule has the form xB or x, • for x ε T* and B ε V • Fact: Right-linear grammars generate all and only regular languages

Finding a grammar for a regular language • For a DFA M, consider the grammar G with • T = S, V = Q and S = q0 • a rule qi → ajqk for each aj move from qi to qk • a rule qi → aj for each aj move from qi to qk where qk ε F • An easy induction shows that d*(q,x) = p iff q =>* xp • and that d*(q,x) = p and p ε F iff q =>* x • So L(G) = L(M)

DFAs for right-linear grammars • Conversely, let G be a right-linear grammar • If all strings x on RHSs have length 1, then the construction above can be reversed • and the proof above still holds • If not, then the construction can be modified by adding extra states as in Linz, pp. 91-2 • In either case a DFA can be obtained for L(G)

Regular grammars • Left-linear grammars may be defined by analogy with right-linear grammars • every rule must have a RHS of the form Bx or x • Fact: Left-linear grammars generate all and only regular languages • A CFG is a regular grammar iff it is right-linear or left-linear • so a language has a regular grammar iff it is regular

Backus-Naur form (BNF) • Grammars for programming languages generally use a variant of our CFG notation called BNF. • In BNF the symbol ::= is used instead of the rightward pointing arrow. • In BNF, terminal symbols may be given in bold face, or nonterminals may be delimited by angle brackets, e.g. • <identifier> ::= <letter> <digits>

Common BNF conventions • The vertical bar convention • [ ] brackets • for optionality (0 or 1 times) • { } braces • for indefinite repetition (0 or more times) • ( ) parentheses • for removing ambiguity, e.g., (a|b)c vs. a | bc

A sample grammar in BNF • <conditional> ::= • if <test> then <block> [ else <block> ] endif • <block> ::= begin [<statements>] end • <statements> ::= { <statement> } • <test> ::= <var> <op> <var> • <statement> ::= <var> = <var> • <var> ::= x | y • <op> ::= = | /=

Grammars, constituency and order

Grammars, constituency and order

Presentation Transcript

Parsers and Grammars

Constituency

Grammar and Grammars

Constituency

Prolog and grammars

Languages and Grammars

Decompositions of Higher -Order Grammars to First -Order Transducers

Languages and Grammars

Languages and Grammars

Grammars and ambiguity

Constituency

Syntax and Grammars

Syntax Word order, constituency

CONSTITUENCY COORDINATION

Grammars and Parsing

My constituency

Parsers and Grammars

Grammars and ambiguity

Grammars and Parsing