Equivalence of CFG's and PDA's

Equivalence of CFG's and PDA's A language is context free if and only if some pushdown automaton recognizes it • As usual with “if and only if” theorems, there are two directions to prove • If a language is context free, then some pushdown automaton recognizes it • If a pushdown automaton recognizes some langauge, then it is context free

Only If (CFG to PDA) • Let L = L(G) for some CFG G = (V,, P, S) • Idea: have PDA A simulate leftmost derivations in G, where a left-sentential form (LSF) is represented by: • The sequence of input symbols that A has consumed from its input, followed by… • A's stack, top left-most Example: If (q, abcd, S) * (q, cd, ABC), then the LSF represented is abABC 

Moves of A • If a terminal a is on top of the stack, then there had better be an a waiting on the input. A consumes a from the input and pops it from the stack • The LSF represented doesn't change! • If a variable B is on top of the stack, then PDA A has a choice of replacing B on the stack by the body of any production with head B

Defining the PDA • Define PDA A as follows: • Q contains a single state, q •  contains the terminal symbols of the grammar •  contains all terminal and non-terminal symbols from the grammar • F is the empty set (A terminates by empty stack) • Start stack symbol is the distinguished symbol of the grammar •  is defined as follows: • For each production X   in the grammar, create a move (q,,X) = (q,) • For each terminal symbol a in the grammar, create a move (q,a,a) = (q, )

Example S  a | aS | bSS | SSb | SbS PDA A = ({q},{a,b},{S,a,b},,q,,S) • is defined as (q,,S) = { (q,a), (q,aS), (q,bSS), (q,SSb), (q,SbS) } (q,a,a) = (q,) (q,b,b) = (q,)

Processing of baa GeneratebSS Match b Generatea Match a S Generatea Match a b S S a a match match match b a a

PDA to CFG • Assume L = N(P), where P = (Q,,,,q0,Z0) • Key idea: units of PDA action have the net effect of popping one symbol from the stack, consuming some input, and making a state change. • The triple [qZp] is a CFG variable that generates exactly those strings w such that P can read w from the input, pop Z (net effect), and go from state q to state p. • More precisely, (q,w,Z) * (q,, ) • As a consequence of above, (q,wx,Z ) * (p, x,) for any x and .  

It's a Zen thing [qZp] is at once a triple involving states and symbols of P, and yet to the CFG we construct it is a single, indivisible object. (OK, I know that's not a Zen thing, but you get the point)

Strategy • A popping rule, e.g., (p,) in (q,a,Z). • [qZp] a • Pop Z, consume a • A rule that replaces one symbol and state by others, e.g., (p,Y) in (q,a,Z). • For all states r in Q: [qZr] a[pYr] • Pop Z, consume a, move to state p, push Y • A rule that replaces one stack symbol by two, e.g., (p,XY) in (q,a,Z). • For all states r and s in Q: [qZs]  a[pXr][rYs] • Pop Z, consume a, move to state p, push X, move to “some other state”, push Y, move to s There may be some states r that cannot be reached from p while popping X. True, but does not affect grammar since the resulting variables are useless and do not affect the language accepted by the grammar

(q,a,Z) = (p,Y) consume a pop Z push Y move to state p q q a, Z / Y  a, Z / Y p consume a pop Z push Y move to state p p process Y ? Since we don’t know which state the PDA will be in after processing Y, define aproduction [qZrn]  a[pYrn]that ends in each possible state rn Y not yet processed

Example S  [q0Z0q] [q0Z0q]  a[q0Ap] [pZ0q] [q0Z0q]  b[q0Bp] [pZ0q] [q0Aq]  a[q0Ap] [pAq] [q0Aq]  b[q0Bp] [pAq] [q0Bq]  a[q0Ap] [pBq] [q0Bq]  b[q0Bp] [pBq] [q0Z0q]  c[q1Z0q] [q0Aq]  c[q1Aq] [q0Bq]  c[q1Bq] [q1Aq1]  a [q1Bq1]  b [q1Z0q1]   PDA with transitions • (q0,a,Z0) = {(q0,AZ)} • (q0,b,Z0) = {(q0,BZ)} • (q0,a,A) = {(q0,AA)} • (q0,a,A) = {(q0,AA)} • (q0,b,A) = {(q0,BA)} • (q0,a,B) = {(q0,AB)} • (q0,b,B) = {(q0,BB)} • (q0,c,Z0) = {(q1,Z0)} • (q0,c,A) = {(q1,A)} • (q0,c,B) = {(q1,B)} • (q1,a,A) = {(q1,)} • (q1,b,B) = {(q1,)} • (q1,,Z0) = {(q1,)} In the above, q and p can each be either q0 or q1

The Full Story If we specify every non-terminal in terms of all possible states, the expansion would contain all of the states below S  [q0Z0q0] S  [q0Z0q1] [q0Z0q0]  a[q0Aq0] [q0Z0q0] [q0Z0q0]  a[q0Aq1] [q1Z0q0] [q0Z0q1]  a[q0Aq0] [q0Z0q1] [q0Z0q1]  a[q0Aq1] [q1Z0q1] [q0Z0q0]  b[q0Bq0] [q0Z0q0] [q0Z0q0]  b[q0Bq1] [q1Z0q0] [q0Z0q1]  b[q0Bq0] [q0Z0q1] [q0Z0q1]  b[q0Bq1] [q1Z0q1] [q0Aq0]  a[q0Aq0] [q0Aq0] [q0Aq0]  a[q0Aq1] [q1Aq0] [q0Aq1]  a[q0Aq0] [q0Aq1] [q0Aq1]  a[q0Aq1] [q1Aq1] [q0Aq0]  b[q0Bq0] [q0Aq0] [q0Aq0]  b[q0Bq1] [q1Aq0] [q0Aq1]  b[q0Bq0] [q0Aq1] [q0Aq1]  b[q0Bq1] [q1Aq1] [q0Bq0]  a[q0Aq0] [q0Bq0] [q0Bq0]  a[q0Aq1] [q1Bq0] [q0Bq1]  a[q0Aq0] [q0Bq1] [q0Bq1]  a[q0Aq1] [q1Bq1] [q0Bq0]  b[q0Bq0] [q0Bq0] [q0Bq0]  b[q0Bq1] [q1Bq0] [q0Bq1]  b[q0Bq0] [q0Bq1] [q0Bq1]  b[q0Bq1] [q1Bq1] [q0Z0q0]  c[q1Z0q0] [q0Z0q1]  c[q1Z0q1] [q0Aq0]  c[q1Aq0] [q0Aq1]  c[q1Aq1] [q0Bq0]  c[q1Bq0] [q0Bq1]  c[q1Bq1] [q1Aq1]  a [q1Bq1]  b [q1Z0q1]  

Deriving bacab Only some of the productions in the generated grammar will allow for a derivation; the rest are unnecessary PDA moves (q0, bacab, Z0) |- (q0, acab, BZ0) |- (q0, cab, ABZ0) |- (q1, ab, ABZ0) |- (q1, b, BZ0) |- (q1, , Z0) |- (q1, , ) S  [q0Z0q1] [q0Z0q1]  a[q0Aq1] [q1Z0q1] [q0Z0q1]  b[q0Bq1] [q1Z0q1] [q0Aq1]  a[q0Aq1] [q1Aq1] [q0Aq1]  b[q0Bq1] [q1Aq1] [q0Bq1]  a[q0Aq1] [q1Bq1] [q0Bq1]  b[q0Bq1] [q1Bq1] [q0Z0q1]  c[q1Z0q1] [q0Aq1]  c[q1Aq1] [q0Bq1]  c[q1Bq1] [q1Aq1]  a [q1Bq1]  b [q1Z0q1]   Corresponding leftmost derivation S [q0,Z0,q1]  b[q0,B,q1] [q1,Z0,q1]  ba[q0,A,q1] [q1,B,q1][q1,Z0,q1]  bac[q1,A,q1] [q1,B,q1][q1,Z0,q1]  baca[q1,B,q1][q1,Z0,q1]  bacab[q1,Z0,q1]  bacab

Deterministic PDAs • Intuitively: never a choice of move •  (q, a, Z) has at most one member for any q, a, Z (including a = ). • If  (q, , Z) is nonempty, then  (q, a, Z) must be empty for all input symbols a. • Why Care? • Parsers, as in YACC, are really DPDA's. • Thus, the question of what languages a DPDA can accept is really the question of what programming language syntax can be parsed conveniently.

Some Language Relationships • Acceptance by empty stack is hard for a DPDA • Once it accepts, it dies and cannot accept any continuation. • Thus, N(P) has the prefix property: if w is in N(P), then wx is NOT in N(P) for any x. • However, parsers do accept by emptying their stack • Trick: they really process strings followed by a unique endmarker (typically $) e.g., if they accept w$, they consider w to be a correct program.

If L is a regular language, then L is a DPDA language • A DPDA can simulate a DFA, without using its stack (acceptance by final state). • If L is a DPDA language, then L is a CFL that is not inherently ambiguous • A DPDA yields an unambiguous grammar in the standard construction.

Cleaning Up Grammars • We can "simplify" grammars to a great extent, e.g.: • Get rid of useless symbols -- those that do not participate in any derivation of a terminal string. • Get rid of -productions--those of the form variable . • But you lose the ability to generate  as a string in the language. • Get rid of unit productions -- those of the form variable  variable. • Any CFG can be converted via these and other methods toChomsky Normal Form • only production forms are variable  two variables and variable  terminal.

Getting Rid of the Empty String • Empty string is a nuisance with grammars and languages in general • We will look at languages that do not contain  • No loss of generality: For language L, let G = (V,T,S,P) be a CFG that generates L - {} Modify grammar by adding a new start variable S0 and add productions S0 S | This grammar generates L Therefore any non-trivial conclusion we make for L - {} should transfer to L

Useless Symbols • In order for a symbol X to be useful, it must: • Derive some terminal string (possibly X is a terminal). • Be reachable from the start symbol; i.e., S X. • Note that X wouldn't really be useful if  or  included a symbol that didn't satisfy (1), so it is important that (1) be tested first, and symbols that don't derive terminal strings be eliminated before testing (2). *

Finding Symbols That Don't Derive Any Terminal String • Recursive construction: • Basis: A terminal surely derives a terminal string. • Induction: If A is the head of a production whose body is X1X2 …Xk, and each Xi is known to derive a terminal string, then surely A derives a terminal string. • Keep going until no more symbols that derive terminal strings are discovered.

Example S  AB | C A  0B | C B  1 | A0 C  AC | C1 • Round 1: 0 and 1 are "in." • Round 2: B 1 says B is in. • Round 3: A 0B says A is in. • Round 4: SAB says S is in. • Round 5: Nothing more can be added. • Thus, C can be eliminated, along with any production that mentions it, leaving SAB; A 0B; B 1 | A0.

Finding Symbols That Can't Be Derived From the Start Symbol • Another recursive algorithm: • Basis:S is "in." • Induction: If variable A is in, then so is every symbol in the production bodies for A. • Keep going until no more symbols derivable from S can be found.

Example SAB A 0B B 1 | A0 • Round 1: S is in. • Round 2: A and B are in. • Round 3: 0 and 1 are in. • Round 4: Nothing can be added. • In this case, all symbols are derivablefrom S, so no change to grammar. • Book has an example where not only are there symbols not derivable from S, but you must eliminate first the symbols that don't derive terminal strings, or you get the wrong grammar.

Eliminating -Productions * A variable A is nullable if A  . Find them by a recursive algorithm: • Basis: If A  is a production, then A is nullable. • Induction: If A is the head of a production whose body consists of only nullable symbols, then A is nullable. • Once we have the nullable symbols, we can add additional productions and then throw away the productions of the form A  for any A.

If AX1X2 …Xkis a production, add all productions that can be formed by eliminating some or all of those Xi's that are nullable. • But, don't eliminate all k if they are all nullable. Example • If A  BC is a production, and both B and C are nullable, add A  B | C

Eliminating Unit Productions • Eliminate useless symbols and -productions. • Discover those pairs of variables (A, B) such that A B. • Because there are no  -productions, this derivation can only use unit productions. • Thus, we can find the pairs by computing reachablity in a graph where nodes = variables, and arcs = unit productions. • Replace each combination where A Band is other than a single variable by A  • I.e., "short circuit" sequences of unit productions, which must eventually be followed by some other kind of production. • Remove all unit productions. * * *

Chomsky Normal Form • Get rid of useless symbols, -productions, and unit productions (already done). • Get rid of productions whose bodies are mixes of terminals and variables, or consist of more than one terminal. • Break up production bodies longer than 2. Result All productions are of the form A  BC or Aa

No Mixed Bodies • For each terminal a, introduce a new variable Aa, with one production Aa a. • Replace a in any body where it is not the entire body by Aa. • Now, every body is either a single terminal or it consists only of variables. Example • A0B1 becomes A00; A11; A A0BA1

Making Bodies Short • If we have a production like ABCDE, we can introduce some new variables that allow the variables of the body to be introduced one at a time. • A body of length k requires k - 2 new variables. Example • Introduce F and G; replace A BCDE by A BF; F CG; G DE.

Summary Theorem If L is any CFL, there is a grammar G that generates L - {}, for which each production is of the form A  BC or A  a, and there are no useless symbols.

Equivalence of CFG's and PDA's