500 likes | 1.18k Views
دانشگاه صنعتی امیر کبیر دانشکده مهندسی کامپیوتر. CYK ) Cocke-Younger-Kasami) Parsing Algorithm. سید محمد حسین معطر پردازش زبان طبیعی. Parsing Algorithms. CFGs are basis for describing (syntactic) structure of NL sentences Thus - Parsing Algorithms are core of NL analysis systems
E N D
دانشگاه صنعتی امیر کبیر دانشکده مهندسی کامپیوتر CYK )Cocke-Younger-Kasami)Parsing Algorithm سید محمد حسین معطر پردازش زبان طبیعی
Parsing Algorithms • CFGs are basis for describing (syntactic) structure of NL sentences • Thus - Parsing Algorithms are core of NL analysis systems • Recognition vs. Parsing: • Recognition - deciding the membership in the language: • Parsing – Recognition+ producing a parse tree forit • Parsing is more “difficult” than recognition? (time complexity) • Ambiguity - an input may have exponentially manyparses
Parsing Algorithms • Parsing General CFLs vs. Limited Forms • Efficiency: • Deterministic (LR) languages can be parsed in linear time • A number of parsing algorithms for general CFLs requireO(n3)time • Asymptotically best parsing algorithm for general CFLs requires O(n2.37), but is not practical • Utility - why parse general grammars and not just CNF? • Grammar intended to reflect actual structure of language • Conversion to CNF completely destroys the parse structure
CYK )Cocke-Younger-Kasami) • One of the earliest recognition and parsing algorithms • The standard version of CYK can only recognize languages defined by context-free grammars in Chomsky Normal Form (CNF). • It is also possible to extend the CYK algorithm to handle some grammars which are not in CNF • Harder to understand • Based on a “dynamic programming” approach: • Build solutions compositionally from sub-solutions • Store sub-solutions and re-use them whenever necessary • Uses the grammar directly (no PDA is used) • Recognition version: decide whether S == > w ?
CYK Algorithm • The CYK algorithm for the membership problem is as follows: • Let the input string be a sequence of n letters a1 ... an. • Let the grammar contain r terminal and nonterminal symbols R1 ... Rr, and let R1 be the start symbol. • Let P[n,n,r] be an array of booleans. Initialize all elements of P to false. • For each i = 1 to n • For each unit production Rj -> ai, set P[i,1,j] = true. • For each i = 2 to n -- Length of span • For each j = 1 to n-i+1 -- Start of span • For each k = 1 to i-1 -- Partition of span • For each production RA -> RB RC • If P[j,k,B] and P[j+k,i-k,C] then set P[j,i,A] = true • If P[1,n,1] is true • Then string is member of language • Else string is not member of language
CYK Pseudocode On input x = x1x2 … xn : for (i = 1 to n) //create middle diagonal for (each var. A) if(Axi) add A to table[i-1][i] for (d = 2 to n) // d’th diagonal for (i = 0 to n-d) for (k = i+1 to i+d-1) for (each var. A) for(each var. B in table[i][k]) for(each var. C in table[k][k+d]) if(ABC) add A to table[i][k+d] return Stable[0][n] ? ACCEPT : REJECT
CYK Algorithm • this algorithm considers every possible consecutive subsequence of the sequence of letters and sets P[i,j,k] to be true if the sequence of letters starting from i of length j can be generated from Rk. • Once it has considered sequences of length 1, it goes on to sequences of length 2, and so on. • For subsequences of length 2 and greater, it considers every possible partition of the subsequence into two halves, and checks to see if there is some production P -> Q R such that Q matches the first half and R matches the second half. If so, it records P as matching the whole subsequence. • Once this process is completed, the sentence is recognized by the grammar if the subsequence containing the entire string is matched by the start symbol
CYK Algorithm for Deciding Context Free Languages Q: Consider the grammar G given by S e | AB | XB T AB | XB X AT A a B b • Is x = aaabb in L(G ) • Is x = aaabbb in L(G )
CYK Algorithm for Deciding Context Free Languages The algorithm is “bottom-up” in that we start with bottom of derivation tree. S e | AB | XB T AB | XB X AT A a B b a a a b b
CYK Algorithm for Deciding Context Free Languages 1) Write variables for all length 1 substrings S e | AB | XB T AB | XB X AT A a B b a a a b b A A A B B
CYK Algorithm for Deciding Context Free Languages 2) Write variables for all length 2 substrings S e | AB| XB T AB| XB X AT A a B b a a a b b A A A B B S,T T
CYK Algorithm for Deciding Context Free Languages 3) Write variables for all length 3 substrings S e | AB | XB T AB | XB X AT A a B b a a a b b A A A B B S,T T X
CYK Algorithm for Deciding Context Free Languages 4) Write variables for all length 4 substrings S e | AB | XB T AB | XB X AT A a B b a a a b b A A A B B S,T T X S,T
CYK Algorithm for Deciding Context Free Languages • Write variables for all length 5 substrings. S e | AB | XB T AB | XB X AT A a B b REJECT! a a a b b A A A B B S,T T X S,T X
CYK Algorithm for Deciding Context Free Languages Now look at aaabbb : S e | AB | XB T AB | XB X AT A a B b a a a b b b
CYK Algorithm for Deciding Context Free Languages 1) Write variables for all length 1 substrings. S e | AB | XB T AB | XB X AT A a B b a a a b b b A A A B B B
CYK Algorithm for Deciding Context Free Languages 2) Write variables for all length 2 substrings. S e | AB| XB T AB| XB X AT A a B b a a a b b b A A A B B B S,T
CYK Algorithm for Deciding Context Free Languages 3) Write variables for all length 3 substrings. S e | AB | XB T AB | XB X AT A a B b a a a b b b A A A B B B S,T T X
CYK Algorithm for Deciding Context Free Languages 4) Write variables for all length 4 substrings. S e | AB | XB T AB | XB X AT A a B b a a a b b b A A A B B B S,T T X S,T
CYK Algorithm for Deciding Context Free Languages 5) Write variables for all length 5 substrings. S e | AB | XB T AB | XB X AT A a B b a a a b b b A A A B B B S,T T X S,T X
CYK Algorithm for Deciding Context Free Languages 6) Write variables for all length 6 substrings. Se | AB | XB T AB | XB X AT A a B b S is included so aaabbb accepted! a a a b b b A A A B B B S,T T X S,T X S,T
CYK Algorithm for Deciding Context Free Languages Can also use a table for same purpose.
CYK Algorithm for Deciding Context Free Languages 1. Variables for length 1 substrings.
CYK Algorithm for Deciding Context Free Languages 2. Variables for length 2 substrings.
CYK Algorithm for Deciding Context Free Languages 3. Variables for length 3 substrings.
CYK Algorithm for Deciding Context Free Languages 4. Variables for length 4 substrings.
CYK Algorithm for Deciding Context Free Languages 5. Variables for length 5 substrings.
CYK Algorithm for Deciding Context Free Languages 6. Variables for aaabbb. ACCEPTED!
Parsing results • We keep the results for every wijin a table. • Note that we only need to fill in entries up to the diagonal – thelongest substring starting ati is of length n-i+1
Constructing parse tree • we need to construct parse trees for string w: • Idea: • Keep back-pointers to the table entries that we combine • At the end - reconstruct a parse from the back-pointers • This allows us to find all parse trees
Ambiguity Efficient Representation of Ambiguities • Local Ambiguity Packing : • a Local Ambiguity - multiple ways to derive the samesubstring from a non-terminal • All possible ways to derive each non-terminalare storedtogether • When creating back-pointers, create a single back-pointerto the “packed” representation • Allows to efficiently represent a very large number ofambiguities (even exponentially many) • Unpacking - producing one or more of the packed parse treesby following the back-pointers.
References • Hopcroft and Ullman,“Intro. to Automata Theory, Lang. and Comp.”Section 6.3, pp. 139-141 • “CYK algorithm ” , Wikipedia, the free encyclopedia • A representationbyZeph Grunschlag