1 / 13

CS730b 김병창 NLP Lab. 1998. 10. 29

Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992. CS730b 김병창 NLP Lab. 1998. 10. 29. Contents. Motivation Partially Bracketed Text Grammar Reestimation The Inside-Outside Algorithm The Extended Algorithm Complexity

infinity
Download Presentation

CS730b 김병창 NLP Lab. 1998. 10. 29

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inside-outside reestimation from partially bracketed corpora F. Pereira and Y. Schabes ACL 30, 1992 CS730b 김병창 NLP Lab. 1998. 10. 29

  2. Contents • Motivation • Partially Bracketed Text • Grammar Reestimation • The Inside-Outside Algorithm • The Extended Algorithm • Complexity • Experimental Evaluation • Inferring the Palindrome Language • Experiments on the ATIS Corpus • Conclusions and Further Work NLP Lab., POSTECH

  3. Motivation I • Very simple method for learning SCFGs [Charniak] • Generate all possible SCFG rules • Assign some initial probabilities • Run the training algorithm on a sample text  raw text • remove those rules with zero probabilities • Difficulties in using SCFGs • Time complexity - O(n3|w|3) • n : the number of non-terminals w : training sentence • cf. O(s2|w|) : training an HMM with s states • Bad convergence properties • The larger number of non-terminals, the worse. • Inferred only by chance NLP Lab., POSTECH

  4. Motivation II • Extension of the Inside-Outside algorithm • Inferring grammars from a partially parsed corpus • Advantages • constituent boundary information in grammar • reduced number of iteration for training • better time complexity NLP Lab., POSTECH

  5. Partially Bracketed Text • Example • (((VB(DT NNS(IN((NN)(NN CD)))))).) • (((List (the fares(for((flight)(number 891)))))).) • Notations • Corpus C = { c | c = ( w, B) }, w : string, B : bracketing of w • w=w1w2wiwi+1  wj  w|w| • (i,j) delimits iwj • consistent : no overlapping in a bracketing • compatible : union of two bracketing is consistent • valid : a span is compatible with a bracketing • span in derivation 01m=w • if j=m, span of wi in j is (i-1,i) • if j<m, j=A, j+1=X1Xk, span A in j is (i1,jk) NLP Lab., POSTECH

  6. Grammar Reestimation • Using reestimation algorithm • parameter estimates for a SCFG derived by other means • grammar inferring from scratch • Grammar inferring • Given set N of Non-terminals, set  of terminals • n=|N|, t=|| • N={A1,,An}, ={b1,,bt} • CNF SCFG over N,  : n3+nt probabilities • Bp,q,r on ApAqAr: n3 • Up,m on Apbm: nt • Meaning of rule probabilities : intuition of context freeness NLP Lab., POSTECH

  7. S i 1 s-1 s t t+1 T The Inside-Outside Algorithm • Definition of inner (e) and outer (f) probabilities i Inner probability S i Outer probability Special thanks to ohwoog NLP Lab., POSTECH

  8. The Extended Algorithm • Compatible function • Extended algorithm • Table 1. 참조 • Inside probabilities : (1), (2) ; (2)에 compatible function 사용. • Outside probabilities : (3), (4); (4)에 compatible function 사용. • Parameter reestimation : (5), (6) ; original algorithm과 같음. • Stopping criterior • When the cross entropy estimate becomes negligible. NLP Lab., POSTECH

  9. Complexity • Complexity of original algorithm : O(|w|3) for each sentence • computation of inside probability, computation of outside probability and rule probability reestimation : 각각 O(|w|3) for each sentence • Complexity of extended algorithm : O(|w|) at best case • In the case of full binary bracketing B of a string w • O(|w|) spans in B • Only one split point for each (i,k) • Each valid span must be a member of B. • Preprocessing • Enumerating valid spans and split points NLP Lab., POSTECH

  10. Experimental Evaluation • Two experiments • Artificial Language ; Palindrome • Natural Language ; Penn Treebank • Evaluation • Bracketing accuracy • proportion of phrases that are compatible NLP Lab., POSTECH

  11. Inferring the Palindrome Language • L={wwR|w{a,b}*} • Initial grammar : 135 rules ( =53+5*2 ) • Training with 100 sentences • Inferred grammar : correct palindrome language grammar • Bracketing accuracy : above 90% (100% in several cases) • In the unbracketing training : 15% - 69% NLP Lab., POSTECH

  12. Experiments on the ATIS Corpus • ATIS(Air Travel Information System) corpus ; 770 sentences (7812 words) • 700 training set, 70 test set (901 words) • Initial grammar : 4095 rules ( =153+15*48) • 15 nonterminals, 48 terminal symbols for POS tags • Bracketing accuracy : 90.36% after 75 iteration • In the unbracketing training : 37.35% • In the case (A) • (Delta flight number) : not compatible • (the cheapest) : linguistically wrong ; lack of information • 16 incompatibles in GR • In the case (B) • fully compatible • 9 incompatibles in GR NLP Lab., POSTECH

  13. Conclusions and Further Work • The use of partially bracketed corpus can • reduce the number of iterations for convergence • find good solution • infer grammars specifying linguistically reasonable constituent boundaries • reduce time complexity (linear in the best case) • More Extensions • determination of sensitivity to the • initial probability assignments • training corpus • lack or misplacement of brackets. • larger terminal vocabularies NLP Lab., POSTECH

More Related