1 / 58

Improving Statistical Parsing Using Cross-Corpus Data

Improving Statistical Parsing Using Cross-Corpus Data. Xiaoqiang Luo IBM T.J. Watson Research Center (joint work with Min Tang of MIT). NLP Technologies. Statistical parsing Natural language understanding in spoken dialog systems Information extraction, and translingual question answering

vidor
Download Presentation

Improving Statistical Parsing Using Cross-Corpus Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Improving Statistical Parsing Using Cross-Corpus Data Xiaoqiang Luo IBM T.J. Watson Research Center (joint work with Min Tang of MIT)

  2. NLP Technologies • Statistical parsing • Natural language understanding in spoken dialog systems • Information extraction, and translingual question answering • Automatic extraction of entities and relations from text • Statistical machine translation • Chinese => English, Arabic => English • Cross-lingual search • Topic detection and tracking • Text categorization • Multilingual and translingual taxonomies • Audio-Indexing • Combine speech recognition and search (mono- and cross-lingual)

  3. Content • Motivation • Cross Domain Data • EM Algorithm • Experiments • Future Work

  4. Impact of Training Data

  5. Unsupervised: related work • Charniak ’97 • Blum&Mitchell 98: co-training • WS02: co-training • McCallum and Nigam ’98: document classification • What we did • Unsupervised adaptation: ASRU’99, ICASSP’00 • Active learning (ACL’02)

  6. Goal Active learning: select what to annotate This work: make use of cross-domain (corpus) data -- labeled, but for (other) purpose

  7. Content • Motivation • Cross Domain Data • EM Algorithm • Experiments • Future Work

  8. Cross Domain/Corpus Data • Claim: cross domain data provides some but NOT all information

  9. AP Treebank S

  10. AP Treebank – WSJ Style

  11. AP vs. WSJ Upenn TB • Cross-bracketing:

  12. PKU POS data • 1MW free (50MW) by Beijing Univ.

  13. PKU -> UPenn Mapping PKU: 在_p这_r辞旧迎新_l的_u美好_a时刻_n ,_w我_r …. UPenn: 在_P这_BNDRY辞旧迎新_RM 的_BNDRY美好_VA 时刻_NN ,_PU我_BNDRY …. English: Atthis goodbye-old-welcome-new ‘sbeautiful moment ,I … [at this beautiful moment when we say good-bye to the old year and welcome new year, I..] • Mapping: • Map 1-1, m-1 tags • Frequent 1-n: limited context; o/w untagged • m-n: keep word boundary, untagged • Style difference: drop word boundary • Result: • 93% words with Upenn tags • 6% words: keep boundary • 1%: no tag, no word boundary

  14. Utilize Cross Domain Data • Existing information • Convert into appropriate format • Properties: granularity, reliability, etc • Missing information • EM algorithm

  15. Content • Motivation • CrossDomain Data • EM Algorithm • Experiments • Discussion & Future Work

  16. Definitions • Incomplete Data (partial parse trees) : tpTp • Complete Data (full parse trees) : t  T , where t = < tm ,tp >, tm is the missing part • F : TTp where F(t) = tp is a many-one relation • P(t) : distribution on T for a given sentence • P(tp) : P(t) induced on Tp

  17. Solid: Tp dashed: Tm

  18. Algorithm Find  that maximizes P(tp) given tp: 0 initialized by “seed” data

  19. Implementation • Constrained decoding • Treat partial tree labels (tp) as constraints • Find missing labels (tm) consistent with tp • Pruning: top-N training • Speed up of the decoder • 2x~16x speed up

  20. Cross Domain Data Update Model Model Pre-processing Partial Trees Full Trees Constrained Decode The Recipe

  21. Content • Motivation • CrossDomain Data • EM Algorithm • Experiments • Discussion & Future Work

  22. Experiments Setup • MaxEnt Parser (Ratnaparkhi 97) • Chinese • Upenn (100K + 120K) Treebank • PKU (1M): POS • English • Upenn (1M) Treebank • AP treebank (1M)

  23. Experiment Settings * Improved baseline ** in-domain

  24. EE-1: amount of supervision

  25. CE-2: with PKU data

  26. CE-2: Relative Error Reduction (% relative error reduction before/after 100K PKU POS data)

  27. CE2: PKU data Lots of partially labeled data helped 100K model a little

  28. EE-2 • AP data: use all brackets or brackets of highest constituent • Results: • Not helpful to small model • Hurt performance if init model is well trained • Reason: • Information is under-used • Style diff: some constraints are wrong

  29. Semi-supervised training • Cross-domain data • Noisy decoding output as training data • Training with noisy data • Constrain Model -- parameter tying

  30. Parameter Tying • Decoding results are noisy • Constrain Model • Features classified: fi in Cj • Parameter: pi’ =pi + dj for all fi in Cj • Idea: change pi to pi’ only if evidence is strong

  31. Preliminary Result Baseline: 200K-word char parser EM data: Chinese NE data

  32. Result Summary • Semi-supervised learning • most helpful when initial model is insufficiently trained • Useful in early stage of system development

  33. Content • Motivation • CrossDomain Data • EM Algorithm • Experiments • Future Work

  34. Future Work • More on constraining model • Induce feature • Cross-domain data: new features • Sample selection • Voting (Multiple models) • Train on partial trees

  35. Acknowledgements • Todd Ward (AP data) • Fei Xia (PKU -> Upenn mapping) • Brian (Chinese NE data) • Salim and Todd: ideas, discussions

  36. The End

  37. End of Presentation

  38. Syntactic Parsing Problem

  39. PKU POS data (1M words) PKU-> UPenn POS Mapping (with help of Fei Xia) -- most are 1-1 -- m-1: vn,n -> NN -- 1-m: 的/u -> DEG/DEC (context dependent) -- m-n: r->DT, PN; Rg->DT,PN Other Issues: -- Word segmetation style: “lname fname” vs. “lnamefname”

  40. Name Entity Data

  41. Recipe: an example PKU: 输入/v 中文/n 是/v 轻而易举/i 的/u 事情/n 。/b UPenn: 输入/VV 中文/NN 是/VC 轻而易举/VA 的/DEG 事情/NN 。/PU Char:[VV 输_vvb 入_vve ] [NN 中_nnb 文_nne ] …. Decode: 0.7 [IP [IP[VP [VV 输_vvb 入_vve VV] [NN 中_nnb 文_nne NN]VP] IP] … 0.3 [IP [NP[VV 输_vvb 入_vve VV] [NN 中_nnb 文_nne NN]NP] … Retraining!

  42. Experiment Settings *No Chunk Labels†Include Chunk Labels

  43. CE-2: Use Raw Text

  44. CE-2: Use NE Information

  45. CE-2: Use NE and Word Model • NE information alone does not help ( so far ) • Word sense information is important ( as shown in CE-1) • 1.1% relative improvement with tags from a word model

  46. Improvement of Baseline * Chinese Char Parser: 5-10% relative

  47. Constrained Decode on Cross Domain Data *Results on the data decodable(66.7%) using beam-width 500. †Include chunk labels.

  48. WSJ and AP Treebanks • Similarities: • WSJ: 23.8 wd/s, depth 9, 76.6k PP’s • AP: 23.7 wd/s, depth 8, 68.9k PP’s • Differences: • WSJ: 26 labels / 36 tags, ADVP, [NP->NP+PP], 17.3k [V+S] or [V+SBAR]’s, 22.0k [S->VP] • AP: 46 labels /222 tags, Fn/Fr/S, Ti/Tg/Tn, 7.3k [V+Fn] or [V+S]’s • Word Level Differences: upper, regarding, etc.

More Related