1 / 78

Dependency Parsing: Machine Learning Approaches

January 7, 2008. Dependency Parsing: Machine Learning Approaches . Yuji Matsumoto Graduate School of Information Science Nara Institute of Science and Technology (NAIST, Japan). He reckons the current account deficit will narrow to only 1.8 billion in September .

oleg
Download Presentation

Dependency Parsing: Machine Learning Approaches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. January 7, 2008 Dependency Parsing:Machine Learning Approaches Yuji Matsumoto Graduate School of Information Science Nara Institute of Science and Technology (NAIST, Japan)

  2. Hereckonsthe current account deficitwill narrowtoonly 1.8 billioninSeptember . Basic Language Analyses (POS-tagging, phrase chunking, parsing) Raw sentence He reckons the current account deficit will narrow to only 1.8 billion in September. Part-of-speech tagging POS-tagged sentence He reckons the current account deficit will narrow to only 1.8 billion in September . PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP . Base phrase chunking Base phrase-chunked sentence Hereckonsthe current account deficitwill narrowtoonly 1.8 billioninSeptember . NPVPNPVPPPNPPPNP Dependency parsing Dependency parsed sentence

  3. Word dependency parsing Word Dependency Parsing (unlabeled) Raw sentence He reckons the current account deficit will narrow to only 1.8 billion in September. Part-of-speech tagging POS-tagged sentence He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP . Word dependency parsed sentence Hereckonsthecurrentaccountdeficitwillnarrowtoonly1.8billioninSeptember.

  4. Word dependency parsing MOD MOD COMP SUBJ MOD SUBJ COMP SPEC S-COMP ROOT Word Dependency Parsing (labeled) Raw sentence He reckons the current account deficit will narrow to only 1.8 billion in September. Part-of-speech tagging POS-tagged sentence He reckons the current account deficit will narrow to only 1.8 billion in September. PRP VBZ DT JJ NN NN MD VB TO RB CD CD IN NNP . Word dependency parsed sentence Hereckonsthecurrentaccountdeficitwillnarrowtoonly1.8billioninSeptember.

  5. A phrase structure tree anda dependency tree ounces

  6. Flattened representation ofa dependency tree ounces

  7. Dependency structure —terminology Label • Child • Dependent • Modifier SUBJ Thisis • Parent • Governor • Head • The direction of arrows may be drawn from head to child • When there is an arrow from w to v, we write w→v. • When there is a path (a series of arrows) from w to v, • we write w→*v.

  8. Definition of Dependency Trees • Single head: Except for the root (EOS), all words have a single parent • Connected: It should be a connected tree • Acyclic: If wi→wj , then it will never be wj→*wi • Projective: If wi →wj , then for all k between i and j, either wk →* wiorwk →* wjholds (non-crossing between dependencies).

  9. Projective dependency tree ounces Projectiveness: all the words between here finally depend on either on “was” or “.” (e.g., light →* was)

  10. NNP VBD DT NN NN WP VBD IN DT NN John saw a man yesterday who walked along the river Non-projective dependency tree root Direction of edges: from a child to the parent

  11. Non-projective dependency tree root *taken from: R. McDonald and F. Pereira, “Online Learning of Approximate Dependency Parsing Algorithms,” European Chapter of Association for Computational Linguistics, 2006. Direction of edges: from a parent to the children

  12. Two Different Strategies for Structured Language Analysis • Sentences have structures • Linear sequences: POS tagging, Phrase/Named Entity chunking • Tree structure: Phrase structure trees, dependency trees • Two statistical approaches to structure analysis • Global optimization • Eg., Hidden Markov Models, Conditional Ramdom Fields for Sequential tagging problems • Probabilistic Context-free parsing • Maximum Spanning Tree Parsing (graph-based) • Repetition of local optimization • Chunking with Support Vector Machine • Deterministic parsing (transition-based)

  13. Statistical dependency parsers • Eisner (COLING 96, Penn Technical Report 96) • Kudo & Matsumoto (VLC 00, CoNLL 02): • Yamada & Matsumoto (IWPT 03) • Nivre (IWPT 03, COLING 04, ACL 05) • Cheng, Asahara, Matsumoto (IJCNLP 04) • McDonald-Crammer-Pereira (ACL 05a, EMNLP 05b, EACL 06) Global optimization Repetition of local optimization

  14. Dependency Parsing used asthe CoNLL Shared Task • CoNLL (Conference on Natural Language Learning) • Multi-lingual Dependency Parsing Track • 10 languages: Arabic, Basque, Catalan, Chinese, Czech, English, Greek, Hungarian, Italian, Turkish • Domain Adaptation Track • Dependency annotated data in one domain and a large unannotated data in other domains (biomedical/chemical abstracts, parent-child dialogue) are available. • Objective: To use large scale unannotated target data to enhance the performance of dependency parser learned in the original domain so as to work well in the new domain. Nivre, J., Hall, J., Kubler, S, McDonald, R., Nilsson, J., Riedel, S., Yuret, D., “The CoNLL 2007 Shared Task on Dependency Parsing,” Proceedings of EMNLP-CoNLL 2007, pp.915-932, June 2007.

  15. Statistical dependency parsers(to be introduced in this lecture) • Kudo & Matsumoto (VLC 00, CoNLL 02): Japanese • Yamada & Matsumoto (IWPT 03) • Nivre (IWPT 03, COLING 04, ACL 05) • McDonald-Crammer-Pereira (EMNLP 05a, ACL 05b, EACL 06) Most of them (except for [Nivre 05] and [McDonald 05a]) Assume projective dependency parsing

  16. Japanese Syntactic Dependency Analysis • Analysis of relationship between phrasal units (“bunsetsu” segments) • Two Constraints: • Each segment modifies one of right-hand side segments (Japanese is head final language) • Dependencies do not cross one another (projectiveness)

  17. 私は彼女と京都に行きます (I go to Kyoto with her.) Raw text Morphological analysis and bunsetsu chunking 私は / 彼女と / 京都に / 行きます I with her to Kyoto go Dependency Analysis 私は / 彼女と / 京都に / 行きます An Example of Japanese Syntactic Dependency Analysis

  18. 1. Build a Dependency Matrix usingME, DT or SVMs (How probable one segment modifies another) Modifiee 2. Search the optimal dependencies which maximize the sentence probabilities, using CYK or Chart 2 3 4 1 0.1 0.2 0.7 Modifier 2 0.2 0.8 Output 3 1.0 Dependency Matrix 私は 1 / 彼女と 2 / 京都に 3 / 行きます 4 Model 1: Probabilistic Model [Kudo & Matsumoto 00] Input 私は 1 / 彼女と 2 / 京都に 3 / 行きます 4 I-top / with her / to Kyoto-loc / go

  19. Problems of Probabilistic model(1) • Selection of training examples: All pairs of segments in a sentence • Depending pairs → positive examples • Non-depending pairs → negative examples • This produces a total of n(n-1)/2 training examples per sentence (n is the number of segments in a sentence) • In Model 1: • All positive and negative examples are used to learn an SVM • Test example is given to the SVM, its distance from the separating hyperplane is transformed into a pseud-probability using the sigmoid function

  20. Problems of Probabilistic Model • Size of training examples is large • O(n3) time is necessary for complete parsing • The classification cost of SVM is much more expensive than other ML algorithms such as Maximum Entropy model and Decision Trees

  21. Model 2: Cascaded Chunking Model [Kudo & Matsumoto 02] • Parse a sentence deterministically only deciding whether the current segment modifies the segment on its immediate right-hand side • Training examples are extracted using the same parsing algorithm

  22. 彼は1  彼女の2  真心に4  感動した。5 ? ? ? 彼は1  彼女の2  真心に4  感動した。5 O DD 彼は1   真心に4  感動した。5 ? ? ? ? ? ? 彼は1   真心に4  感動した。5 彼は1  彼女の2  温かい3  真心に4  感動した。5 O D O O DD 彼は1     感動した。5 彼は1     感動した。5 SVM-learning after accumulation ? D SVMs Example: Training Phase Annotated sentence 彼は1  彼女の2  温かい3  真心に4  感動した。5 He her warm heart was moved (He was moved by her warm heart.) 彼は1  彼女の2  温かい3  真心に4  感動した。5 Pairs of tag (D or O) and context(features) are stored as training data for SVMs Training Data

  23. 彼は1  彼女の2  真心に4  感動した。5 ? ? ? 彼は1  彼女の2  真心に4  感動した。5 O DD 彼は1   真心に4  感動した。5 ? ? ? ? ? ? 彼は1   真心に4  感動した。5 彼は1  彼女の2  温かい3  真心に4  感動した。5 O D O O DD 彼は1     感動した。5 彼は1     感動した。5 ? D Example: Test Phase Test sentence 彼は1  彼女の2  温かい3  真心に4  感動した。5 He her warm heart was moved (He was moved by her warm heart.) 彼は1  彼女の2  温かい3  真心に4  感動した。5 Tag is decided by SVMs built in training phase SVMs

  24. Advantages of Cascaded Chunking model • Efficiency • O(n3) (Probability model) v.s. O(n2) (Cascaded chunking model) • Lower than O(n2) since most segments modify the segment on their immediate right-hand-side • The size of training examples is much smaller • Independence from ML methods • Can be combined with any ML algorithm which works as a binary classifier • Probabilities of dependency are not necessary

  25. B A C Features used in implementation Modify or not? 彼の1友人は2  この本を3  持っている4 女性を5 探している6 His friend-top this book-acc have lady-acc be looking for modifier head His friend is looking for a lady who has this book. • Static Features • modifier/modifiee • Head/Functional Word: surface, POS, POS-subcategory, inflection-type, inflection-form, brackets, quotations, punctuations,… • Between segments: distance, case-particles, brackets, quotations, punctuations • Dynamic Features • A,B: Static features of Functional word • C: Static features of Head word

  26. Settings of Experiments • Kyoto University Corpus 2.0/3.0 • Standard Data Set • Training: 7,958 sentences / Test: 1,246 sentences • Same data as [Uchimoto et al. 98], [Kudo, Matsumoto 00] • Large Data Set • 2-fold Cross-Validation using all 38,383 sentences • Kernel Function: 3rd polynomial • Evaluation method • Dependency accuracy • Sentence accuracy

  27. Results Data Set Standard (8,000 sentences) Large (20,000 sentences) Model Cascaded Chunking Probabilistic Cascaded Chunking Probabilistic Dependency Acc. (%) 89.29 89.09 90.45 N/A Sentence Acc. (%) 47.53 46.17 53.16 N/A # of training sentences 7,956 7,956 19,191 19,191 # of training examples 110,355 459,105 251,254 1,074,316 Training time (hours) 8 336 48 N/A Parsing time (sec./sent.) 0.5 2.1 0.7 N/A

  28. Probabilistic v.s. Cascaded Chunking Models

  29. Smoothing Effect (in cascade model) • No need to cut off low frequent words

  30. Combination of features • Polynomial Kernels for taking into combination of features (tested with a small corpus (2000 sentences))

  31. Deterministic Dependency Parser based on SVM [Yamada & Matsumoto 03] • Three possible actions: • Right: For the two adjacent words, modification goes from left word to the right word • Left: For the two adjacent words, modification goes from right word to the left word • Shift: no action should be taken for the pair, and move the focus to the right • There are two possibilities in this situation: • There is really no modification relation between the pair • There is actually a modification relation between them, but need to wait until the surrounding analysis has been finished • The second situation can be categorized into a different class (called Wait) • Do this process on the input sentence from the beginning to the end, and repeat it until a single word remains

  32. Right action

  33. Left action

  34. Shift action

  35. The features used in learning SVM is used to make classification either in 3 class model (right, left, shift) or in 4 class model (right, left, shift, wait)

  36. SVM Learning of Actions • The best action for each configuration is learned by SVMs • Since the problem is 3-class or 4-class classification problem, either pair-wise or one-vs-rest method is employed • pair-wise method: For each pair of classes, learn an SVM. The best class is decide by voting of all SVMs • One-vs-rest method: For each class, an SVM is learned to discriminate that class from others. The best class is decided by the SVM that gives the highest value

  37. word pair being considered Referred context An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) right with a the boy hits the dog rod

  38. boy the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) right with a hits the dog rod

  39. boy the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) shift with a hits the dog rod

  40. boy the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) right with a hits the dog rod

  41. boy the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) shift with a hits dog rod the

  42. boy the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) shift with a hits dog rod the

  43. boy the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) right with a hits dog rod the

  44. boy dog the the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) left with hits rod a

  45. boy dog the the An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) shift with hits rod a

  46. boy dog rod the the a An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) left with hits

  47. with boy dog the the rod a An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) left hits

  48. with boy dog the the rod a An Example of Deterministic DependencyParsing (Yamada & Matsumoto Algorithm) End of parsing hits

  49. The Accuracy of Parsing Accuracies for: Dependency relation Rood identification Complete analysis • Learned with 30000 English sentences • no children: no child info is considered • word, POS: only word/POS info is used • all: all information is used

  50. Deterministic linear time dependency parser based on Shift-Reduce parsing [Nivre 03,04] • There are two stacks S and Q • Initialization: S[w1] [w2, w3, …, wn]Q • Termination: S[…] []Q • Parsing actions: • SHIFT: S[…] [wi,…]Q → S[…, wi] […]Q • Left-Arc: S[…, wi] [wj, …]Q → S[…] [wj, …]Q • Right-Arc: S[…, wi] [wj,…]Q → S[…, wi, wj] […]Q • Reduce: S[…, wi, wj] […]Q → S[…, wi] […]Q wi wj Though the original parser uses memory-based learning, recent implementation uses SVMs to select actions

More Related