1 / 15

Co-training & Self-training for Word Sense Disambiguation

Co-training & Self-training for Word Sense Disambiguation. Author: Rada Mihalcea. Introduction. Supervised learning  -> Best performance but limited to only words with sense tagged data available & accuracy depends on the amount of labeled data present

libra
Download Presentation

Co-training & Self-training for Word Sense Disambiguation

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Co-training & Self-training for Word Sense Disambiguation Author: RadaMihalcea

  2. Introduction • Supervised learning -> Best performance but limited to only words with sense tagged data available & accuracy depends on the amount of labeled data present • Methods for building sense classifiers using less annotated data explored • Applicability of co-training & self-training methods to supervised word sense disambiguation investigated • Bootstrapping parameters tweaked for optimal performance

  3. Bootstrapping • Independent views represented by two different feature sets based on local versus topical feature split provided for co-training • Self-training requires only one classifier and no feature split is involved • Class distribution ratio in labeled data is kept constant to avoid imbalance in training data • Parameters to be optimized: Iterations (I), Pool size(P)& Growth size (G)

  4. Supervised Word Sense Disambiguation • Preprocessing : • Removal of SGML tags • Tokenization • Parts of speech annotation • Collocation removal • Issues: (1) selection of best features (2) choice of learning algorithm

  5. Classifiers • Naive Bayes • Local classifier: uses all local features - > in co-training • Topical classifier: uses SK feature (10 keywords/word sense each occurring at least 3 times in the annotated corpus) -> in co-training • Global classifier: combination of local and topical classifiers -> in self-training • One classifier for each word in supervised learning => co-training & self-training shows heterogeneous behavior and best parameters for both are different for each classifier

  6. Parameter optimization • Determine an optimal parameter setting for each word in data set • Explore different algorithms for bootstrapping parameter selection: • Best overall parameter setting • Best individual parameter setting • Best per word parameter selection • New method with improved bootstrapping scheme using majority voting across several iterations

  7. Optimal Settings • Measurements performed on test set • 40 iterations performed for each setting • Experiments performed separately for co-training & self-training • Best set of values (G, P, I) determined for each word • Baseline - Global classifier

  8. Observations • Co-training & self-training have same performance under optimal settings • Words having high baseline classifier performance do not show any improvement with either co-training or self-training • Commonalities among optimal parameters for different classifiers absent

  9. Empirical Settings • Experimental determination of optimal parameter values might be difficult • 20% of training data used for determining empirical solutions • For each run, G, P, I precision of base classifier & boosted classifier are noted • Expt1 : Determine the total relative growth in performance for each possible parameter setting, by adding up the relative improvements for all the runs for that particular setting • Next the value for each parameter is determined independent of the other parameters following a similar approach

  10. The value leading to the highest growth is selected • Both co-training & self-training have the same set of parameter values which gives the highest growth • Average results are worse than the baseline • Expt2 : Best parameter values identified for each word • Performance of base classifier better

  11. Majority voting • Bootstrapping learning curves first exhibit non uniform growth and decline rate • Values at which maximum and minimum is reached varies for different classifiers • Combining co-training and self-training with majority voting slows the learning rate as well produces a larger constant performance interval • Performance better than baseline for a greater interval of iterations

  12. The parameter settings evaluation repeated for the smoothed co-training and self-training with majority voting • Co-training results improve • Self-training results do not show significant improvement

  13. Discussions • Dependencies might be present between features since they are extracted from the same context • Words with accurate base classifiers show no improvement • Words with higher number of senses show no improvement • Words having large subsets of their senses belonging to different domains show little or no improvement

  14. Questions?

More Related