1 / 32

Co-training

Co-training. LING 572 Fei Xia 02/21/06. Overview. Proposed by Blum and Mitchell (1998) Important work: (Nigam and Ghani, 2000) (Goldman and Zhou, 2000) (Abney, 2002) (Sarkar, 2002) … Used in document classification, parsing, etc. . Outline. Basic concept: (Blum and Mitchell, 1998)

luce
Download Presentation

Co-training

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Co-training LING 572 Fei Xia 02/21/06

  2. Overview • Proposed by Blum and Mitchell (1998) • Important work: • (Nigam and Ghani, 2000) • (Goldman and Zhou, 2000) • (Abney, 2002) • (Sarkar, 2002) • … • Used in document classification, parsing, etc.

  3. Outline • Basic concept: (Blum and Mitchell, 1998) • Relation with other SSL algorithms: (Nigam and Ghani, 2000)

  4. An example • Web-page classification: e.g., find homepages of faculty members. • Page text: words occurring on that page e.g., “research interest”, “teaching” • Hyperlink text: words occurring in hyperlinks that point to that page: e.g., “my advisor”

  5. Two views • Features can be split into two sets: • The instance space: • Each example: • D: the distribution over X • C1: the set of target functions over X1. • C2: the set of target function over X2.

  6. Assumption #1: compatibility • The instance distribution D is compatible with the target function f=(f1, f2) if for any x=(x1, x2) with non-zero prob, f(x)=f1(x1)=f2(x2). • The compatibility of f with D:  Each set of features is sufficient for classification

  7. Assumption #2: conditional independence

  8. Co-training algorithm

  9. Co-training algorithm (cont) • Why uses U’, in addition to U? • Using U’ yields better results. • Possible explanation: this forces h1 and h2 select examples that are more representative of the underlying distribution D that generates U. • Choosing p and n: the ratio of p/n should match the ratio of positive examples and negative examples in D. • Choosing the iteration number and the size of U’.

  10. Intuition behind the co-training algorithm • h1 adds examples to the labeled set that h2 will be able to use for learning, and vice verse. • If the conditional independence assumption holds, then on average each added document will be as informative as a random document, and the learning will progress.

  11. Experiments: setting • 1051 web pages from 4 CS depts • 263 pages (25%) as test data • The remaining 75% of pages • Labeled data: 3 positive and 9 negative examples • Unlabeled data: the rest (776 pages) • Manually labeled into a number of categories: e.g., “course home page”. • Two views: • View #1 (page-based): words in the page • View #2 (hyperlink-based): words in the hyperlinks • Learner: Naïve Bayes

  12. Naïve Bayes classifier(Nigam and Ghani, 2000)

  13. Experiment: results p=1, n=3 # of iterations: 30 |U’| = 75

  14. Questions • Can co-training algorithms be applied to datasets without natural feature divisions? • How sensitive are the co-training algorithms to the correctness of the assumptions? • What is the relation between co-training and other SSL methods (e.g., self-training)?

  15. (Nigam and Ghani, 2000)

  16. EM • Pool the features together. • Use initial labeled data to get initial parameter estimates. • In each iteration use all the data (labeled and unlabeled) to re-estimate the parameters. • Repeat until converge.

  17. Experimental results: WebKB course database EM performs better than co-training Both are close to supervised method when trained on more labeled data.

  18. Another experiment: The News 2*2 dataset • A semi-artificial dataset • Conditional independence assumption holds. Co-training outperforms EM and the “oracle” result.

  19. Co-training vs. EM • Co-training splits features, EM does not. • Co-training incrementally uses the unlabeled data. • EM probabilistically labels all the data at each round; EM iteratively uses the unlabeled data.

  20. Co-EM: EM with feature split • Repeat until converge • Train A-feature-set classifier using the labeled data and the unlabeded data with B’s labels • Use classifier A to probabilistically label all the unlabeled data • Train B-feature-set classifier using the labeled data and the unlabeled data with A’s labels. • B re-labels the data for use by A.

  21. Four SSL methods Results on the News 2*2 dataset

  22. Random feature split Co-training: 3.7%  5.5% Co-EM: 3.3%  5.1% • When the conditional independence assumption does not hold, but there is sufficient redundancy among the features, co-training still works well.

  23. Assumptions • Assumptions made by the underlying classifier (supervised learner): • Naïve Bayes: words occur independently of each other, given the class of the document. • Co-training uses the classifier to rank the unlabeled examples by confidence. • EM uses the classifier to assign probabilities to each unlabeled example. • Assumptions made by SSL method: • Co-training: conditional independence assumption. • EM: maximizing likelihood correlates with reducing classification errors.

  24. Summary of (Nigam and Ghani, 2002) • Comparison of four SSL methods: self-training, co-training, EM, co-EM. • The performance of the SSL methods depends on how well the underlying assumptions are met. • Random splitting features is not as good as natural splitting, but it still works if there is sufficient redundancy among features.

  25. Variations of co-training • Goldman and Zhou (2000) use two learners of different types but both takes the whole feature set. • Zhou and Li (2005) use three learners. If two agree, the data is used to teach the third learner. • Balcan et al. (2005) relax the conditional independence assumption with much weaker expansion condition.

  26. An alternative? • L  L1, LL2 • U U1, U  U2 • Repeat • Train h1 using L1 on Feat Set1 • Train h2 using L2 on Feat Set2 • Classify U2 with h1 and let U2’ be the subset with the most confident scores, L2 + U2’  L2, U2-U2’  U2 • Classify U1 with h2 and let U1’ be the subset with the most confident scores, L1 + U1’  L1, U1-U1’  U1

  27. Yarowsky’s algorithm • one-sense-per-discourse  View #1: the ID of the document that a word is in • one-sense-per-allocation  View #2: local context of word in the document • Yarowsky’s algorithm is a special case of co-training (Blum & Mitchell, 1998) • Is this correct? No, according to (Abney, 2002).

  28. Summary of co-training • The original paper: (Blum and Mitchell, 1998) • Two “independent” views: split the features into two sets. • Train a classifier on each view. • Each classifier labels data that can be used to train the other classifier. • Extension: • Relax the conditional independence assumptions • Instead of using two views, use two or more classifiers trained on the whole feature set.

  29. Summary of SSL • Goal: use both labeled and unlabeled data. • Many algorithms: EM, co-EM, self-training, co-training, … • Each algorithm is based on some assumptions. • SSL works well when the assumptions are satisfied.

  30. Additional slides

  31. Rule independence • H1 (H2) consists of rules that are functions of X1 (X2, resp) only.

  32. EM: the data is generated according to some simple known parametric model. • Ex: the positive examples are generated according to an n-dimensional Gaussian D+ centered around the point

More Related