Co-training

Co-training LING 572 Fei Xia 02/21/06

Overview • Proposed by Blum and Mitchell (1998) • Important work: • (Nigam and Ghani, 2000) • (Goldman and Zhou, 2000) • (Abney, 2002) • (Sarkar, 2002) • … • Used in document classification, parsing, etc.

Outline • Basic concept: (Blum and Mitchell, 1998) • Relation with other SSL algorithms: (Nigam and Ghani, 2000)

An example • Web-page classification: e.g., find homepages of faculty members. • Page text: words occurring on that page e.g., “research interest”, “teaching” • Hyperlink text: words occurring in hyperlinks that point to that page: e.g., “my advisor”

Two views • Features can be split into two sets: • The instance space: • Each example: • D: the distribution over X • C1: the set of target functions over X1. • C2: the set of target function over X2.

Assumption #1: compatibility • The instance distribution D is compatible with the target function f=(f1, f2) if for any x=(x1, x2) with non-zero prob, f(x)=f1(x1)=f2(x2). • The compatibility of f with D:  Each set of features is sufficient for classification

Assumption #2: conditional independence

Co-training algorithm

Co-training algorithm (cont) • Why uses U’, in addition to U? • Using U’ yields better results. • Possible explanation: this forces h1 and h2 select examples that are more representative of the underlying distribution D that generates U. • Choosing p and n: the ratio of p/n should match the ratio of positive examples and negative examples in D. • Choosing the iteration number and the size of U’.

Intuition behind the co-training algorithm • h1 adds examples to the labeled set that h2 will be able to use for learning, and vice verse. • If the conditional independence assumption holds, then on average each added document will be as informative as a random document, and the learning will progress.

Experiments: setting • 1051 web pages from 4 CS depts • 263 pages (25%) as test data • The remaining 75% of pages • Labeled data: 3 positive and 9 negative examples • Unlabeled data: the rest (776 pages) • Manually labeled into a number of categories: e.g., “course home page”. • Two views: • View #1 (page-based): words in the page • View #2 (hyperlink-based): words in the hyperlinks • Learner: Naïve Bayes

Naïve Bayes classifier(Nigam and Ghani, 2000)

Experiment: results p=1, n=3 # of iterations: 30 |U’| = 75

Questions • Can co-training algorithms be applied to datasets without natural feature divisions? • How sensitive are the co-training algorithms to the correctness of the assumptions? • What is the relation between co-training and other SSL methods (e.g., self-training)?

(Nigam and Ghani, 2000)

EM • Pool the features together. • Use initial labeled data to get initial parameter estimates. • In each iteration use all the data (labeled and unlabeled) to re-estimate the parameters. • Repeat until converge.

Experimental results: WebKB course database EM performs better than co-training Both are close to supervised method when trained on more labeled data.

Another experiment: The News 2*2 dataset • A semi-artificial dataset • Conditional independence assumption holds. Co-training outperforms EM and the “oracle” result.

Co-training vs. EM • Co-training splits features, EM does not. • Co-training incrementally uses the unlabeled data. • EM probabilistically labels all the data at each round; EM iteratively uses the unlabeled data.

Co-EM: EM with feature split • Repeat until converge • Train A-feature-set classifier using the labeled data and the unlabeded data with B’s labels • Use classifier A to probabilistically label all the unlabeled data • Train B-feature-set classifier using the labeled data and the unlabeled data with A’s labels. • B re-labels the data for use by A.

Four SSL methods Results on the News 2*2 dataset

Random feature split Co-training: 3.7%  5.5% Co-EM: 3.3%  5.1% • When the conditional independence assumption does not hold, but there is sufficient redundancy among the features, co-training still works well.

Assumptions • Assumptions made by the underlying classifier (supervised learner): • Naïve Bayes: words occur independently of each other, given the class of the document. • Co-training uses the classifier to rank the unlabeled examples by confidence. • EM uses the classifier to assign probabilities to each unlabeled example. • Assumptions made by SSL method: • Co-training: conditional independence assumption. • EM: maximizing likelihood correlates with reducing classification errors.

Summary of (Nigam and Ghani, 2002) • Comparison of four SSL methods: self-training, co-training, EM, co-EM. • The performance of the SSL methods depends on how well the underlying assumptions are met. • Random splitting features is not as good as natural splitting, but it still works if there is sufficient redundancy among features.

Variations of co-training • Goldman and Zhou (2000) use two learners of different types but both takes the whole feature set. • Zhou and Li (2005) use three learners. If two agree, the data is used to teach the third learner. • Balcan et al. (2005) relax the conditional independence assumption with much weaker expansion condition.

An alternative? • L  L1, LL2 • U U1, U  U2 • Repeat • Train h1 using L1 on Feat Set1 • Train h2 using L2 on Feat Set2 • Classify U2 with h1 and let U2’ be the subset with the most confident scores, L2 + U2’  L2, U2-U2’  U2 • Classify U1 with h2 and let U1’ be the subset with the most confident scores, L1 + U1’  L1, U1-U1’  U1

Yarowsky’s algorithm • one-sense-per-discourse  View #1: the ID of the document that a word is in • one-sense-per-allocation  View #2: local context of word in the document • Yarowsky’s algorithm is a special case of co-training (Blum & Mitchell, 1998) • Is this correct? No, according to (Abney, 2002).

Summary of co-training • The original paper: (Blum and Mitchell, 1998) • Two “independent” views: split the features into two sets. • Train a classifier on each view. • Each classifier labels data that can be used to train the other classifier. • Extension: • Relax the conditional independence assumptions • Instead of using two views, use two or more classifiers trained on the whole feature set.

Summary of SSL • Goal: use both labeled and unlabeled data. • Many algorithms: EM, co-EM, self-training, co-training, … • Each algorithm is based on some assumptions. • SSL works well when the assumptions are satisfied.

Additional slides

Rule independence • H1 (H2) consists of rules that are functions of X1 (X2, resp) only.

EM: the data is generated according to some simple known parametric model. • Ex: the positive examples are generated according to an n-dimensional Gaussian D+ centered around the point

Co-training

Co-training

Presentation Transcript

A Co. WILDCAT BATTALION TRAINING MEETING

A Co. WILDCAT BATTALION TRAINING MEETING

Co-op Training Module II

2012 Co-Facilitator Training

International Co-ordinator Training for Schools

Math II Co-teacher Training

F Co Training Brief

DAR and CO Training

Co-op Training Module III

CO Controls Super User Training

International Co-ordinator Training for Schools

Self-Training & Co-Training Overview

Dog Training Techniques in CO

Personal Training Littleton CO

SAP CO Training | SAP Controlling Online Training Course - GOT

Personal Training Lakewood CO

Personal Training Lakewood CO | Mma Lakewood CO

Hutchins Mfg. Co. Training

Co-operative Training in Classifier Ensembles

Co-op Training Module III

Co-training

Co-training

Presentation Transcript

A Co. WILDCAT BATTALION TRAINING MEETING

A Co. WILDCAT BATTALION TRAINING MEETING

Co-op Training Module II

2012 Co-Facilitator Training

International Co-ordinator Training for Schools

Math II Co-teacher Training

F Co Training Brief

DAR and CO Training

Co-op Training Module III

CO Controls Super User Training

International Co-ordinator Training for Schools

Self-Training &amp; Co-Training Overview

Dog Training Techniques in CO

Personal Training Littleton CO

SAP CO Training | SAP Controlling Online Training Course - GOT

Personal Training Lakewood CO

Personal Training Lakewood CO | Mma Lakewood CO

Hutchins Mfg. Co. Training

Co-operative Training in Classifier Ensembles

Co-op Training Module III

Self-Training & Co-Training Overview