Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization - PowerPoint PPT Presentation

emmet
semi supervised learning with weakly related unlabeled data towards better text categorization n.
Skip this Video
Loading SlideShow in 5 Seconds..
Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization PowerPoint Presentation
Download Presentation
Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

play fullscreen
1 / 14
Download Presentation
Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization
109 Views
Download Presentation

Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Semi-supervised Learning with Weakly-Related Unlabeled Data: Towards Better Text Categorization Advisor: Hsin-His Chen Reporter: Chi-Hsin Yu Date: 2009.09.24 From NIPS 2008

  2. Outlines • Introduction • Related Work • Review SVM • SSLW (Semi-supervised Learning with Weakly-Related Unlabeled Data) • Experiments • Conclusion

  3. Introduction • Semi-supervised Learning (SSL) • takes advantage of a large amount of unlabeled data to enhance classification accuracy • Cluster assumption • puts the decision boundary in low density areas without crossing the high density regions • is only meaningful when the labeled and unlabeled data are somehow closely related • If they were weakly related, the labeled and unlabeled data could be well separated

  4. Introduction (conti.) • This paper aiming to • Identify a new data representation (in feature space) • By constructing a new kernel function • Advantages • Informative to the target class(category) • consistent with the feature coherence patterns exhibiting in the weakly related unlabeled data

  5. Related Work • The two types of semi-supervised learning (SSL) • Transductive SSL • labels only for the available unlabeled data • Inductive SSL • also learns a classifier that can be used to predict labels for new data • SSLW

  6. SVM • Notations • £ = {(x1, y1), . . . , (xl, yl)} Labeled documents • U= {(xl+1, yl+1), . . . , (xn, yn)} unlabeled documents • Document-word matrix D=(d1, d2, …, dn), di∈NV • V: the size of the vocabulary • di: word-frequency vector for document i • Word-Document matrix G=(g1, g2, …, gV) • gi=(gi,1, gi,2,…,gi,n) K=DTD, K ∈ Rnxn Document pairwise similarity α。y=(α1y1,α2y2, …, αnyn) element-wise product

  7. SSLW • K=DTD  K=DTRD • R ∈ RVxV: word-correlation matrix • Two ways to construct the matrix R G=UW, W=(w1,w2,…wV) wi: internal representation o the i-th word R= WTW, T=UUT the top p right eigenvectors of G αi ≥0, ξ ≥0

  8. SSLW (conti.)

  9. SSLW (conti.) • An Efficient Algorithm of SSLW

  10. Experiments • Corpus • Reuters-21578 (9400 docs), • WebKB (4518 docs) • TREC AP88: an external information source for both datasets (1000 documents, randomly selected)

  11. Evaluation Methodology • 4 positive + 4 negative samples from each training set • AUR (area under the ROC curve) • Averaging the AUR (ten times of each experiment)

  12. Conclusion • SSLW • Significantly improves both the accuracy and the reliability of text categorization, • given a small training pool and the additional unlabeled data that are weakly related to the test bed.

  13. Thanks!!