1 / 13

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI. RANDOM SUBSPACE METHOD (RSM). Proposed by Ho “The Random Subspace for Constructing Decision Forests”, 1998 Another combining technique for weak classifiers like Bagging, Boosting. RSM ALGORITHM.

arella
Download Presentation

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. COP5992 – DATA MINING TERM PROJECTRANDOM SUBSPACE METHOD + CO-TRAININGbySELIM KALAYCI

  2. RANDOM SUBSPACE METHOD (RSM) • Proposed by Ho “The Random Subspace for Constructing Decision Forests”, 1998 • Another combining technique for weak classifiers like Bagging, Boosting.

  3. RSM ALGORITHM 1. Repeat for b = 1, 2, . . ., B: (a) Select an r-dimensional random subspace X from the original p-dimensional feature space X. 2. Combine classifiers Cb(x), b = 1, 2, . . ., B, by simple majority voting to a final decision rule

  4. MOTIVATION FOR RSM • Redundancy in Data Feature Space • Completely redundant feature set • Redundancy is spread over many features • Weak classifiers that have critical training sample sizes

  5. RSM PERFORMANCE ISSUES • RSM Performance depends on: • Training sample size • The choice of a base classifier • The choice of combining rule (simple majority vs. weighted) • The degree of redundancy of the dataset • The number of features chosen

  6. DECISION FORESTS (by Ho) • A combination of trees instead of a single tree • Assumption: Dataset has some redundant features • Works efficiently with any decision tree algorithm and data splitting method • Ideally, look for best individual trees with lowest tree similarity

  7. UNLABELED DATA • Small number of labeled documents • Large pool of unlabeled documents • How to classify unlabeled documents accurately?

  8. EXPECTATION-MAXIMIZATION (E-M)

  9. CO-TRAINING • Blum and Mitchel, “Combining Labeled and Unlabeled Data with Co-Training”, 1998. • Requirements: • Two sufficiently strong feature sets • Conditionally independent

  10. CO-TRAINING

  11. APPLICATION OF CO-TRAINING TO A SINGLE FEATURE SET Algorithm: Obtain a small set L of labeled examples Obtain a large set U of unlabeled examples Obtain two sets F1and F2of features that are sufficiently redundant While U is not empty do: Learn classifier C1 from L based on F1 Learn classifier C2 from L based on F2 For each classifier Ci do: Ci labels examples from U based on Fi Ci chooses the most confidently predicted examples E from U E is removed from U and added (with their given labels) to L End loop

  12. THINGS TO DO • How can we measure redundancy and use it efficiently? • Can we improve Co-training? • How can we apply RSM efficiently to: • Supervised learning • Semi-supervised learning • Unsupervised learning

  13. QUESTIONS ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

More Related