COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI

COP5992 – DATA MINING TERM PROJECTRANDOM SUBSPACE METHOD + CO-TRAININGbySELIM KALAYCI

RANDOM SUBSPACE METHOD (RSM) • Proposed by Ho “The Random Subspace for Constructing Decision Forests”, 1998 • Another combining technique for weak classifiers like Bagging, Boosting.

RSM ALGORITHM 1. Repeat for b = 1, 2, . . ., B: (a) Select an r-dimensional random subspace X from the original p-dimensional feature space X. 2. Combine classifiers Cb(x), b = 1, 2, . . ., B, by simple majority voting to a final decision rule

MOTIVATION FOR RSM • Redundancy in Data Feature Space • Completely redundant feature set • Redundancy is spread over many features • Weak classifiers that have critical training sample sizes

RSM PERFORMANCE ISSUES • RSM Performance depends on: • Training sample size • The choice of a base classifier • The choice of combining rule (simple majority vs. weighted) • The degree of redundancy of the dataset • The number of features chosen

DECISION FORESTS (by Ho) • A combination of trees instead of a single tree • Assumption: Dataset has some redundant features • Works efficiently with any decision tree algorithm and data splitting method • Ideally, look for best individual trees with lowest tree similarity

UNLABELED DATA • Small number of labeled documents • Large pool of unlabeled documents • How to classify unlabeled documents accurately?

EXPECTATION-MAXIMIZATION (E-M)

CO-TRAINING • Blum and Mitchel, “Combining Labeled and Unlabeled Data with Co-Training”, 1998. • Requirements: • Two sufficiently strong feature sets • Conditionally independent

CO-TRAINING

APPLICATION OF CO-TRAINING TO A SINGLE FEATURE SET Algorithm: Obtain a small set L of labeled examples Obtain a large set U of unlabeled examples Obtain two sets F1and F2of features that are sufficiently redundant While U is not empty do: Learn classifier C1 from L based on F1 Learn classifier C2 from L based on F2 For each classifier Ci do: Ci labels examples from U based on Fi Ci chooses the most confidently predicted examples E from U E is removed from U and added (with their given labels) to L End loop

THINGS TO DO • How can we measure redundancy and use it efficiently? • Can we improve Co-training? • How can we apply RSM efficiently to: • Supervised learning • Semi-supervised learning • Unsupervised learning

QUESTIONS ????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????????

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI

COP5992 – DATA MINING TERM PROJECT RANDOM SUBSPACE METHOD + CO-TRAINING by SELIM KALAYCI

Presentation Transcript

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining: Preprocessing Techniques

Chapter 3: Data Mining and Data Visualization

Mining data with PolyAnalyst

Data Mining on Streams

DATA MINING LECTURE 4

Web Mining

CS490D: Introduction to Data Mining Prof. Walid Aref

What we have covered?

MMDSS 2007 Data stream management and mining

Mining text and data on chemicals

15-826: Multimedia Databases and Data Mining

CSE 634 Data Mining Concepts and Techniques Association Rule Mining

Data Mining with Big Data

Spatial Data Mining

Data Mining: Concepts and Techniques

Data Mining Classification: Basic Concepts, Decision Trees, and Model Evaluation

CENG 464 Introduction to Data Mining