117 Views

Download Presentation
##### What are the real challenges in data mining?

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**What are the real challenges in data mining?**Charles Elkan University of California, San Diego August 21, 2003**Bogosity about learning with unbalanced data**• The goal is yes/no classification. • No: ranking, or probability estimation • Often, P(c=minority|x) < 0.5 for all examples x • Decision trees and C4.5 are well-suited • No: model each class separately, then use Bayes’ rule • P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] • No: avoid small disjuncts • With naïve Bayes: P(x|c) = P(xi | c) • Under/over-sampling are appropriate • No: do cost-based example-specific sampling, then bagging • ROC curves and AUC are important**Learning to predict contact maps**3D protein distance map binary contact map (Source: Paolo Frasconi et al.)**Issues in contact map prediction**• An ML researcher sees O(n2) non-contacts and O(n) contacts. • But to a biologist, the concept “an example of a non- contact” is far from natural. • Moreover, there is no natural probability distribution defining the population of “all” proteins. • A statistician sees simply O(n2) distance measures— but s/he finds least-squares regression is useless!**For the rooftop detection task …**• We used […] BUDDS, to extract candidate rooftops (I.e. parallelograms) from six-large area images. Such processing resulted in 17,289 candidates, which an expert labeled as 781 positive examples and 17,048 negative examples of the concept “rooftop.” • (Source: Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown, Marcus Maloof, this workshop.)**How to detect faces in real-time?**• Viola and Jones, CVPR ‘01: • Slide window over image • 45396 features per window • Learn boosted decision-stump classifier**UCI datasets are small and not highly unbalanced**(Source: C4.5 and Imbalanced Data Sets, Nitin Chawla, this workshop.)**Features of the DMEF and similar datasets**• At least 105 examples and 102.5 features. • No single well-defined target class. • Interesting cases have frequency < 0.01. • Much information on costs and benefits, but no overall model of profit/loss. • Different cost matrices for different examples • Most cost matrix entries are unknown.**Example-dependent costs and benefits**• Observations: • Loss or profit depends on the transaction size x. • Figuring out the full profit/loss model is hard. • Opportunity costs are confusing. • Creative management transforms costs into benefits. • How do we account for long-term costs and benefits?**Correct decisions require correct probabilities**• Let p = P(legitimate). The optimal decision is “approve” iff • 0.01xp – (1-p)x > (-20)p + (-10)(1-p) • This calculation requires well-calibrated estimates of p.**ROC curves considered harmful(Source: Medical College**of Georgia.) • “AUC can give a general idea of the quality of the probabilistic estimates produced by the model” • No, AUC only evaluates the ranking produced. • “Cost curves are equivalent to ROC curves” • No, a single point on the ROC curve is optimal only if costs are the same for all examples. • Advice: Use $ profit to compare methods. • Issue: When is $ difference statistically significant?**Usually we must learn a model to estimate costs**• Cost matrix for soliciting donors to a charity. • The donation amount x is always unknown for test examples, so we must use the training data to learn a regression model to predict x.**So, we learn a model to estimate costs …**• Issue: The subset in the training set with x > 0 is a skewed sample for learning a model to estimate x. • Reason: Donation amount x and probability of donation p are inversely correlated. • Hence, the training set contains too few examples of large donations, compared to small ones.**The “reject inference” problem**• Let humans make credit grant/deny decisions. • Collect data about repay/write-off, but only for people to whom credit is granted. • Learn a model from this training data. • Apply the model to all future applicants. • Issue: “All future applicants” is a sample from a different population than “people to whom credit is granted.”**Selection bias makes training labels incorrect**• In the Wisconsin Prognostic Breast Cancer Database, average survival time with chemotherapy is lower (58.9 months) than without (63.1)! • Historical actions are not optimal, but they are not chosen randomly either. (Source: William H. Wolberg, M.D.)**Sequences of training sets**• Use data collected in 2000 to learn a model; apply this model to select inside the 2001 population. • Use data about the individuals selected in 2001 to learn a new model; apply this model in 2002. • And so on… • Each time a new model is learned, its training set is has been created using a different selection bias.**Let’s use the word “unbalanced” in the future**• Google: Searched the web for imbalanced. … about 53,800. • Searched the web for unbalanced. … about 465,000.**C. Elkan. The Foundations of Cost-Sensitive Learning**IJCAI'01, pp. 973-978. • B. Zadrozny and C. Elkan. Learning and Making Decisions When Costs and Probabilities are Both Unknown KDD'01, pp. 204-213. • B. Zadrozny and C. Elkan. Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian ClassifiersICML'01, pp. 609-616. • N. Abe et al. Empirical Comparison of Various Reinforcement Learning Strategies for Sequential Targeted Marketing ICDM'02. • B. Zadrozny, J. Langford, and N. Abe. Cost-Sensitive Learning by Cost-Proportionate Example Weighting ICDM’03.