what are the real challenges in data mining n.
Skip this Video
Loading SlideShow in 5 Seconds..
What are the real challenges in data mining? PowerPoint Presentation
Download Presentation
What are the real challenges in data mining?

Loading in 2 Seconds...

play fullscreen
1 / 21

What are the real challenges in data mining? - PowerPoint PPT Presentation

Download Presentation
What are the real challenges in data mining?
An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. What are the real challenges in data mining? Charles Elkan University of California, San Diego August 21, 2003

  2. Bogosity about learning with unbalanced data • The goal is yes/no classification. • No: ranking, or probability estimation • Often, P(c=minority|x) < 0.5 for all examples x • Decision trees and C4.5 are well-suited • No: model each class separately, then use Bayes’ rule • P(c|x) = P(x|c)P(c) / [ P (x|c)P(c) + P(x|~c)P(~c) ] • No: avoid small disjuncts • With naïve Bayes: P(x|c) =  P(xi | c) • Under/over-sampling are appropriate • No: do cost-based example-specific sampling, then bagging • ROC curves and AUC are important

  3. Learning to predict contact maps 3D protein distance map binary contact map (Source: Paolo Frasconi et al.)

  4. Issues in contact map prediction • An ML researcher sees O(n2) non-contacts and O(n) contacts. • But to a biologist, the concept “an example of a non- contact” is far from natural. • Moreover, there is no natural probability distribution defining the population of “all” proteins. • A statistician sees simply O(n2) distance measures— but s/he finds least-squares regression is useless!

  5. For the rooftop detection task … • We used […] BUDDS, to extract candidate rooftops (I.e. parallelograms) from six-large area images. Such processing resulted in 17,289 candidates, which an expert labeled as 781 positive examples and 17,048 negative examples of the concept “rooftop.” • (Source: Learning When Data Sets are Imbalanced and When Costs are Unequal and Unknown, Marcus Maloof, this workshop.)

  6. How to detect faces in real-time? • Viola and Jones, CVPR ‘01: • Slide window over image • 45396 features per window • Learn boosted decision-stump classifier

  7. UCI datasets are small and not highly unbalanced (Source: C4.5 and Imbalanced Data Sets, Nitin Chawla, this workshop.)

  8. Features of the DMEF and similar datasets • At least 105 examples and 102.5 features. • No single well-defined target class. • Interesting cases have frequency < 0.01. • Much information on costs and benefits, but no overall model of profit/loss. • Different cost matrices for different examples • Most cost matrix entries are unknown.

  9. Example-dependent costs and benefits • Observations: • Loss or profit depends on the transaction size x. • Figuring out the full profit/loss model is hard. • Opportunity costs are confusing. • Creative management transforms costs into benefits. • How do we account for long-term costs and benefits?

  10. Correct decisions require correct probabilities • Let p = P(legitimate). The optimal decision is “approve” iff • 0.01xp – (1-p)x > (-20)p + (-10)(1-p) • This calculation requires well-calibrated estimates of p.

  11. ROC curves considered harmful(Source: Medical College of Georgia.) • “AUC can give a general idea of the quality of the probabilistic estimates produced by the model” • No, AUC only evaluates the ranking produced. • “Cost curves are equivalent to ROC curves” • No, a single point on the ROC curve is optimal only if costs are the same for all examples. • Advice: Use $ profit to compare methods. • Issue: When is $ difference statistically significant?

  12. Usually we must learn a model to estimate costs • Cost matrix for soliciting donors to a charity. • The donation amount x is always unknown for test examples, so we must use the training data to learn a regression model to predict x.

  13. So, we learn a model to estimate costs … • Issue: The subset in the training set with x > 0 is a skewed sample for learning a model to estimate x. • Reason: Donation amount x and probability of donation p are inversely correlated. • Hence, the training set contains too few examples of large donations, compared to small ones.

  14. The “reject inference” problem • Let humans make credit grant/deny decisions. • Collect data about repay/write-off, but only for people to whom credit is granted. • Learn a model from this training data. • Apply the model to all future applicants. • Issue: “All future applicants” is a sample from a different population than “people to whom credit is granted.”

  15. Selection bias makes training labels incorrect • In the Wisconsin Prognostic Breast Cancer Database, average survival time with chemotherapy is lower (58.9 months) than without (63.1)! • Historical actions are not optimal, but they are not chosen randomly either. (Source: William H. Wolberg, M.D.)

  16. Sequences of training sets • Use data collected in 2000 to learn a model; apply this model to select inside the 2001 population. • Use data about the individuals selected in 2001 to learn a new model; apply this model in 2002. • And so on… • Each time a new model is learned, its training set is has been created using a different selection bias.

  17. Let’s use the word “unbalanced” in the future • Google: Searched the web for imbalanced.  … about 53,800. • Searched the web for unbalanced.  … about 465,000.

  18. C. Elkan.  The Foundations of Cost-Sensitive Learning IJCAI'01, pp. 973-978. • B. Zadrozny and C. Elkan.  Learning and Making Decisions When Costs and Probabilities are Both Unknown KDD'01, pp. 204-213. • B. Zadrozny and C. Elkan.  Obtaining Calibrated Probability Estimates from Decision Trees and Naive Bayesian ClassifiersICML'01, pp. 609-616. • N. Abe et al.  Empirical Comparison of Various Reinforcement Learning Strategies for Sequential Targeted Marketing ICDM'02. • B. Zadrozny, J. Langford, and N. Abe. Cost-Sensitive Learning by Cost-Proportionate Example Weighting ICDM’03.