Wei Fan Ian Davidson

On Sample Selection Bias and Its Efficient Correction via Model Averaging and Unlabeled Examples Wei Fan Ian Davidson

A Toy Example Two classes: red and green red: f2>f1 green: f2<=f1

Unbiased and Biased Samples Not so-biased sampling Biased sampling

Unbiased 96.9% Unbiased 97.1% Unbiased 96.405% Biased 95.9% Biased 92.7% Biased 92.1% Effect on Learning • Some techniques are more sensitive to bias than others. • One important question: • How to reduce the effect of sample selection bias?

Normally, banks only have data of their own customers • “Late payment, default” models are computed using their own data • New customers may not completely follow the same distribution. • Is the New Century “sub-prime mortgage” bankcrupcy due to bad modeling? Ubiquitous • Loan Approval • Drug screening • Weather forecasting • Ad Campaign • Fraud Detection • User Profiling • Biomedical Informatics • Intrusion Detection • Insurance • etc

Bias as Distribution • Think of “sampling an example (x,y) into the training data” as an event denoted by random variable s • s=1: example (x,y) is sampled into the training data • s=0: example (x,y) is not sampled. • Think of bias as a conditional probability of “s=1” dependent on x and y • P(s=1|x,y) : the probability for (x,y) to be sampled into the training data, conditional on the example’s feature vector x and class label y.

Bias for a Training Set • How P(s=1|x,y) is computed • Practically, for a given training set D • P(s=1|x,y) = 1: if (x,y) is sampled into D • P(s=1|x,y) = 0: otherwise • Alternatively, consider D of the size can be sampled “exhaustively” from the universe of examples.

Realistic Datasets are biased? • Most datasets are biased. • Unlikely to sample each and every feature vector. • For most problems, it is at least feature bias. • P(s=1|x,y) = P(s=1|x)

Effect on Learning • Learning algorithms estimate the “true conditional probability” • True probability P(y|x), such as P(fraud|x)? • Estimated probabilty P(y|x,M): M is the model built. • Conditional probability in the biased data. • P(y|x,s=1) • Key Issue: • P(y|x,s=1) = P(y|x) ? • At least for those sampled examples.

Appropriate Assumptions • More “good training examples” in “feature bias” than both “class bias” and “complete bias”. • “good”: P(y|x,s=1) = P(y|x) • beware: it is “incorrect” to conclude that P(y|x,s=1) = P(y|x) unless under some restricted situations that can rarely happen. • For class bias and complete bias, it is hard to derive anything. • It is hard to make any more detailed claims without knowing more about • Both the sampling process • The true function.

Categorizing into the exact type is difficult. • You don’t know what you don’t know. • Not that bad, since the key issue is the number of examples with “bad” conditional probability. • Small • Large

“Small” Solutions Averaging of estimated class probabilities weighted by posterior Posterior weighting Integration Over Model Space Class Probability Removes model uncertainty by averaging

Prove that the expected error of model averaging is less than any single model combined. • What this says: • Compute many models in different ways • Don’t hang on one tree

“Large” Solutions • When too many base models’s estimates are off track, the power of model averaging will be limited. • In this case, we need to smartly use “unlabeled example” that are unbiased. • Reasonable assumption: unlabeled examples are usually plenty and easier to get.

How to Use Them • Estimate “joint probability” P(x,y) instead of just conditional probability, i.e., • P(x,y) = P(y|x)P(x) • Makes no difference use 1 model, but Multiple models

But if there are two P(x) models, with probability 0.05 and 0.4 • Then • P(+,x) = 0.05 * 0.8 + 0.4 * 0.4 = 0.2 • P(-,x) = 0.05 * 0.2 + 0.4 * 0.6 = 0.25 • Recall with model averaging: • P(+|x) = 0.6 and P(-|x)=0.4 • Prediction is + • But, now the prediction will be – instead of + • Key Idea: • Unlabeled examples can be used as “weights” to re-weight the models.

Improve P(y|x) • Use a semi-supervised discriminant learning procedure (Vittaut et al, 2002) • Basic procedure: • Use learned models to predict unlabeled examples. • Use a random sample of “predicted” unlabeled examples to combine with labeled training data • Re-train the model • Repeat until the predictions on unlabeled examples remain stable.

Experiments • Feature Bias Generation • Sort the according to feature values • “chop off” the top.

+ - + - Class Bias Generation • Class Bins • Randomly generate prior class probability distribution P(y). • Just the number, such as P(+)=0.1 and P(-)=0.9 • Sample without replacement from “class bins”

Complete Bias Generation • Recall: the probability to sample an example (x,y) is dependent on both x and y. • Easiest simulation: • Sample (x,y) without replacement from the training data.

Feature Bias

Datasets • Adult: 2 classes • SJ: 3 classes • SS: 3 classes • Pendig: 10 classes • ArtiChar: 10 classes • Query: 4 classes • Donation: 2 classes, cost-sensitive • Credit Card: 2 classes, cost-sensitive

Winners and Losers • Single model *never wins* • Under feature bias winners: • model averaging *with or without* improved conditional probability using unlabeled examples • Joint probability averaging with *uncorrelated* P(y|x) and P(x) models (details in paper) • Under class bias winners: • Joint probability averaging with *correlated* P(y|x) and *improved* P(x) models. • Under complete bias: • Model averaging with improved P(y|x)

Summary • According to our definition, sample selection bias is ubiquitous. • Categorization of sample selection bias into 4 types is useful for analysis, but hard to use in practice. • In practice, the key question is: relative number of examples with inaccurate P(y|x). • Small: use model averaging of conditional probabilities of several models • Medium: use model averaging of improved conditional probabilities • Large: use joint probability averaging of uncorrelated conditional probability and feature probability

When the number is small • Prove in the paper that the expected error of model averaging is less than any single model combined. • What this says: • Compute model in different ways • Don’t hang yourself on one tree

Wei Fan Ian Davidson

Wei Fan Ian Davidson

Presentation Transcript

HARLEY – DAVIDSON

Harley - Davidson

Ian P. Wei University of Bristol March 2012

Harley Davidson

Harley-Davidson

Wei Zhu, Xiang Tian , Fan Zhou and Yaowu Chen IEEE TCE, 2010

My Cousin Ong Q ian Wei

IAN

Ye Zhao, Zhe Fan, Wei Li, Arie Kaufman and Suzanne Yoakum-Stover

Ian Davidson B.Sc. MBA FCIPD MCIM

Ai Wei Wei

Ai Wei Wei

Wei Fan Ed Greengrass Joe McCloskey Philip S. Yu Kevin Drummey

Advisor: Prof. Frank Y. S. Lin Presented BY Fan chiang , chun-wei

Davidson Elementary

Harley - Davidson

Davidson Law

Harley-Davidson