**On Sample Selection Bias and Its Efficient Correction via** Model Averaging and Unlabeled Examples Wei Fan Ian Davidson

**A Toy Example** Two classes: red and green red: f2>f1 green: f2<=f1

**Unbiased and Biased Samples** Not so-biased sampling Biased sampling

**Unbiased 96.9%** Unbiased 97.1% Unbiased 96.405% Biased 95.9% Biased 92.7% Biased 92.1% Effect on Learning • Some techniques are more sensitive to bias than others. • One important question: • How to reduce the effect of sample selection bias?

**Normally, banks only have data of their own customers** • “Late payment, default” models are computed using their own data • New customers may not completely follow the same distribution. • Is the New Century “sub-prime mortgage” bankcrupcy due to bad modeling? Ubiquitous • Loan Approval • Drug screening • Weather forecasting • Ad Campaign • Fraud Detection • User Profiling • Biomedical Informatics • Intrusion Detection • Insurance • etc

**Bias as Distribution** • Think of “sampling an example (x,y) into the training data” as an event denoted by random variable s • s=1: example (x,y) is sampled into the training data • s=0: example (x,y) is not sampled. • Think of bias as a conditional probability of “s=1” dependent on x and y • P(s=1|x,y) : the probability for (x,y) to be sampled into the training data, conditional on the example’s feature vector x and class label y.

**Categorization** • From Zadrozny’04 • No Sample Selection Bias • P(s=1|x,y) = P(s=1) • Feature Bias • P(s=1|x,y) = P(s=1|x) • Class Bias • P(s=1|x,y) = P(s=1|y) • Complete Bias • No more reduction

**Bias for a Training Set** • How P(s=1|x,y) is computed • Practically, for a given training set D • P(s=1|x,y) = 1: if (x,y) is sampled into D • P(s=1|x,y) = 0: otherwise • Alternatively, consider D of the size can be sampled “exhaustively” from the universe of examples.

**Realistic Datasets are biased?** • Most datasets are biased. • Unlikely to sample each and every feature vector. • For most problems, it is at least feature bias. • P(s=1|x,y) = P(s=1|x)

**Effect on Learning** • Learning algorithms estimate the “true conditional probability” • True probability P(y|x), such as P(fraud|x)? • Estimated probabilty P(y|x,M): M is the model built. • Conditional probability in the biased data. • P(y|x,s=1) • Key Issue: • P(y|x,s=1) = P(y|x) ? • At least for those sampled examples.

**Appropriate Assumptions** • More “good training examples” in “feature bias” than both “class bias” and “complete bias”. • “good”: P(y|x,s=1) = P(y|x) • beware: it is “incorrect” to conclude that P(y|x,s=1) = P(y|x) unless under some restricted situations that can rarely happen. • For class bias and complete bias, it is hard to derive anything. • It is hard to make any more detailed claims without knowing more about • Both the sampling process • The true function.

**Categorizing into the exact type is difficult.** • You don’t know what you don’t know. • Not that bad, since the key issue is the number of examples with “bad” conditional probability. • Small • Large

**“Small” Solutions** Averaging of estimated class probabilities weighted by posterior Posterior weighting Integration Over Model Space Class Probability Removes model uncertainty by averaging

**Prove that the expected error of model averaging is less** than any single model combined. • What this says: • Compute many models in different ways • Don’t hang on one tree

**“Large” Solutions** • When too many base models’s estimates are off track, the power of model averaging will be limited. • In this case, we need to smartly use “unlabeled example” that are unbiased. • Reasonable assumption: unlabeled examples are usually plenty and easier to get.

**How to Use Them** • Estimate “joint probability” P(x,y) instead of just conditional probability, i.e., • P(x,y) = P(y|x)P(x) • Makes no difference use 1 model, but Multiple models

**Examples of How This Works** • P1(+|x) = 0.8 and P2(+|x) = 0.4 • P1(-|x) = 0.2 and P2(+|x) = 0.6 • model averaging, • P(+|x) = (0.8 + 0.4) / 2 = 0.6 • P(-|x) = (0.2 + 0.6)/2 = 0.4 • Prediction will be –

**But if there are two P(x) models, with probability 0.05 and** 0.4 • Then • P(+,x) = 0.05 * 0.8 + 0.4 * 0.4 = 0.2 • P(-,x) = 0.05 * 0.2 + 0.4 * 0.6 = 0.25 • Recall with model averaging: • P(+|x) = 0.6 and P(-|x)=0.4 • Prediction is + • But, now the prediction will be – instead of + • Key Idea: • Unlabeled examples can be used as “weights” to re-weight the models.

**Improve P(y|x) ** • Use a semi-supervised discriminant learning procedure (Vittaut et al, 2002) • Basic procedure: • Use learned models to predict unlabeled examples. • Use a random sample of “predicted” unlabeled examples to combine with labeled training data • Re-train the model • Repeat until the predictions on unlabeled examples remain stable.

**Experiments** • Feature Bias Generation • Sort the according to feature values • “chop off” the top.

**+** - + - Class Bias Generation • Class Bins • Randomly generate prior class probability distribution P(y). • Just the number, such as P(+)=0.1 and P(-)=0.9 • Sample without replacement from “class bins”

**Complete Bias Generation** • Recall: the probability to sample an example (x,y) is dependent on both x and y. • Easiest simulation: • Sample (x,y) without replacement from the training data.

**Feature Bias**

**Datasets** • Adult: 2 classes • SJ: 3 classes • SS: 3 classes • Pendig: 10 classes • ArtiChar: 10 classes • Query: 4 classes • Donation: 2 classes, cost-sensitive • Credit Card: 2 classes, cost-sensitive

**Winners and Losers** • Single model *never wins* • Under feature bias winners: • model averaging *with or without* improved conditional probability using unlabeled examples • Joint probability averaging with *uncorrelated* P(y|x) and P(x) models (details in paper) • Under class bias winners: • Joint probability averaging with *correlated* P(y|x) and *improved* P(x) models. • Under complete bias: • Model averaging with improved P(y|x)

**Summary** • According to our definition, sample selection bias is ubiquitous. • Categorization of sample selection bias into 4 types is useful for analysis, but hard to use in practice. • In practice, the key question is: relative number of examples with inaccurate P(y|x). • Small: use model averaging of conditional probabilities of several models • Medium: use model averaging of improved conditional probabilities • Large: use joint probability averaging of uncorrelated conditional probability and feature probability

**When the number is small** • Prove in the paper that the expected error of model averaging is less than any single model combined. • What this says: • Compute model in different ways • Don’t hang yourself on one tree