# ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias - PowerPoint PPT Presentation

1 / 24

ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias. Wei Fan IBM T.J.Watson Ian Davidson SUNY Albany. Sampling process. Where Sample Selection Bias Comes From?. Universe of Examples: Joint probability distribution P(x,y) = P(y|x) P(x)

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

### Download Presentation

ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias

Wei Fan

IBM T.J.Watson

Ian Davidson

SUNY Albany

Sampling process

### Where Sample Selection Bias Comes From?

Universe of Examples:

Joint probability

distribution

P(x,y) = P(y|x) P(x)

DM models this universe

Training

Data

Question:

Is the training data a good sample of the universe?

Algorithm

x

Model

y

Two classes:

red and green

red: f2>f1

green: f2<=f1

### Unbiased & Biased Samples

Biased Sample:

less likely to sample points

close to decision boundary

Rather Unbiased Sample:

evenly distributed

Trained from Unbiased Sample

Trained from Biased Sample

### Single Decision Tree

Error = 2.9%

Error = 7.9%

Trained from Unbiased Sample

Trained from Biased Sample

Error = 3.1%

Error = 4.1%

### What can we observe?

• Sample Selection Bias does affect modeling.

• Some techniques are more sensitive to bias than others.

• Models’ accuracy do get affected.

• One important question:

• How to choose amongst the best classification algorithm, given potentiallybiased dataset?

### Ubiquitous Problem

• Fundamental assumption: training data is an unbiased sample from the universe of examples.

• Catalogue:

• Purchase history is normally only based on each merchant’s own data

• However, may not be representative of a population that may potentially purchase from the merchant..

• Drug Testing:

• Fraud Detection:

• Other examples (see Zadrozny’04 and Smith and Elkan’04)

### Effect of Bias on Model Construction

• Inductive model:

• P(y|x,M): non-trivial dependency on the constructed model M.

• Recall that P(y|x) is the true conditional probability “independent” from any modeling techniques.

• In general, P(y|x,M) != P(y|x).

• If the model M is the “correct model”, sample selection bias doesn’t affect learning. (Fan,Davidson,Zadrozny, and Yu’05)

• Otherwise, it does.

• Key Issues:

• for real-world problems, we normally do not know the relationship between P(y|x,M) and P(y|x).

• No exact idea about where the bias comes from.

### Re-Capping Our focus

• How to choose amongst the best classification algorithm, given potentiallybiased dataset?

• No information on the exactly how the data is biased

• No information on if the learners are affected by the bias.

• No information on true model, P(y|x)

### Failure of Traditional Methods

• Given sample section bias, cross-validation based methods are a bad indicator of which methods are the most accurate.

• Results come next.

### ReverseTesting

• Basic idea: how to use testing data’s feature vector x’s to help ordering different models even when their true labels y are not known.

MA

MBA

MAA

Labeled

test data

MBB

MB

MAB

A

A

DA

B

B

DB

### Basic Procedure

Train

Test

Train

Estimate the performance of MA and MB based on the order of MAA, MAB, MBA and MBB

### Rule

• If “A’s labeled test data” can construct “more accurate models”for both algorithm A and B evaluated on labeled training data, then A is expected to be more accurate.

• If MAA > MAB and MBA > MBB then choose A

• Similarly,

• If MAA < MAB and MBA < MBB then choose B

• Otherwise, undecided.

### Heuristics of ReverseTesting

• Assume that:

• A is more accurate than B

• Use both A and B labeled data to train two models.

• Using A’s data is likely to train a more accurate model than B’s data.

Sparse Region

### CV under-estimate in sparse regions

• 1. Examples in sparse regions are under represented in CV’s averaged results.

• Comparing those examples near the decision boundary

• A model performs badly in these under sample regions are not accurately

• estimated in cross-validation.

• 2. CV could also create “biased folds” in these “sparse” regions.

• Their estimate on biased region itself could also be unreliable.

• 3. No information on how a model behaves on “feature vectors” not represented in

• the training data.

### Decision Boundary of one fold in 10-fold CV

1-fold

Full Training Data

### Desiderata in ReverseTesting

• Not reduce the size of “sparse regions” as 10-fold CV does

• Not use “training model” or something close to training model.

• Utilize “feature vectors” not present in the training dataset.

### C45 Decision Boundary

C45 can never learn such a model from training data

RDT labeled data

C45 labeled data

RDT Data

C45 labeled data

Training Data

C45 labeled data

RDT labeled data

### Model Comparison

• “Feature vectors in testing data” change the “decision boundary.

• The model constructed by algorithm A from A’s own labeled data != original “training model”.

• A’s “inductive bias” is represented in B’s space.

• “Use the changed boundary to include more emphasis on these sparse regions for both A and B re-trained on the two labeled test datasets.

### Summary

• Sample Selection bias is a ubiquitous problem for DM and ML in practice.

• For most applications and modeling, techniques, sample selection bias does affect accuracy.

• Given sample selection bias, CV based method is bad at estimating order.

• ReverseTesting can do a much better job.

• Future work:

• not only orders but also estimates accuracy.