Loading in 2 Seconds...

ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias

Loading in 2 Seconds...

- 71 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias' - mariko-wilder

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias

Wei Fan

IBM T.J.Watson

Ian Davidson

SUNY Albany

Where Sample Selection Bias Comes From?

Universe of Examples:

Joint probability

distribution

P(x,y) = P(y|x) P(x)

DM models this universe

Training

Data

Question:

Is the training data a good sample of the universe?

Algorithm

x

Model

y

Unbiased & Biased Samples

Biased Sample:

less likely to sample points

close to decision boundary

Rather Unbiased Sample:

evenly distributed

What can we observe?

- Sample Selection Bias does affect modeling.
- Some techniques are more sensitive to bias than others.
- Models’ accuracy do get affected.
- One important question:
- How to choose amongst the best classification algorithm, given potentiallybiased dataset?

Ubiquitous Problem

- Fundamental assumption: training data is an unbiased sample from the universe of examples.
- Catalogue:
- Purchase history is normally only based on each merchant’s own data
- However, may not be representative of a population that may potentially purchase from the merchant..
- Drug Testing:
- Fraud Detection:
- Other examples (see Zadrozny’04 and Smith and Elkan’04)

Effect of Bias on Model Construction

- Inductive model:
- P(y|x,M): non-trivial dependency on the constructed model M.
- Recall that P(y|x) is the true conditional probability “independent” from any modeling techniques.
- In general, P(y|x,M) != P(y|x).
- If the model M is the “correct model”, sample selection bias doesn’t affect learning. (Fan,Davidson,Zadrozny, and Yu’05)
- Otherwise, it does.
- Key Issues:
- for real-world problems, we normally do not know the relationship between P(y|x,M) and P(y|x).
- No exact idea about where the bias comes from.

Re-Capping Our focus

- How to choose amongst the best classification algorithm, given potentiallybiased dataset?
- No information on the exactly how the data is biased
- No information on if the learners are affected by the bias.
- No information on true model, P(y|x)

Failure of Traditional Methods

- Given sample section bias, cross-validation based methods are a bad indicator of which methods are the most accurate.
- Results come next.

ReverseTesting

- Basic idea: how to use testing data’s feature vector x’s to help ordering different models even when their true labels y are not known.

MBA

MAA

Labeled

test data

MBB

MB

MAB

A

A

DA

B

B

DB

Basic ProcedureTrain

Test

Train

Estimate the performance of MA and MB based on the order of MAA, MAB, MBA and MBB

Rule

- If “A’s labeled test data” can construct “more accurate models”for both algorithm A and B evaluated on labeled training data, then A is expected to be more accurate.
- If MAA > MAB and MBA > MBB then choose A
- Similarly,
- If MAA < MAB and MBA < MBB then choose B
- Otherwise, undecided.

Heuristics of ReverseTesting

- Assume that:
- A is more accurate than B
- Use both A and B labeled data to train two models.
- Using A’s data is likely to train a more accurate model than B’s data.

Why CV won’t work?

Sparse Region

CV under-estimate in sparse regions

- 1. Examples in sparse regions are under represented in CV’s averaged results.
- Comparing those examples near the decision boundary
- A model performs badly in these under sample regions are not accurately
- estimated in cross-validation.
- 2. CV could also create “biased folds” in these “sparse” regions.
- Their estimate on biased region itself could also be unreliable.
- 3. No information on how a model behaves on “feature vectors” not represented in
- the training data.

Desiderata in ReverseTesting

- Not reduce the size of “sparse regions” as 10-fold CV does
- Not use “training model” or something close to training model.
- Utilize “feature vectors” not present in the training dataset.

C45 Decision Boundary

C45 can never learn such a model from training data

RDT labeled data

C45 labeled data

RDT Data

C45 labeled data

Training Data

Model Comparison

- “Feature vectors in testing data” change the “decision boundary.
- The model constructed by algorithm A from A’s own labeled data != original “training model”.
- A’s “inductive bias” is represented in B’s space.
- “Use the changed boundary to include more emphasis on these sparse regions for both A and B re-trained on the two labeled test datasets.

Summary

- Sample Selection bias is a ubiquitous problem for DM and ML in practice.
- For most applications and modeling, techniques, sample selection bias does affect accuracy.
- Given sample selection bias, CV based method is bad at estimating order.
- ReverseTesting can do a much better job.
- Future work:
- not only orders but also estimates accuracy.

Download Presentation

Connecting to Server..