- 45 Views
- Uploaded on
- Presentation posted in: General

ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

ReverseTesting: An Efficient Framework to Select Amongst Classifiers under Sample Selection Bias

Wei Fan

IBM T.J.Watson

Ian Davidson

SUNY Albany

Sampling process

Universe of Examples:

Joint probability

distribution

P(x,y) = P(y|x) P(x)

DM models this universe

Training

Data

Question:

Is the training data a good sample of the universe?

Algorithm

x

Model

y

Two classes:

red and green

red: f2>f1

green: f2<=f1

Biased Sample:

less likely to sample points

close to decision boundary

Rather Unbiased Sample:

evenly distributed

Trained from Unbiased Sample

Trained from Biased Sample

Error = 2.9%

Error = 7.9%

Trained from Unbiased Sample

Trained from Biased Sample

Error = 3.1%

Error = 4.1%

- Sample Selection Bias does affect modeling.
- Some techniques are more sensitive to bias than others.
- Models’ accuracy do get affected.

- One important question:
- How to choose amongst the best classification algorithm, given potentiallybiased dataset?

- Fundamental assumption: training data is an unbiased sample from the universe of examples.
- Catalogue:
- Purchase history is normally only based on each merchant’s own data
- However, may not be representative of a population that may potentially purchase from the merchant..

- Drug Testing:
- Fraud Detection:
- Other examples (see Zadrozny’04 and Smith and Elkan’04)

- Inductive model:
- P(y|x,M): non-trivial dependency on the constructed model M.
- Recall that P(y|x) is the true conditional probability “independent” from any modeling techniques.
- In general, P(y|x,M) != P(y|x).

- If the model M is the “correct model”, sample selection bias doesn’t affect learning. (Fan,Davidson,Zadrozny, and Yu’05)
- Otherwise, it does.
- Key Issues:
- for real-world problems, we normally do not know the relationship between P(y|x,M) and P(y|x).
- No exact idea about where the bias comes from.

- How to choose amongst the best classification algorithm, given potentiallybiased dataset?
- No information on the exactly how the data is biased
- No information on if the learners are affected by the bias.
- No information on true model, P(y|x)

- Given sample section bias, cross-validation based methods are a bad indicator of which methods are the most accurate.
- Results come next.

- Basic idea: how to use testing data’s feature vector x’s to help ordering different models even when their true labels y are not known.

MA

MBA

MAA

Labeled

test data

MBB

MB

MAB

A

A

DA

B

B

DB

Train

Test

Train

Estimate the performance of MA and MB based on the order of MAA, MAB, MBA and MBB

- If “A’s labeled test data” can construct “more accurate models”for both algorithm A and B evaluated on labeled training data, then A is expected to be more accurate.
- If MAA > MAB and MBA > MBB then choose A

- Similarly,
- If MAA < MAB and MBA < MBB then choose B

- Otherwise, undecided.

- Assume that:
- A is more accurate than B

- Use both A and B labeled data to train two models.
- Using A’s data is likely to train a more accurate model than B’s data.

Sparse Region

- 1. Examples in sparse regions are under represented in CV’s averaged results.
- Comparing those examples near the decision boundary
- A model performs badly in these under sample regions are not accurately
- estimated in cross-validation.

- 2. CV could also create “biased folds” in these “sparse” regions.
- Their estimate on biased region itself could also be unreliable.

- 3. No information on how a model behaves on “feature vectors” not represented in
- the training data.

1-fold

Full Training Data

- Not reduce the size of “sparse regions” as 10-fold CV does
- Not use “training model” or something close to training model.
- Utilize “feature vectors” not present in the training dataset.

C45 can never learn such a model from training data

RDT labeled data

C45 labeled data

RDT Data

C45 labeled data

Training Data

C45 labeled data

RDT labeled data

- “Feature vectors in testing data” change the “decision boundary.
- The model constructed by algorithm A from A’s own labeled data != original “training model”.
- A’s “inductive bias” is represented in B’s space.
- “Use the changed boundary to include more emphasis on these sparse regions for both A and B re-trained on the two labeled test datasets.

- Sample Selection bias is a ubiquitous problem for DM and ML in practice.
- For most applications and modeling, techniques, sample selection bias does affect accuracy.
- Given sample selection bias, CV based method is bad at estimating order.
- ReverseTesting can do a much better job.
- Future work:
- not only orders but also estimates accuracy.