Sample selection bias covariate shift problems solutions and applications
Download
1 / 84

Sample Selection Bias - PowerPoint PPT Presentation


  • 277 Views
  • Uploaded on

Sample Selection Bias – Covariate Shift: Problems, Solutions, and Applications. Wei Fan, IBM T.J.Watson Research Masashi Sugiyama, Tokyo Institute of Technology Updated PPT is available: http//www.weifan.info/tutorial.htm. Overview of Sample Selection Bias Problem. A Toy Example.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sample Selection Bias ' - Anita


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Sample selection bias covariate shift problems solutions and applications

Sample Selection Bias – Covariate Shift: Problems, Solutions, and Applications

Wei Fan, IBM T.J.Watson Research

Masashi Sugiyama, Tokyo Institute of Technology

Updated PPT is available:

http//www.weifan.info/tutorial.htm


Overview of sample selection bias problem

Overview of Sample Selection Bias Problem Solutions, and Applications


A toy example
A Toy Example Solutions, and Applications

Two classes:

red and green

red: f2>f1

green: f2<=f1


Unbiased and biased samples
Unbiased and Biased Samples Solutions, and Applications

Not so-biased sampling

Biased sampling


Effect on learning

Unbiased 96.9% Solutions, and Applications

Unbiased 97.1%

Unbiased 96.405%

Biased 95.9%

Biased 92.7%

Biased 92.1%

Effect on Learning

  • Some techniques are more sensitive to bias than others.

  • One important question:

    • How to reduce the effect of sample selection bias?


Ubiquitous

Ubiquitous

  • Loan Approval

  • Drug screening

  • Weather forecasting

  • Ad Campaign

  • Fraud Detection

  • User Profiling

  • Biomedical Informatics

  • Intrusion Detection

  • Insurance

  • etc


Face recognition
Face Recognition Solutions, and Applications

  • Sample selection bias:

    • Training samples are taken inside research lab, where there are a few women.

    • Test samples: in real-world, men-women ratio is almost 50-50.

The Yale Face Database B


Brain computer interface bci
Brain-Computer Interface (BCI) Solutions, and Applications

  • Control computers by EEG signals:

    • Input: EEG signals

    • Output: Left or Right

Figure provided by Fraunhofer FIRST, Berlin, Germany


Training
Training Solutions, and Applications

  • Imagine left/right-hand movement following the letter on the screen

Movie provided by Fraunhofer FIRST, Berlin, Germany


Testing playing games
Testing: Playing Games Solutions, and Applications

  • “Brain-Pong”

Movie provided by Fraunhofer FIRST, Berlin, Germany


Non stationarity in eeg features
Non-Stationarity in EEG Features Solutions, and Applications

  • Different mental conditions (attention, sleepiness etc.) between training and test phases may change the EEG signals.

Bandpower differences between

training and test phases

Features extracted from brain activity

during training and test phases

Figures provided by Fraunhofer FIRST, Berlin, Germany


Robot control by reinforcement learning
Robot Control Solutions, and Applicationsby Reinforcement Learning

  • Let the robot learn how to autonomously move without explicit supervision.

Khepera Robot


Rewards
Rewards Solutions, and Applications

Robot moves autonomously

= goes forward without hitting wall

  • Give robot rewards:

    • Go forward: Positive reward

    • Hit wall: Negative reward

  • Goal: Learn the control policy that maximizes future rewards


Example
Example Solutions, and Applications

  • After learning:


Policy iteration and covariate shift
Policy Iteration and Covariate Shift Solutions, and Applications

  • Policy iteration:

  • Updating the policy correspond to changing the input distributions!

Evaluate

control policy

Improve

control policy


Different types of sample selection bias

Different Types of Sample Selection Bias Solutions, and Applications


Bias as distribution
Bias as Distribution Solutions, and Applications

  • Think of “sampling an example (x,y) into the training data” as an event denoted by random variable s

    • s=1: example (x,y) is sampled into the training data

    • s=0: example (x,y) is not sampled.

  • Think of bias as a conditional probability of “s=1” dependent on x and y

  • P(s=1|x,y) : the probability for (x,y) to be sampled into the training data, conditional on the example’s feature vector x and class label y.


Categorization zadrozy 04 fan et al 05 fan and davidson 07
Categorization Solutions, and Applications(Zadrozy’04, Fan et al’05, Fan and Davidson’07)

  • No Sample Selection Bias

    • P(s=1|x,y) = P(s=1)

  • Feature Bias/Covariate Shift

    • P(s=1|x,y) = P(s=1|x)

  • Class Bias

    • P(s=1|x,y) = P(s=1|y)

  • Complete Bias

    • No more reduction


Bias for a training set
Bias for a Training Set Solutions, and Applications

  • How P(s=1|x,y) is computed

  • Practically, for a given training set D

    • P(s=1|x,y) = 1: if (x,y) is sampled into D

    • P(s=1|x,y) = 0: otherwise

  • Alternatively, consider D of the size can be sampled “exhaustively” from the universe of examples.


Realistic datasets are biased
Realistic Datasets are biased? Solutions, and Applications

  • Most datasets are biased.

  • Unlikely to sample each and every feature vector.

  • For most problems, it is at least feature bias.

    • P(s=1|x,y) = P(s=1|x)


Effect on learning1
Effect on Learning Solutions, and Applications

  • Learning algorithms estimate the “true conditional probability”

    • True probability P(y|x), such as P(fraud|x)?

    • Estimated probabilty P(y|x,M): M is the model built.

  • Conditional probability in the biased data.

    • P(y|x,s=1)

  • Key Issue:

    • P(y|x,s=1) = P(y|x) ?


Bias resolutions

Bias Resolutions Solutions, and Applications


Heckman s two step approach
Heckman’s Two-Step Approach Solutions, and Applications

  • Estimate one’s donation amount if one does donate.

  • Accurate estimate cannot be obtained by a regression using only data from donors.

  • First Step: Probit model to estimate probability to donate:

  • Second Step: regression model to estimate donation:

  • Expected error

  • Gaussian assumption


Covariate shift or feature bias
Covariate Shift or Feature Bias Solutions, and Applications

  • However, no chance for generalization if training and test samples have nothing in common.

  • Covariate shift:

    • Input distribution changes

    • Functional relation remains unchanged


Example of covariate shift
Example of Covariate Shift Solutions, and Applications

(Weak) extrapolation:

Predict output values outside training region

Training samples

Test samples


Covariate shift adaptation
Covariate Shift Adaptation Solutions, and Applications

  • To illustrate the effect of covariate shift, let’s focus on linear extrapolation

Training samples

Test samples

True function

Learned function


Generalization error bias variance
Generalization Error Solutions, and Applications= Bias + Variance

: expectation over noise


Model specification
Model Specification Solutions, and Applications

  • Model is said to be correctly specified if

  • In practice, our model may not be correct.

  • Therefore, we need a theory for misspecified models!


Ordinary least squares ols

If model is correct: Solutions, and Applications

OLS minimizes bias asymptotically

If model is misspecified:

OLS does not minimize bias even asymptotically.

We want to reduce bias!

Ordinary Least-Squares (OLS)


Law of large numbers
Law of Large Numbers Solutions, and Applications

  • Sample average converges to the population mean:

  • We want to estimate the expectation overtest input points only using training input points .


Key trick importance weighted average
Key Trick: Solutions, and ApplicationsImportance-Weighted Average

  • Importance: Ratio of test and training input densities

  • Importance-weighted average:

(cf. importance sampling)


Importance weighted ls

Even for misspedified models, IWLS Solutions, and Applicationsminimizes bias asymptotically.

We need to estimate importance in practice.

Importance-Weighted LS

(Shimodaira, JSPI2000)

:Assumed strictly positive


Use of unlabeled samples importance estimation
Use of Unlabeled Samples: Importance Estimation Solutions, and Applications

  • Assumption: We have training inputs and test inputs .

  • Naïve approach: Estimate and separately, and take the ratio of the density estimates

  • This does not work well since density estimation is hard in high dimensions.


Vapnik s principle
Vapnik’s Principle Solutions, and Applications

When solving a problem,

more difficult problems shouldn’t be solved.

  • Directly estimating the ratio is easier than estimating the densities!

(e.g., support vector machines)

Knowing densities

Knowing ratio


Modeling importance function
Modeling Importance Function Solutions, and Applications

  • Use a linear importance model:

  • Test density is approximated by

  • Idea: Learn so that well approximates .


Kullback leibler divergence
Kullback-Leibler Divergence Solutions, and Applications

(constant)

(relevant)


Learning importance function
Learning Importance Function Solutions, and Applications

  • Thus

  • Since is density,

(objective function)

(constraint)


Kliep kullback leibler importance estimation procedure
KLIEP (Kullback-Leibler Solutions, and ApplicationsImportance Estimation Procedure)

(Sugiyama et al., NIPS2007)

  • Convexity: unique global solution is available

  • Sparse solution: prediction is fast!


Examples
Examples Solutions, and Applications


Experiments setup
Experiments: Setup Solutions, and Applications

  • Input distributions: standard Gaussian with

    • Training: mean (0,0,…,0)

    • Test: mean (1,0,…,0)

  • Kernel density estimation (KDE):

    • Separately estimate training and test input densities.

    • Gaussian kernel width is chosen by likelihood cross-validation.

  • KLIEP

    • Gaussian kernel width is chosen by likelihood cross-validation


Experimental results
Experimental Results Solutions, and Applications

  • KDE:Error increases as dim grows

  • KLIEP: Error remains small for large dim

KDE

Normalized MSE

KLIEP

dim


Ensemble methods fan and davidson 07
Ensemble Methods (Fan and Davidson’07) Solutions, and Applications

Averaging of estimated class probabilities weighted by posterior

Posterior

weighting

Integration Over

Model Space

Class

Probability

Removes model uncertainty by averaging


How to use them
How to Use Them Solutions, and Applications

  • Estimate “joint probability” P(x,y) instead of just conditional probability, i.e.,

    • P(x,y) = P(y|x)P(x)

    • Makes no difference use 1 model, but

Multiple models


Examples of how this works
Examples of How This Works Solutions, and Applications

  • P1(+|x) = 0.8 and P2(+|x) = 0.4

  • P1(-|x) = 0.2 and P2(-|x) = 0.6

  • model averaging,

    • P(+|x) = (0.8 + 0.4) / 2 = 0.6

    • P(-|x) = (0.2 + 0.6)/2 = 0.4

    • Prediction will be –


  • But if there are two P(x) models, with probability 0.05 and 0.4

  • Then

    • P(+,x) = 0.05 * 0.8 + 0.4 * 0.4 = 0.2

    • P(-,x) = 0.05 * 0.2 + 0.4 * 0.6 = 0.25

  • Recall with model averaging:

    • P(+|x) = 0.6 and P(-|x)=0.4

    • Prediction is +

  • But, now the prediction will be – instead of +

  • Key Idea:

    • Unlabeled examples can be used as “weights” to re-weight the models.


Structure discovery ren et al 08
Structure Discovery (Ren et al’08) 0.4

Structural Discovery

Original Dataset

Structural Re-balancing

Corrected Dataset


Active learning
Active Learning 0.4

  • Quality of learned functions depends on training input location .

  • Goal: optimize training input location

Good input location

Poor input location

Target

Learned


Challenges
Challenges 0.4

  • Generalization error is unknown and needs to be estimated.

  • In experiment design, we do not have training output valuesyet.

  • Thus we cannot use, e.g., cross-validationwhich requires .

  • Only training input positions can be used in generalization error estimation!


Agnostic setup
Agnostic Setup 0.4

  • The model is not correctin practice.

  • Then OLS is not consistent.

  • Standard “experiment design” method does not work!

(Fedorov 1972; Cohn et al., JAIR1996)


Bias reduction by importance weighted ls iwls
Bias Reduction by 0.4Importance-Weighted LS (IWLS)

(Wiens JSPI2001; Kanamori & Shimodaira JSPI2003; Sugiyama JMLR2006)

  • The use of IWLS mitigates the problem of in consistency under agnostic setup.

  • Importance is known in active learning setup since is designed by us!

Importance



Model selection
Model Selection 0.4

Polynomial of order 1

Polynomial of order 2

Polynomial of order 3

  • Choice of models is crucial:

  • We want to determine the model so that generalization error is minimized:


Generalization error estimation
Generalization Error Estimation 0.4

  • Generalization error is not accessible since the target function is unknown.

  • Instead, we use a generalization error estimate.

Model complexity

Model complexity


Cross validation
Cross-Validation 0.4

  • Divide training samples into groups.

  • Train a learning machine with groups.

  • Validate the trained machine using the rest.

  • Repeat this for all combinations and output the mean validation error.

  • CV is almost unbiased without covariate shift.

  • But, itis heavily biased under covariate shift!

Group 1

Group 2

Group k-1

Group k

Training

Validation


Importance weighted cv iwcv
Importance-Weighted CV (IWCV) 0.4

(Zadrozny ICML2004; Sugiyama et al., JMLR2007)

  • When testing the classifier in CV process, we also importance-weight the test error.

  • IWCV gives almost unbiased estimates of generalization error even under covariate shift

Set 1

Set 2

Set k-1

Set k

Training

Testing


Example of iwcv
Example of IWCV 0.4

  • IWCV gives better estimates of generalization error.

  • Model selection by IWCV outperforms CV!


Reserve testing fan and davidson 06

MA 0.4

MBA

MAA

Labeled

test data

MBB

MB

MAB

A

A

DA

B

B

DB

Reserve Testing (Fan and Davidson’06)

Train

Test

Train

Estimate the performance of MA and MB based on the order of MAA, MAB, MBA and MBB


Rule 0.4

  • If “A’s labeled test data” can construct “more accurate models” for both algorithm A and B evaluated on labeled training data, then A is expected to be more accurate.

    • If MAA > MAB and MBA > MBB then choose A

  • Similarly,

    • If MAA < MAB and MBA < MBB then choose B

  • Otherwise, undecided.


Why cv won t work
Why CV won’t work? 0.4

Sparse Region



Ozone day prediction zhang et al 06
Ozone Day Prediction (Zhang et al’06) 0.4

  • Daily summary maps of two datasets from Texas Commission on Environmental Quality (TCEQ)


Challenges as a Data Mining Problem 0.4

  • Rather skewed and relatively sparse distribution

    • 2500+ examples over 7 years (1998-2004)

    • 72 continuous features with missing values

    • Large instance space

      • If binary and uncorrelated, 272 is an astronomical number

    • 2% and 5% true positive ozone days for 1-hour and 8-hour peak respectively


  • A large number of irrelevant features 0.4

    • Only about 10 out of 72 features verified to be relevant,

    • No information on the relevancy of the other 62 features

    • For stochastic problem, given irrelevant features Xir , where X=(Xr, Xir),

      P(Y|X) = P(Y|Xr) only if the data is exhaustive.

    • May introduce overfitting problem, and change the probability distribution represented in the data.

      • P(Y = “ozone day”| Xr, Xir) 1

      • P(Y = “normal day”|Xr, Xir) 0


1 0.4

1

2

+

2

+

+

+

-

3

3

-

+

+

Testing Distribution

Training Distribution

  • “Feature sample selection bias”.

    • Given 7 years of data and 72 continuous features, hard to find many days in the training data that is very similar to a day in the future

    • Given these, 2 closely-related challenges

      • How to train an accurate model

      • How to effectively use a model to predict the future with a different and yet unknown distribution


Reliable probability estimation under irrelevant features
Reliable probability estimation under 0.4irrelevant features

  • Recall that due to irrelevant features:

    • P(Y = “ozone day”| Xr, Xir) 1

    • P(Y = “normal day”|Xr, Xir) 0

  • Construct multiple models

  • Average their predictions

    • P(“ozone”|xr): true probability

    • P(“ozone”|Xr, Xir, θ): estimated probability by model θ

    • MSEsinglemodel:

      • Difference between “true” and “estimated”.

    • MSEAverage

      • Difference between “true” and “average of many models”

    • Formally show that MSEAverage ≤ MSESingleModel


Ma 0.4

Mb

VE

Precision

Estimated

probability

values

1 fold

Estimated

probability

values

10 fold

Estimated

probability

values

2 fold

Concatenate

10CV

Recall

“Probability-

TrueLabel”

file

PrecRec

plot

Decision

threshold

VE

TrainingSet Algorithm

…..

10CV

Concatenate

1

1

2

+

2

+

+

+

P(y=“ozoneday”|x,θ) Lable

7/1/98 0.1316 Normal

7/3/98 0.5944 Ozone

7/2/98 0.6245 Ozone

………

P(y=“ozoneday”|x,θ) Lable

7/1/98 0.1316 Normal

7/2/98 0.6245 Ozone

7/3/98 0.5944 Ozone

………

-

3

3

-

+

+

Testing Distribution

Training Distribution

  • A CV based procedure for decision threshold selection

  • Prediction with feature sample selection bias


Addressing data mining challenges

Classification on future days 0.4

Whole TrainingSet

if P(Y = “ozonedays”|X,θ ) ≥ VE

θ

Predict “ozonedays”

Addressing Data Mining Challenges

  • Prediction with feature sample selection bias

    • Future prediction based on decision threshold selected


Results
Results 0.4



Task 1
Task 1 0.4

  • Task 1: Who rated what in 2006

    • Given a list of 100,000 pairs of users and movies, predict for each pair the probability that the user rated the movie in 2006

    • Result: They are the close runner-up, No 3 out of 39 teams

  • Challenges:

    • Huge amount of data how to sample the data so that any learning algorithms can be applied is critical

    • Complex affecting factors: decrease of interest in old movies, growing tendency of watching (reviewing) more movies by Netflix users


Netflix data generation process
NETFLIX data generation process 0.4

NO User

or Movie

Arrival

User Arrival

Movie Arrival

Task 1

17K movies

Task 2

Training Data

1998 Time 2005 2006

Qualifier

Dataset

3M


Task 1 effective sampling strategies

…… 0.4

Movie5 .0011

……

Movie3 .001

……

Movie4 .0007

….

1488844,3,2005-09-06

822109,5,2005-05-13

885013,4,2005-10-19

30878,4,2005-12-26

823519,3,2004-05-03

……

Movie5 User 7

……

Movie3 User 7

……

Movie4 .User 8

……

User7 .0007

……

User6 .00012

……

User8 .00003

……

Task 1: Effective Sampling Strategies

  • Sampling the movie-user pairs for “existing” users and “existing” movies from 2004, 2005 as training set and 4Q 2005 as developing set

    • The probability of picking a movie was proportional to the number of ratings that movie received; the same strategy for users

Movies

Samples

History

Users


  • Learning Algorithm: 0.4

    • Single classifiers: logistic regression, Ridge regression, decision tree, support vector machines

    • Naïve Ensemble: combining sub-classifiers built on different types of features with pre-set weights

    • Ensemble classifiers: combining sub-classifiers with weights learned from the development set


Brain computer interface bci1
Brain-Computer Interface (BCI) 0.4

  • Control computers by brain signals:

    • Input: EEG signals

    • Output: Left or Right


Bci results
BCI Results 0.4

KL divergence from training

to test input distributions

  • When KL is large, covariate shift adaptation tends to improve accuracy.

  • When KL is small, no difference.


Robot control by reinforcement learning1
Robot Control by 0.4Reinforcement Learning

  • Swing-up inverted pendulum:

    • Swing-up the pole by controlling the car.

    • Reward:


Results1
Results 0.4

Covariate shift adaptation

Existing method (b)

Existing method (a)



Wafer alignment in semiconductor exposure apparatus
Wafer Alignment in 0.4Semiconductor Exposure Apparatus

  • Recent silicon wafers have layer structure.

  • Circuit patterns are exposed multiple times.

  • Exact alignment of wafers is very important.


Markers on wafer
Markers on Wafer 0.4

  • Wafer alignment process:

    • Measure marker location printed on wafers.

    • Shift and rotate the wafer to minimize the gap.

  • For speeding up, reducing the number of markers to measure is very important.

Active learning problem!


Non linear alignment model
Non-linear Alignment Model 0.4

  • When gap is only shift and rotation, linear model is exact:

  • However, non-linear factors exist, e.g.,

    • Warp

    • Biased characteristic of measurement apparatus

    • Different temperature conditions

  • Exactly modeling non-linear factors is very difficult in practice!

Agnostic setup!


Experimental results1
Experimental Results 0.4

(Sugiyama & Nakajima ECML-PKDD2008)

Mean squared error of wafer position estimation

  • IWLS-based active learning works very well!

  • 20 markers (out of 38) are chosen by experiment design methods.

  • Gaps of all markers are predicted.

  • Repeated for 220 different wafers.

  • Mean (standard deviation) of the gap prediction error

  • Red: Significantly better by 5% Wilcoxon test

  • Blue: Worse than the baseline passive method



Book on dataset shift
Book on Dataset Shift 0.4

  • Quiñonero-Candela, Sugiyama, Schwaighofer & Lawrence (Eds.), Dataset Shift in Machine Learning, MIT Press, Cambridge, 2008.


ad