Loading in 5 sec....

Unsupervised Learning With Non-ignorable Missing DataPowerPoint Presentation

Unsupervised Learning With Non-ignorable Missing Data

- By
**sen** - Follow User

- 97 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Unsupervised Learning With Non-ignorable Missing Data' - sen

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Unsupervised Learning With Non-ignorable Missing Data

### Outline

### Introduction The Problem of Missing Data

### Introduction A Theory of Missing Data

### Introduction Types of Missing Data: MCAR

### Introduction Types of Missing Data: MAR

### Introduction Types of Missing Data: Non-Ignorable

### Introduction The Effect of Missing Data

### Introduction Unsupervised Learning and Missing Data

### Introduction Research Overview

### Introduction Research Overview

### Missing Data Theory and EM fitting a normal distribution to the data, a simple unsupervised learning problem. Notation

### Missing Data Theory and EM The MAR Assumption

### Missing Data Theory and EM follows:Observed and Full Likelihood Functions

### Missing Data Theory and EM follows:Expectation Maximization Algorithm

### Models for Non-Ignorable Missing Data follows:Review: Standard Mixture Model

### Models for Non-Ignorable Missing Data Mixture/Fully Connected Model

### Models for Non-Ignorable Missing Data Mixture/CPT-v Model

### Models for Non-Ignorable Missing Data Mixture/CPT-v Model

### Models for Non-Ignorable Missing Data Mixture/LOGIT-v,mz Model

### Models for Non-Ignorable Missing Data Mixture/LOGIT-v,mz Model

### Synthetic Data Experiments gradient based techniques for Experimental Procedure

### Synthetic Data Experiments gradient based techniques for Experiment 1: CPT-v Missing Data

### Synthetic Data Experiments gradient based techniques for Experiment 1: Results

### Synthetic Data Experiments gradient based techniques for Experiment 2: LOGIT-v,mz Missing Data

### Synthetic Data Experiments gradient based techniques for Experiment 2: Results

### Real Data Experiments gradient based techniques for Experimental Procedure

### Real Data Experiments Data Sets

### Real Data Experiments gradient based techniques for Results – Marginal Selection Probabilities

### Real Data Experiments gradient based techniques for Results – Full Data Log Likelihood

### ConclusionsSummary and Future Work

### The End with both the CPT-v, and that the LOGIT-v,mz model. We have shown that the LOGIT-v,mz model does something reasonable on real data.

Ben Marlin

Sam Roweis

Rich Zemel

Machine Learning Group Talk

University of Toronto

Monday Oct 4, 2004

Missing Data Theory and EM

Synthetic Data Experiments

Extensions and Future Work

Conclusions

Models for Non-Ignorable Missing Data

Real Data Experiments

Missing data is a pervasive problem in machine learning and statistical data analysis.

Most large, complex data sets will be certain amount of missing data.

A fundamental question in the analysis of missing data is why is the data missing and what do we have to do about it?.

There are extreme examples of data sets in machine learning with upwards of 95% missing data (EachMovie).

Little and Rubin laid out a theory of missing data several decades ago that provides answers to these questions.

They describe a classification of missing data in terms of the mechanism, or process that causes the data to be missing. ie: the generative model for missing data.

They also derive the exact conditions outlining when missing data must be treated specially to obtain correct inferences based on likelihood.

If the missing data can be explained by a simple random process like flipping a single biased coin, the missing data is missing completely at random.

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

1

1

1

1

1

1

1

2

2

2

2

2

2

2

3

3

3

3

3

3

3

6

6

6

6

6

6

6

5

5

5

5

5

5

5

4

4

4

4

4

4

4

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

1

4

4

4

4

4

4

4

2

2

2

2

2

2

2

1

1

1

1

1

1

1

2

2

2

2

2

2

2

5

5

5

5

5

5

5

1

1

1

1

1

1

1

1

1

1

1

1

1

1

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

3

3

3

3

3

3

3

3

3

3

3

3

3

3

2

2

2

4

4

4

4

4

4

4

1

1

1

1

1

1

1

3

3

3

3

3

3

3

2

2

2

Data Cases

Data Cases

Data Cases

Data Cases

Data Cases

Data Cases

Data Cases

4

4

4

4

4

4

4

4

4

4

4

4

4

4

3

3

3

3

3

3

3

4

4

4

4

4

4

4

5

5

5

5

3

3

3

3

3

3

3

3

3

3

3

3

3

3

5

5

5

5

5

5

5

3

3

3

3

3

1

1

1

1

1

1

1

3

3

3

3

3

3

3

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

2

2

2

2

6

6

6

6

6

6

6

5

5

5

5

5

5

5

2

2

2

2

2

2

2

5

5

5

5

5

5

5

1

1

1

1

1

1

5

5

5

5

5

5

5

2

2

2

2

2

2

2

If the probability that a data entry is missing depends only on the data entries that are observed, then the data is missing at random.

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

Attributes

1

1

1

1

1

1

1

2

2

2

2

2

2

2

3

3

3

3

3

3

3

6

6

6

6

6

6

6

5

5

5

5

5

5

5

4

4

4

4

4

4

4

1

1

1

1

1

1

1

1

1

1

1

1

1

1

2

2

2

2

2

2

2

1

1

1

1

1

1

1

4

4

4

4

4

4

4

2

2

2

2

2

2

2

1

2

2

2

2

2

2

2

5

5

5

5

5

5

5

1

1

1

1

1

1

1

1

1

1

1

1

1

1

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

5

3

5

2

5

3

2

3

3

3

3

3

3

3

3

3

3

3

3

3

3

2

2

2

2

2

2

2

4

4

4

4

4

4

4

1

1

1

1

1

1

1

3

3

3

3

3

3

3

2

2

2

Data Cases

Data Cases

Data Cases

Data Cases

Data Cases

Data Cases

Data Cases

4

4

4

4

4

4

4

4

4

4

4

4

4

4

3

3

3

3

3

3

3

4

4

4

4

4

4

4

5

5

5

5

5

5

5

3

3

3

3

3

3

3

3

3

3

3

3

3

3

5

5

5

5

5

5

5

3

3

3

3

3

3

3

1

1

1

1

1

1

1

3

3

3

3

3

3

3

1

1

1

1

1

1

1

2

2

2

2

2

2

2

2

2

2

2

2

6

6

6

6

6

6

6

5

5

5

5

5

5

5

2

2

2

2

2

2

2

5

5

5

5

5

5

5

1

1

1

1

1

1

1

5

5

5

5

5

5

5

2

2

2

2

2

2

2

If the probability that a data entry is missing depends on the value of that data entry, then the missing data is non-ignorable.

Attributes

Attributes

1

1

2

2

3

3

6

6

5

5

4

4

1

1

1

1

2

2

1

1

4

2

2

1

4

1

2

3

5

2

2

5

1

1

1

1

5

5

5

5

3

3

3

3

2

2

4

1

1

3

3

2

2

Data Cases

Data Cases

4

4

4

3

3

4

5

5

3

3

3

3

5

5

3

3

1

1

3

3

1

1

2

2

2

6

6

5

2

2

4

4

1

1

5

2

2

4.90 the value of that data entry, then the missing data is non-ignorable.

4.90

4.10

If missing data is MCAR or MAR, then inference based on the observed data likelihood will not be biased.

If missing data is non-ignorable, then inference based on the observed data likelihood is provably biased.

Mean

Data:

5

6

4

5

6

3

3

4

8

5

6

4

4

7

6

5

4

5

2

6

MCAR:

5

4

5

6

3

8

5

7

4

2

NI:

5

3

3

4

5

4

4

6

4

5

2

This simple mean estimation problem can be interpreted as fitting a normal distribution to the data, a simple unsupervised learning problem.

Just like the mean estimation example, any unsupervised learning algorithm that treats non-ignorable missing data as missing at random will learn biased estimates of model parameters.

The goals of this research project are: fitting a normal distribution to the data, a simple unsupervised learning problem.

1. Apply the theory developed by Little and Rubin to extend the standard unsupervised learning framework to correctly handle non-ignorable missing data.

2. Apply this extended framework to augment a variety of existing models, and show that tractable learning algorithms can be obtained.

3. Demonstrate that these augmented models out perform standard models on tasks where missing data is believed to be non-ignorable.

The current status of the project: fitting a normal distribution to the data, a simple unsupervised learning problem.

1. We have been able to augment mixture models to account for non-ignorable missing data.

2. We have derived efficient learning and exact inference algorithms for the augmented models.

3. We have obtained empirical results on synthetic data sets showing the augmented models learn accurately.

4. Preliminary results were recently submitted to AISTATS.

Complete data matrix.

Observed elements of the data matrix.

Missing elements of the data matrix.

Matrix of response indicators.

Data model.

Selection or observation model.

Under this notation the MAR assumption can be expressed as follows:

Basically this says the distribution over the response indicators is independent of the missing data.

The standard procedure for unsupervised learning is to maximize the observed data likelihood. The correct procedure is maximize the full data likelihood.

In an unsupervised learning setting with non-ignorable missing data, the correct learning procedure is to maximize the expected full log likelihood.

In the work that follows we assume a multinomial mixture model as the data model. It is a simple baseline model that is quite effective in many discrete domains.

q

n=1:N

Latent variable for case n.

b

Zn

Data variables for case n.

Y1n

Y2n

Y3n

YMn

If we fully connect the response indicators to the data variables we get the most general selection mode, but it is not tractable.

q

Latent variable

n=1:N

b

Zn

Data variables

Response indicators

m=1:M

Ymn

m

m=1:M

Rmn

To derive tractable learning and inference algorithms we need to assert further independence relations.

q

Latent variable

n=1:N

b

Zn

Data variables

Response indicators

m=1:M

Ymn

m

Rmn

Exact inference and learning for the Mixture/CPT-v model is only slightly more complex than in a standard mixture model.

The LOGIT-v,mz model assumes a functional form for the missing data parameters. It is able to model a wider range of effects.

q

Latent variable

n=1:N

b

Zn

Data variables

Response indicators

m=1:M

Ymn

m

Rmn

Exact inference is still possible, but learning requires gradient based techniques for s and w.

- Sample mixture model parameters from Dirichlet priors.
- Sample 5000 complete data cases from the mixture model.
- Apply each missing data effect and resample complete data to obtain observed data.
- Train each model on observed data only.
- Measure prediction error on complete data set.

Value Based Effect

Item/Latent Variable Effect

- Train LOGIT-v,mz model on observed data.
- Look at parameters and full likelihood values after training.

- EachMovie Collaborative Filtering Data Set: gradient based techniques for
- Base: 2.8M Ratings, 73K users, 1.6K movies, 97.6% missing
- Filtering: Min 20 ratings per user.
- Train: 2.1M Ratings, 30K Users, 95.6% missing

- Jester Collaborative Filtering Data Set :
- Base: 900K Ratings, 17K users, 100 jokes, 50.4% missing
- Filtering: Continuous –10 to +10 scale to discrete 5 point scale.

We have shown positive preliminary results on synthetic data with both the CPT-v, and that the LOGIT-v,mz model. We have shown that the LOGIT-v,mz model does something reasonable on real data.

To show some convincing results on real data we need to look at new procedures for collect data, and possibly new experimental procedures for validating model under this framework.

We have proposed a framework for dealing with non-ignorable missing data by augmenting existing models with a general selection model.

Download Presentation

Connecting to Server..