Advanced methods and analysis for the learning and social sciences
Download
1 / 50

Advanced Methods and Analysis for the Learning and Social Sciences - PowerPoint PPT Presentation


  • 75 Views
  • Uploaded on

Advanced Methods and Analysis for the Learning and Social Sciences. PSY505 Spring term, 2012 April 9, 2012. Today’s Class. Missing Data and Imputation. Missing Data. Frequently, when collecting large amounts of data from diverse sources, there are missing values for some data sources.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Advanced Methods and Analysis for the Learning and Social Sciences' - millie


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Advanced methods and analysis for the learning and social sciences

Advanced Methods and Analysis for the Learning and Social Sciences

PSY505Spring term, 2012

April 9, 2012


Today s class
Today’s Class Sciences

  • Missing Data and Imputation


Missing data
Missing Data Sciences

  • Frequently, when collecting large amounts of data from diverse sources, there are missing values for some data sources


Examples
Examples Sciences

  • It’s easy to think of examples; can anyone here give examples from your own current or past research or projects


Classes of missing data
Classes of missing data Sciences

  • Missing all data/“Unit nonresponse”

  • Missing all of one source of data

    • E.g. student did not fill out questionnaire but used tutor

  • Missing specific data/“Item nonresponse”

    • E.g. student did not answer one question on questionnaire

  • Subject dropout/attrition

    • Subject ceased to be part of population during study

      • E.g. student was suspended for a fight


What do we do
What do we do? Sciences


Case deletion
Case Deletion Sciences

  • Simply delete any case that has at least one missing value

  • Alternate form: Simply delete any case that is missing the dependent variable


Case deletion1
Case Deletion Sciences

  • In what situations might this be acceptable?

  • In what situations might this be unacceptable?

  • In what situations might this be practically impossible?


Case deletion2
Case Deletion Sciences

  • In what situations might this be acceptable?

    • Relatively little missing data in sample

    • Dependent variable missing, and journal unlikely to accept imputed dependent variable

    • Almost all data missing for case

      • Example: A student who is absent during entire usage of tutor

  • In what situations might this be unacceptable?

  • In what situations might this be practically impossible?


Case deletion3
Case Deletion Sciences

  • In what situations might this be acceptable?

  • In what situations might this be unacceptable?

    • Data loss appears to be non-random

      • Example: The students who fail to answer “How much marijuana do you smoke?” have lower GPA than the average student who does answer that question

    • Data loss is due to attrition, and you care about inference up until the point of the data loss

      • Student completes pre-test, tutor, and post-test, but not retention test

  • In what situations might this be practically impossible?


Case deletion4
Case Deletion Sciences

  • In what situations might this be acceptable?

  • In what situations might this be unacceptable?

  • In what situations might this be practically impossible?

    • Almost all students missing at least some data


Analysis by analysis case deletion
Analysis-by-Analysis Case Deletion Sciences

  • Common approach

  • Advantages?

  • Disadvantages?


Analysis by analysis case deletion1
Analysis-by-Analysis Case Deletion Sciences

  • Common approach

  • Advantages?

    • Every analysis involves all available data

  • Disadvantages?

    • Difficult to do omnibus or multivariate analyses

    • Are your analyses fully comparable to each other?


Mean substitution
Mean Substitution Sciences

  • Replace all missing data with the mean value for the data set

  • Mathematically equivalent: unitize all variables, and treat missing values as 0


Mean substitution1
Mean Substitution Sciences

  • Advantages?

  • Disadvantages?


Mean substitution2
Mean Substitution Sciences

  • Advantages?

    • Simple to Conduct

    • Essentially drops missing data from analysis without dropping case from analysis entirely

  • Disadvantages?

    • Distorts variances and correlations

    • May bias against significance in an over-conservative fashion


Distortion from mean substitution
Distortion From Mean Substitution Sciences

  • Imagine a sample where the true sample is that 50 out of 1000 students have smoked marijuana

  • GPA

  • Smokers: M=2.6, SD=0.5

  • Non-Smokers: M=3.3, SD=0.5


Distortion from mean substitution1
Distortion From Mean Substitution Sciences

  • However, 30 of the 50 smokers refuse to answer whether they smoke, and 20 of the 950 non-smokers refuse to answer

  • And the respondents who remain are fully representative

  • GPA

  • Smokers: M=2.6

  • Non-Smokers: M=3.3


Distortion from mean substitution2
Distortion From Mean Substitution Sciences

  • GPA

  • Smokers: M=2.6

  • Non-Smokers: M=3.3

  • Overall Average: M=3.285


Distortion from mean substitution3
Distortion From Mean Substitution Sciences

  • GPA

  • Smokers: M=2.6

  • Non-Smokers: M=3.3

  • Overall Average: M=3.285

  • Smokers (Mean Sub): M= 3.02

  • Non-Smokers (Mean Sub): M= 3.3


Mar and mnar
MAR and MNAR Sciences

  • “Missing At Random”

  • “Missing Not At Random”


MAR Sciences

  • Data is MAR if

  • R = Missing data

  • Ycom = Complete data set (if nothing missing)

  • Yobs = Observed data set


MAR Sciences

  • In other words

  • If values for R are not dependent on whether R is missing or not, the data is MAR


Mar and mnar1
MAR and MNAR Sciences

  • Are these MAR or MNAR?

  • Students who smoke marijuana are less likely to answer whether they smoke marijuana

  • Students who smoke marijuana are likely to lie and say they do not smoke marijuana

  • Some students don’t answer all questions out of laziness

  • Some data is not recorded due to server logging errors

  • Some students are not present for whole study due to suspension from school due to fighting


Mar and mnar2
MAR and MNAR Sciences

  • MAR-based estimation may often be reasonably robust to violation of MAR assumption(Graham et al., 2007; Collins et al., 2001)

  • Often difficult to verify for real data

    • In many cases, you don’t know why data is missing…


Mar assuming approaches
MAR-assuming approaches Sciences

  • Single Imputation

  • Multiple Imputation

  • Maximum Likelihood Estimation


Single imputation
Single Imputation Sciences

  • Replace all missing items with statistically plausible values and then conduct statistical analysis

  • Mean substitution is a simple form of single imputation


Single imputation1
Single Imputation Sciences

  • Relatively simple to conduct

  • Probably OK when limited missing data



Other single imputation procedures1
Other Single Imputation Procedures Sciences

  • Hot-Deck Substitution: Replace each missing value with a value randomly drawn from other students (for the same variable)

  • Very conservative; biases strongly towards no effect by discarding any possible association for that value


Other single imputation procedures2
Other Single Imputation Procedures Sciences

  • Linear regression/classification:

  • For missing data for variable X

  • Build regressor or classifier predicting observed cases of variable X from all other variables

  • Substitute predictor of X for missing values


Other single imputation procedures3
Other Single Imputation Procedures Sciences

  • Linear regression/classification:

  • For missing data for variable X

  • Build regressor or classifier predicting observed cases of variable X from all other variables

  • Substitute predictor of X for missing values

  • Limitation: if you want to correlate X to other variables, this will increase the strength of correlation


Other single imputation procedures4
Other Single Imputation Procedures Sciences

  • Distribution-based linear regression/classification:

  • For missing data for variable X

  • Build regressor or classifier predicting observed cases of variable X from all other variables

  • Compute probability density function for X

    • Based on confidence interval if X normally distributed

  • Randomly draw from probability density function of each missing value

  • Limitation: A lot of work, still reduces data variance in undesirable fashions


Multiple imputation
Multiple Imputation Sciences

  • Conduct procedure similar to single imputation many times, creating many data sets

    • 10-20 times recommended by Schafer & Graham (2002)

  • Conduct meta-analysis across data sets, to determine both overall answer and degree of uncertainty


Multiple imputation procedure
Multiple Imputation Procedure Sciences

  • Several procedures – essentially extensions of single imputation procedures

  • One example


Multiple imputation procedure1
Multiple Imputation Procedure Sciences

  • Conduct linear regression/classification

  • For each data set

    • Add noise to each data point, drawn from a distribution which maps to the distribution of the original (non-missing) data set for that variable

    • Note: if original distribution is non-normal, use non-normal noise distribution


Maximum likelihood
Maximum Likelihood Sciences

  • Use expectation maximization to iterate between

  • Estimating the function to predict missing data, using both observed data and predictions of missing data

  • Predicting missing data


Maximum likelihood1
Maximum Likelihood Sciences

  • Limitations:

  • Do not account for noise and variance in observations as well as Multiple Imputation

  • Not very robust to departures from MAR


Mnar estimation
MNAR Estimation Sciences

  • Selection models

    • Predict missingness on a variable from other variables

    • Then attempt to predict missing cases using both the other variables, and the model of situations when the variable is missing


Reducing missing values
Reducing Missing Values Sciences

  • Of course, the best way to deal with missing values is to not have missing values in the first place

  • For more on this, please take PSY503!


Asgn 9
Asgn Sciences. 9

  • Students who completed Asgn. 9, please discuss your solutions

    • Christian

    • Mike Wixon

    • Zak


Wixon solution over fitting
Wixon Sciences Solution: Over-fitting?

  • Why might it be overfit?


Rogoff solution over fitting
Rogoff Sciences Solution: Over-fitting?

  • StandardizedExam =

  • 0.2396 * studentID +

  • -6.6658 * MathCareerInterest +

  • 26.919 * MathCourseInHS +

  • -12.6207


Now to compare the models of wixon versus rogoff
Now to compare the models of SciencesWixon versus Rogoff

  • On a new test set with no missing data


Results r
Results (r) Sciences


Wait what
Wait, what? Sciences


Wait what1
Wait, what? Sciences

  • Correlation in test set between student ID and standardized exam score = -0.0636

  • But coefficient = 0.2396

  • So weighted value ranges from 0.2396 to 11.98, which was not enough to hurt prediction


Asgn 10
Asgn Sciences. 10

  • Questions?

  • Comments?


Next class
Next Class Sciences

  • Wednesday, April 11

  • 3pm-5pm

  • AK232

  • Power Analysis

  • Readings

  • Lachin, J.M. (1981) Introduction to Sample Size Determination and Power Analysis for Clinical Trials Controlled Clinical Trials,2,93-113.[pdf]

  • Cohen, J. (1992) A Power Primer. Psychological Bulletin , 112 (1), 155-159.

  • Assignments Due: None


The end
The End Sciences


ad