DATA PREPARATION AND SCREENING

DATA PREPARATION AND SCREENING James G. Anderson, Ph.D.

Importance • Model specification • Failure of model fitting • Problems with parameter estimates • Problems with tests of significance

Categories of Problems • Case-related Issues • Missing Observations • Outliers • Distributional/Relational Issues • Normality • Linearity • Homoscedasticity

Missing Data • Missing Completely at Random (MCAR) – The missing data is entirely unrelated statistically to the values that would have been observed. • Missing at Random (MAR) – Data values and missing values are conditional on a set of predictors or stratifying variables. • Nonignorable Missing Data (NMD) – The missing data conveys probabilistic information about the vlaues that would have been observed above the information provided in the observed data.

Methods for Dealing with Missing Data • Listwise deletion • Pairwise deletion • Mean replacement • Regression replacement • Pattern matching • Maximum likelihood

Listwise Deletion (LD) • Eliminates observations where there is any data value missing. • Limitations: • Discards other information that the respondent provided • Reduces sample size significantly

Pairwise Deletion (PD) • Excludes an observation from a calculation only when it is missing a value needed for that particular calculation. • Limitations: • Each mean, variance, covariance, etc. that is calculated is based on a different sample size. • Pairwise deletion may lead to out of bound values resulting in nonpositve definite/singular covariance matrices, negative variances, etc. • Pairwise deletion is not recommended for SEM

Data Imputation (MI) • Replaces the missing value with an estimate of the value based on the complete data. (e.g., the mean of the value for those persons who reported the data)

Data Imputation (AMOS) • Regression Imputation. The model is initially fitted with ML. After setting model parameters to their ML estimaters, linear regression is used to predict unobserved values for each case as a linear combination of the observed values for the same case. • Stochastic Regression Imputation. Imputes values for each case by drawing at random from the conditional distribution of the missing values given the observed values with the unknown model parameters fixed at their ML estimates.

Data Imputation (AMOS) • Bayesian Imputation. Is like stochastic regression imputation except that it takes into account the fact that the parameter values are only estimated and not known.

Performance of the Various Methods to Deal with Missing Data • When the missing data are MCAR ( missing is entirely unrelated statisticvally to the values that would have been observed): • PD, LD and FIML all yield consistent solutions • PD and LD are not as efficient as FIML • MI is consistent with the first moments but yields biased variance and covariance estimates. • MI is not recommended for structural equation modeling which is based on variance and covariance information.

Performance of the Various Methods to Deal with Missing Data • When the missing data are MAR (missingness and data values are statistically unrelated conditional on a set of predictor or stratifying variables): • MPD, LD, and M I can produce severely biased results independent of the sample size. • FIML yields parameter estimates that are consistent and efficient.

Performance of the Various Methods to Deal with Missing Data • When the missing data are nonignorable (missingness conveys probabilistic information about the values that would have been observed): • All standards multivariate approaches can yield biased results. • There is some evidence, however, that FIML estimates tend to be less biased than other methods. • FIML is recommended for handling missing data.

NORMALITY • Many SEM estimation procedures assume multivariate normal distributions • Lack of univariate normality occurs when the skew index is > 3.0 and kurtosis index > 10. • Multivariate normality can be detected by indices of multivariate skew or kurtosis • Non-normal distributions can sometimes be corrected by transforming variables

OUTLIERS • Univariate outliers more than three SDs away from the mean • Detection by inspecting frequency distributions and univariate measures of skewness and kurtosis • Multivariate outliers may have extreme scores on two or more variables or their figurations of scores may be unusual • Detection by inspecting indices of multivariate skewness and kurtosis. Mahlanobis Distance squared is distributed as chi square with df equal to the number of variables. • Can be remedied by correcting errors or by dropping these cases of transforming the variables

MULTICOLLINEARITY • Occurs when intercorrelations among some variables are so high that certain mathematical operations are impossible or results are unstable because denominators are close to 0. • Bivariate correlations >0.85; Multiple correlations>0.90 • May cause a nonpositive definite/singular covariance matrix • May be due to inclusion of individual and composite variables Detection; Tolerance = 1-R2 , 0.10; Variance Inflation Factor (VIF) = 1/(1-R2) >10 • Can be corrected by eliminating or combining redundant variables

RELATIVE VARIANCES • Covariance matrices where the ratio of the largest to the smallest variance is greater than 10 are Ill Scaled • Most SEM estimation methods are iterative • Estimates may not converge to stable values when variances of observed variables are very different in magnitude • To prevent this problem, variables with extremely low or high variances can be rescaled by multiplying or dividing observed scores by a constant. This changes a variables mean and variance but not its correlations with other variables.

LINEARITY • SEMs assume linearity in the relations among the variables • Estimation of curvilinear and interactive effects is possible.

VIOLATIONS OF ASSUMPTIONS • The best known distribution with no kurtosis is the multinormal. • Leptokurtic (more peaked) distributions result in too many rejections of Ho based on the Chi square statistic. • Platykurtic distributions will lead to too low estimates of Chi Square.

VARIABLE SCALES • SEM in general assumes observed variables are measured on a linear continuous scale • Dichotomous and ordinal variables cause problems because correlations /covariances tend to be truncated. These scores are not normally distributed and responses to individual items may not be very reliable. • Some SEM programs like LISCOMP can analyze dichotomous and ordinal variables • PRELIS can be used to prepare a corrected covariance matrix for non-continuous variables.

VIOLATIONS OF ASSUMPTIONS • High degrees of skewness lead to excessively large Chi square estimates. • In small samples (N<100), the Chi square statistic tends to be too large.

Reliability • The degree to which scores are free from random measurement error • Reliability measures • Internal Consistency Reliability • Test-retest Reliability • Alternate Forms Reliability

Reliability • Levels of Reliability • 0.90 Excellent • 0.80 Very Good • 0.70 Adequate

Validity • Whether the scores measure what they are sup-posed to measure • Types of validity • Construct Validity (SEM Confirmatory Factor Analysis helps to establish construct validity) • Criterion-Related Validity (Correlation with an external standard) • Convergent Validity/ Discriminant Validity (Can be determined through SEM Confirmatory Factor Analysis)

FORM OF INPUT DATA • ASCII • SPSS • Microsoft Excel 3 through 8 • Microsoft Access • Microsoft FoxPro 2.0, 2.5, 2.6 • dBase 3 through 5 • Lotus1, 2, 3 with wk1, wk3 and wk4 extensions • Microsoft Access throgh Access 97

Reference • R.B. Kline, “Chapter 3. Data Preparation and Screening,” in Principles and Practice of Structural Equation Modeling, NY: Guilford Press, 2005, pp. 45-62.

DATA PREPARATION AND SCREENING

DATA PREPARATION AND SCREENING

Presentation Transcript

Exploratory Data Mining and Data Preparation

Questionnaire design and data preparation

Data Coding and Screening

Data preparation and data capturing

2. Data Preparation and Preprocessing

DATA PREPARATION AND SCREENING

Data Preparation and Preliminary Analysis

DATA PREPARATION and storage

2. Data Preparation and Preprocessing

Data Preparation

Data Preparation and Analysis Strategy

Algorithm Preparation and Data Availability

Donor Screening and Component Preparation

Data Coding and Screening

Exploratory Data Mining and Data Preparation

Universal Screening and Data Review

Data Preparation and Reduction

COMP 3503 Data Preparation and Meta Data

Preparation and Biological Screening of Novel Heterocyclic Compounds