Tony Tam The Chinese University of Hong Kong May 11, 2010

Accounting for Dynamic Selection Bias in Education Transitions:A Large-sample Evaluation of Latent Class Regression Estimators Tony Tam The Chinese University of Hong Kong May 11, 2010 A similar version has been presented at the 2008 RC28 conference in Florence, Italy.

Motivation • Cameron & Heckman’s (CH98) paper has proposed a number of ways to “fix” the identification problem of the sequential logit model of educational transitions (sometimes called the conditional logit model or the Mare model). • This presentation uses a very large simulated dataset to evaluate CH98’s proposal of applying latent class (LC) to fix the problem for cross-sectional survey data. This approach offers a nonparametric control for stable unobserved heterogeneity. • How well does this proposal remedy work? • What alternative strategies are available and how well do they work?

Design of the Simulation • Use simulated data to illustrate the extent to which the selection bias can seriously distort numerical and qualitative results, ignoring for now the issue of sampling variability. • The main results are based on a large sample of 100,000 individuals with up to 5 transitions. • Consider a model with only two observed variables and a single source of stable unobserved heterogeneity (Xa): • observed parental education (Xe) • observed family income (Xi) • unobserved respondent’s ability (Xa). • Xe, Xi, and Xa are uniformly distributed.

Assumptions • Central to the whole problem of selection bias is stable unobserved heterogeneity, which may also be called stable unobserved ability without loss of generality. • This stability causes problem, but it is also a crucial source of identification in the LC regression approach to adjust for selection bias. • This or some other assumptions may fail and defeat the LC regression approach. Two other important assumptions are: • The random-effect assumption: the stable unobserved heterogeneity is independent of observed variables. • The effect parameters are constant across individuals and transitions. This is of course an assumption Heckman himself would be the first to strongly object.

Unobserved ability (Xa) may be conceived as a rank order measure (1000 distinct ranks). • Generating Xa as a uniform distribution facilitates the visualization of dynamic selection across transitions. • By design, ability (Xa) is generated as statistically independent of parental education (Xe) and income (Xi) in the full population. • Yet Xa becomes correlated with Xe and Xi after 1st transition. • Correlation between Xe and Xi is 0.70. • Xa is unobservable but has two trichotomous indicators: Xa3 and Xa3e. • Correlations with Xa3 and Xa3e are 0.88 and 0.50, respectively. • Xa3e is a relatively noisy indicator of Xa (reliability=0.25).

Yt* = 100 Xe + 100 Xi + 150 Xa + 50 Vt Log[P(Y=1)/P(Y=0)] = k*100 Xe + k*100 Xi + k*150 Xa • Yt* is a latent variable of time t, measured by a binary indicator Y of transition. • This model is just an example of the logit model on page 273 of CH98, with • θ = k*150 Xa, k is a constant of proportionality • Vt follows a logistic distribution. • In this model, not only that ability is stable, its effect on the latent outcome is also stable. As underscored above, this stability is a crucial source of identification. • This population model implies that the coefficients of Xe and Xi are equal and invariant across all transitions. • Since scale is underidentified, only the relative size of coefficients are relevant. • Both Xa and Vt are unobservable.

Individuals make a transition if Yt* > threshold • The threshold varies across transitions and results in the following unconditional rates: 90% enter high school 45% high school graduates 40% reach some postsecondary 20% has college degree 10% has postgraduate

A Graphical Perspective onDynamic Selection Transition 1 Before Any Transition

Transition 2 Transition 3

Transition 4 Transition 5

Q1. Does CH98’s Solution Work? Table 1 • Presents results from two models: (1) conventional sequential logit (CSL) model and (2) the CH98 model applied to a pooled file of transition records, using a transition-by-X interaction specification. • All standard errors have been adjusted for clustering of records within individual. • The CH98 proposal is to allow for a dichotomous measure of unobserved heterogeneity—that is a random-effects intercept with 2 probability masses. • To facilitate a closer comparison between LC solution and one that uses a trichotomous indicator of ability, the following LC analysis allows for 3 latent classes.

Table 1. Sequential Logit Models for Pooled Data on Five Transitions, with and without Adjustment for Unobserved Ability Notes: Col. (1) is the conventional sequential logit (CSL) model for pooled transitions data without adjustment for selection bias. Col. (2) is CH98’s LC regression, but with 3 LC (turned out no different than 2 LC).

Key Findings from Table 1 • The conventional estimators have much lower S.E.’s than those of the CH98. • However, the conventional estimators are seriously misleading/biased with respect to the generating mechanism underlying the simulated data. • The CH98 estimators work well for most transitions, but S.E.’s are quite large. • By and large, CH98 is vindicated.

LatentGOLD 4.0 Batch Program Codes for Model 2 Just a few items to specify • Dataset • Model: LC regression, # of classes • Dependent variable • Independent variables (“predictors”) • Person ID • Class independent parameters • Measurement scale of variables

spss='C:\wk\edutran.sav'; * The following lines (this slide) are technical specifications about the estimation algorithm, not model spec per se except 1st line. • model:'Table1-Model(2)' regression 3 / • toler=1e-008 tolem=0.01 tolran=1e-005 bayes=1 • bayess2=1 bayeslat=1 • bayespoi=1 • iterem=2500 iternr=1000 itersv=500 • iterboot=500 • nseed=632544 nseedboot=0 • nrand=0 • usemiss=No • sewald=serobust dummy=first • outsect=0x1c17 • ;

dependent DYt; [depvar=binary observed transition status] • replicate pid; [ID to identify data records of same person] • predictor tran Xe XeT2 XeT3 XeT4 XeT5 Xi XiT2 XiT3 XiT4 XiT5; [“predictor” option above specifies the independent variables to include in the logit model.] • attr tran nominal ; [“attr” option is for measurement scale] • attr Xe ordinal ; • attr Xi ordinal ; • attr DYt ordinal ; • attr XeT2 ordinal ; • attr XeT3 ordinal ; • attr XeT4 ordinal ; • attr XeT5 ordinal ; • attr XiT2 ordinal ; • attr XiT3 ordinal ; • attr XiT4 ordinal ; • attr XiT5 ordinal ; • classind tran Xe XeT2 XeT3 XeT4 XeT5 Xi XiT2 XiT3 XiT4 XiT5; [“classind” option specifies which coefficients are class independent; all except the intercept] • End;

Q2. Would the Conventional Control for an Observed Indicator of Ability help? Table 2 • Model 3 is a simple enhancement of the sequential logit model, only adding Xa3 (a trichotomous indicator of unobserved ability) to sequential logit model 1. • Model 4 is a simple enhancement of the CH98 model, adding Xa3 to the LC regression model 2. • Model 5 is a direct extension of the LC regression model 2 by introducing Xa3 as a covariate that helps predict respondent membership in each of the latent classes.

Table 2. Sequential Logit Models for Pooled Data on Five Transitions, with a Crude Instrument for Unobserved Ability Notes: Xa3 is a trichotomous indicator of unobserved ability (R-sq=.77). Model (3) emulates the strategy of Lucas (2001 AJS) whereas model (4) combines models (2) and (3). Estimates of (3) are rescaled using the first cell as reference to facilitate comparisons of coefficients and SE’s across variables, transitions, and models.

Key Findings from Table 2 • Model (3), like model (1) has the smallest S.E.’s but large & misleading errors on point estimates. • Model (4) improves over (2) on S.E.’s and still correctly reflects the true parameters. • Model (5) is a remarkably consistent and efficient approach to incorporating information on unobserved ability. • This shows the potential utility of the latent class regression framework, not only to get valid estimates but also incorporate extra identifying information to improve statistical efficiency.

Q3. Can LC Regression Reveal the True Parameter PatternsWithout the Interaction Terms? Table 3 • Latent class (LC) regression allows both intercept and slope coefficients to systematically vary across classes/regimes, hence an alternative to the interaction specification of Tables 1 and 2. • Model (6) is a 3-class LC regression model with Xa3 as a control (among the explanatory variables), analogous to model (4). • Model (7) is a 3-class LC regression model with Xa3 as a covariate predicting latent class membership, analogous to model (5).

Table 3. Sequential Logit Models for Pooled Data on Five Transitions, with Latent Class Regression Specifications Note: All estimates are rescaled using the first cell as reference, only to facilitate comparisons of coefficients and SE’s across variables, classes, and models.

Q4. Does a Noisy Instrument for Unobserved Ability Help? Table 4 • Xa3e is a noisy version of Xa3, both are trichotomous but the R-sq of Xa3e and Xa is only 0.25, much weaker than the R-sq for Xa3 (0.77). • Model (6e) is a LC regression model with Xa3e as a control, to be compared to model (6). • Model (7e) is LC regression model with Xa3e as a covariate predicting latent class membership, to be compared to model (7).

Table 4. Sequential Logit Models for Pooled Data on Five Transitions, LC Regressions with Noisy Crude IV Note: Xa3e is a noisy indicator of unobserved ability (R-sq=.25). All estimates are rescaled using the first cell as reference.

Key Findings from Tables 3 and 4 • The LC regression framework— an effective tool for incorporating information on unobserved ability as a direct control or as a covariate in a probability model. • The main lesson is— the incorporation of proxies for stable unobserved heterogeneity as a predictor of LC membership appears to provide substantial improvements over the CH98 solution on robustness, statistical consistency, and much improved efficiency. The gains are achievable even with a weak and crude indicator for ability.

Conclusions • As expected, the conventional estimators of the effects of parental education and income produce seriously misleading/biased results. • Cameron and Heckman’s (1998) relatively nonparametric adjustment is vindicated. • Yet there is much room for improvement on statistical efficiency. Introducing an indicator of unobserved heterogeneity proves to be very helpful. • In particular, the latent class regression approach works very well in recovering the effect parameters, relative to each other and across transitions. • Surprisingly, statistical consistency and efficiency are not appreciably diminished when a crude but reliable indicator of unobserved ability is replaced with an equally crude and noisy indicator.

The transition-by-explanatory variable interaction specification appears to be especially susceptible to dynamic selection bias. LC regression effectively avoids the interaction specification. • In results not shown here, allowing latent classes to be more than three does not improve model fit and tends to create numerical instability. • Despite a very large sample size, the standard errors of parameter estimates under most specifications are large, driving home the fact that sample size is not the ultimate determinant of the amount of statistically relevant information. • This also suggests that adjustment for unobserved heterogeneity is highly demanding of statistical information. Cross-sectional data often provide insufficiently rich information to identify important model parameters, as CH98 explained in detail.

Tony Tam The Chinese University of Hong Kong May 11, 2010