Beyond the Generalized Linear Mixed Model: a Hierarchical Bayesian Perspective

Beyond the Generalized Linear Mixed Model:a Hierarchical Bayesian Perspective KSU Conference on Applied Statistics in Agriculture, April 30, 2012 Robert J. Tempelman, Professor Department of Animal Science Michigan State University East Lansing, MI, USA

“It is safe to say that improper attention to the presence of random effects is one of the most common and serious mistakes in the statistical analysis of data” Littell, R.C, W.W. Stroup and R.J. Freund. SAS System for Linear Models (2002) pg. 92 • This statement was likely intended to apply to biologists analyzing their own data • It surely does not apply to the “experts”….right?

Mixed models in genetics & genomics and agriculture • Have we often thought carefully about how we use mixed models? • Do we sometimes mis-state the appropriate scope of inference? (lapses in both design…and analyses??) • Do we always fully appreciate/stipulate what knowledge we are conditioning on in data analyses? • Are there too far many other efficiencies going untapped? • “Shrinkage is a good thing” Allison et al. (2006) • Hierarchical/Mixed model inference should be carefully calibrated to exploit data structure while maintaining integrity of inference scope and upfronting on any conditioning specifications. • Particularly important with GLMM as opposed to LMM?

Research and Public Credibility. • A disheartening article “Raise standards for preclinical cancer research” by Glenn Begley in Nature (March 29, 2012) • Out of 53 papers deemed to be landmarkstudies in last decade, scientific findings were confirmed in only 6 cases. • Some of these non-reproducible studies have spawned entire fields (and were cited 100s of times!) and triggered subsequent clinical trials. • What happened? • 1) some studies based on small number of cell lines (narrow scope!), 2) obsession to provide a ‘perfectly’ clean story; 3) poor data analyses too?

Would having the “data” available in the public domain help? • Maybe not…consider the case of gene expression microarrays. • Data from microarray studies are routinely deposited in public repositories (GEO and Array Express)…most based on very simple designs. • Ioannidis, J. P. A. et al. 2009. Repeatability of published microarray gene expression analyses. Nature Genetics 41: 149-155. • Out of 18 articles on microarray-based gene expression profiling published in Nature Genetics in 2005-2006, only two analyses could be reproduced in principle • Why? Generally because of incomplete data annotation or specification of data processing and analysis.

Outline of talk • The scope of inference issue! • It’s a little murky sometimes, • How scope depends on proper VC estimation. • (Generalized) linear mixed models • The implications of good/poor VC estimates on “proper” scope of inference. • Bayesian methods. • Beyond the generalized linear mixed model • Hierarchical extensions provide a potentially richer class of models…and a calibration of scope of inference that may best match “intended scope”.

A ridiculously obvious example • Treatment A vs. B; two mice per treatment • Suppose you weigh each mouse 3 times A1 A1 A1 A2 B1 B2 20.1 21.2 22.1 21.5 20.0 21.3 22.2 21.5 20.1 21.3 22.2 21.5 20.067 21.267 22.167 21.5 How many experimental (biological replicates)? Duh, Rob…it’s n= 2 Are the subsamples (technical replicates) useful? Oh yeah, well sure…it helps control measurement error

Rethinking experimental replication (Based on a true story!) Suppose you have one randomly selected (representative) litter with 12 piglets….assign 4 piglets at random to each of 3 treatments (A),(B), and (C) A A B C C B C B B A A C Is there any biological replication??? Well, actually no….n = 1 4 piglets /trt is better than 2 piglets /trt -> but you’re only controlling for “measurement error” or “subsampling” with one litter

Ok, well let’s now replicate. Have three randomly (representative) selected litters with 6 pigs each….assign 2 pigs at random to each of 3 treatments (A),(B), and (C) within each litter…6 pigs per treatment C C A B A A B C B B B B A A A C C C How many experimental replicates per trt…?

Scope of inference • Well, it might depend on your scope of inference…. • it’s n=6 (pigs) if intended scope of inference is intended for just those three litters. (“narrow scope”) • It’s n=3 (litters) if intended scope of inference is intended for population of litters from which the three litters are a “random” sample. (“broad scope”) • Analysis better respects the experimental design and intended scope. • McLean, R. A., W. L. Sanders, and W. W. Stroup. 1991. A unified approach to mixed linear-models. American Statistician 45: 54-64.

How do we properly decipher scope of inference • Hierarchical statistical modeling • SPECIFICALLY, (generalized) linear mixed models • Proper distinction of, say, litters as fixed versus random effects. • Previous (RCBD with subsampling) example: • Litter effects as fixed: Narrow scope (sometimes desired!) • Litter effects as random: Broad scope. • specify Litter*Treatment as the experimental error-> determines n. • Proper mixed model analysis helps delineate true (experimental) from pseudo (subsampling) replication.

Scope of inference • Focus of much agricultural research is ordinary replication(Johnson, 2006; Crop Science) • All replication conducted at a single research farm/station. • If inference scope is for all farms from which study farm is representative (i.e. farms are random effects) then…. • Single farm study -> no replication in principle: treatment x herd serves as the experimental error term. • Treatment inferences may be influenced by management nuances. • Similar criticism could be levied against university research. • Hence, continued need for multi-state research projects.

Treatment A Treatment B Meta scope of inference Treatment C • Treatments cross-classified with 6 “random” farms Specify farm as fixed: “n” = 12 cows per trt Specify farm random: “n” = 6 trt*farm replicates per group. Shouldn’t treatment effects be expected to be consistent across farms?.....

You can never tell even in the “best” of cases….THE CURSE OF ENVIRONMENTAL STANDARDIZATION “environmental standardization is a cause rather than a cure for poor reproducibility of experimental outcomes”

We shouldn’t aspire for environmental standardization in agricultural research… For the sake of agricultural sustainability & organismic (plant and animal) plasticity…we just can’t! • 2 MORE billion people to feed by 2050…on less land! • “management systems & environments are changing more rapidly than animal populations can adapt to such changes through natural selection” (Hohenboken et al., 2005)..e.g. • Energy policy (corn distiller’s grain) • More intensive management (larger farms) • Climate change • What are the implications for agricultural statisticians? • Even greater importance in terms of using reliable inference procedures AND careful calibration on scope of inference.

Desired calibration of mixed model analyses requires reliable inference procedures • Broader the scope, the greater the importance. • Under classical assumptions (CA), inference on treatment effects depends on reliability of fixed effects and variance component inference. • Linear mixed models (LMM) under CA. • No real issues…we’ve already got great software (e.g. PROC MIXED) • E-GLS based on REML works reasonably well. • ANOVA (METHOD= TYPE3) for balanced designs might be even better (Littell et al., 2006) • Common tool for many agricultural statistical consulting centers

Analysis of non-normal data. • i.e., binary/binomial/count data…maybe a different story • Fixed effects models (no design structure) • Generalized linear models (GLM) inference is based on Wald tests/ likelihood ratio tests. • Nice asymptotic/large sample properties. • Mixed effects models (treatment and design structure). • More at stake with generalized linear mixed models (GLMM). • Asymptotic inference on fixed effects conditioned upon…. • asymptotic inference on variance components (VC)

From Rod Little’s 2005 ASA Presidential address Does asymptotic behavior really depend on just “n” • Impact of design complexity and number of fixed effects/random effects factors/levels relative to “n” ? How many more to reach the promised land of asymptotia? Murky sub-asymptotial forests

Status of VC inference in GLMM • Default method in SAS PROC GLIMMIX: RSPL (PQL-based method). • PQL has been discouraged (McCulloch and Searle,2002) especially with binary data….generally downward bias! • Transformations of count data followed by LMM analyses sometimes even advocated (Young et al. 1998). • Method = LAPLACE and Method = QUAD might be better suited. • Aggh…but “ML”-like rather than “REML-” like. • Also, can’t use QUAD for all model specifications.

How big is this issue? • PQL (RSPL) inference (Bolker et al., 2009) • As a “rule” works poorly for Poisson data when mean < 5 or for binomial data when y,n-y both < 5. • Yet 95% of analyses of binary responses (n=205) and 92% of Poisson responses with means < 5 (n=48) and 89% of binomial responses with y<5 used PQL • Bolker et al. (2009) further claim that 311/537 GLMM analyses used tools inappropriately in some way. • Does this contribute to the reproducibility of research issue? (over and beyond scope of inference?) Bolker, B.M. 2009. Generalized linear mixed models. A practical guide for ecology and evolution. Trends in Ecology and Evolution 24:127-135

Bayesian inference • “small sample inference” • Bayesian revolution started 1990s by introduction of simulation-based Markov Chain Monte Carlo (MCMC) methods. • Might seem to be “a cure” to asymptotic inference but there are serious caveats. • MCMC is not a “plug and play” procedure like REML or GLS. There are potential nuances that need to be constantly checked (e.g, did I sample enough?) • Issue of sensitivity to priors…especially in small samples!!! (yet is likelihood inference “objective”?)

Classic Experimental Test Case • Split Plot Design. • Different sizes of experimental units provides a proxy for demonstrating different scopes of inference. • Whole Plot Factor Inference • Larger experimental unit (Whole Plot) • Sub Plot Factor Inference. • Smaller experimental unit (Sub Plot)

Split Plot Design-> Animal Science Example Diet1 • PEN serves a dual role: • Experimental Unit for Diet • Block for Drug • Animal is experimental unit for drug 1 2 Pen # 1 3 4 1 2 Pen # 2 3 4 1 2 Pen # 3 3 4 4 animals per pen. 1 of 4 diets assigned to one animal within each pen. Pens numbered within diet Diet 2 1 2 Pen # 1 3 4 1 2 Pen # 2 3 4 1 2 Pen # 3 3 4

Split Plot-> Plant Science Example Irrigation level 1 • Field serves a dual role: • Experimental Unit for irrigation level • Block for variety • Plot is experimental unit for variety 1 2 Field # 1 3 4 3 2 Field # 2 4 1 3 1 Field # 3 2 4 4 plots per field. 1 of 4 corn varieties assigned to one plots within each field. Fields numbered within irrigation level Irrigation level 2 4 2 Field # 1 3 1 2 1 Field # 2 4 3 3 1 Field # 3 4 2

Split plot ANOVA (LMM) So Inference on Treatment A (Whole Plot Factor) effects should be more sensitive to VC inference than Treatment B (Sub Plot Factor) effects. Since s2e is “constrained” for binary data in GLMM (logit or probit link), even less of an issue for Treatment B there…right?

A simulation study. • Simulate data from a split plot design. • A: whole plot factor a=3 levels. • B: sub plot factor b=3 levels. n = number of whole plot (WP) per whole plot factor n = 3 in figure. Note: if data is binary at subplot level, it is binomial at wholeplot level. B1B2 B3 B3B1 B2 B2B3 B1 A1 WP 1(A1) WP 2(A1) WP 3(A1) B1B2 B3 B3B1 B2 B2B3 B1 A2 WP 1(A2) WP 2(A2) WP 3(A2) B1B2 B3 B3B1 B2 B2B3 B1 A3 WP 1(A3) WP 2(A3) WP 3(A3)

Simulation study details • Let’s simulate normal (l) and binary (y) data • s2e= 1.00 (let’s assume known) • s2wp= 0.5 • Convert normal to binary data as y = I(l > 0.5) • Note with binary data, s2e= 1.00 is not identifiable in probit link GLMM (also for logit link). • WholePlot factor: binomial; SubPlot factor: binary. • Let’s compare standard errors of differences (A1 vs. A2, B1 vs. B2) as functions of s2wp. Prob(l>0.5|mij) mij

SED of differences (A1 vs A2; B1 vs B2) for conventional LMM analysis of Gaussian data (n=8) No surprises here. WholePlot Factor Inferences are sensitive to s2wp. SubPlot Factor Inferences are insensitive to s2wp. Standard errors of A1 vs A2, B1 vs B2 s2wp

SED of differences (A1 vs A2; B1 vs B2) for conventional (asymptotic) GLMM analysis of binary data (n=8) WholePlot Factor Inferences are sensitive to s2wp. But so are SubPlot Factor Inferences (albeit less so)! Misspecification of VC have stronger implications for GLMM than for LMM? Standard errors of A1 vs A2, B1 vs B2 s2wp s2wp

Implications • If underestimate s2wp -> then understate standard errors, inflate Type I errors on conventional GLMM inference on both marginal mean comparisons involving whole plot ANDsubplot factors! • Obvious opposite implications whenever overestimating s2wp as well. • What kind of performance on VC estimation do we get with METHOD = RSPL (PQL), LAPLACE, QUAD…MCMC methods? • And what about Bayesian (MCMC) methods? Should you use Bayesian mean? median? Others?

Back to simulation study • Consider the same 3 x 3 split plot • Consider two different scenarios, • n = 16 wholeplots / A level and n=4 wholeplots/ A level. • 20 replicated datasets for each comparison. • Compare LAPLACE, QUAD, RSPL with Bayesian estimates for estimates s2wp • Bayesian estimates: mean or median (others?). • Prior on s2wp : • Prior variance not defined for v ≤ 4.

Scatterplot of VC estimates from 20 replicated datasets n = 16 wholeplots/ A level Everything lines up pretty well!

Boxplots of VC estimates from 20 replicated datasets n = 16 wholeplots / A level Proportion of reps with convergence 19/20 16/20 17/20 RSPL biased downwards (conventional wisdom)

Scatterplot of VC estimates from 20 replicated datasets n = 4 wholeplots / A level Much less agreement between methods

Boxplots of VC estimates from 20 replicated datasets n = 4 wholeplots / A level Proportion of reps with convergence 16/20 7/20 0/20 Influenced by prior?

Are Bayesian point estimators better/worse than other GLMM VC estimators? • No clear answers • I could have tried different “non-informative” priors and actually got rather different point estimates of VC for n = 4. • Implications then for Bayesian inference on fixed treatment effects? • “Embarassment of riches” Inferences is more than just point estimates; it involves entire posterior densities! • Bayesian inferences on fixed treatment effects average out (integrate) over the uncertainty on VC. • n = 4 might be so badly underpowered that point is moot for this simulation study…but recall Bolker’s review!!!

Any published formal comparisons between GLS/REML/EB(M/PQL) and MCMC for GLMM? • Check Browne and Draper (2006). • Normal data (LMM) • Generally, inferences based on GLS/REML and MCMC are sufficiently close. • Since GLS/REML is faster, it is the method of choice for classical assumptions. • Non-normal data (GLMM). • Quasi-likelihood based methods are particularly problematic in bias of point estimates and interval coverage of variance components. • Some fixed effects are poorly estimated too! • Bayesian methods with certain diffuse priors are well calibrated for both properties for all parameters. • Comparisons with Laplace not done yet (Masters project… anyone?). Browne, W.J. and Draper, D. 2006. A comparison of Bayesian and likelihood-based methods for fitting multilevel models. Bayesian Analysis. 1: 473-514

Why do (some) animal breeders do Bayesian analysis? Consider the linear mixed model (Henderson et al. 1959; Biometrics) Y = Xb + Zu + e; e~ N(0,Is2e) Yn x 1 bp x 1 (p << n) uq x 1 (q >>n) u ~ N(0,As2u) • i.e., more animals to genetically evaluate than have data. • “Animal models” • Somewhat “pathological” models but mixed model inference is viable (Tempelman and Rosa, 2004): “borrowing of information” • REML seems to work just fine for Gaussian data. • Put Bayes “on the shelf” Greatest interest in u, s2u, and s2e.

For GLMM in animal breeding…it’s hard not to be Bayesian. • Binary or ordinal categorical data. • Probit link animal models (Tempelman and Rosa, 2004) • PQL methods are completely unreliable in “animal models” • “Restricted” Laplace a little better but still biased (Tempelman, 1998: Journal of Dairy Science). • Fully Bayes inference using MCMC - > most viable. • Our models are becoming increasingly pathological! • Tens/Hundreds of 1000s of genetic markers (see later) on each animal for both normal and non-normal traits. • In that case, Bayesian methods become increasingly important…even for normal data. • i.e., asymptotic inference issues due to p….for same n. Tempelman, R. J. and G. J. M. Rosa (2004). Empirical Bayes approaches to mixed model inference in quantitative genetics. Genetic Analysis of Complex Traits Using SAS. A. M. Saxton. Cary, NC, SAS Institute Inc.: 149-176.

Where is the greatest need for Bayesian modeling? • Multi-stage hierarchical models. • When the classical distributional assumptions do not fit. • e.g. e is NOT ~ N(0,Rs2e) or uis NOT ~ N(0,As2u) • Examples: • Heterogeneous variances and covariances across environments (scope of inference issue?) • Different distributional forms (e.g. heavy-tailed or mixtures for residual/random effects). • High dimensional variable selection models (animal genomics)

Heteroskedastic error (Kizilkaya and Tempelman, 2005) • Given: has a certain “heteroskedastic” specification. determines the nature of heterogeneous residual variances. Could do so something similar for GLMM (with overdispersion..binomial, Poisson, or ordinal categorical data > 3 categories).

“Mixed Model” for Heterogeneous Variances. • Suppose • with as a “fixed” intercept residual variance • gk > 0kthfixed scaling effect (e.g. sex) • vl > 0lthrandom scaling effect; vl ~ IG(av, av-1) (e.g. herd). E(vl)=1; Var(vl)=1/(av-2) -> • Adjusting (estimating) av: calibrates scope of inference. • High av…less heterogeneity, low av…higher heterogeneity.

Birthweights in Italian Piedmontese cattle Lower the , the greater the shrinkage in estimated residual variances across herds. “calibrated pooling” of error degrees of freedom. 95% Credible Intervals in Residual Variances for birthweights for each of 66 “random” herds Also fitted “fixed effects” of calf sex:

Heterogeneous bivariate G-side and R-side inferences! well established that joint modeling of correlated traits provides efficiencies, especially with GLMM! Bello et al. (2010, 2012) Investigated herd-level and cow-level relationship between 305-day milk production and calving interval (CI) as a function of various factors Random (herd) effects Residual (cow) effects P.S. Nora has also done this for bivariate Gaussian-binary analyses too. See Bello et al. (2012b) in Biometrical Journal.

Herd-Specific and Cow- Specific (Co)variances Herd k Let Cow j and

Rewrite this Herd k Cow j Model each of these different colored terms as functions of fixed and random effects (in addition to the classical b and u)!

0.0 0.2 0.4 0.6 0.8 1.0 Increase in # of days of CI / 100 kg herd milk yield Random effect variability in RESIDUAL associations between traits across herds for • DICM0 – DICM1= 243 • Expected range between extreme herd-years ± 2 = 0.7 d / 100 kg 0.7 d/100kg 0.16 0.86 Ott and Longnecker, 2001

So…is research irreproducibility sometimes heterogeneity across studies? • and/or…failure to distinguish or calibrate between narrow versus broad scope of inference. • Recall treatment*station (or study) as the “error” term for treatment in a “meta-replicated” study. • But heterogeneity might exist at other levels as well? • heterogeneous residual and random effects (co)variances across farms…even overdispersion??? • So inference may require proper calibration at various levels. • Discrete (binary/count) data…recall the problem with VC inference in GLMM? • We actually may need to go even deeper than that!

Rethinking that “perfect” story • Should model heterogeneity (of means, variances, covariances) across studies/farms/times, etc. explicitly with multi-stage models. • Estimates of the corresponding hyperparameters will indicate how “clean” or “messy” the story really is. • Estimate low heterogeneity? HEAVY SHRINKAGE: Broad scope and narrow scope inference more closely matches up. • Estimate high heterogeneity? LIGHT SHRINKAGE: Broad scope inference might be almost pointless…inference should be better calibrated

“Shrinkage is a good thing” (Allison et al., 2006)! • Neither too broad nor too narrow….but calibrated. • Borrow information across studies/environments. K’b K’b + M’u Study-spec. Study-spec. Parameter q1 Parameter q2 Meta-estimate Low Heterogeneity Calibration High Heterogeneity Calibration Too narrow? Too broad? Mixture (including point mass on 0) priors could be considered as well.

Beyond the Generalized Linear Mixed Model: a Hierarchical Bayesian Perspective