Use of Proc Mixed to Analyze Experimental Data Animal Science 500 Lecture No. October , 2010
GLM and MIXED in SAS • The SAS procedures GLM and MIXED can be used to fit linear models. • Commonly used to analyze data from a wide range of experiments • Proc GLM was designed to fit fixed effect models • Later amended to fit some random effect models by including RANDOM statement with TEST option. • The REPEATED statement in PROC GLM allows to estimate and test repeated measures models with an arbitrary correlation structure for repeated observations.
GLM and MIXED in SAS • The PROC MIXED was specifically designed to fit mixed effect models. • Including fixed effects, random effects, repeated effects • It can model data with heterogeneous variances and autocorrelated observations. • The MIXED gives a user more flexibility in specifying the correlation structures, particularly useful in repeated measures and random effect models.
GLM and MIXED in SAS • SAS has made many advancements to different procedures over the years. • PROC MIXED is not an extension of GLM. • Based on different statistical principles; • GLM and MIXED use different estimation methods • GLM uses the ordinary least squares (OLS) estimation • Parameter estimates are such values of the parameters of the model that minimize the squared difference between observed and predicted values of the dependent variable. • Provides the familiar analysis of variance table in which the variability in the dependent variable (the total sum of squares) is divided into variances due to different sources (sum of squares for effects in the model).
GLM and MIXED in SAS • Using PROC MIXED you do not get the analysis of variance table that you may be used to evaluating. • It uses estimation methods based on different principles. • PROC MIXED has three options for the method of estimation. • They are: ML (Maximum Likelihood), • REML (Restricted or Residual maximum likelihood, which is the default method) and • MIVQUE0 (Minimum Variance Quadratic Unbiased Estimation).
Defining fixed or random factor From D. A. Dickey, 2008: SAS Global Forum
Characteristics of Mixed Solutions • Best Linear Unbiased Estimates = BLUE • Best Linear Unbiased Predictions = BLUP
Comparison with GLM • GLM has a random statement however it models all effects as if they were fixed
Usually the tests of interest even in a mixed model are the fixed effects differences • Can test differences between random effects by using likelihood ratio tests
Proc Mixed General Form • PROC MIXED is used to fit models of the form y = Xβ + ZU + e where y is a vector of responses X is a known design matrix for the fixed effects, β is vector of unknown fixed-effect parameters, Z is a known design matrix for the random effects, U is vector of unknown random-effect parameters, and e is a vector of (normally distributed) random errors.
PROC MIXED Syntax • The PROC MIXED syntax is similar to the syntax of PROC GLM with a few important differences. • The random effects and repeated statements are used differently, • Random effects are not listed in the model statement, • GLM has MEANS and LSMEANS statements, whereas MIXED has only the LSMEANS statement, • GLM offers Type I, II, III and IV tests for fixed effects, while MIXED offers TYPE I and TYPE III.
PROC MIXED Syntax • The following is a general form of PROC MIXED statement: PROC MIXED options;CLASS variable-list;MODEL dependent=fixed effects/ options;RANDOM random effects / options;REPEATED repeated effects / options;CONTRAST 'label' fixed-effect values | random-effect values/ options;ESTIMATE 'label' fixed-effect values | random-effect values/ options;LSMEANS fixed-effects / options;MAKE 'table' OUT= SAS-data-set < options >;RUN;
PROC MIXED Syntax • The CONTRAST, ESTIMATE, LSMEANS, MAKE and RANDOM statements can appear multiple times, all other statements can appear only once. • The PROC MIXED and MODEL statements are required. • The MODEL statement must appear after the CLASS statement if CLASS statement is used. • The CONTRAST, ESTIMATE, LSMEANS, RANDOM and REPEATED statement must follow the MODEL statement. • CONTRAST and ESTIMATE statements must follow RANDOM statement if the RANDOM is used.
PROC MIXED Syntax • PROC MIXED <options>; • Selected options: • DATA= SAS data set Names SAS data set to be used by PROC MIXED. • The default is the most recently created data set. • METHOD=REML METHOD=ML METHOD=MIVQUE0 • Specifies the estimation method. • REML is the default method.
PROC MIXED Syntax • PROC MIXED <options>; • COVTEST • Prints asymptotic standard errors and Wald Z-test for variance-covariance structure parameter estimates. • For example, if a random effect A is included in the model, then the estimator of the variance of A will be printed together with the Wald test of the hypothesis that the variance of A is 0. • The COVTEST option is specified after Proc mixed and before semicolon;. • For example, Proc mixed data=mydata method=remlcovtest;
PROC MIXED Syntax • CLASS variables; • CLASS statement used in a similar manner that we have used with other SAS Procedures • Lists classification variables (categorical independent variables in the model). • For example: PROC MIXED data=mydatacovtest; Class group gender agecat;
PROC MIXED Syntax • MODEL dependent = fixed effects </options>; • The model statement names a single dependent variable and the fixed effects, that is independent variables that are not random. • Is different from GLM in that numerous dependent variables can be listed on the left hand side of the equation and only one can be listed with MIXED • Requires a new MIXED, Class, Model, Random, Repeated, etc. for each dependent variable • An intercept is included in the model by default. • The NOINT option can be used to remove the intercept.
PROC MIXED Syntax • Selected Options of the model statement: • CHISQ, request c2 – tests (Wald tests) be performed for all fixed effects in addition to the F-tests. • DDFM=RESIDUAL • The DDFM= options specifies the method for computing the denominator degrees of freedom for the tests of fixed effects. DDFM=SATTERTH will result in the Satterthwaite approximation for the denominator degrees of freedom. • For balanced designs with random effects it will produce the same test results as RANDOM …/ TEST option in PROC GLM (if the default METHOD=REML is used in proc mixed).
PROC MIXED Syntax • Selected Options of the model statement: • CHISQ, request c2 – tests (Wald tests) be performed for all fixed effects in addition to the F-tests. • DDFM=CONTAIN • DDFM=BETWITHN • DDFM=SATTERTH,
PROC MIXED Syntax • RANDOM random effects </options>; • The RANDOM statement defines the random effects in the model. • It can be used to specify traditional variance components • (independent random effects with different variances) or • to list correlated random effects and specify a correlation structure for them with the TYPE=covariance-structure option. • A variety of structures are available. Those most frequently used include • TYPE=VC, a variance components correlation structure or • TYPE=UN, an unstructured, that is, arbitrary covariance matrix. • TYPE=VC is the default structure. In the following example, the effect of subject is random.
PROC MIXED Syntax Example • Proc mixed data=one method=remlcovtest; Class gender treat subject; Model y=gender treat gender*treat /ddfm=satterth; Random subject(gender); Run; Quit;
PROC MIXED Syntax • REPEATED repeated effects / options; • The repeated statement is used in PROC MIXED to specify the covariance structure of the error term. • The repeated effect has to be categorical and has to appear in the class statement and the data has to be sorted accordingly. • For example, suppose that each pig or steer was weighed at five equally spaced time points. • The time is the repeated effect and the data has to be sorted by subject and time within each subject. • If time is also used as a continuous independent variable in the model then a new variable, say t, identical to time has to be defined and t should be used in the class and repeated statements.
PROC MIXED Syntax Repeated Example • Data one; Set one; T=time; Run; Quit; Proc sort data=one; By group id t; Run; Quit Proc mixed data=one covtest; Class t group id; Model y=group time group*time; Repeated t /type=ar(1) subject=id; Run; Quit;
PROC MIXED Syntax Repeated Example • The option TYPE in the REPEATED statement specifies the type of the error correlation structure. • The one specified in the above example is the first-order autoregressive correlation. • The subject option is needed to identify observations that are correlated. • Observations within the same subject are correlated with the type of correlation specified in TYPE, observations from different subjects are independent. • The TYPE option allows for many types of correlation structures. Most commonly used are autocorrelation, compound symmetry, Huynh-Feldt, Toeplitz, variance components, unstructured and spatial.
Common covariance structures used with the Repeated statement • Variance Components: • The VC structure is the standard variance components and is the default. • So treatments using are not correlated • Note that the co variances are assumed uncorrelated or 0 • Random error term would also be uncorrelated as well. σ 2
Common covariance structures used with the Repeated statement • Compound symmetry: • This structure says that the correlations between all pairs of measures are the same. • One reason for its popularity is that in many simple cases it gives the same results as the univariate analysis from pre-PROC MIXED repeated measures ANOVA programs, including SAS's own PROC GLM. • The assumption is not unreasonable when the repeated measures arise from different sets of conditions, such as the response to different treatments. • Its biggest drawback is that it is often unrealistic when the repeated measures are serial measurements, that is, the same response measured over time. • Typically, measurements that are made close together (consecutive measurments, say) will be more highly correlated than measurements made farther apart (the first and last). σ 2
Common covariance structures used with the Repeated statement • Auto regressive (1): • This structure resolves some of the objections to the use of compound symmetry with serial data when the measures are equally spaced over time. • AR(1) says that the correlation between two responses that are t measurements apart is t. • Since less than 1, the greater power, the smaller the magnitude. • Thus, the farther apart measurements are, the lower their correlation. σ 2
Common covariance structures used with the Repeated statement • Unstructured: • Sometimes, no standard covariance structure seems appropriate. • This option will estimate every covariance individually and let the data themselves dictate what they should be. • That is what the unstructured option does. • Is the most liberal of all covariance structures as it allows every term to be different • The more data that are used to assess the correlation structure, the less data there are to estimate the parameters of linear models. • An analysis that uses an unstructured covariance matrix will be less powerful that an analysis that uses the proper structure. • The challenge is knowing what the structure is. σ 2
Common covariance structures used with the Repeated statement • Toeplitz: • The TOEP structure is similar to the AR(1) in that all measurements next to each other have the same correlation, • Measurements two apart have the same correlation different from the first, measurements three apart have the same correlation different from the first two, etc. • However, the correlations do not necessarily have the same pattern as in the AR(1). Technically, the AR(1) is a special case of the Toeplitz. σ 2
Common covariance structures used with the Repeated statement • What is the proper way to choose among many covariances? • Ideally, the covariance structure should be known from previous work or subject matter considerations. • Otherwise, one runs the risk of "shopping" for the structure that leads to a preconceived result. • However, there are many cases where the structure is unknown or where the analyst would like to check to be sure that s/he is not making a mistake (in the manner of checking for an interaction between a factor and covariate in an analysis of covariance model). • It is common to consider a few likely structures and choose among them according to some measure of fit. • The purest statistician suggests as above that this is not really the correct way to do this but is what happens in reality
Common covariance structures used with the Repeated statement • These measures tend to be composed of two parts--one that rewards for the accuracy of the fit and another that penalizes for the number of parameters it takes to achive it. • The most popular of these is the Akaike Information Criterion (AIC). • In the reward portion, the AIC looks at how well the estimated and observed structures agree, or rather the extent to which they differ. • Smaller values are good. • In the penalty portion, the AIC considers how many parameters it takes to achieve the fit. • Thus, one might analyze the data using the CS, AR(1), and UN covariance structures and choose the one for which the AIC is a minimum.
Example(Littell, Milliken, Stroup, Wolfinger) • Example = consider a multi location (9 locations) trial using 4 treatments. • The treatments were observed at each of 9 locations and at each location a RCB design with 3 blocks was used. • The model is as follows:
Writing the model statement yijk= μ + τi+ Lj+ R (L)jk+ τLij+ eijk where yijk is the observation μ is the overall mean τi is the treatment effect Lj is the random Location effect, ~ N(0,σL2 ) R (L)jkis the block within location, ~ N(0,σR2 ) τLijis the treatment by location effect, ~ N(0,σT2 ) and eijkis the random error, ~ N(0,σ2)
SAS code to analyze The SAS code we’ll use to fit the data is the following. Proc Mixed; Class loc block trt; Model resp = trt / ddfm=satterth; Random loc block(loc) loc*trt; Run; Quit; • Note : This code uses the default variance component (VC) structure give us an estimate of σB2 . Because of the assumption regarding the distribution of the errors we do not need to specify a REPEATED statement.
PROC MIXED Syntax • CONTRAST ‘label’ fixed-effect values | random-effect values / options; • ESTIMATE ‘label’ fixed-effect values | random-effect values / options; • The CONTRAST statement is used when there is need for custom hypothesis tests, the ESTIMATE statement, when there is need for custom estimates. Although they were extended in PROC MIXED to include random effects, their use is very similar to the CONTRAST and ESTIMATE statement in PROC GLM. SAS/STAT(R) 9.2 User's Guide, Second Edition
PROC MIXED Syntax • CONTRAST ‘label’ fixed-effect values | random-effect values / options; • ESTIMATE ‘label’ fixed-effect values | random-effect values / options; • LABEL is required for every contrast or estimate statement. It identifies the contrast or estimated parameter on the output. It can not be longer than 20 characters. • FIXED-EFFECT is the name of an effect appearing in the MODEL statement. • RANDOM-EFFECT is the name of an effect appearing in the RANDOM statement. • VALUES are the coefficients of the contrast to be tested or the parameter to be estimated. SAS/STAT(R) 9.2 User's Guide, Second Edition
PROC MIXED Syntax • LSMEANS fixed-effects / options; • Similar to use with GLM • LSMEANS computes the least squares means of fixed effects. • The ADJUST option requests a multiple comparison adjustment to the p-values for pair-wise comparisons of means. • The following adjustments are available: BON (Bonferroni), DUNNET, SCHEFFE, SIDAK, SIMULATE, SMM|GT2 and TUKEY. • The ADJUST option results in all possible pair-wise comparisons. • If comparisons with a control level are only needed then in addition to ADJUST option, PDIFF=control should be used. The SLICE option allows to test the significance of one effect at each level of another effect. SAS/STAT(R) 9.2 User's Guide, Second Edition
PROC MIXED Syntax • MAKE 'table' OUT= SAS-data-set < options >; • The MAKE statement converts any table produced by PROC MIXED into a sas data set. • NOPRINT option can be used to prevent printing the requested table. • Only requested or default output can be converted into a sas data set. • The P option has to be used in the model statement to produce a data set with predicted values, and • The LSMEANS statement has to be included to output least squares means. SAS/STAT(R) 9.2 User's Guide, Second Edition
Understanding the PROC MIXED LOG • If the model is correct and runs a message will appear in the log that states “convergence criteria met” • If you see anything else then either your model is not correct or assumptions of normality of the data are incorrect • A common error message is “G matrix not positive definite • Get this because the mean square within subjects is greater than the mean square between subjects • The value of the G matrix can be obtained by putting the g option at the end of the random statement along with the NOBOUND • If this is a small negative value relative to the size of the residual then nothing to worry about, if not then the model may need to be changed. SAS/STAT(R) 9.2 User's Guide, Second Edition
Understanding the PROC MIXED OUTPUT • Output 56.1.1 Results for Split-Plot Analysis The Mixed Procedure Model Information Data Set WORK.SP Dependent Variable Y Covariance Structure Variance Components Estimation Method REML Residual Variance Method Profile Fixed Effects SE Method Model-Based Degrees of Freedom Method Containment SAS/STAT(R) 9.2 User's Guide, Second Edition
Understanding the PROC MIXED OUTPUT Class Level Information Class Levels Values A 3 1 2 3 B 2 1 2 Block 4 1 2 3 4 The "Class Level Information" table lists the levels of all variables specified in the CLASS statement. Check this table to make sure that the data are correct. SAS/STAT(R) 9.2 User's Guide, Second Edition
Understanding the PROC MIXED OUTPUT Dimensions Covariance Parameters 3 Columns in X 12 Columns in Z 16 Subjects 1 Max Obs Per Subject 24 The "Dimensions" table lists the magnitudes of various vectors and matrices. SAS/STAT(R) 9.2 User's Guide, Second Edition
Understanding the PROC MIXED OUTPUT Number of Observations Number of Observations Read 24 Number of Observations Used 24 Number of Observations Not Used 0 The "Number of Observations" table shows that all observations read from the data set are used in the analysis SAS/STAT(R) 9.2 User's Guide, Second Edition
Understanding the PROC MIXED OUTPUT Iteration History Iteration Evaluations -2 Res Log Like Criterion 0 1 139.81461222 1 1 119.76184570 0.00000000 PROC MIXED estimates the variance components for Block, A*Block, and the residual by REML. The REML estimates are the values that maximize the likelihood of a set of linearly independent error contrasts, and they provide a correction for the downward bias found in the usual maximum likelihood estimates. The objective function is times the logarithm of the restricted likelihood, and PROC MIXED minimizes this objective function to obtain the estimates. The minimization method is the Newton-Raphson algorithm, which uses the first and second derivatives of the objective function to iteratively find its minimum. The "Iteration History" table records the steps of that optimization process. For this example, only one iteration is required to obtain the estimates. The Evaluations column reveals that the restricted likelihood is evaluated once for each of the iterations. A criterion of 0 indicates that the Newton-Raphson algorithm has converged. SAS/STAT(R) 9.2 User's Guide, Second Edition
Understanding the PROC MIXED OUTPUT Covariance Parameter Estimates CovParm Estimate Block 62.3958 A*Block 15.3819 Residual 9.3611 The REML estimates for the variance components for the random effects Block, A*Block, and the residual are shown in the Estimate column of the "Covariance Parameter Estimates“. SAS/STAT(R) 9.2 User's Guide, Second Edition
Understanding the PROC MIXED OUTPUT Fit Statistics -2 Res Log Likelihood 119.8 AIC (smaller is better) 125.8 AICC (smaller is better) 127.5 BIC (smaller is better) 123.9 The "Fit Statistics“ lists several values about the fitted mixed model, including the residual log likelihood. The Akaike (AIC) and Bayesian (BIC) information criteria can be used to compare different models; the ones with smaller values are preferred. The AICC information criteria is a small-sample bias-adjusted form of the Akaike criterion (Hurvich and Tsai 1989). SAS/STAT(R) 9.2 User's Guide, Second Edition
Understanding the PROC MIXED OUTPUT Type 3 Tests of Fixed Effects Effect Num DF Den DF F Value Pr > F A 2 6 4.07 0.076 B 1 9 19.39 0.0017 A*B 2 9 4.02 0.0566 The fixed effects are tested by using Type 3 estimable functions. SAS/STAT(R) 9.2 User's Guide, Second Edition
Understanding the PROC MIXED OUTPUT The results from the PROC MIXED analysis are the same as those obtained from the following GLM analysis PROC GLM data=sp; class A B Block; model Y = A B A*B Block A*Block; test h=A e=A*Block; Run; Quit; SAS/STAT(R) 9.2 User's Guide, Second Edition
Understanding the PROC MIXED OUTPUT • LS Means can be obtained in a similar manner as in GLM • Various mean separation techniques can be used to determine differences between levels of a factor once the factor has been found to be a significant source of variation in the analysis model used to evaluate the data. SAS/STAT(R) 9.2 User's Guide, Second Edition