Use of Proc GLM to Analyze Experimental Data Animal Science 500 Lecture No. October , 2010
PROC GLM • The GLM procedure uses the method of least squares to fit general linear models. • Among the statistical methods available in PROC GLM are: • Regression, • Analysis of variance, • Analysis of covariance, • Multivariate analysis of variance (MANOVA), • and partial correlation. SAS/STAT(R) 9.22 User's Guide
PROC GLM • PROC GLM analyzes data within the framework of general linear models. • PROC GLM handles models relating one or several continuous dependent variables to one or several independent variables. • The independent variables can be either classification variables, which divide the observations into discrete groups, or continuous variables. • Thus, the GLM procedure can be used for many different analyses, including the following: SAS/STAT(R) 9.22 User's Guide
PROC GLM • Thus, the GLM procedure can be used for many different analyses, including the following: • simple regression • multiple regression • analysis of variance (ANOVA), especially for unbalanced data • analysis of covariance • response surface models • weighted regression • polynomial regression • partial correlation • multivariate analysis of variance (MANOVA) • repeated measures analysis of variance SAS/STAT(R) 9.22 User's Guide
PROC GLM • PROC GLM enables you to specify any degree of interaction (crossed effects) and nested effects. • It also provides for polynomial, continuous-by-class, and continuous-nesting-class effects. • Through the concept of estimability, the GLM procedure can provide tests of hypotheses for the effects of a linear model regardless of the number of missing cells or the extent of confounding. • PROC GLM displays the sum of squares (SS) associated with each hypothesis tested and, upon request, the form of the estimable functions employed in the test. PROC GLM can produce the general form of all estimable functions. SAS/STAT(R) 9.22 User's Guide
PROC GLM • The REPEATED statement enables you to specify effects in the model that represent repeated measurements on the same experimental unit for the same response, providing both univariate and multivariate tests of hypotheses. • The RANDOM statement enables you to specify random effects in the model; expected mean squares are produced for each Type I, Type II, Type III, Type IV, and contrast mean square used in the analysis. Upon request, tests that use appropriate mean squares or linear combinations of mean squares as error terms are performed. SAS/STAT(R) 9.22 User's Guide
PROC GLM • The ESTIMATE statement enables you to specify an vector for estimating a linear function of the parameters . • The CONTRAST statement enables you to specify a contrast vector or matrix for testing the hypothesis that . When specified, the contrasts are also incorporated into analyses that use the MANOVA and REPEATED statements. • The MANOVA statement enables you to specify both the hypothesis effects and the error effect to use for a multivariate analysis of variance. SAS/STAT(R) 9.22 User's Guide
PROC GLM • PROC GLM can create an output data set containing the input data set in addition to predicted values, residuals, and other diagnostic measures. • PROC GLM can be used interactively. After you specify and fit a model, you can execute a variety of statements without recomputing the model parameters or sums of squares. SAS/STAT(R) 9.22 User's Guide
PROC GLM • For analysis involving multiple dependent variables but not the MANOVA or REPEATED statements, a missing value in one dependent variable does not eliminate the observation from the analysis for other dependent variables. PROC GLM automatically groups together those variables that have the same pattern of missing values within the data set or within a BY group. This ensures that the analysis for each dependent variable brings into use all possible observations. SAS/STAT(R) 9.22 User's Guide
Estimable Function • Often see an error in SAS non-est. • What does this mean?
Estimability • Generalized inverses are used to obtain solutions for effects in general linear models. • There are many generalized inverses. • Many different sets of solutions are possible. • Estimable are unique and don’t depend on the generalized inverse used to obtain solutions. • To analyze data properly, that is answer the hypothesis being tested, the scientist should know what function of the parameters in the model are being estimated.
Estimability • The hypothesis being tested is NOT the absolute values for a level of a factor in the model. • Usually asking or hypothesizing that two means are different or some treatment is different from a control. • Hence the differences are estimable function NOT the values (solutions) for any of the functions.
The General Linear Model • The main effects general linear model can be parameterized as Yij= µ + αi + bj+ εij Where Y observation for ithα, µ is the overall mean (unknown fixed parameter), αi effect of the ithvalue of α (αi- µ), bj effect of the jthvalue of b (bj - µ), and εijis the experimental error N(0,δ2)
The General Linear Model • In matrix terminology, the general linear model may be expressed as • Y = Xβ + ε where Y the observed data vector, X the design matrix, β is a vector of unknown fixed effect parameters, and ε is the vector of errors
Programming the General Linear Model • In the GLM procedure, one saves the data set plus the residuals, predicted values, and studentized residuals with an output statement in a data set called resdat. PROC GLM; class machine operator; Model yield=machine|operator; output out=resdat r=resid p=pred student=stdresrstudent=rstud cookd=cksd h=lev;
Assumptions of the general linear model • E (ε) = 0 • var(ε) = σ2 I • var(Y) = σ2 I • E(Y ) = Xβ
Assumptions of the Linear Regression Model • Linear Functional form • Fixed independent variables • Independent observations • Representative sample and proper specification of the model (no omitted variables) • Normality of the residuals or errors • Equality of variance of the errors (homogeneity of residual variance) • No multicollinearity • No autocorrelation of the errors • No outlier distortion
Explanation of the Assumptions • Linear Functional form • Does not detect curvilinear relationships • The Observations are Independent observations • Representative sample from some larger population • If the observations are not independent results in an autocorrelation which inflates the t and r and f statistics which in turn distorts the significance tests • Normality of the residuals • Permits proper significance testing similar to ANOVA and other statistical procedures • Equal variance (or no heterogenous variance) • Heteroskedasticity precludes generalization and external validity • This too distorts the significance tests being used • Multicollinearity(many of the traits exhibit collinearity) • Biases parameter estimation. • Can prevent the analysis from running or converging (getting your answers) • Severe or several outliers will distort the results and may bias the results. • If outliers have high influence and the sample is not large enough, then they may serious bias the parameter estimates
SAS test for residual normality Proc univariate data=resdat normal plot; varresid; Run; Quit;
Graphically examining residuals for homogeneity Proc gplot data=resdat; plot resid * pred; Run; Quit; Analysis for lack of pattern;
Testing for outliers Proc freq data=resdat; tables stdrescksd; Run; Quit; 1. Look for standardized residuals greater than 3.5 or less than – 3.5 2. And look for high Cook’s D (greater than 4*p/(n-p-1).
Class Statement • Variables included in the CLASS statement referred to as class variables. • Specifies the variables whose values define the subgroup combinations for the analysis. • Represent various level of some factors or effects • Treatment (1,….n) • Season (spring, summer, fall, and winter coded 1 through 4) • Breed • Color • Sex • Line • Day • Laboratory
Evaluating outliers 1.Check coding to spot typos 2. Correct typos 3. If observational outlier is correct, Examine the dffits option to see determine how much influence the outlier has on the fitting statistics. This will show the standardized influence of the observation on the fit. If the influence of the outlier is bad, then consider removal making it a missing observation ( . )
PROC GLM Syntax PROC GLM <options> ; CLASS variables </ option> ; MODEL dependent-variables=independent-effects </ options> ;
Class Variables • Are usually things you would like to account for in your model • Can be numeric or character • Can be continuous values • They are generally not used in regression analyses • What meaning would they have
Class Statement Options • Ascending sorts class variable in ascending order • Descending sorts class variable in descending order Other options with the Class statement generally related to the procedure (PROC) being used and thus will not cover them all
Discrete Variables • A discrete variable is one that cannot take on all values within the limits of the variable. • Limited to whole numbers • For example, responses to a five-point rating scale can only take on the values 1, 2, 3, 4, and 5. • The variable cannot have the value 1.7. A variable such as a person's height can take on any value. Discrete variables also are of two types: • unorderable (also called nominal variables) • orderable (also called ordinal)
Discrete Variables • Data sometimes called categorical as the observations may fall into one of a number of categories for example: • Any trait where you score the value • Lameness scores • Body condition scores • Soundness scoring • Reproductive • Feet and leg • Behavioral traits • Fear test • Back test • Vocal scores • Body lesion scores
Discrete Variables • When do discrete variables become continuous or do they? • What is a trait like number born alive considered discrete or continuous?
Example Variables Data: The dependent variable (what is being measured) is aerial biomass and there are five substrate measurements: (These are the independent variables) • Salinity, • Acidity, • Potassium, • Sodium, and Zinc.
Covariates • a covariate is a independent variable that contribute variation to the dependent variable of interest. • The research wants to account for the covariate differences that occurs for each observation. • A covariate may be of direct interest or it may be a confounding or interacting type of variable
Covariates • Examples Weight of animal at measurement Age of animal at measurement Age of animal at weaning Parity of sow for number born alive and weaning weight Days of lactation for milk weight
Covariates • Covariate may influence the dependent variable in the following ways • Linear covariate • Quadratic covariate • Cubic covariate
Covariates • Check to be sure your covariate is significant • If the linear is significant, test the quadratic • If the linear and quadratic are significant sources of variation test the cubic • How do you do that?
Covariates • How do you do that? • Linear include the variable name in the model not listed in the class statement. • Example weight • Quadratic the variable name is included as follows weight*weight • Cubic the variable name is included as follows weight*weight*weight
Covariates • Covariate may influence the dependent variable in the following ways • Linear covariate • Independent covariate affects the dependent variable in a linear manner • Quadratic covariate • Independent covariate affects the dependent variable in a linear quadratic manner • Indicates there is an inflection point (and only one) • Cubic covariate • Independent covariate affects the dependent variable in a linear cubic manner • Indicates there are two inflection points
Covariates • Covariate may influence the dependent variable in the following ways • Linear covariate • Independent covariate affects the dependent variable in a linear manner • Dependent variable increase or decreases at a constant rate
Covariates • Covariate may influence the dependent variable in the following ways • Quadratic covariate • Independent covariate affects the dependent variable in a linear quadratic manner • Indicates there is an inflection point (and only one) • The dependent variable increases (or decreases) to some point and then either increases at an increasing rate (decreases at an increasing rate) or increases at a decreasing rate (or decreases at a decreasing rate) • Or could be a directional change – to some point the dependent variable increases and then after another point the dependent variable response decreases or vise versa
Covariates • Cubic covariate • Independent covariate affects the dependent variable in a linear cubic manner • Indicates there are two inflection points • Essentially the same as quadratic but the changes can occur at an additional point • The dependent variable increases (or decreases) to some point and then either increases at an increasing rate (decreases at an increasing rate) or increases at a decreasing rate (or decreases at a decreasing rate) • Or could be a directional change – to some point the dependent variable increases and then after another point the dependent variable response decreases or vise versa
Model Development and Selection of Variables Example: The general problem addressed is to identify important soil characteristics influencing aerial biomass production of marsh grass, Spartina alterniflora.
Example Data Origination (Dr. P. J. Berger) Data: The data were published as an exercise by Rawlings (1988) and originally appeared as a study by Dr. Rick Linthurst, North Carolina State University (1979). The purpose of his research was to identify the important soil characteristics influencing aerial biomass production of the marsh grass, Spartina alterniflora in the Cape Fear Estuary of North Carolina. The design for collecting data was such that there were three types of Spartina vegetation, in each of three locations, and five random sites within each location vegetation type.
Example Data • Objective: • Find the substrate variable, or combination of variables, showing the strongest relationship to biomass. Or, • From the list of five independent variables of salinity, acidity, potassium, sodium, and zinc, find the combination of one or more variables that has the strongest relationship with aerial biomass. • Find the independent variables that can be used to predict aerial biomass.
Example Data • Class vegetative_type location sites • Recall 3 vegetative types evaluated • Recall 3 locations where tests occurred • Recall 5 sites within each location • Model • Biomass = vegetative_type location site(location) vegetative_type*location salinity acidity potassium sodium zinc;
Example Data • Model • Biomass = vegetative_type location site(location) vegetative_type*location salinity acidity potassium sodium zinc; • Would need to examine assuming each linear affect was signficant • salinity*salinity • salinity*salinity*salinity • acidity*acidity • acidity*acidity*acidity, • Etc.
PROC GLM Example • Example Strawberry yield is modeled as a function of strawberry variety, type of fertilizer, and their interaction. PROC GLM DATA=berry; CLASS fertiliz variety; MODEL yield=fertiliz variety Fertiliz*variety / SOLUTION; LSMEANS fertiliz variety; Run; Quit; • The SOLUTION statement is useful for showing the relative effect sizes.
PROC GLM Example Output General Linear Models Procedure Class Level Information FERTILIZ 2 K N VARIETY 2 Red Sweet Number of observations in data set = 24 This section lets us verify that we have two fertilizers and two varieties of interest, and that there are 24 observations in the data. Information about missing observations is also printed here, if applicable.
PROC GLM Example Output Dependent Variable: YIELD Sum of Mean Source DF Squares Square F Value Pr > F Model 3 0.87166667 0.29055556 2.59 0.0816 Error 20 2.24666667 0.11233333 Corrected Total 23 3.11833333 R-Square C.V. Root MSE YIELD Mean 0.279530 3.790707 0.3351617 8.8416667 This section shows the ANOVA table, with degrees of freedom (DF), sums of squares, and an F value which tests whether any of the terms in the model are significant. The C. V. (coefficient of variation) is (root MSE/mean yield)(100%). R-Square is the model sum of squares divided by total sum of squares. This is commonly used to evaluate how well the model fits the data, but it should not be the only criterion of fit that you examine.