**Topic 7: Analysis of Variance**

**Outline** • Partitioning sums of squares • Breakdown degrees of freedom • Expected mean squares (EMS) • F test • ANOVA table • General linear test • Pearson Correlation / R2

**Analysis of Variance** • Organize results arithmetically • Total sum of squares in Y is • Partition this into two sources • Model (explained by regression) • Error (unexplained / residual)

**Total Sum of Squares** • MST is the usual estimate of the variance of Y if there are no explanatory variables • SAS uses the term Corrected Total for this source • Uncorrected is ΣYi2 • The “corrected” means that we subtract of the mean before squaring

**Model Sum of Squares** • dfR = 1 (due to the addition of the slope) • MSR = SSR/dfR • KNNL uses regressionfor what SAS calls model • So SSR (KNNL) is the same as SS Model

**Error Sum of Squares** • dfE = n-2 (both slope and intercept) • MSE = SSE/dfE • MSE is an estimate of the variance of Y taking into account (or conditioning on) the explanatory variable(s) • MSE=s2

**ANOVA Table** Source df SS MS Regression 1 SSR/dfR Error n-2 SSE/dfE ________________________________ Total n-1 SSTO/dfT

**Expected Mean Squares** • MSR, MSE are random variables • When H0 : β1 = 0 is true E(MSR) =E(MSE)

**F test** • F*=MSR/MSE ~ F(dfR, dfE) = F(1, n-2) • See KNNL pgs 69-71 • When H0: β1=0 is false, MSR tends to be larger than MSE • We reject H0 when F is large If F* F(1-α, dfR, dfE) = F(.95, 1, n-2) • In practice we use P-values

**F test** • When H0: β1=0 is false, F has a noncentralF distribution • This can be used to calculate power • Recall t* = b1/s(b1) tests H0 : β1=0 • It can be shown that (t*)2 = F* (pg 71) • Two approaches give same P-value

**ANOVA Table** Source df SS MS F P Model 1 SSM MSM MSM/MSE 0.## Error n-2 SSE MSE Total n-1 **Note: Model instead of Regression used here. More similar to SAS

**Examples** • Tower of Pisa study (n=13 cases) proc reg data=a1; model lean=year; run; • Toluca lot size study (n=25 cases) proc reg data=toluca; model hours=lotsize; run;

**Pisa Output**

**Pisa Output** (30.07)2=904.2 (rounding error)

**Toluca Output**

**Toluca Output** (10.29)2=105.88

**General Linear Test** • A different view of the same problem • We want to compare two models • Yi = β0 + β1Xi + ei (full model) • Yi = β0 + ei (reduced model) • Compare two models using the error sum of squares…better model will have “smaller” mean square error

**General Linear Test** • Let SSE(F) = SSE for full model SSE(R) = SSE for reduced model • Compare with F(1-α,dfR-dfF,dfF)

**Simple Linear Regression** • dfR=n-1, dfF=n-2, • dfR-dfF=1 • F=(SSTO-SSE)/MSE=SSR/MSE • Same test as before • This approach is more general

**Pearson Correlation** • r is the usual correlation coefficient • It is a number between –1 and +1 and measures the strength of the linear relationship between two variables

**Pearson Correlation** • Notice that • Test H0: β1=0 similar to H0: ρ=0

**R2 and r2** • Ratio of explained and total variation

**R2 and r2** • We use R2 when the number of explanatory variables is arbitrary (simple and multiple regression) • r2=R2only for simple regression • R2 is often multiplied by 100 and thereby expressed as a percent

**R2 and r2** • R2 always increases when additional explanatory variables are added to the model • Adjusted R2 “penalizes” larger models • Doesn’t necessarily get larger

**Pisa Output** R-Square 0.9880 (SAS) = SSM/SSTO = 15804/15997 = 0.9879

**Toluca Output** R-Square 0.8215 (SAS) = SSM/SSTO = 252378/307203 = 0.8215

**Background Reading** • May find 2.10 and 2.11 interesting • 2.10 provides cautionary remarks • Will discuss these as they arise • 2.11 discusses bivariate Normal dist • Similarities and differences • Confidence interval for r • Program topic7.sas has the code to generate the ANOVA output • Read Chapter 3