Stats 760 lecture 2
Download
1 / 45

Stats 760: Lecture 2 - PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on

Stats 760: Lecture 2. Linear Models. Agenda. R formulation Matrix formulation Least squares fit Numerical details – QR decomposition R parameterisations Treatment Sum Helmert. R formulation. Regression model y ~ x1 + x2 + x3 Anova model y ~ A+ B (A, B factors)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Stats 760: Lecture 2' - marcia-stephenson


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Stats 760 lecture 2
Stats 760: Lecture 2

Linear Models


Agenda
Agenda

  • R formulation

  • Matrix formulation

  • Least squares fit

  • Numerical details – QR decomposition

  • R parameterisations

    • Treatment

    • Sum

    • Helmert


R formulation
R formulation

  • Regression model y ~ x1 + x2 + x3

  • Anova model y ~ A+ B (A, B factors)

  • Model with both factors and continuous variables y ~ A*B*x1 + A*B*x2

    What do these mean? How do we interpret the output?


Regression model
Regression model

Mean of observation

= b0 + b1x1 + b2x2 + b3x3

Estimate b’s by least squares ie minimize


Matrix formulation
Matrix formulation

Arrange data into a matrix and vector

Then minimise


Normal equations
Normal equations

  • Minimising b’s satisfy

Non-negative, zero when b=beta hat

Proof:


Solving the equations
Solving the equations

  • We could calculate the matrix XTX directly, but this is not very accurate (subject to roundoff errors). For example, when trying to fit polynomials, this method breaks down for polynomials of low degree

  • Better to use the “QR decomposition” which avoids calculating XTX


Solving the normal equations
Solving the normal equations

  • Use “QR decomposition” X=QR

  • X is n x p and must have “full rank” (no column a linear combination of other columns)

  • Q is n x p “orthogonal” (i.e. QTQ = identity matrix)

  • R is p x p “upper triangular” (all elements below the diagonal zero), all diagonal elements positive, so inverse exists


Solving using qr
Solving using QR

XTX = RTQTQR = RTR

XTy = RTQTy

Normal equations reduce to

RTRb = RTQTy

Premultiply by inverse of RT to get Rb = QTy

Triangular system, easy to solve



A refinement
A refinement

  • We need QTy:

  • Solution: do QR decomp of [X,y]

  • Thus, solve Rb = r


What r has to do
What R has to do

When you run lm, R forms the matrix X from the model formula, then fits the model E(Y)=Xb

Steps:

  • Extract X and Y from the data and the model formula

  • Do the QR decomposition

  • Solve the equations Rb = r

  • Solutions are the numbers reported in the summary


Forming x
Forming X

When all variables are continuous, it’s a no-brainer

  • Start with a column of 1’s

  • Add columns corresponding to the independent variables

    It’s a bit harder for factors


Factors one way anova
Factors: one way anova

Consider model y ~ a where a is a factor having 3 levels say.

In this case, we

  • Start with a column of ones

  • Add a dummy variable for each level of the factor (3 in all), order is order of factor levels

    Problem: matrix has 4 columns, but first is sum of last 3, so not linearly independent

    Solution: Reparametrize!


Reparametrizing
Reparametrizing

  • Let Xa be the last 3 columns (the 3 dummy variables)

  • Replace Xa by XaC (ie Xa multiplied by C), where C is a 3 x 2 “contrast matrix” with the properties

    • Columns of XaC are linearly independent

    • Columns of XaC are linearly independent of the column on 1’s

      In general, if a has k levels, C will be k x (k-1)


The treatment parametrization
The “treatment” parametrization

  • Here C is the matrix

    C = 0 0

    1 0

    0 1

    (You can see the matrix in the general case by typing contr.treatment(k) in R, where k is the number of levels)

    This is the default in R


Treatment parametrization 2
Treatment parametrization (2)

  • The model is E[Y] = Xb, where X is

    1 0 0

    1 0 0

    1 1 0

    1 1 0

    1 0 1

    1 0 1

  • The effect of the reparametrization is to drop the first column of Xa, leaving the others unchanged.

. . .

Observations at level 1

Observations at level 2

. . .

Observations at level 3

. . .

. . .


Treatment parametrization 3
Treatment parametrization (3)

  • Mean response at level 1 is b0

  • Mean response at level 2 is b0 + b1

  • Mean response at level 3 is b0 + b2

  • Thus, b0 is interpreted as the baseline (level 1) mean

  • The parameter b1 is interpreted as the offset for level 2 (difference between levels 1 and 2)

  • The parameter b2 is interpreted as the offset for level 3 (difference between levels 1 and 3)

. . .


The sum parametrization
The “sum” parametrization

  • Here C is the matrix

    C = 1 0

    0 1

    -1 -1

    (You can see the matrix in the general case by typing contr.sum(k) in R, where k is the number of levels)

    To get this in R, you need to use the options function

    options(contrasts=c("contr.sum", "contr.poly"))


Sum parametrization 2
sum parametrization (2)

  • The model is E[Y] = Xb, where X is

    1 1 0

    1 1 0

    1 0 1

    1 0 1

    1 -1 -1

    1 -1 -1

  • The effect of this reparametrization is to drop the last column of Xa, and change the rows corresponding to the last level of a.

. . .

Observations at level 1

Observations at level 2

. . .

Observations at level 3

. . .

. . .


Sum parameterization 3
Sum parameterization (3)

  • Mean response at level 1 is b0 + b1

  • Mean response at level 2 is b0 + b2

  • Mean response at level 3 is b0 - b1 -b2

  • Thus, b0 is interpreted as the average of the 3 means, the “overall mean”

  • The parameter b1 is interpreted as the offset for level 1 (difference between level 1 and the overall mean)

  • The parameter b2 is interpreted as the offset for level 2 (difference between level 1 and the overall mean)

  • The offset for lervel 3 is - b1 -b2

. . .


The helmert parametrization
The “Helmert” parametrization

  • Here C is the matrix

    C = -1 -1

    1 -1

    0 2

    (You can see the matrix in the general case by typing contr.helmert(k) in R, where k is the number of levels)


Helmert parametrization 2
Helmert parametrization (2)

  • The model is E[Y] = Xb, where X is

    1 -1 -1

    1 -1 -1

    1 1 -1

    1 1 -1

    1 0 2

    1 0 2

  • The effect of this reparametrization is to change all the rows and columns.

. . .

Observations at level 1

Observations at level 2

. . .

Observations at level 3

. . .

. . .


Helmert parametrization 3
Helmert parametrization (3)

  • Mean response at level 1 is b0 - b1 -b2

  • Mean response at level 2 is b0 + b1 -b2

  • Mean response at level 3 is b0 + 2b2

  • Thus, b0 is interpreted as the average of the 3 means, the “overall mean”

  • The parameter b1 is interpreted as half the difference between level 2mean and level 1mean

  • The parameter b2 is interpreted as the one third of the difference between the level 3 mean and the average of the level 1 and 2 means

. . .


Using r to calculate the relationship between b parameters and means
Using R to calculate the relationship between b-parameters and means

Thus, the matrix (XTX)-1XT gives the coefficients we need to find the b’s from the m’s


Example one way model
Example: One way model

  • In an experiment to study the effect of carcinogenic substances, six different substances were applied to cell cultures.

  • The response variable (ratio) is the ratio of damages to undamaged cells, and the explanatory variable (treatment) is the substance


Data

ratio treatment

0.08 control

+ 49 other control obs

0.08 choralhydrate

+ 49 other choralhydrate obs

0.10 diazapan

+ 49 other diazapan obs

0.10 hydroquinone

+ 49 other hydroquinine obs

0.07 econidazole

+ 49 other econidazole obs

0.17 colchicine

+ 49 other colchicine obs


Lm output
lm output

> cancer.lm<-lm(ratio ~ treatment,data=cancer.df)

> summary(cancer.lm)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 0.26900 0.02037 13.207 < 2e-16 ***

treatmentcolchicine 0.17920 0.02881 6.221 1.69e-09 ***

treatmentcontrol -0.03240 0.02881 -1.125 0.262

treatmentdiazapan 0.01180 0.02881 0.410 0.682

treatmenteconidazole -0.00420 0.02881 -0.146 0.884

treatmenthydroquinone 0.04300 0.02881 1.493 0.137

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.144 on 294 degrees of freedom

Multiple R-Squared: 0.1903, Adjusted R-squared: 0.1766

F-statistic: 13.82 on 5 and 294 DF, p-value: 3.897e-12


Relationship between means and betas
Relationship between means and betas

X<-model.matrix(cancer.lm)

coef.mat<-solve(t(X)%*%X)%*%t(X)

> levels(cancer.df$treatment)

[1] "chloralhydrate" "colchicine" "control" "diazapan" "econidazole" "hydroquinone"

> cancer.df$treatment[c(1,51,101,151,201,251)]

control chloralhydrate diazapan hydroquinone econidazole colchicine

Levels: chloralhydrate colchicine control diazapan econidazole hydroquinone

> round(50*coef.mat[,c(1,51,101,151,201,251)])

1 51 101 151 201 251

(Intercept) 0 1 0 0 0 0

treatmentcolchicine 0 -1 0 0 0 1

treatmentcontrol 1 -1 0 0 0 0

treatmentdiazapan 0 -1 1 0 0 0

treatmenteconidazole 0 -1 0 0 1 0

treatmenthydroquinone 0 -1 0 1 0 0


Two factors model y a b
Two factors: model y ~ a + b

To form X:

  • Start with column of 1’s

  • Add XaCa

  • Add XbCb


Two factors model y a b1
Two factors: model y ~ a * b

To form X:

  • Start with column of 1’s

  • Add XaCa

  • Add XbCb

  • Add XaCa: XbCb

    (Every column of XaCa multiplied elementwise with every column of XbCb)


Two factors example
Two factors: example

Experiment to study weight gain in rats

  • Response is weight gain over a fixed time period

  • This is modelled as a function of diet (Beef, Cereal, Pork) and amount of feed (High, Low)

  • See coursebook Section 4.4


Data

> diets.df

gain source level

1 73 Beef High

2 98 Cereal High

3 94 Pork High

4 90 Beef Low

5 107 Cereal Low

6 49 Pork Low

7 102 Beef High

8 74 Cereal High

9 79 Pork High

10 76 Beef Low

. . . 60 observations in all


Two factors the model
Two factors: the model

  • If the (continuous) response depends on two categorical explanatory variables, then we assume that the response is normally distributed with a mean depending on the combination of factor levels: if the factors are A and B, the mean at the i th level of A and the j th level of B is mij

  • Other standard assumptions (equal variance, normality, independence) apply



Decomposition of the means
Decomposition of the means

  • We usually want to split each “cell mean” up into 4 terms:

    • A term reflecting the overall baseline level of the response

    • A term reflecting the effect of factor A (row effect)

    • A term reflecting the effect of factor B (column effect)

    • A term reflecting how A and B interact.


Mathematically
Mathematically…

Overall Baseline: m11 (mean when both factors are at their baseline levels)

Effect of i th level of factor A (row effect): mi1 - m11(The i th level of A, at the baseline of B, expressed as a deviation from the overall baseline)

Effect of j th level of factor B (column effect) : m1j - m11 (The j th level of B, at the baseline of A, expressed as a deviation from the overall baseline)

Interaction: what’s left over (see next slide)


Interactions
Interactions

  • Each cell (except the first row and column) has an interaction:

    Interaction = cell mean - baseline - row effect - column effect

  • If the interactions are all zero, then the effect of changing levels of A is the same for all levels of B

    • In mathematical terms, mij – mi’j doesn’t depend on j

  • Equivalently, effect of changing levels of B is the same for all levels of A

  • If interactions are zero, relationship between factors and response is simple


Splitting up the mean rats
Splitting up the mean: rats

Factors are : level (amount of food) and source (diet)

interaction


Fit model
Fit model

> rats.lm<-lm(gain~source+level + source:level)

> summary(rats.lm)

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 1.000e+02 4.632e+00 21.589 < 2e-16 ***

sourceCereal -1.410e+01 6.551e+00 -2.152 0.03585 *

sourcePork -5.000e-01 6.551e+00 -0.076 0.93944

levelLow -2.080e+01 6.551e+00 -3.175 0.00247 **

sourceCereal:levelLow 1.880e+01 9.264e+00 2.029 0.04736 *

sourcePork:levelLow -3.052e-14 9.264e+00 -3.29e-15 1.00000

---

Signif. codes: 0 `***' 0.001 `**' 0.01 `*' 0.05 `.' 0.1 ` ' 1

Residual standard error: 14.65 on 54 degrees of freedom

Multiple R-Squared: 0.2848, Adjusted R-squared: 0.2185

F-statistic: 4.3 on 5 and 54 DF, p-value: 0.002299


Fitting as a regression model
Fitting as a regression model

Note that using the treatment contrasts, this is equivalent to fitting a regression with dummy variables R2, C2, C3

R2 = 1 if obs is in row 2, zero otherwise

C2 = 1 if obs is in column 2, zero otherwise

C3 = 1 if obs is in column 3, zero otherwise

The regression is

Y ~ R2 + C2 + C3 + I(R2*C2) + I(R2*C3)


Notations
Notations

For two factors A and B

  • Baseline: m = m11

  • A main effect: ai = mi1-m11

  • B main effect: bj = m1j -m11

  • AB interaction: abij = mij -mi1-m1j +m11

  • Thenmij=m+ai +bj +abij



Using r to interpret parameters
Using R to interpret parameters

>rats.df<-read.table(file.choose(), header=T)

>rats.lm<-lm(gain~source*level, data=rats.df)

>X<-model.matrix(rats.lm)

>coef.mat<-solve(t(X)%*%X)%*%t(X)

>round(10*coef.mat[,1:6])

1 2 3 4 5 6

(Intercept) 1 0 0 0 0 0

sourceCereal -1 1 0 0 0 0

sourcePork -1 0 1 0 0 0

levelLow -1 0 0 1 0 0

sourceCereal:levelLow 1 -1 0 -1 1 0

sourcePork:levelLow 1 0 -1 -1 0 1

>rats.df[1:6,]

gain source level

1 73 Beef High

2 98 Cereal High

3 94 Pork High

4 90 Beef Low

5 107 Cereal Low

6 49 Pork Low

Cell Means

betas


X matrix details first six rows
X matrix: details (first six rows)

(Intercept) source source level sourceCereal: sourcePork:

Cereal Pork Low levelLow levelLow

1 0 0 0 0 0

1 1 0 0 0 0

1 0 1 0 0 0

1 0 0 1 0 0

1 1 0 1 1 0

1 0 1 1 0 1

Col of 1’s

XaCa

XbCb

XaCa:XbCb


ad