1 / 74

# AAEC 4302 STATISTICAL METHODS IN AGRICULTURAL RESEARCH - PowerPoint PPT Presentation

AAEC 4302 STATISTICAL METHODS IN AGRICULTURAL RESEARCH. Chapter 7(7.1 &7.2): Theory and Application of the Multiple Regression Model. Introduction. The multiple regression model aims to and must include all of the independent variables X1, X2, X3, …, Xk that are believed to affect Y

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'AAEC 4302 STATISTICAL METHODS IN AGRICULTURAL RESEARCH' - rae

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### AAEC 4302 STATISTICAL METHODS IN AGRICULTURAL RESEARCH

Chapter 7(7.1 &7.2): Theory and Application of the Multiple Regression Model

• The multiple regression model aims to and must include all of the independent variables X1, X2, X3, …, Xk that are believed to affect Y

• Their values are taken as given: It is critical that, although X1, X2, X3, …, Xk are believed to affect Y, Y does not affect the values taken by them

• The multiple linear regression model is given by:

Yi = β0 + β1X1i + β2X2i + β3X3i +…+ βkXki + ui

where i=1,…,n represents the observations, k is the total number of independent variables in the model, β0, β1,…, βk are the parameters to be estimated and ui is the disturbance term, with the same properties as in the simple regression model

• In our example we have a time series data, k is five and i is twenty one.

• The model to be estimated, therefore, is

Yi = β0 + β1X1i + β2X2i + β3X3i +β4X4i + ui

• As before:

E[ Yi ]= β0 + β1X1i + β2X2i + β3X3i +…+ βkXki

Yi = E[Yi]+ ui , the systematic (explainable) and

unsystematic (random) components of Yi

• And the corresponding prediction of Yi:

Yi = β0 + β1X1i + β2X2i + β3X3i +β4X4i

^

^

^

^

^

^

• Also as before, the parameters of the multiple regression model (βo, β1, β2, β3, β4) are estimated by minimizing SSR:

SSR = ei2= (Yi-β0 - β1X1i - β2X2i - β3X3i - β4X4i )2

• As before, the formulas to estimate the regression model parameters that would make the SSR as small as possible

n

n

^

^

^

^

^

i=1

i=1

X2

Regression surface (plane)

E[Y] = Bo+B1X1+B2X2

Ui

X2 slope

measured

by B2

Bo

X1 slope measured by B1

X1

^

• The intercept βo estimates the value of Y when all of the independent variables in the model take a value of zero; which may not be empirically relevant or even correct in some cases.

• In our example βo , is 144.94, which means that if :

• Yi = 144.94+β1*(0)+β2*(0) + β3*(0)+β4*(0)

• All the independent variables take the value of zero (price of beef is zero cents/lb, price of chicken is zero cents/lb, price of pork is zero cents/lb, and the income for US population is zero dollars/ per – year, then the estimated beef consumption will be 144.94 lbs/year).

^

^

^

^

^

^

^

^

• In a strictly linear model, β1, β2,..., βk are slopes of coefficients that measure the unit change in Y when the corresponding X (X1, X2,..., Xk) changes by one unit and the values of all of the other independent variables remain constant at any given level (it does not matter which)

• Ceteris paribus (other things being equal)

^

• In our example:

• β1= -0.00291. That means, if the price of beef increases by one cent/lb then the beef consumption will decrease by 0.00291 pounds per – year, ceteris paribus

• β2= -0.116. That means, if the price of chicken increases by one cent/lb then the beef consumption will decrease by 0.116 pounds per – year, ceteris paribus (Does this result makes sense?)

^

^

^

• In our example:

• β3= 0.3413. That means, if the price of pork increases by one cent/lb then the beef consumption will increase by 0.3413 pounds per – year, ceteris paribus (beef and pork are substitutes).

• β4= 0.3121. That means, if the US income increases by one dollar per year then beef consumption will increase by 0.3121 pounds per – year, ceteris paribus

^

^

• The same key measure of goodness of fit is used in the case of the multiple regression model:

R2 = 1 - { ei2/ (Yi-Y)2}

• A disadvantage of the regular R2 as a measure of a model’s goodness of fit is that it always increases in value as independent variables are added into the model, even if those variables can’t be statistically shown to affect Y

n

n

i=1

i=1

• The adjusted or corrected R2 denoted by R2 is better measure to assess whether the adding of an independent variable likely increases the ability of the model to predict Y:

R2 = 1  [{ei2/(n-k-1)}/{(Yi-Y)2/(n-1)}]

• The R2 is always less than the R2, unless the R2 = 1

• Adjusted R2 lacks the same straightforward interpretation as the regular R2; under unusual circumstances, it can even be negative

• Any variable that is suspected to directly affect Y, and that did not hold a constant value throughout the sample, should be included in the model

• Excluding such a variable would likely cause the estimates of the remaining parameters to be “incorrect”; i.e. the formulas for estimating those parameters would be biased

• The consequences of including irrelevant variables in the model are less serious; if in doubt, this is preferred

### AAEC 4302ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH

Chapters 6.3

Variables & Model Specifications

• In many cases the value of Y in time period t is more likely explained by the value taken by X in the previous time period:

For example, a farmer’s current year investment

decisions might be based on the previous year prices,

since the current year prices are not known when

making these decisions.

• In multiple regression models (i.e. models with more than one explanatory variable), it can be assumed that Y is affected by different lags of X:

• The model can also be estimated using the OLS method (i.e. the previously developed formulas for calculating ( and )

• It is only necessary to rearrange the data in such a way that the value of Y at time period t coincides with the value of X at time period t-1

Suppose we want to estimate cotton acres planted in the US (Y) as a function of the last 3 years price of cotton lint (Xt), cents/lb.

What's the interpretation of: = 1.2 ?

It means that if the price of cotton lint three years ago (t-3), changed by 1 cent per pound; the # of acres of planted cotton today (time, t) would increase by 1.2 acres, while holding all the other X’s constant.

• The first difference of a variable is its change in value from one time period to the next

• First difference on Y:

• First difference on X:

• The only reason you do this is if you believe that it is not the previous year that affects Yt; but the difference between the previous year and current year that affects Yt.

Suppose you wanted to estimate the function where investment is a function of the change in GNP (i.e. first difference).

• In economics, the demand for durable goods could be more directly affected by the change in interest rates than by the interest rate level (a first difference in the independent variable)

• In forestry, deforestation (i.e. the change in the forest cover from one year to the next) could be more directly related to the price of wood than total forest cover (a first difference in the dependent variable)

### AAEC 4302ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH

Chapters 6.4-6.5, 7.4

Variables & Model Specifications

• The reciprocal model specification is:

• Relationship between Y and the transformed independent variable is linear

• Model specified relation between inflation and unemployment as reciprocal, observations for 15 observations (1956-1970):

• UINVi = 1/UMPLi

• INFLi = B0 + B1*UINVi + ei

• The estimated regression is:

INFLi = -1.984+ 22.234*UINVi

R2= 0.549 SER=0.956

• B0 =-1.984

• As UNEML increases, INFL decreases and approaches the lower limit of -1.984 percent

• Quantitative implications are understood when we compare diff. predicted values of INFL for diff. rates of unemployment

• If UNEMPL = 3%, INFL = -1.984 +22.234*(1/3) = 5.43 %

• If UNEMPL = 4%, INFL = -1.984 +22.234*(1/4) = 3.57 %

• A special type of non-linear relations become linear when they are transformed with logarithms

• Specifically, consider

• We take natural logs of both sides of this equation:

• This is also known as the Log-Log or Double-Log specification, because it becomes a linear relation when taking the natural logarithm of both sides

• Also note that in a Log-Linear specification all ( and values must be positive, since the natural logarithm of a non-positive number is not defined

• An important feature is that directly measures the elasticity of Y with respect to Xj; i.e. the percentage change in Y when Xj changes by one percent

• Model of aggregate demand for money in the US

• Ln Mi= Bo + B1 ln GNPi + Ui

• Estimated regression:

LnMi= 3.948 + 0.215 LnGNPi

R2 = 0.78 SER=0.0305

• B1= 0.215, or 0< B1<1 the elasticity of M with respect to GNP is 0.215

• 5% increase in GNP leads to 0.215*5=1.075% increase in predicted M

• Predict demand for money when GNP = 1000: ln1000=6.908

lnM = 3.948 + 0.215*6.908 = 5.433

Antilog of 5.433 = 222.8 bill \$

• A polynomial model specification (with respect to only) is:

An advantage of the polynomial model specification is that it can combine situations in which some of the independent variables are non-linearly related to Y while others are linearly related to Y

• A polynomial model can be estimated by OLS, viewing as any other independent variable in the multiple regression

• In the example before j=1, i.e. a polynomial specification with respect to is desired: both ( and would be included as independent variables in the data set given to the Excel program for OLS (linear regression) estimation

Multiple regression :

Cross-sectional DB with 100 observations

Estimated EANRS function:

EANRSi = -9.791 +0.995 EDi + 0.471EXPi –

0.00751EXPSQi

R2=0.329 SER4.267

B 1= 0.995 – holding the level of experience constant one additional year of education increases earnings by \$995

EANRSi = constant + 0.471EXPi – 0.00751EXPSQi

where the “constant” depends of the particular value chosen for ED

• Slope = 0.471 + (2)(-0.00751)EXP

• If EXP = 5 years, then

slope = 0.471 + (2)(-0.00751)(5) = 0.396 thou \$

A man with 5 years of experience will have his earnings increased by 396 \$ after gaining one additional year of experience

• Yi = β0 + β1 ln Xi + ui

• ln Yi = β0 + β1 Xi + ui

• ln Earngi = 0.673 + 0.107 Edui

• One additional year of schooling increases earnings by the proportion of 0.107 or 10.7%

### AAEC 4302ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH

Chapter 7.3 Dummy Variables

• In many models, one or more of the independent variables is qualitative or categorical in nature

• This type of independent variables have to be modeled through dummy variables

• A set of dummy variables is created for each categorical independent variable X in the model, where the number of dummy variables in the set equals the number of categories in which that independent variable is classified

• In our biological example is the skull length (mm) of the ith mouse:

• X1i sex: male or female (two categories),

• X2i specie (three categories), and

• X3 age.

• Two dummy variables will be created for X1 (D11 and D12) and three for X2(D21, D22, and D23)

• In the ith observation (mouse):

• , if sex is male, 0 otherwise;

• , if sex is female, 0 otherwise;

• , if specie 1, 0 otherwise;

• , if specie 2, 0 otherwise;

• and ( , if specie 3, 0 otherwise.

X1

X2

• The estimated model would be:

• Notice that the dummy variables corresponding to the last categories of X1 and X2 (D12 and D23) have been excluded from the estimated model (any one dummy/category can be excluded, it makes no difference)

• If you don’t exclude a dummy variable from a group, it will contain redundant information.

• Notice that this model actually estimates a different intercept for each observed sex/specie combination, while maintaining the same slope parameters for each of the other independent variables in the model ( ) (only one -age or - in our example)

Model to estimate:

Estimated Model:

• For a male mouse of the first specie:

1

1

0

D11: 1 if sex = Male, 0 otherwise

D21: 1 if species = 1, 0 otherwise

D22: 1 if species = 2, 0 otherwise

• measures the difference in skull length (for any age) between male and female for any specie

• : means that regardless of age, a male mouse will have a skull length 3.05 mm larger than a female mouse

• measures the difference in skull length (for male mouse of any age) between species one and three

• : means that a mouse of species 1 will have a skull length 4.9 mm smaller than a mouse of species 3, regardless of sex and age.

• measures the difference in skull length (male or female mice of any age) between species two and three

• : means that a mouse of species 2 will have a skull length 0.22 mm smaller than a mouse of species 3, regardless of sex and age.

• measures the difference in skull length (for male or female mice of any age) between species one and two

• (-4.9) – (-0.22) = -4.68 means the skull length for species 1 is 4.68 mm shorter than for species 2, regardless of age and sex.

• A model like the former assumes that sex or specie shift the skull length regression function at the origin, in a parallel fashion, for example:

Male of Specie 3

Y (mm)

Female of Specie 3

3.05

(age)

### Chapter 10

The Normal and t Distributions

• A random variable Z (-∞ ∞) is said

to have a standard normal distribution if its probability distribution is of the form:

The area under p(Z) is equal to 1

Z has and , page 210

• Pr (Z ≥ 1.5), Figure 10.1 (a), page 210

• Table A.1 give (Z ≥ 1.5), for positive values of Z*

• Find ά such that Pr (Z ≥ Z*)= α ,

ά is the probability

If Z*= 1.5 than from the table α = .067

Pr (Z ≤ -Z*)= Pr (Z ≥ Z*)

Pr (Z ≤ -1.5)= .067

• To determine the probability in two symmetrical tails of the distribution:

ІZІ = Z* means Z ≤ -Z* andZ ≥ Z* together

• Pr(ІZІ ≥ Z*)= Pr(Z ≤ -Z*)+ Pr(Z ≥ Z*) =

= 2Pr(Z ≥ Z*) area in Fig. 10.1b

The probability of not being in either tail is unshaded area or:

Pr(ІZІ ≤ Z*) = 1 - Pr(ІZІ≥ Z*)

Pr(Z ≥ 1.5) = 0.067, then Pr(ІZІ≥ 1.5) = 0.134 and Pr(ІZІ≤ 1.5) = 0.866

• Random variable X (-∞ ∞) is said

to have a normal distribution if its probability distribution is of the form:

where b>0 and a can be any value.

and

• Standard normal is one of the members of this family with μ=0 and σ=1 if a=0 and b=1

• Figure 10.4 shows different normal distributions, page 214

• All members of the normal distribution family can be viewed as being linear transformations of each other

• Figure 10.5, page 215

• Any transformation can be thought of as a transformation of the standard normal distribution

• α=Pr(X ≥ Xk)= Pr(Z ≥Zk), where

• X has a normal distribution with μ=5 and σ=2

Pr(X≥ 6) ?

From Table A.1 we find Pr(Z ≥ 0.5)=0.309

The t Distribution

• The equation of the probability density function p(t) is quite complex:

p(t) = f (t; df), -∞< t <∞

• t has and when df>2

• Probability problems:

Find α such that Pr(t ≥ t*) =α

Table A.2 can be used to find probability

df=5, Pr(t ≥ 1.5) = 0.97 and Pr(t ≥ 2.5) = 0.027

• When we have d independent random variables z1, z2 , z3, . . . Zd , each having a standard normal distribution.

• We can define a new random variable

χ2 = , df=d

Figure 10.6 page 222

χ2 has μ = d and S =

Find (χ2 )c such that Pr(χ2 ≥ (χ2)c) =α

Table A.4 df =10 and α=0.10 then χ2 ≥ (χ2)c=15.99

• Suppose we have two independent random variables χ2n and χ2d having chi-square distributions with n and d degrees of freedom

• A new random variable F can be defined as:

• This random variable has a distribution with n and d degrees of freedom

• 0 ≤ F < ∞

• Find Fc such that Pr(Fn,d ≥ Fc) =α

• Table A.5 gives the Fc values for n and d when α = 0.05

• Table A.6 gives the Fc values for n and d when α = 0.01

• For F distribution with 5 and 10 df

Fc = 3.33 for α = 0.05

### AAEC 4302ADVANCED STATISTICAL METHODS IN AGRICULTURAL RESEARCH

Chapter 11:

Sampling Theory in Regression Analysis

• The basic model of simple linear regression states: for a given set of values X, the corresponding values of Y are determined by:

• Two parts indetermination of Y: systematic portion and random portion, the disturbance Ui

• Ui is a random variable with normal probability distribution E(ui) = 0 and σ(ui) = σu

• Since Yi and ui only differ by a constant, the former implies that the dependent variable also follows a normal probability distribution with a changing mean

and its standard deviation is σ(Yi) = σu

• Ui is normally distributed with E(Ui)=0 and σ(Ui)= 5

• If Xi = 5, What can you say about Yi?

• Yi is normally distributed with:

• E[Yi] = 7 + 12(5) = 67

• σ(Yi) = σ(Ui) = 5

P(Yi)

Yi~ N[67,(5)2]

σ(Yi) = σ = 5

Mean: B0+B1X1

E[Yi]=67

Yi

67

72

62

P(Ui)

Ui~ N[0,(5)2]

Ui

E[Ui]=0

• Applying the OLS estimators for simple regression to sets of data that are generated by the same normal regression model

• Figure 11.3 page 235

• Different values for B0 and B1 occur in different samples drawn from the same economic process is called sampling variability

• Relative frequency histograms and frequency distributions

• In the simple linear regression model:

where  means “distributed”, N means normal, the first element in parenthesis is the mean or expected value of the estimator and the second element is the formula for calculating the variance of the estimator.

2

æ

ö

æ

ö

ΣX

ç

s

÷

ç

÷

2

B

~

N

B

,

i

ç

÷

ç

÷

(

)

2

0

0

-

n

X

X

å

è

ø

è

ø

i

• The standard error of the estimator is the standard error of .

• The expression , which appears in is known as the total variation in X.

Example:

• σu =5, β0 =7 and β1 =12

• Assume the total variation in X equals 9

^

^

What is the chance that B1 is between 11 & 13?

α = Pr(11≤β1≤13)

= 1-2Pr(β1≥13)

= 1-2Pr(Z≥Zk) where

=1-2Pr(Z≥0.6) = 1-(2)(0.274) = 0.452

Thus, the probability α is about 45 percent.

^

• For a set of data for which total variation in X is equal to 25

• Standard Error for this case σ(β1) = 1

• The probability for this case

α = P(11≤β1≤13) = 0.68

When Standard Error is smaller there is a greater possibility that est. β1 will take on a value in some interval centered around true β1 valueThe smaller the standard error, the more precise is est. β1 as an estimator of β1

• The greater is the total variation in X, the smaller will be the standard error