1 / 79

Lecture Ten - PowerPoint PPT Presentation

Lecture Ten. Where Do We Go From Here?. Contingency Tables. Regression. Properties Assumptions Violations Diagnostics. Modeling. ANOVA. Count. Probability. Probability. Lecture. Part I: Regression properties of OLS estimators assumptions of OLS pathologies of OLS

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

PowerPoint Slideshow about 'Lecture Ten' - anevay

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Lecture Ten

Contingency Tables

Regression

Properties

Assumptions

Violations

Diagnostics

Modeling

ANOVA

Count

Probability

Probability

• Part I: Regression

• properties of OLS estimators

• assumptions of OLS

• pathologies of OLS

• diagnostics for OLS

• Part II: Experimental Method

• Unbiased:

• Note: y(i) = a + b*x(i) + e(i)

• And summing over observations i and dividing by n:

• Recall, the estimator for the slope is:

• So expression for

• The dispersion in the estimate for the slope depends upon unexplained variance, and inversely on the dispersion in x.

• the estimate, the unexplained mean square, is used for the variance of e.

Other Properties of Estimators expression for

• Efficiency: makes optimum use of the sample information to obtain estimators with minimum dispersion

• Consistency: As the sample size increases the estimator approaches the population parameter

Outline: Regression expression for

• The Assumptions of Least Squares

• The Pathologies of Least Squares

• Diagnostics for Least Squares

Assumptions expression for

• Expected value of the error is zero, E[e]= 0

• The error is independent of the explanatory variable, E{e [x-Ex]}=0

• The errors are independent of one another, E[e(i)e(j)] = 0 , i not equal to j.

• The variance is homoskedatic, E[e(i)]2=E[e(j)]2

• The error is normal with mean zero and variance sigma squared,

18.4 Error Variable: Required Conditions expression for

• The error e is a critical part of the regression model.

• Four requirements involving the distribution of e must be satisfied.

• The probability distribution of e is normal.

• The mean of e is zero: E(e) = 0.

• The standard deviation of e is sefor all values of x.

• The set of errors associated with different values of y are all independent.

E(y|x expression for3)

b0 + b1x3

E(y|x2)

b0 + b1x2

E(y|x1)

b0 + b1x1

The Normality of e

The standard deviation remains constant,

m3

m2

but the mean value changes with x

m1

From the first three assumptions we have:

y is normally distributed with mean

E(y) = b0 + b1x, and a constant standard deviation se

x1

x2

x3

Pathologies expression for

• Cross section data: error variance is heteroskedatic. Example, could vary with firm size. Consequence, all the information available is not used efficiently, and better estimates of the standard error of regression parameters is possible.

• Time series data: errors are serially correlated, i.e auto-correlated. Consequence, inefficiency.

Lab 6: Autocorrelation? expression for

Genr: Error = resid expression for

Genr: errorlag1=resid(-1)

Error (t) = a +b *error(t-1) + e(t)

Pathologies ( Cont. ) expression for

• Explanatory variable is not independent of the error. Consequence, inconsistency, i.e. larger sample sizes do not lead to lower standard errors for the parameters, and the parameter estimates (slope etc.) are biased.

• The error is not distributed normally. Example, there may be fat tails. Consequence, use of the normal may underestimate true 95 % confidence intervals.

Pathologies (Cont.) expression for

• Multicollinearity: The independent variables may be highly correlated. As a consequence, they do not truly represent separate causal factors, but instead a common causal factor.

View/open selected/one window/one group expression for

In Group Window: View/ correlations

View/open selected/one window/one group

In Group Window: View/Multiple Graphs/Scatter/

Matrix of all pairs

18.9 Regression Diagnostics - I expression for

• The three conditions required for the validity of the regression analysis are:

• the error variable is normally distributed.

• the error variance is constant for all values of x.

• The errors are independent of each other.

• How can we diagnose violations of these conditions?

Residual Analysis expression for

• Examining the residuals (or standardized residuals), help detect violations of the required conditions.

• Example 18.2 – continued:

• Nonnormality.

• Use Excel to obtain the standardized residual histogram.

• Examine the histogram and look for a bell shaped. diagram with a mean close to zero.

Diagnostics ( Cont. ) expression for

• Multicollinearity may be suspected if the t-statistics for the coefficients of the explanatory variables are not significant but the coefficient of determination is high. The correlation between the explanatory variable can then be calculated. To see if it is high.

Diagnostics expression for

• Is the error normal? Using EViews, with the view menu in the regression window, a histogram of the distribution of the estimated error is available, along with the coefficients of skewness and kurtosis, and the Jarque-Bera statistic testing for normality.

Lab 6 expression for

Diagnostics (Cont.) expression for

• To detect heteroskedasticity: if there are sufficient observations, plot the estimated errors against the fitted dependent variable

^ expression for

y

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

+

^

Heteroscedasticity

• When the requirement of a constant variance is violated we have a condition of heteroscedasticity.

• Diagnose heteroscedasticity by plotting the residual against the predicted y.

Residual

+

+

+

+

+

+

+

+

+

+

+

+

+

^

+

+

+

y

+

+

+

+

+

+

+

+

Homoscedasticity expression for

• When the requirement of a constant variance is not violated we have a condition of homoscedasticity.

• Example 18.2 - continued

Diagnostics ( Cont.) expression for

• Autocorrelation: The Durbin-Watson statistic is a scalar index of autocorrelation, with values near 2 indicating no autocorrelation and values near zero indicating autocorrelation. Examine the plot of the residuals in the view menu of the regression window in EViews.

Non Independence of Error Variables expression for

• A time series is constituted if data were collected over time.

• Examining the residuals over time, no pattern should be observed if the errors are independent.

• When a pattern is detected, the errors are said to be autocorrelated.

• Autocorrelation can be detected by graphing the residuals against time.

Non Independence of Error Variables expression for

Patterns in the appearance of the residuals over time indicates that autocorrelation exists.

Residual

Residual

+

+

+

+

+

+

+

+

+

+

+

+

+

+

0

0

+

Time

Time

+

+

+

+

+

+

+

+

+

+

+

+

+

Note the runs of positive residuals,

replaced by runs of negative residuals

Note the oscillating behavior of the

residuals around zero.

Fix-Ups expression for

• Error is not distributed normally. For example, regression of personal income on explanatory variables. Sometimes a transformation, such as regressing the natural logarithm of income on the explanatory variables may make the error closer to normal.

Fix-ups (Cont.) expression for

• If the explanatory variable is not independent of the error, look for a substitute that is highly correlated with the dependent variable but is independent of the error. Such a variable is called an instrument.

Data Errors: May lead to outliers expression for

• Typos may lead to outliers and looking for ouliers is a good way to check for serious typos

Outliers expression for

• An outlier is an observation that is unusually small or large.

• Several possibilities need to be investigated when an outlier is observed:

• There was an error in recording the value.

• The point does not belong in the sample.

• The observation is valid.

• Identify outliers from the scatter diagram.

• It is customaryto suspect an observation is an outlier if its |standard residual| > 2

+ expression for

+

+

+

+

+

+

+

+

+

+

An influential observation

An outlier

+

+

… but, some outliers

may be very influential

+

+

+

+

+

+

+

+

+

+

+

+

+

+

The outlier causes a shift

in the regression line

Procedure for Regression Diagnostics expression for

• Develop a model that has a theoretical basis.

• Gather data for the two variables in the model.

• Draw the scatter diagram to determine whether a linear model appears to be appropriate.

• Determine the regression equation.

• Check the required conditions for the errors.

• Check the existence of outliers and influential observations

• Assess the model fit.

• If the model fits the data, use the regression equation.

Outline expression for

• Critique of Regression

Critique of Regression expression for

• Samples of opportunity rather than random sample

• Uncontrolled Causal Variables

• omitted variables

• unmeasured variables

• Insufficient theory to properly specify regression equation

Experimental Method: # Examples expression for

• Deterrence

• Aspirin

• Miles per Gallon

Isaac Ehrlich Study of the Death Penalty: 1933-1969 expression for

• Homicide Rate Per Capita

• Control Variables

• probability of arrest

• probability of conviction given charged

• Probability of execution given conviction

• Causal Variables

• labor force participation rate

• unemployment rate

• percent population aged 14-24 years

• permanent income

• trend

Long Swings in the Homicide Rate in the US: 1900-1980 expression for

Source: Report to the Nation on Crime and Justice

Source: Isaac Ehrlich, “The Deterrent Effect of Capital Punishment

• Time period used: 1933-1968

• period of declining probability of execution

• Ehrlich did not include probability of imprisonment given conviction as a control variable

• Causal variables included are unconvincing as causes of homicide

http://www.ojp.usdoj.gov/bjs/

Experimental Method Controls

• Police intervention in family violence

http://www.ojp.usdoj.gov/bjs/

http://www.ojp.usdoj.gov/bjs/

• A 911 call from a family member

• the case is randomly assigned for “treatment”

• A police patrol responds and visits the household

• police calm down the family members

• based on the treatment randomly assigned, the police carry out the sanctions

• To control for unknown causal factors

• assign known numbers of cases, for example equal numbers, to each treatment

• with this procedure, there should be an even distribution of difficult cases in each treatment group

911 call Controls

(characteristics of household Participants unknown)

Random Assignment

code blue

code gold

patrol responds

patrol responds

settles the household

settles the household

verbally warn the husband

take the husband to jail

for the night

• Doctors Volunteer

• Randomly assigned to two groups

• treatment group takes an aspirin a day

• the control group takes a placebo (sugar pill) per day

• After 5 years, 11,037 experimentals have 139 heart attacks (fatal and non fatal) pE = 0.0126

• after 5 years, 11034 controls have 239 heart attacks, pc= 0.0217

• Hypotheses: H0 : pC = pE , or pC - pE = 0.; Ha : (pC - pE ) 0.

• Statistic:Z = [ C - E ) – (pC - pE )]/( pC - pE )

• recall, from the variance for a proportionSE SE( C - E )={[ c (1- c )]/nc + [ E(1- E )]/nE }1/2

• { [0.={[0217 ( 1- 0.0217)/ 11,034] + [0.0126 ( 1 – 0.0126)/ 11,039}1/2

• = 0.00175, so z = (.2017-.0126)/.00175

• z= 5.2

Pseudo Experimental Method Controls

Observations assigned to two groups, 12 each

• “treatment” group is low temperature , 5 failures

• the “control” group is high temperature, 2 failures

• “experimentals” have 5 failures (yesses) pL = 5/12

• controls have 2 failures, pH= 2/12

Challenger Controls

• Divide the data into two groups

• 12 low temperature launches, 53-70 degrees

• 12 high temperature launches, 70-81 degrees

H Controls0: pL = pH, i.e, pL – pH =0

HA: pL > pH , i.e. pL – pH >0

Z = [(5/12-2/12) – 0]/[(5/12)(7/12)/12 +(2/12)(10/12)/12]1/2

Z = 0.25/0.178 = 1.40

H Controls0: p(low temp) = p(high temp)

Binomial Prob(k≥5) in 12 Trials, Given p = 2/12

Experimental Method Controls

• Experimental Design: Paired Comparisons

• comparing mileage for two different brands of gasoline

• control for variation in car and driver by having each cab use both gasolines. Each cab is called a block in the experimental design

• control for weather, traffic, and other factors by assigning different days and times to each cab.

• H0: diff= 0, Ha : diff not zero

• t-stat = (sample difference - zero)/(smpl. std. dev/n)

• t-stat = -0.60/(0.61/101/2) = - -0.60/0.190 = - 3.16

Lab 6 Exercises Controls

Midterm 2000 Controls

• .(15 points) The following table shows the results of regressing the natural logarithm of California General Fund expenditures, in billions of nominal dollars, against year beginning in 1968 and ending in 2000. A plot of actual, estimated and residual values follows.

• .How much of the variance in the dependent variable is explained by trend?

• .What is the meaning of the F statistic in the table? Is it significant?

• .Interpret the estimated slope.

• .If General Fund expenditures was \$68.819 billion in California for fiscal year 2000-2001, provide a point estimate for state expenditures for 2001-2002.

• Cont. Controls

• A state senator believes that state expenditures in nominal dollars have grown over time at 7% a year. Is the senator in the ballpark, or is his impression significantly below the estimated rate, using a 5% level of significance?

• If you were an aide to the Senator, how might you criticize this regression?