- By
**read** - Follow User

- 130 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Dealing with data' - read

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

Presentation Transcript

Dealingwith data

- All variables ok? / gettingacquainted
- Base model
- Final model(s)
- Assumptionchecking on final model(s)
- Conclusion(s) / Inference

Bettermodels

Better variables (interaction, transformations)

Assumptionchecking

Outliersandinfluential cases

Creatingsubsets of the data

ASSUMPTION CHECKING AND OTHER NUISANCES

- In regression analysis with Stata
- In logistic regression analysis with Stata
NOTE: THIS WILL BE EASIER IN Stata THAN IT WAS IN SPSS

Assumptions in regression analysis

- No multi-collinearity
- All relevant predictor variables
- included
- Homoscedasticity: all residuals are
- from a distribution with the same variance
- Linearity: the “true” model should be
- linear.
- Independent errors: having information
- about the value of a residual should not
- give you information about the value of
- other residuals
- Errors are distributed normally

NOTHING NEW IN STATA

(NOTE: SLIDE TAKEN LITERALLY FROM MMBR)

Independent errors: havinginformationabout the value of a residualshouldnotgiveyouinformationabout the value of otherresiduals

Detect: askyourselfwhetherit is likelythatknowledgeaboutoneresidualwouldtellyousomethingabout the value of anotherresidual.

Typical cases:

-repeatedmeasures

-clusteredobservations

(peoplewithinfirms /

pupilswithin schools)

Consequences: as forheteroscedasticity

Usually, yourconfidenceintervals are estimatedtoosmall (thinkaboutwhythat is!).

Cure: usemulti-level analyses

part 2 of this course

The rest, in Stata:

Example:

the Stata “auto.dta” data set

sysuse auto

corr (correlation)

vif (variance inflation factors)

ovtest (omitted variable test)

hettest (heterogeneity test)

predict e, resid

swilk (test for normality)

Finding the commands

- “help regress”
- “regress postestimation”
and you will find most of them (and more) there

Multi-collinearity

A strongcorrelationbetweentwoor more of your predictor variables

Youdon’t want it, because:

- It is more difficult to gethigher R’s
- The importance of predictorscanbedifficult to establish (b-hatstend to go to zero)
- The estimatesforb-hats are unstableunderslightly different regressionattempts (“bouncingbeta’s”)
Detect:

- Look at correlation matrix of predictor variables
- calculateVIF-factorswhile running regression
Cure:

Delete variables sothatmulti-collinearitydisappears, forinstancebycombiningtheminto a single variable

Stata: calculating the correlation matrix (“corr” or “pwcorr”) and VIF statistics (“vif”)

Misspecification tests(replaces: all relevant predictor variables included [Ramsey])

Also run “ovtest, rhs” here. Both tests should be non-significant.

Note that there are two ways to interpret

“all relevant predictor variables included”

Homoscedasticity: all residuals are from a distribution with the samevariance

This can be done

in Stata too

(check for yourself)

Consequences: Heteroscedasticiy does notnecessarilylead to biases in yourestimatedcoefficients (b-hat), butit does lead to biases in the estimate of the width of the confidence interval, and the estimation procedure itself is notefficient.

Testing for heteroscedasticity in Stata

- Your residuals should have the same variance for all values of Y hettest
- Your residuals should have the same variance for all values of X hettest, rhs

Errorsdistributednormally

Errorsshouldbedistributednormally

(justthe errors, not the variables themselves!)

Detect: look at the residual plots, test fornormality, or save residualsand test directly

Consequences: rule of thumb: ifn>600, noproblem. Otherwiseconfidenceintervals are wrong.

Cure: try to fit a bettermodel (or use more difficultways of modelinginstead- askan expert).

Errorsdistributednormally

First calculate the residuals (after regress):

predict e, resid

Then test for normality

swilke

logistic regression

with Stata

Note: based on

http://www.ats.ucla.edu/stat/stata/webbooks/logistic/chapter3/statalog3.htm

Assumptions in logistic regression

- Y is 0/1
- Independence of errors (as in multiple regression)
- No cases where you have complete separation
(Stata will try to remove these cases automatically)

- Linearity in the logit (comparable to “the true model should be linear” in multiple regression) – “specification error”
- No multi-collinearity (as in m.r.)

Think!

- What will happen if you try
logit y x1 x2 in this case?

This!

Because all cases with x==1 lead to y==1, the weight of x should be +infinity. Stata therefore rightly disregards these cases.

Do realize that, even though you do not see them in the regression, these are extremely important cases!

(checking for)multi-collinearity

- In regression, we had “vif”
- Here we need to download a command that a user-created: “collin” (try “finditcollin” in Stata)

(checking for)specification error

- The equivalent for “ovtest” is the command “linktest”

(checking for)specification error – part 2

Further things to do:

- Check for useful transformations of variables, and interaction effects
- Check for outliers / influential cases:
1) using a plot of stdres

(against n) and dbeta(against n)

2) using a plot of ldfbeta’s(against n)

3) using regress and diag

(but don’t tell anyone that I suggested this)

Checking for outliers / influential cases

… check the file auto_outliers.do for this …

Dealingwith data

- All variables ok? / gettingacquainted
- Base model
- Final model(s)
- Assumptionchecking on final model(s)
- Conclusion(s) / Inference

Bettermodels

Better variables (interaction, transformations)

Assumptionchecking

Outliersandinfluential cases

Creatingsubsets of the data

Example analyses on ideas.dta

For next week:improve the logisticregressionyou had

Annotated output: as ifyouwriteanexamassignment ...

- Create do-file withcommentsin it
- Run itandaddfurthercomments on the outcomes in the log file
- Submit do-file andlog-file
Useyourownassignment, and the skills youmasteredtoday.

Deadline: comingWednesday

Online also:the taxi tipping data

Download Presentation

Connecting to Server..