- 104 Views
- Uploaded on
- Presentation posted in: General

Quantitative data analysis

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Quantitative data analysis

Module in research methods course for tourism program

Reza Mortazavi

2014

Lecture 4

- Correlation
- When two variables are linearly related (or covary) we say they are correlated either positively or negatively.

- Correlation is not causation!
- Open the file PLdata.dta
- sum incd1000 age
- scatter incd1000 age, yline(121.6) xline(22.9)

- Interpret the scatterplot.
- twoway scatter incd1000 age,by(female)
- Correlation coefficient
- Measures the strength of linear association between two variables. Is between [-1,1].

- pwcorrincd1000 age,sig
- H0:no correlation.

- pwcorr incd1000 female,sig

- pwcorr incd1000 education,sig
- pwcorr incd1000 totexpdayage,sig star(0.05)
Interpret the output!

- gen neg5incd=-5*incd1000
- What do we expect in terms of correlation between them?
- scatter neg5incd incd1000
- pwcorr neg5incd incd1000,sig

- gen x=rnormal()
- gen y=rnormal()
- What do we expect? (two independent variables have been drawn randomly…)
- pwcorr x y,sig

- Zero correlation does not mean independence
- gen seq = int((_n-_N/2))
- gen seqsq=seq^2
- scatter seqsq seq
- pwcorr seqsq seq

- Correlation does not imply causation.
- Statistical significance is not the same as practical significance.
- Use common sense when interpreting and drawing conclusions.

- Correlation is about linear association
- Use scatterplot to discover possible nonlinear association.

- Normally distributed data are assumed.
- The correlation coefficient is sensitive to outliers (extreme values)
- Sometimes transformations (e.g. logarithmic) of non-normally distributed data are normal
- Non-normal data may be converted into ordinal (ranked) data and non-parametric test, Spearman’s rank correlation, may be used.

- Note that the purpose is not to go into all details regarding regression analysis. Even though there are a couple of slides with some algebraic expressions the exposition is not intended to be technical.
- The purpose is, however, to cover the basics so that you can run your own regression analysis using software and present, interpret and discuss results.

- Estimatea relationshipamong some variables, such as y = f(x). Here y is the dependent and x is the independent variable.
For example, food consumption or tourism demand depends on income.

2. Forecast or predict the value of one variable, y, based on the value of another variable, x.

- Y is called dependent variable, response variable, explained variable, output variable or regressand.
- X’s are called independent variable, predictor variable, explanatory variable, input variable or regressor.
- A model is an abstraction from reality. It is a simplified representation focusing on some features while ignoring details.

y = dollars spent each week on food items.

x = consumer’s weekly income.

The relationship between x and the expected value of y , given x, might belinear:E(y|x) = b1 + b2 x

f(y|x)

f(y|x=480)

f(y|x=800)

my|x=480

my|x=800

y

Probability Distribution of Food Expenditures given

income x=$480 and x=$800.

Average

Expenditure

E(y|x)

E(y|x)=b1+b2x

DE(y|x)

b2=

DE(y|x)

Dx

Dx

{

b1

x (income)

a linear relationship between average expenditure

on food and income.

The population parametersb1andb2are unknown population constants.

The formulas that produce thesample estimates b1 and b2 arecalled the estimators of b1andb2.

When b1 and b2 are used to representthe formulas rather than specific values,they are called estimators of b1andb2which are random variables becausethey are different from sample to sample.

- twoway (scatter totexpday incd1000) (lfittotexpday incd1000)
- regress totexpday incd1000
- What is the “intercept” here? What does it mean?
- What is the “slope” here? What does it mean?
- Interpret your estimated model!

- In interpreting the results you have to be careful about what are the units of measurements
- regress totexpdayinccont
- What is the “intercept” here? What does it mean?
- What is the “slope” here? What does it mean? Compare with the previous model.

- Hypothesis tests:
- Is income (statistically) significantly related to visitors expenditures?
- The output table gives us several ways to answer this question.

- Is income (statistically) significantly related to visitors expenditures?
- How good is our model?
- R-squared
- R-squared = 0.0575 in our example. How can we interpret this number?

- R-squared

- Can we make a prediction of the totexpday for say an average person earning 200000 SEK per year?
Well: 411.123+ 1.03526*200= 618. 18

This is a point (prediction) estimate. We can calculate say a 95% confidence (prediction) interval.

95 % PI: (570.1205-666.2293)

- regr incd1000 age
- regr incd1000 education
Iterpretthe results!