IHEID  The Graduate Institute Academic year 20102011. Statistics for International Relations Research I. Dr. NAI Alessandro, visiting professor. Nov. 19, 2010 Lecture 6 : Regression analysis I. Lecture content. Feedback on Assignment V Introduction to OLS regression analysis
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
IHEID  The Graduate Institute
Academic year 20102011
Dr. NAI Alessandro, visiting professor
Nov. 19, 2010
Lecture 6:
Regression analysis I
Introduction to OLS regression analysis [i/x]
Inferential statistics
Main goal: draw conclusions about the existence (likelihood) of relationships among variables based on data subject to random variation
Does A affect B?
Is the individuals’ positioning on B affected by their positioning on A?
Introduction to OLS regression analysis [ii /x]
Diferent statistical tools for uncover the existence of a causal relationship
Introduction to OLS regression analysis [iii /x]
Correlation
Statistical relationship between two scale variables
(see lecture 5)
Regression
Method for model the effect of one or more independent scale variables on a dependent scale variable
Introduction to OLS regression analysis [iv /x]
Two major uses for regression models
Prediction analysis:
Develop a formula for making predictions about the dependent variable based on observed values
Ex: predict GNP for next year
Causal analysis:
Independent variables are regarded as causes of the dependent variable
Ex: uncover the causes for a higher criminality rate
Introduction to OLS regression analysis [v/x]
Two main types of regression
OLS (Ordinary Least Squares): linear relationship between variables, scale dependent variable
Logistic regression: curvilinear relationship between variables, dummy (binomial logistic regression) or nominal dependent variable (multinomial logistic regression)
(see lecture 8)
All regression models may be bi or multivariate
Introduction to OLS regression analysis [vi /x]
Independent variables in (all) regression models may take the following form:
 Scale (optimal measurement level in regressions)
 Ordinal (metrical, or close)
 Binary (0,1)
Nominal variables are allowed (almost) only in logistic regressions
Introduction to OLS regression analysis [vii /x]
Why a regression is not efficient with qualitative variables?
Introduction to OLS regression analysis [viii /x]
OLS regressions
Dependent variable is scale
Independent variable(s) may be scale, ordinal (metric) and binary
Estimations based on Ordinary Least Squares
Introduction to OLS regression analysis [ix/x]
Ordinary Least Squares (OLS)
Method used to get values for the regression coefficients: slope(s) and intercept
Based on the difference between observed and predicted values
Observed values: values in the database for each unit
Predicted values: for the same units, values predicted by the regression model
Introduction to OLS regression analysis [x/x]
Prediction error
For each unit of observation:
Error = Observed value – Predicted value
The OLS method (on which the regression line is based) proposes a model that makes the sum of the squared predictions errors as small as possible
The regression line
Reassumes the relationship between two (scale) variables as being linear
Based on OLS estimations (model that makes the sum of the squared predictions errors as small as possible)
In other terms: the distance between the line and all the observed values is minimized
An intuitive example
Consider a squared wooden board (the Cartesian space) on which we dispose randomly some rocks
The wooden board has no weight
The rocks have the exactly same shape and weight
The regression line will be the line of equilibrium of the board
The regression line [iv /xiv ]
Bivariate OLS regression: an example
Relationship between the % of female politicians in Parliament and the number of years since women had the right to vote
Null hypothesis: no relationship between the two variables
Working hypothesis: the older the female vote right in a given county, the higher the % of female politicians in its Parliament
(%women in Parl.)
X axis
(yearssince vote right to wmn)
The regression line [v/ xiv]
Expected distribution for the verification of the working H
The regression line [vii / xiv]
Regression line
Working hypothesis confirmed by observation?
More or less
The regression line [viii / xiv]
The regression line always assumes the following algebraic formula (regression equation):
y = a + b*x + e
y: dependent variable
x: independent variable
a: intercept (value for y where x=0)
b: slope for x
e: residual (not explained linearly)
The slope (b)
Coefficient that links the two variables
Effect of x on y, given that y=f(x)
Changes on y for each change of x
Ex: if b=2, when x increases of 10 units y increases of 20 units (10*2)
Look particularly at:
 The direction
 The strenght
Direction of the slope
(interpretation similar as the distribution in crosstabs built on ordinal variables)
Positive relationship
If x increases, so does y
Negative relationship
If x increases, y decreases
Strength of the slope
Slope=1
(if x increases of 1 unit, y increases of 1 unit)
Slope=2
(if x increases of 1 unit, y increases of 2 units)
Slope=0.5
(if x increases of 1 unit, y increases of 0.5 units)
The regression line [xii/ xiv]
SPSS procedure: Analyze / Regression / Linear
The regression line [xiii/ xiv]
Back to our example
(H: the older the female vote right in a given county, the higher the % of female politicians in its Parliament)
y = a + b*x
y = 3.58 + 0.17*x
The regression line [xiv/ xiv]
General quality of the model
R: strength of the relationship (Pearson’s r)
R square: explanatory power of the model (% of explained variance, here 15.3%)
Standard error of the estimate (Se): mean prediction error
The multivariate regression model [i/xiii]
Bivariate linear regression
Method for model the effect of one independent scale variable (x) on a dependent scale variable (y)
y = f(x)
Multivariate linear regression
Method for model the effect of two or more independent scale variables (x, z, …) on a dependent scale variable (y)
y = f(x,z,…)
“Explanatory model”
+
Illiteracy rate

The multivariate regression model [ii/xiii]
Example: Explain life expectancy for a country
Working hypothesis: life expectancy for a country is positively influenced by the daily supply of proteins and negatively on the illiteracy rate
The model:
Life expectancy
The multivariate regression model [vi/xiii]
As for the bivariate models, multivariate models may be reassumed though a regression equation
y = a + b1*x + b2*z + … + e
y: dependent variable
x,z: independent variables
a: intercept (value for y where x=0)
b1: slope for x
b2: slope for z
e: residual (not explained linearly)
The multivariate regression model [vii/xiii]
SPSS procedure: Analyze / Regression / Linear
The multivariate regression model [viii/xiii]
The equation is: y = 50.59  0.27*x + 0.29*z
If the adult illiteracy rate (x) increases of 1%, the life expectancy decreases of 0.27 years
If the daily per capita supply of proteins increases of 1 grams, the life expectancy increases of 0.29 years
The multivariate regression model [ix/xiii]
Standardized coefficients (betas)
Useful to asses the contribution of each INDV to the dependent variable
May be compared to each other, contrarily to the nonstandardized coefficients (Bs).
Here, x is more important than z in order to explain y
The multivariate regression model [x/xiii]
Overall quality of the model
R square (% of explained variance) and the standard error of the estimate have to be interpreted as in bivariate models.
In multivariate models, R is almost never taken into account.
Here, very good model!
The multivariate regression model [xi/xiii]
Problem in multivariate regression models
The logic “observe before, analyze afterwards” (as in analyses though crosstabs, ANOVA, and bivariate regressions) is complicated
Graphical representation of multivariate models is hard to visualize
The multivariate regression model [xii/xiii]
Example with two scale independent variables
Model adequacy for a specific case
Main idea: uncover whether or not a general regression model is optimal to explain the situation in a given unit of observation
Works only with aggregative data!
With individual observation, it simply does not make sense
Main logic
Compare, for a specific case, the observed value (in the database) with the value predicted by the model (regression equation)
If the two values are close, the model adequacy for that specific case is high
If the values are not close, the model is not optimal for explain the situation in that specific case
Example of model adequacy:
Relationship between the % of female politicians in Parliament and the number of years since women had the right to vote
Is the model adequacy high for the Swiss case?
Regression equation (overall model):
y = 3.58 + 0.17*x
For Switzerland, the observed value for x is 25 (25 years since the right to vote was granted to women)
Predicted value for y (% of female politicians in Parliament):
y = 3.58 + 0.17*x = 3.58 + 0.17*25 = 7.85%
The observed value for y in the Swiss case (actual % of female politicians in Parliament) is 20.3%
The two values are close?
In order to decide, compare with the Standard Error of Estimation (Se) for the overall model
Main logic:
If the difference between observed and predicted values for y is lower than the Se, the model adequacy is high
For the Swiss case:
Predicted value for y = 7.85%
Observed value for y = 20.3%
Difference (Δ) = 20.3 – 7.85 = 12.45
Se for the overall model = 7.6
In this case, Δ> Se (12.45 > 7.6), therefore the model adequacy for Switzerland is low.
Thank you for your attention!