IHEID - The Graduate Institute Academic year 2010-2011 Statistics for International Relations Research I Dr. NAI Alessandro, visiting professor Nov. 19, 2010 Lecture 6: Regression analysis I
Lecture content • Feedback on Assignment V • Introduction to OLS regression analysis • The regression line • The multivariate regression model • Model adequacy for specific cases
Introduction to OLS regression analysis [i/x] Inferential statistics Main goal: draw conclusions about the existence (likelihood) of relationships among variables based on data subject to random variation Does A affect B? Is the individuals’ positioning on B affected by their positioning on A?
Introduction to OLS regression analysis [ii /x] Diferent statistical tools for uncover the existence of a causal relationship
Introduction to OLS regression analysis [iii /x] Correlation Statistical relationship between two scale variables (see lecture 5) Regression Method for model the effect of one or more independent scale variables on a dependent scale variable
Introduction to OLS regression analysis [iv /x] Two major uses for regression models Prediction analysis: Develop a formula for making predictions about the dependent variable based on observed values Ex: predict GNP for next year Causal analysis: Independent variables are regarded as causes of the dependent variable Ex: uncover the causes for a higher criminality rate
Introduction to OLS regression analysis [v/x] Two main types of regression OLS (Ordinary Least Squares): linear relationship between variables, scale dependent variable Logistic regression: curvilinear relationship between variables, dummy (binomial logistic regression) or nominal dependent variable (multinomial logistic regression) (see lecture 8) All regression models may be bi- or multivariate
Introduction to OLS regression analysis [vi /x] Independent variables in (all) regression models may take the following form: - Scale (optimal measurement level in regressions) - Ordinal (metrical, or close) - Binary (0,1) Nominal variables are allowed (almost) only in logistic regressions
Introduction to OLS regression analysis [vii /x] Why a regression is not efficient with qualitative variables?
Introduction to OLS regression analysis [viii /x] OLS regressions Dependent variable is scale Independent variable(s) may be scale, ordinal (metric) and binary Estimations based on Ordinary Least Squares
Introduction to OLS regression analysis [ix/x] Ordinary Least Squares (OLS) Method used to get values for the regression coefficients: slope(s) and intercept Based on the difference between observed and predicted values Observed values: values in the database for each unit Predicted values: for the same units, values predicted by the regression model
Introduction to OLS regression analysis [x/x] Prediction error For each unit of observation: Error = Observed value – Predicted value The OLS method (on which the regression line is based) proposes a model that makes the sum of the squared predictions errors as small as possible
The regression line [i/ xiv] The regression line Reassumes the relationship between two (scale) variables as being linear Based on OLS estimations (model that makes the sum of the squared predictions errors as small as possible) In other terms: the distance between the line and all the observed values is minimized
The regression line [ii/ xiv] An intuitive example Consider a squared wooden board (the Cartesian space) on which we dispose randomly some rocks The wooden board has no weight The rocks have the exactly same shape and weight The regression line will be the line of equilibrium of the board
The regression line [iii/ xiv] Woodenboard Equilibriumpoint (regressionline) Rocks
The regression line [iv /xiv ] Bivariate OLS regression: an example Relationship between the % of female politicians in Parliament and the number of years since women had the right to vote Null hypothesis: no relationship between the two variables Working hypothesis: the older the female vote right in a given county, the higher the % of female politicians in its Parliament
Y axis (%women in Parl.) X axis (yearssince vote right to wmn) The regression line [v/ xiv] Expected distribution for the verification of the working H
The regression line [vii / xiv] Regression line Working hypothesis confirmed by observation? More or less
The regression line [viii / xiv] • The regression line reassumes the relationship between two or more (scale) variables • In a bivariate relationship, we assign as convention • - The independent variable on the X axis (horizontal) • The dependent variable on the Y axis (vertical) • We therefore say that y is a function of x • y=f(x) • If we have two independent variables (x and z), we say that • y=f(x,z)
The regression line [ix/ xiv] The regression line always assumes the following algebraic formula (regression equation): y = a + b*x + e y: dependent variable x: independent variable a: intercept (value for y where x=0) b: slope for x e: residual (not explained linearly)
a The regression line [x/ xiv] y = a + b*x Δy b = Δy / Δx Δx
The regression line [ix/ xiv] The slope (b) Coefficient that links the two variables Effect of x on y, given that y=f(x) Changes on y for each change of x Ex: if b=2, when x increases of 10 units y increases of 20 units (10*2) Look particularly at: - The direction - The strenght
The regression line [x/ xiv] Direction of the slope (interpretation similar as the distribution in crosstabs built on ordinal variables) Positive relationship If x increases, so does y Negative relationship If x increases, y decreases
The regression line [xi/ xiv] Strength of the slope Slope=1 (if x increases of 1 unit, y increases of 1 unit) Slope=2 (if x increases of 1 unit, y increases of 2 units) Slope=0.5 (if x increases of 1 unit, y increases of 0.5 units)
The regression line [xii/ xiv] SPSS procedure: Analyze / Regression / Linear
The regression line [xiii/ xiv] Back to our example (H: the older the female vote right in a given county, the higher the % of female politicians in its Parliament) y = a + b*x y = 3.58 + 0.17*x
The regression line [xiv/ xiv] General quality of the model R: strength of the relationship (Pearson’s r) R square: explanatory power of the model (% of explained variance, here 15.3%) Standard error of the estimate (Se): mean prediction error
The multivariate regression model [i/xiii] Bivariate linear regression Method for model the effect of one independent scale variable (x) on a dependent scale variable (y) y = f(x) Multivariate linear regression Method for model the effect of two or more independent scale variables (x, z, …) on a dependent scale variable (y) y = f(x,z,…) “Explanatory model”
Protein supply + Illiteracy rate - The multivariate regression model [ii/xiii] Example: Explain life expectancy for a country Working hypothesis: life expectancy for a country is positively influenced by the daily supply of proteins and negatively on the illiteracy rate The model: Life expectancy
The multivariate regression model [vi/xiii] As for the bivariate models, multivariate models may be reassumed though a regression equation y = a + b1*x + b2*z + … + e y: dependent variable x,z: independent variables a: intercept (value for y where x=0) b1: slope for x b2: slope for z e: residual (not explained linearly)
The multivariate regression model [vii/xiii] SPSS procedure: Analyze / Regression / Linear
The multivariate regression model [viii/xiii] The equation is: y = 50.59 - 0.27*x + 0.29*z If the adult illiteracy rate (x) increases of 1%, the life expectancy decreases of 0.27 years If the daily per capita supply of proteins increases of 1 grams, the life expectancy increases of 0.29 years
The multivariate regression model [ix/xiii] Standardized coefficients (betas) Useful to asses the contribution of each INDV to the dependent variable May be compared to each other, contrarily to the non-standardized coefficients (Bs). Here, x is more important than z in order to explain y
The multivariate regression model [x/xiii] Overall quality of the model R square (% of explained variance) and the standard error of the estimate have to be interpreted as in bivariate models. In multivariate models, R is almost never taken into account. Here, very good model!
The multivariate regression model [xi/xiii] Problem in multivariate regression models The logic “observe before, analyze afterwards” (as in analyses though crosstabs, ANOVA, and bivariate regressions) is complicated Graphical representation of multivariate models is hard to visualize
The multivariate regression model [xii/xiii] Example with two scale independent variables
Model adequacy [i/vii] Model adequacy for a specific case Main idea: uncover whether or not a general regression model is optimal to explain the situation in a given unit of observation Works only with aggregative data! With individual observation, it simply does not make sense
Model adequacy [ii/vii] Main logic Compare, for a specific case, the observed value (in the database) with the value predicted by the model (regression equation) If the two values are close, the model adequacy for that specific case is high If the values are not close, the model is not optimal for explain the situation in that specific case
Observed value Δ Predicted value Model adequacy [iii/vii] Lowadequacy High adequacy
Model adequacy [iv/vii] Example of model adequacy: Relationship between the % of female politicians in Parliament and the number of years since women had the right to vote Is the model adequacy high for the Swiss case?
Model adequacy [v/vii] Regression equation (overall model): y = 3.58 + 0.17*x For Switzerland, the observed value for x is 25 (25 years since the right to vote was granted to women) Predicted value for y (% of female politicians in Parliament): y = 3.58 + 0.17*x = 3.58 + 0.17*25 = 7.85% The observed value for y in the Swiss case (actual % of female politicians in Parliament) is 20.3%
Model adequacy [vi/vii] The two values are close? In order to decide, compare with the Standard Error of Estimation (Se) for the overall model Main logic: If the difference between observed and predicted values for y is lower than the Se, the model adequacy is high
Model adequacy [vii/vii] For the Swiss case: Predicted value for y = 7.85% Observed value for y = 20.3% Difference (Δ) = 20.3 – 7.85 = 12.45 Se for the overall model = 7.6 In this case, Δ> Se (12.45 > 7.6), therefore the model adequacy for Switzerland is low.
Any questions? Thank you for your attention!