1 / 24

SOC 206 Lecture 2

SOC 206 Lecture 2. Logic of Multivariate Analysis Multiple Regression. Multivariate Analysis. Why multivariate analysis? Nothing happens by a single cause If it did – it would imply perfect determinism it would imply perfect/divine measurement

Download Presentation

SOC 206 Lecture 2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. SOC 206 Lecture 2 Logic of Multivariate Analysis Multiple Regression

  2. Multivariate Analysis • Why multivariate analysis? • Nothing happens by a single cause • If it did – it would imply perfect determinism • it would imply perfect/divine measurement • it would be impossible to separate cause from effect (where does effect start and where does cause end) • Social reality is notoriously multi-causal even more than certain physical/chemical/biological processes • People are not just objects but also subjects of causal processes – reflexivity, agency, framing etc. (Some of these are hard to capture in statistical models.)

  3. John Stuart Mill’s 3 Main Criteria of Causation (recall) • #1. Empirical Association • #2. Appropriate Time Order • #3. Non-Spuriousness (Excluding other Forms of Causation) • Mill tells us that even individual causal relationships cannot be established without multivariate analysis (#3). • Suppose we suspect X causes Y Y=f(X,e) • Suppose we establish that X is related to Y (#1) and X precedes Y (#2). • But what if both X and Y are the result of Z a third variable: • E.g. Academic Performance=f( Poverty, e) • If that were true redistributing income should help academic achievements. • But maybe both are the result of parents education (a confounding factor) - e Academic Performance Poverty Poverty e2 e1 Poverty Academic Performance + - Parents’ Education

  4. Excluding other Forms of Causationor Eliminating Confounding Factors • Eliminating or “controlling for” other, confounding factors (Z) • Experiments -- treatment (X) is introduced by researcher: • 1. Physical control • Excluding factors by physical design – physical control of Zs • 2. Randomization • Random assignment to treatment and control – randomized control Zs • Observational research – no manipulation by researcher • 3. Quasi-experiments • Found experiments – choice of cases that are “minimum pairs”: they are the same on most confounding factors (Zs) but they are different in the treatment (X) • 4. Statistical manipulation • Removing the effect of Z from the relationship between Y and X • Organizing data into groups homogenous by the control variable Z and looking at the relationship between treatment X and response Y • if Y still moves together with X it cannot be because they are moved by Z: Z is constant. If Z is the cause of Y and Z is constant Y must be constant too. • Residualizing X on Z then residualizing Y on Z. That leaves us with that part of X and Y that is unrelated to Z. If the two residualized variables still move together, that cannot be because they are moved by Z.

  5. Residualizing • Remember: in a regression the error is always unrelated to the independent variable(s) • Residualizing

  6. The Importance of Temporal Sequence

  7. Multiple Regression with Two Independent variables • Yi=a+b1Xi+b2Zi+ei • or • Yi=a+b1X1i+b2X2i+ei • To obtain a, b1, and b2 we first calculate β*1 and β*2 from the standardized regression. • Then we transform them into their metric equivalents • Finally we obtain a with the help of the means of Y, X1 and X2 .

  8. We multiply each side by ZX1i We sum across all cases and divide by n We get our first normal equation (for the correlation between Y and X1 ). We get an expression for β*1 . We multiply each side by ZX2i . Repeat what did. We get our second normal equation (for the correlation betweenY and X2 ). Plugging in for β*1 . Both standardized coefficients can be expressed in terms of the three correlations among Y,X1 andX2 . Finding the Standardized (Path) Coefficients 1. 2.

  9. Finding the Unstandardized (Metric) Coefficients • We multiply each standardized coefficient by the ratio of the standard deviation of the dependent variable and the independent variable to which it belongs. • Take the two normal equations: • What do we learn from the normal equations? • If either β*2 =0 or rx1x2=0 , the unconditional effect does not change once we control for X2. • We get suppression only if β*2≠0 and rx1x2 ≠ and • of the opposite signsif the unconditional effect is positive and of the same signs if the unconditional effect is negative. • The correlation (unconditional effect) of X1 or X2 on Y can be decomposed into two parts. Take X1 • The direct (or net) effect of X1 on Y (β*1 ) controlling for X2 • and something else that is the product of the direct (or net) effect of X2 (β*2 ) on Y and the correlation between X1 and X2 (rx1x2), the measure of multicollinearity between the two independent variables.

  10. Path Analysis • AP=f(P,e1) ZAP= β*’1 ZP+e1 • AP=f(P,PE,e) ZAP= β*1 ZP+ β*2 ZPE+ e e1 Poverty Academic Performance β*’1 e Academic Performance Poverty β*1 β*2 Parents’ Education

  11. The Multiple Regression Model . regress API13 AVG_ED MEALS, beta Source | SS df MS Number of obs = 10173 -------------+------------------------------ ------------------- F( 2, 10170) = 4441.76 Model | 49544993 2 24772496.5 Prob > F = 0.0000 Residual | 56719871.2 10170 5577.17514 R-squared = 0.4662 -------------+------------------------------ --------------------- Adj R-squared = 0.4661 Total | 106264864 10172 10446.8014 Root MSE = 74.68 ---------------------------------------------------------------------------------------------------------- API13 | Coef. Std. Err. t P>|t| Beta -------------+-------------------------------------------------------------------------------------- AVG_ED | 114.9596 1.695597 67.80 0.000 .853387 MEALS | .8187537 .0461029 17.76 0.000 .2235364 _ cons | 416.4326 7.135849 58.36 0.000 . ------------------------------------------------------------------------------------------------------------ • . correlate AVG_ED API13 MEALS, means • (obs=10173) • Variable | Mean Std. Dev. Min Max • - ------------+------------------------------------------------------------------------ • AVG_ED | 2.781778 .758739 1 5 • API13 | 784.182 102.2096 311 999 • MEALS | 58.57338 27.9053 0 100 • | AVG_ED API13 MEALS • ------------------+--------------------------- • AVG_ED | 1.0000 • API13 | 0.6706 1.0000 • MEALS | -0.8178 -0.4743 1.0000

  12. Basic Path Analysis ryx1 =β*’1 =-.4743 e1 Poverty β*’1 =-.4743 Academic Performance e Academic Performance Poverty β*1=.2235364 rx1x2=β*’1=-.8178 β*2=.853387 Parents’ Education Spurious indirect effect

  13. Basic Path Analysis ryx2 =β*’2 =. 6706 e1 Parents’ Education β*’2 =. 6706 Academic Performance e Academic Performance Poverty β*1=.2235364 rx1x2=β*’1=-.8178 β*2=.853387 Parents’ Education Indirect effect

  14. Fit (R-square) • Venn diagram • R-square= Unique contribution by X1 + unique contribution by X2 + common contribution by both X1 and X2 • Multicollinearity • Unique contributions are small, statistically non-significant, still R-square is large because of the common contribution is large. y x2 x1 y x2 x1

  15. Nested Regression Equations • Comparing theories • How much a theory adds to an already existing one • Calculating the contribution of a set of variables ----- R2 • Where R12 is the fit of the smaller model and R22 is the fit of the full model • and K1is the number of independent variables in the smaller model andK2is the number of independent variables in the full model • and N is the sample size. • Warning: You have to make sure you use the exact same cases for each model!

  16. Adjusted R-square • Adding a new independent variable will always improve fit even if it is unrelated to the dependent variable. • We have to consider the parsimony (number of independent variables) of the model relative to the sample size. • For N=2, a simple regression will always have a perfect fit • General rule: N-1 independent variables will always result in R-squared of 1 no matter what those variables are • Adjusted R-square

  17. Multiple Regression with K Independent Variables • Yi=a+b1X1i+b2X2i+....+bkXki+ei • If we standardized Y, X1… Xk turning them into Z scores we can re-write the equation as • Zyi=β*1Zx1i+ β*2Zx2i+… +β*kZxki+ei • To find the coefficients we have to write out k number of normal equations one for each correlation between each independent variable and the dependent variable • ryx1=β*1+ β*2 rx1x2+…..+β*k rx1xk • ryx2= β*1rx1x2+β*2+…..+β*k rx2xk ………………. • ryxk= β*1rx1xk +β*2 rx2xk+…..+β*k • and solve k equations for k unknowns (β*1, β*2…. β*k)

  18. The Correlations . correlate API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR (obs=10082) | API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS ----------------+------------------------------------------------------------------------------------------ API13 | 1.0000 MEALS | -0.4876 1.0000 AVG_ED | 0.6736 -0.8232 1.0000 P_EL | -0.3039 0.6149 -0.6526 1.0000 P_GATE | 0.2827 -0.1631 0.2126 -0.1564 1.0000 EMER | -0.0987 0.0197 -0.0407 -0.0211 -0.0541 1.0000 DMOB | 0.5413 -0.0693 0.2123 0.0231 0.2198 -0.0487 1.0000 PCT_AA | -0.2215 0.1625 -0.1057 -0.0718 0.0334 0.1380 -0.1306 1.0000 PCT_AI | -0.1388 0.0461 -0.0246 -0.1510 -0.0812 0.0180 -0.1138 -0.0684 1.0000 PCT_AS | 0.3813 -0.3031 0.3946 -0.0954 0.2321 -0.0247 0.1620 -0.0475 -0.0902 1.0000 PCT_FI | 0.1646 -0.1221 0.1687 -0.0526 0.1281 0.0007 0.1203 0.0578 -0.0788 0.2485 PCT_HI | -0.4301 0.6923 -0.8007 0.7143 -0.1296 -0.0192 -0.0193 -0.0911 -0.1834 -0.3733 PCT_PI | -0.0598 0.0533 -0.0228 0.0286 0.0091 0.0315 -0.0202 0.2195 -0.0311 0.0748 PCT_MR | 0.1468 -0.3714 0.3933 -0.3322 0.0052 0.0102 -0.0928 -0.0053 0.0667 0.0904 | PCT_FI PCT_HI PCT_PI PCT_MR -----------------+------------------------------------ PCT_FI | 1.0000 PCT_HI | -0.1488 1.0000 PCT_PI | 0.2769 -0.0763 1.0000 PCT_MR | 0.0928 -0.4700 0.0611 1.0000

  19. . regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB if AVG_ED>0 & AVG_ED<6, beta Source | SS df MS Number of obs = 10082 --------------+------------------------------ -------------------------------------- F( 6, 10075) = 2947.08 Model | 65503313.6 6 10917218.9 Prob > F = 0.0000 Residual | 37321960.3 10075 3704.41293 R-squared = 0.6370 -------------+---------------------------------------------------------------------- Adj R-squared = 0.6368 Total | 102825274 10081 10199.9081 Root MSE = 60.864 ------------------------------------------------------------------------------------------------------------ API13 | Coef. Std. Err. t P>|t| Beta -------------+---------------------------------------------------------------------------------------------- MEALS | .1843877 .0394747 4.67 0.000 .0508435 AVG_ED | 92.81476 1.575453 58.91 0.000 .6976283 P_EL | .6984374 .0469403 14.88 0.000 .1225343 P_GATE | .8179836 .0666113 12.28 0.000 .0769699 EMER | -1.095043 .1424199 -7.69 0.000 -.046344 DMOB | 4.715438 .0817277 57.70 0.000 .3746754 _cons | 52.79082 8.491632 6.22 0.000 . ------------------------------------------------------------------------------------------------------------ . regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ED>0 & AVG_ED<6, beta Source | SS df MS Number of obs = 10082 ----------------+-------------------------------------------------------------------- F( 13, 10068) = 1488.01 Model | 67627352 13 5202104 Prob > F = 0.0000 Residual | 35197921.9 10068 3496.01926 R-squared = 0.6577 -------------+---------------------------------------------------------------------- Adj R-squared = 0.6572 Total | 102825274 10081 10199.9081 Root MSE = 59.127 -------------------------------------------------------------------------------------------------------------- API13 | Coef. Std. Err. t P>|t| Beta --------------+----------------------------------------------------------------------------------------------- MEALS | .370891 .0395857 9.37 0.000 .1022703 AVG_ED | 89.51041 1.851184 48.35 0.000 .6727917 P_EL | .2773577 .0526058 5.27 0.000 .0486598 P_GATE | .7084009 .0664352 10.66 0.000 .0666584 EMER | -.7563048 .1396315 -5.42 0.000 -.032008 DMOB | 4.398746 .0817144 53.83 0.000 .349512 PCT_AA | -1.096513 .0651923 -16.82 0.000 -.1112841 PCT_AI | -1.731408 .1560803 -11.09 0.000 -.0718944 PCT_AS | .5951273 .0585275 10.17 0.000 .0715228 PCT_FI | .2598189 .1650952 1.57 0.116 .0099543 PCT_HI | .0231088 .0445723 0.52 0.604 .0066676 PCT_PI | -2.745531 .6295791 -4.36 0.000 -.0274142 PCT_MR | -.8061266 .1838885 -4.38 0.000 -.0295927 _cons | 96.52733 9.305661 10.37 0.000 . -----------------------------------------------------------------------------------------------------------

  20. Special Schools (Outliers) GOOD ONES Residual Name Tested/Enrolled 506.0523 Muir Charter 78/78 488.5563 SIATech 65/66 342.7693 Escuela Popular/Center for Training and 88/91 280.2587 YouthBuild Charter School of California 78/78 246.7804 Oakland Charter Academy 238/238 232.4897 Oakland Charter High 146/146 230.0739 Opportunities For Learning - Baldwin Par 1434/1442 BAD ONES -399.4998 Sierra Vista High (SD) 14/15 -342.2773 Baden High (Continuation) 73/73 -336.5667 Dover Bridge to Success 84/88 -322.1879 Millennium High Alternative 43/49 -318.0444 Aurora High (Continuation) 128/131 -315.5069 Sunrise (Special Education) 34/34 -311.1326 Nueva Vista High 20/28

  21. Multiple Regression Weighted by the Number of Test Takes (TESTED) . regress API13 MEALS AVG_ED P_EL P_GATE EMER DMOB PCT_AA PCT_AI PCT_AS PCT_FI PCT_HI PCT_PI PCT_MR if AVG_ > ED>0 & AVG_ED<6 [aweight = TESTED], beta (sum of wgt is 9.0302e+06) Source | SS df MS Number of obs = 10082 ----------------+-------------------------------------------------------------------- F( 13, 10068) = 2324.54 Model | 41089704.2 13 3160746.48 Prob > F = 0.0000 Residual | 13689769.3 10068 1359.73076 R-squared = 0.7501 ----------------+--------------------------------------------------------------------- Adj R-squared = 0.7498 Total | 54779473.6 10081 5433.9325 Root MSE = 36.875 ------------------------------------------------------------------------------ API13 | Coef. Std. Err. t P>|t| Beta ------------------+---------------------------------------------------------------- MEALS | .2401007 .032364 7.42 0.000 .0828479 AVG_ED | 83.84621 1.444873 58.03 0.000 .8044588 P_EL | .1605591 .0405248 3.96 0.000 .0306712 P_GATE | .2649964 .0443791 5.97 0.000 .0317522 EMER | -1.527603 .1503635 -10.16 0.000 -.0513386 DMOB | 3.414537 .0834016 40.94 0.000 .2212861 PCT_AA | -1.275241 .0583403 -21.86 0.000 -.1301146 PCT_AI | -1.96138 .2143326 -9.15 0.000 -.0499468 PCT_AS | .4787539 .0368303 13.00 0.000 .082836 PCT_FI | -.0272983 .1113346 -0.25 0.806 -.0013581 PCT_HI | .0440935 .0351466 1.25 0.210 .0158328 PCT_PI | -2.464109 .5116525 -4.82 0.000 -.0271533 PCT_MR | -.5071886 .1678521 -3.02 0.003 -.0187953 _cons | 220.2237 9.318893 23.63 0.000 . ------------------------------------------------------------------------------

  22. Best Linear Unbiased Estimate (BLUE) • Characteristics of OLS ifsample is probability sample • Unbiased E(b)= themean sample value is the population value • Efficient Min bthe sample values are as close to each other as possible • Consistent as sample size (n) approaches infinity, the sample • value converges on the population value • If the following assumptions are met: • The Model is • Complete • Linear • Additive • Variables are • measured at an interval or ratio scale • without error • The regression error term is • normally distributed • has an expected value of 0 • errors are independent • homoscedasticity • predictors are unrelated to error • In a system of interrelated equations the errors are unrelated to each other

More Related