Unit 2: Correlation and Causality. The S-030 roadmap: Where’s this unit in the big picture?. Unit 1: Introduction to simple linear regression. Unit 2: Correlation and causality. Unit 3: Inference for the regression model. Building a solid foundation. Unit 5: Transformations
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
simple linear regression
Inference for the
to achieve linearity
Evaluating their tenability
Adding additional predictors
The basics of
Statistical control in depth:
Correlation and collinearity
Generalizing to other types of predictors and effects
Categorical predictors II: Polychotomies
Categorical predictors I: Dichotomies
Read the Pediatrics article
Listen to the NPR Interview
Heredity: relationships between siblings and spouses(Pearson & Lee, 1903, On the laws of inheritance in man, Biometrika)
Learn more about Karl Pearson
A particular transformation that yields a new variable with mean = 0 and sd = 1
Mean = 98.11
sd = 15.21
ID OwnIQ SOwnIQ FostIQ SFostIQ
1 68 -1.9985 63 -2.3080
2 71 -1.7943 76 -1.4535
. . .
25 95 -0.1606 96 -0.1389
26 96 -0.0925 93 -0.3361
. . .
52 129 2.1539 117 1.2415
53 131 2.2900 132 2.2274
Mean = 97.36
sd = 14.69
Any re-expression of a variable’s scale
Slope of the standardized regression line
assesses the estimated difference in FostIQ
(measured in standard deviation units)
per standard deviation in OwnIQ
Standardized regression line goes precisely through (0,0)
At average X (SOwnIQ=0), we predict average Y (SFostIQ=0)
Pearson product-moment coefficient, r
Does 0.8767 seem familiar?
Cool interactive applet for learning more about correlation
Not uncommon in social sciences, but when r < .2, you have very little explanatory power (R2 < 4%)
Covers most “statistically significant” correlations in social sciences, but even when r = .5, you’re only explaining 25% of the variance in Y
Rare in the social sciences and even when r = .7, you’re still explaining less than ½ the variance in Y
Extremely rare in the social sciences, unless you have aggregate data or a coding problem(!)
Another way of thinking about r is as a measure of effect size
You have a sound theory to explain how a change in the predictor produces a change in the outcome
You find the same result in other populations, with different characteristics, at different times
What do we really mean when we say:
“I interpreted…Galton…to mean that there was a category broader than causation, namely correlation…and that this new conception of correlation brought psychology, anthropology, medicine, and sociology in large parts into the field of mathematical treatment. It was Galton who first freed me from the prejudice that sound mathematics could only be applied to natural phenomenon under the category of causation. Here for the first time was a possibility, I will not say a certainty, of reaching knowledge—as valid as physical knowledge was then thought to be—in the field of living forms and above all in the field of human conduct.”
Karl Pearson, 1889
Four criteria for establishing causality
You demonstrate that a change in the predictor actually produces a change in the outcome
No plausible alternative explanation
There’s no rival predictor that can explain away the observed correlation
Highest priorities for design and analysis
and often the hardest to establish
Counterfactual reasoning provides a powerful lens for thinking about these questions
You’d like to know what outcome values these individuals would have had if they had received a “different treatment”
(ie, if they had different predictor values)?
Narrative development in bilingual kindergarteners: Can Arthur help?Yuuko Uchikoshi (2005) Developmental Psychology
RQ:Can narrative skills be ‘taught’ via TV to English Language Learners?
Four important attributes of randomized experiments
The researcher actively intervened in the system, actually changing X (the treatment) and seeing what happens to Y
Because of random assignment, groups are guaranteed to be initially equivalent, on average, on all observable (and unobservable) characteristics
The control group provides the ideal counterfactual—our best estimate of what the treatment group would have looked like if it didn’t receive the treatment
Any difference found in Y must be due to the changing of X (the treatment) because there’s no other plausible explanation
“You can’t fix by analysis what you bungled by design…”
Light, Singer and Willett (1990)
Ethics: Morally, there are some treatments to which you can’t expose people
Does radiation cause cancer?
Many would argue that these
can’t be “causes”
When participants (or even researchers) choose, the conclusions are weaker because they’re subject to selection bias
Feasibility: Logistically, there are some treatments to which you can’t assign people
Does education cause increased income?
Time: Practically, some information is better than no information
Does quality child care cause better life outcomes?
Matching (especially propensity score matching is very popular now)
Availability of data: With so much data, shouldn’t we analyze it?
Let’s think about how you might go about doing this
US Committee on Gov’t Reform “When forced to take legally binding positions, the tobacco industry still does not accept scientific consensus … that…cigarettes cause disease in smokers [and] that environmental tobacco smoke causes disease in nonsmokers.
Read the Waxman (2002) report Tobacco industry statements in the Department of Justice Lawsuit
But, just because we haven’t done an experiment
doesn’t mean the correlation isn’t causal
Sample Tobacco Industry Statements
17 September 1982
Heart Attack Study Finds Men Heeding Health Advice BetterA federally financed study of 12,866 men -- half exhorted to improve their health habits and half getting only "usual care" from their doctors--has produced an unexpected result: Both groups had the same rate of heart attacks, but it was only one-fourth the rate of the general population of the same age.What happened [is that] almost all Americans were reading and hearing advice to smoke less, eat fewer fats and lower their cholesterol level and blood pressure. Exhorted or not, most of the men in the study and their doctors apparently got the same message, and did even better than the average American.
Find Article on LexisNexis
But not all spurious correlations
SES is often the “3rd variable”
It is easy to prove that the wearing of tall hats and the carrying of umbrellas enlarges the chest, prolongs life, and confers comparative immunity from disease… A university degree, a daily bath, the owning of thirty pairs of trousers, a knowledge of Wagner’s music, a pew in church, anything, in short, that implies more means and better nurture…can be statistically palmed off as a magic spell conferring all sorts of privileges…The mathematician whose correlations would fill a Newton with admiration, may, in collecting and accepting data and drawing conclusions from them, fall into quite crude errors by just such popular oversights --George Bernard Shaw (1906)
There’s a 3rd variable, Z, which causes changes in X and may—or may not—also cause change in Y
Yule (1899) An investigation into the causes of changes in pauperism in England
Yule’s footnote 25
“Strictly speaking, for ‘due to’ read ‘associated with.’ ”
25 Feb 1993
Crack cocaine study faulted on race factorA study carried out four years ago has created the false perception that crack cocaine smoking is more common among blacks and Hispanics than among white Americans, say scientists who reanalyzed the findings in a new report.
The 1988 National Household Survey on Drug Abuse said that rates of crack use among blacks and Hispanics were twice as high as among whites. But the study failed to take into account social factors such as where the users lived and how easily the drug could be obtained, according to researchers writing in yesterday's issue of the Journal of the American Medical Association. The authors, from Johns Hopkins University, said that when adjusted for those factors, the study found equivalent use of crack among blacks, Hispanics and whites.
"Researchers have the responsibility to go beyond the reporting of racial and ethnic differences" because such findings "are often presented as if a person's race has intrinsic explanatory power," the authors wrote.
There’s a 3rd variable, Z, which is correlated with X and which causes changes in Y, but we don’t know if this explains away the correlation between X & Y
Find Article on LexisNexis
Some confounders don’t just ‘explain away’ the association, they reveal a reversal in the direction of the effect
r = -0.56
Sex bias in graduate admissions:
UC Berkeley (1973)
Learn more about Simpson’s paradox
X may cause Y or Y may cause X—with the data we have, we just can’t tell
Robinson, W.S. (1950). Ecological Correlations and the Behavior of Individuals. American Sociological Review 15: 351–357.
(Rural, low foreign born, but lots of illiterates)
Correlations with Illiteracy
Unit of analysis
(Urban, lots of foreign born, but also lots of very literate folks)
Aggregate data describe aggregate relationships, not individual level relationships
In recent years, there has been an explosion of interest in the conditions necessary for establishing causal inferences. Different disciplines use different standards and approaches, and there is much to learn by reading broadly. Here are some sources that I find particularly interesting and insightful:
Don’t forget the semicolonat the end of every statement;
options nodate nocenter nonumber;
title1 "Unit 2: IQs of Cyril Burt's identical twins";
footnote1 "Program: Unit 2--Burt analysis.sas";
Be sure to update the infile reference to the file's
location on your computer
Input Burt data and name variables in dataset
input ID 1-2 OwnIQ 4-6 FostIQ 8-10;
Estimate bivariate correlation between owniq & fostiq
(Pearson correlation coefficient)
proc corr data=one;
var owniq fostiq;
Don’t forget to specify the location of the raw data, and check that you are indicating the appropriate drive
proc correstimates bivariate correlations between variables you specify. Its var statement syntax is var1 var2 var3 … varn(note that it has neither an * (like proc gplot) or an = (like proc reg)