Download Presentation
## Sections 2.1-2.2

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Sections 2.1-2.2**Looking at Data-Relationships**Data with two or more variables:**• Response vs Explanatory variables • Scatterplots • Correlation • Regression line**Association between a pair of variables**• Association: Some values of one variable tend to occur more often with certain values of the other variable • Both variables measured on same set of individuals • Examples: • Height and weight of same individual • Smoking habits and life expectancy • Age and bone-density of individuals**Causation?**• Caution: Often there are spurious, other variables lurking in the background • Shorter women have lower risk of heart attack • Countries with more TV sets have better life expectancy rates • More deaths occur when ice cream sales peak • Just explore association or investigate a causal relationship?**Preliminaries:**• Who are the individuals observed? • What variables are present? • Quantitative or categorical? • Association measures depend on types of variables • Response variable measures outcome of interest • Explanatory variable explains and sometimes causes changes in response variable**Examples**• Different amount of alcohol given to mice, body temperature noted (belief: drop in body temperature with increasing amount of alcohol) • Response variable? • Explanatory variable? • SAT scores used to predict college GPA • Response variable? • Explanatory variable?**Examples**• Does fidgeting keep you slim? • Some people don’t gain weight even when they overeat. Perhaps fidgeting and other “nonexercise activity” explains why, here is the data: • We want to plot Y vs. X • Which is Y? • Which is X?**Things to look for on scatterplot:**• Form (linear, curve, exponential, parabola) • Direction: • Positive Association: Y increases as X increases • Negative Association: Y decreases as X increases • Strength: Do the points follow the form quite closely or scattered? • Outliers: deviations from overall relationship • Let’s look again…**Example: State mean SAT math score plotted against the**percent of seniors taking the exam**Adding a categorical variable or grouping**• May enhance understanding of the data • Categorical variable is (region): • “e” is for northeastern states • “m” is for midwestern states • All others states excluded**Other things:**• Plotting different categories via different symbols may throw light on data • Read examples 2.7-2.9 for more examples of scatterplots • Existence of a relationship does not imply causation • SAT math and SAT verbal scores have strong relationship • But a person’s intelligence is causing both • The relationship does not have to hold true for every subject, it is random**2.2 Correlation**• Linear relationships are quite common • Correlation coefficient r measures strength and direction of a linear relationship between two quantitative variables X and Y • Data structure: (X,Y) pairs measured on n individuals • Weight and blood pressure • Age and bone-density**Correlation (r)**• Lies between -1 and 1 • If switch roles of X and Y r remains the same • Unit free, unaffected by linear transformation • Positive correlation means positive association • negative correlation means negative association • X and Y should both be quantitative • r near 0 implies weak (or no) linear relationship; closer to +1 or -1 suggests very strong linear pattern**Formula:**• Calculation: • Usually by software or calculator • Calculate means and standard deviations of data • Standardize X and Y: • take off respective mean • divide by corresponding standard deviation • Take products of X(standardized)*Y (standardized) for each subject • Add up and divide by n-1**Issues:**• r is affected by outliers • Captures only the strength of the “linear” relationship • it could be true that Y and X have a very strong non-linear relationship but r is close to zero • r = +1 or -1 only when points lie perfectly on a straight line. (Y=2X+3) • SAS program: correlation.doc • proc corr is the procedure**Summary**• Scatterplots: look for form, direction, strength, outliers • Correlation: Numerical measure capturing direction and strength of a linear relationship • Sign of r: direction • Value of r: strength • Always: Plot the data, look at other descriptive measures along with the correlation**Sections 2.3-2.4**Looking at Data-Relationships**2.3 Regression Line**• Straight line which describes best how the response variable y changes when the explanatory variable x changes • We do distinguish between Y and X cannot switch their roles • Equation of straight line: y = a + b x • a is the intercept (where it crosses the y-axis) • b is the slope (rate) • Procedure • calculate best a and b for your data • Find the line that best fits your data • Use this line to predict y for different values of x**Example:Regression line for NEA data. We can predict the**mean fat gain at 400 calories**Prediction and Extrapolation**• Fitted line for NEA data: Pred. fat gain = 3.505 – 0.00344(NEA) • Prediction at 400 calories: Pred. fat gain = 3.505 – 0.00344*400 = 2.13 kg • So when a person’s NEA increases by 400 calories when they overeat, they will have a predicted fat gain of 2.13 kilograms.**Prediction and Extrapolation**• Warning: Extrapolation--predicting beyond the range of the data--is dangerous! • Prediction at 1500 calories Pred. fat gain = 3.505 – 0.00344*1500 = -1.66 kg • So predicting for a 1500 NEA increase when overeating, the prediction is that they will lose 1.66 kilograms of weight • Not trustworthy • Far outside the range of the data**Least Squares Regression (LSR) Line**• The line which makes the sum of squares of the vertical distances of the data points from the line as small as possible • y is the observed (actual) response • ŷ is the predicted response by using the line • Residuals • Error in prediction • y – ŷ**Formula for Least Squares Regression line**Given (explanatory x, response y):**Using the formula:**• Slope: b = -.7786 * 1.1389/257.66 = -0.00344 • Intercept: a = (mean of y) – slope * (mean of x) = 2.388 – (-0.00344)*324.8 = 3.505 • Regression line: Predicted fat gain = 3.505 – 0.00344*cal ŷ = 3.505 – 0.00344x**Example: Predicted values and Residuals**• Predicted fat gain for observation 2 (-57 cal.) ŷ2 = 3.505 – 0.00344*(-57) = 3.70108 kg • Observed fat gain: y2 = 3.0 kg • Residual or error in prediction = y2 - ŷ2 = 3.0 – 3.70108 = -0.70108 kg**Residual practice**• Residual is yi – ŷi • For NEA data observation 14 has NEA = x14 = 580 • Find the predicted value, ŷ14 • Find the residual, y14 - ŷ14**Properties of regression line**• Cannot switch Y and X • Passes through the mean of x and mean of y • Physical interpretation of the slope b: • with one unit increase in X, how much does Y change on average? • Example: NEA data: with 1 calorie increase in NEA, fain gain changes by -0.00344 kg • How about 100 increase in NEA?**Properties (cont.)**• Sign of slope (b) is sign of correlation (r) • captures the direction of linear association • Slope b is affected by scale change but not by a shift (adding or subtracting a constant from all data points) • Convert: X from months to years • Let’s say the slope is 5, when using months • What would the slope be if we used years for X instead? • If Y increases by 5 per month, it’ll increase by ? per year?**Using software**• SAS will evaluate the least squares regression line but you have to know where to find them in the output • Residuals and predicted values are also printed • SAS program : regression.doc • the regression procedure is proc reg • We will do a deeper analysis of regression in chapter 10**Correlation and Regression**• In correlation, X and Y are interchangeable, NOT so in regression. • Slope (b), depends on correlation (r) • R2—Coefficient of Determination • Square of correlation • Fraction of variation in y explained by LSR line • Higher R2 suggests better fit • Example: R2 = 0.6062 for NEA data • means that 60.62% of the unexplained variation in fat gain is explained by your fitted regression line with x = NEA.**R2—another example**• Explains the part of the variation of y which comes from the linear relationship between y and x. In this case between Height and Age. less spread tight fit R2 = 0.989 more scatter more error in prediction R2 = 0.849**2.4 Caveats about correlation & regression**• Residuals can tell us whether we have a good fit • Residual = observed y - predicted y • Residual plot: plot of residuals vs x • Used to assess the fit of regression line • Residuals add up to zero and have a mean of zero • Thus, a fit is considered good if the plot shows a random spread of points about the zero line but without any definitive pattern**Residual plot**• Scatterplot of residuals against explanatory variable • Helps assess the fit of regression line**Outliers and influential observations**• Outliers: Lies outside the pattern of other observations • Y-outliers: large residual • X-outliers: often influential in regression • Influential points: Deleting this point changes your statistical analysis drastically • pull the regression line towards themselves • Least squares regression is NOT robust to presence of outliers**Example: Gesell data**• r = 0.4819 • Subject 15: • Y-outlier • Far from line • High residual • Subject 18: • X-outlier • Close to line • Small residual**Example: Gesell data**• r = 0.4819 • Drop 15: • r = 0.5684 • Drop 18: • r = 0.3837 • Both have some influence, but neither seems excessive**Causation**• Association does not imply causation! • An association between x and y, even if it is very strong, is not itself good evidence that changes in x actually cause changes in y. • Causation: Variable X directly causes a change in Variable Y • Example: • X = plant food • Y = plant’s growth**Common Response**• Other variables may affect the relationship between X and Y • Beware of lurking variables • Example: for children, • X = height • Y = Math Score • Z = Age**Confounding**• Other variables may affect the relationship between X and Y • Can’t separate effects of X and Z on Y • Example: • X = number years of education • Y = income • Z = ??