Chapter

Chapter 4 Describing the Relation between Two Variables

Chap 2

Section 4.1 Scatter Diagrams and Correlation

The response (dependent) or “output” variable is the variable whose value can be “predicted” or explained by the value of the explanatory/predictor (independent) or “input” variable.

A scatter diagram is a graph that shows the relationship between two quantitative variables. The predictor (independent / “x”) variable is plotted on the horizontal axis, and the response (dependent / “y”) variable is plotted on the vertical axis.

EXAMPLE Drawing and Interpreting a Scatter Diagram The data to the right are based on a study for drilling thru rock. The researchers wanted to determine whether the time it takes to drill thru 5 feet of rock increases with the depth at which the drilling begins. Depth at which drilling begins is the predictor variable, “x”, and time (min) to drill five feet is the response variable “y”. Draw a scatter diagram of the data.

Various Types of Relations in a Scatter Diagram

Two variables that are linearly related are positively correlated when higher values of one variable are associated with higher values of the other (positive slope), and lower values of one variable are associated with lower values of the other. That is, two variables are “positively correlated” if, as one variable increases, the other variable also increases.

Two variables that are linearly related are negatively correlated when higher values of one variable are associated with lower values of the other (negative slope), and lower values of one variable are associated with higher values of the other. That is, two variables are “negatively correlated” if, as one variable increases, the other variable decreases.

The linear correlation coefficient or Pearson Correlation Coefficient is a measure of the strength and direction of the linear relation between two quantitative variables. The Greek letter “ρ” (rho) represents the population correlation coefficient, and “r” represents the sample correlation coefficient.

1 –1 0 Linear Correlation Coefficient A measure of the strength and direction of a linear relationship between two variables The range of r is from –1 to +1 If r is close to –1 there is a strong negative correlation. If r is close to 1 there is a strong positive correlation. If r is close to 0 there is no linear correlation.

Properties of the Pearson Correlation Coefficient • –1 ≤ r ≤ 1. • If r = + 1, then a perfect positive correlation exists between the two variables. • If r = –1, then a perfect negative correlation exists between the two variables. • The closer r is to +1, the stronger is the positive correlation between the two variables. • The closer r is to –1, the stronger is the negative correlation between the two variables.

If r is close to 0, then little or no evidence exists of correlation between the two variables. So r close to 0 does not imply no relation, just no linear relation. • The “r” coefficient is dimensionless. • The Pearson correlation coefficient is not resistant. Therefore, just one observation that does not follow the overall data pattern (think outlier) could affect the value of r.

EXAMPLE Determining the Pearson Correlation Coefficient • Determine the Pearson correlation coefficient “r” of the drilling data: • By algebra (boo!) • By calculator (yeah!)

Better way … • Enter all the “x” data in List 1, and all the “y” data in List 2. Make sure you keep the related pairs of x,y data in the same order. • Set your Calc to “Diagnostics On” • Go to Stat: Calc:4 LinReg: L1,L2 • Look for value of Pearson “r” Chap 2

TI-84 Line of Regression (LOR) • X data (horiz) to (L1), Y (vert) data to (L2) • STAT PLOT: Plot 1: Scatter Plot • Zoom:9:Stat • STAT:Calc:4: LinReg(ax+b): L1, L2, Y1 This will generate the LOR on the Stat Plot thru the points, and will show the equation at Y1 To predict “y” value when x=9, find Y1(9) Y1 is found at VARS, Y-VARS, Func, Y1 Chap 9

Testing for a Linear Relation • Determine the absolute value of the Pearson correlation coefficient: |r|. • Find the critical value in Table II from Appendix A (or handout) for the given sample size. • 3. If |r| is greater than the critical value, then a usable (make predictions) linear relation exists between the two variables. Otherwise, no linear relation exists.

EXAMPLE Does a Linear Relation Exist? Determine whether a linear relation exists between time and depth of the drilling. What type of relation appears to exist between time to drill five feet and depth at which drilling begins? The Pearson |r| value for the two variables (time/depth) is 0.773. The critical value for n = 12 observations is 0.576. Since 0.773 > 0.576, there is a positive linear correlation between time to drill five feet and depth at which drilling begins. We can use this correlation to make predictions.

Another way that two variables can be related even though there is not a causal relation is through a “lurking variable”. A lurking variable is related to both the explanatory and response variable. For example, ice cream sales and crime rates have a very high positive correlation. Does this mean that sales of ice cream causes crime rates to go up? The lurking variable is temperature. As temperatures rise, both ice cream sales and crime rates rise.

Something to remember… Correlation between variables does not imply “causation” (the independent causes the dependent) unless the results come from a controlled experiment. Correlation of variables in an observational study only implies “association” between the variables and not “causation” of one by the other. Chap 2

Section 4.2 Least-squares Regression

EXAMPLE Finding an Equation that Describes Linearly Correlated Data Find a linear equation that relates x (predictor variable) and y ( response variable) by selecting any two points and finding the equation of the line between those points. Use points: (2, 5.7) and (6, 1.9)

Graph the equation on the scatter diagram. Use the equation to predict y if x = 3 Note: (3, 5.2) is actual data point

The difference between the observed value of y and the predicted value of y is the error, or residual. Using the line and the predicted value at x = 3 : residual = observed y – predicted y = 5.2 – 4.75 = 0.45 (error) (3, 5.2) } residual = observed y – predicted y = 5.2 – 4.75 = 0.45

Least-Squares Regression Criterion The least-squares regression line (LOR or COBF) is the line that minimizes the sum of the squared errors (residuals). This LOR line minimizes the sum of the squared vertical distance between the observed values of y and those predicted by the line (“y-hat”), In other words: minimizeΣ residuals2

Key Concepts LOR stands for “Line of Regression” COBF stands for “Curve of Best Fit” Both terms refer to the Least-Squares Regression Line and are used interchangeably. Chap 2

EXAMPLE Finding the Least-squares Regression Line • Find the LOR line. • Predict the drilling time if drilling starts at 130 feet. • Is the observed drilling time at 130 feet above, or below predicted? • Draw the LOR on the scatter diagram of the data.

We agree to round the estimates of the slope and intercept to four decimal places. (b) (c) The observed drilling time is 6.93 seconds. The predicted drilling time is 7.035 seconds. The LOR-predicted drilling time is 1.52% above observed.

Interpretation of Slope of a line: The slope of the LOR regression line is 0.0116. Therefore, for each additional one foot of depth we start the drilling, the time to drill five feet increases by 0.0116 min (~ 0.7 sec), on average.

If the LOR is used to make predictions based on values of the predictor (independent) variable that are significantly outside the observed values, then the researcher is working outside the scope of the model. Never use an LOR to make predictions outside the scope of the model because the linear relation may not still exist.

Section 4.3 Diagnostics on the Least-squares Regression (LOR) Line

The coefficient of determination, R2, measures the proportion of total variation in the response variable that is explained by the LOR line. The coefficient of determination is a number between 0 and 1, inclusive. 0 <R2< 1. If R2 = 0 the LOR has no prediction value If R2 = 1 it means 100% of the variation in the response variable is caused by a change in the predictor variable.

Depth at which drilling begins is the predictor variable, “x” Time (min) to drill five feet is the response variable, y.

Sample Statistics Mean Standard Deviation Depth 126.2 52.2 Time 6.99 0.781 Correlation Between Depth and Time: 0.773 Regression Analysis The regression equation (LOR) is: y (time) = 0.0116x (depth) + 5.53 (min)

Suppose we were asked to predict the time to drill an additional 5 feet, but we did not know the current depth of the drill. What would be our best “guess”? ANSWER: The mean time to drill additional 5 feet: 6.99 minutes (see Sample Statistics)

Now suppose that we are asked to predict the time to drill an additional 5 feet if we know that the current depth of the drill is 160 feet? ANSWER: Our “guess” increased from 6.99 minutes to 7.39 minutes because we knew the drill depth and the LOR equation.

Definitions The “observed” value of the response (dependent) variable: The “predicted” value (by the LOR) “ “ “ The “mean” value (of all the “y” values) “ Chap 2

Total Deviation Unexplained Deviation Explained Deviation = +

Explained Variation Unexplained Variation = 1 – R2 = Total Variation Total Variation Total Variation = Unexplained Variation + Explained Variation

To determine R2 for the linear regression model simply square the value of the Pearson correlation coefficient “r ”. The TI-84 gives you both “r” and R2 when you use the “LinReg subroutine (remember to set “Diagnostics On”)

EXAMPLE Determining the Coefficient of Determination Find and interpret the coefficient of determination for the drilling data. The Pearson correlation coefficient, “r ” = 0.773, R2 = 0.7732 = 0.5975 = 59.75%. So, 59.75% of the variance in drilling time is explained by the variance of drilling depth.

Data Set A Data Set B Data Set C A: 99.99% of the variation in y is explained by the variation in x (LOR) B: 94.7% of the variation in y is explained by the variation in x (LOR) C: 9.4% of the variation in y is explained by the variation in x (LOR)

Chapter

Chapter

Presentation Transcript

Chapter

Chapter

Chapter

chapter:

Chapter

Chapter # - Chapter Title

Chapter

chapter:

Chapter 6 Chapter Review

Chapter # - Chapter Title

CHAPTER 3 CHAPTER REVIEW

Chapter 1. Chapte r 2. Chapter 3. Chapter 4. Chapter 5. Chapter 6. Chapter 7.

Chapter Number Chapter Title

CHAPTER

chapter

Chapter

CHAPTER

Chapter # - Chapter Title

Chapter # - Chapter Title

CHAPTER 5 CHAPTER 2

Chapter 17 Chapter Review

Chapter 18 Chapter 18