1 / 38

Chapters 8 and 9: Correlations Between Data Sets

Chapters 8 and 9: Correlations Between Data Sets. Math 1680. Overview. Scatter Plots Associations The Correlation Coefficient Sketching Scatter Plots Changes of Scale Summary. Scatter Plots. Often, we are interested in comparing two related data sets Heights and weights of students

muriel
Download Presentation

Chapters 8 and 9: Correlations Between Data Sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapters 8 and 9: Correlations Between Data Sets Math 1680

  2. Overview • Scatter Plots • Associations • The Correlation Coefficient • Sketching Scatter Plots • Changes of Scale • Summary

  3. Scatter Plots • Often, we are interested in comparing two related data sets • Heights and weights of students • SAT scores and freshman GPA • Age and fuel efficiency of vehicles • We can draw a scatter plot of the data set • Plot paired data points on a Cartesian plane

  4. Scatter Plots • Scatter plot for the heights of 1,078 fathers and their adult sons • From HANES study

  5. Scatter Plots • What does the dashed diagonal line represent? • Find the point representing a 5'3¼" father who has a 5'6½" son

  6. Scatter Plots • What does the vertical dashed column represent? • Consider the families where the father was 72" tall, to the nearest inch • How tall was the tallest son? • Shortest?

  7. Scatter Plots • Was the average height of the fathers around 64”, 68” or 72”? • Was the SD of the fathers’ heights around 3", 6" or 9"?

  8. Scatter Plots • The points form a swarm that is more or less football-shaped • This indicates that there is a linear association between the fathers’ heights and the sons’ heights

  9. Scatter Plots • Short fathers tend to have short sons, and tall fathers tend to have tall sons • We say there is a positive association between the heights of fathers and sons • What would it mean for there to be a negative association between the heights?

  10. Scatter Plots • Does knowing the father’s height give a precise prediction of his son’s height? • Does knowing the father’s height let you better predict his son’s height?

  11. Scatter Plots • We will generally assume the scatter plots are football-shaped • Association is linear in nature • Each data set is approximately normal

  12. Scatter Plots • Key features of scatter plots • Given two data sets X and Y, … • The point of averages is the point (x, y) • The average of a data set is denoted by μ (Greek mu, for mean) • The subscript indicates which set is being referenced • It will be in the center of the cloud • Due to the normal approximation, the vast majority (95%) of the cloud should fall within 2 SD’s less than and greater than average for both X and Y

  13. Scatter Plots

  14. Associations • When given a value in one data set, we often want to make a prediction for the other data set • We call our given value the independent variable • We call the value we are trying to predict the dependent variable

  15. Associations • If there is indeed a relationship between the two data sets, we can say various things about their association: • Strong: Knowing X helps you a lot in predicting Y, and vice versa • Weak: Knowing X doesn’t really help you predict Y, and vice versa • Positive:X and Y are directly proportional • The higher in one you look, the higher in the other you should be • Negative:X and Y are inversely proportional • The higher in one you look, the lower in the other you should be

  16. Positive associations Study time/final grade Height/weight SAT score/GPA Clouds in sky/chance of rain Bowling practice/bowling score Age of husband/age of wife Negative associations Age of car/fuel efficiency Golfing practice/golf score Dental hygiene/cavities formed Pollution/air quality Speed/mile time Associations

  17. Associations • What kind of association is this?

  18. Associations • What kind of association is this?

  19. Associations • Remember that even a very strong association does not necessarily imply a causal relationship • There may be a confounding influence at play

  20. The Correlation Coefficient • While strong/weak and positive/negative give a sense of the association, we want a way to quantify the strength and direction of the association • The correlation coefficient (r) is the statistic which accomplishes this

  21. The Correlation Coefficient • The correlation coefficient is always between –1 and 1 • A positive r means that there is a positive association between the sets • A negative r means that there is a negative association between the sets • If r is close to 0, then there is only a weak association between the sets • If r is close to 1 or –1, then there is a strong association between the sets

  22. The Correlation Coefficient • The following plots have and , with 50 points in them • The only difference between them is the correlation coefficient • Note how the points fall into a line as r approaches 1 or –1

  23. The Correlation Coefficient • To calculate r… • Find the average and SD of each data set • Multiply the data sets pairwise and find the average • The correlation is the average of the product minus the product of the averages, all divided by the product of the SD’s

  24. XY 5 27 28 5 91 The Correlation Coefficient

  25. The Correlation Coefficient • Compute r for the following data 1 0.8214

  26. The Correlation Coefficient • Estimate the correlation

  27. The Correlation Coefficient • Estimate the correlation

  28. Sketching Scatter Plots • The SD line is the line consisting of all the points where the standard score in X equals the standard score in Y • zX = zY • To sketch the SD line, draw a line bisecting the long axis of the football shape • Note that the SD line always goes through the point of averages

  29. Sketching Scatter Plots • Given the five-statistic summary (averages, SD’s, and correlation) for a pair of data sets, we can sketch the scatter plot • Plot the point of averages in the center • Mark two SD’s in both directions, on both axes • Plot the point 1 SD above average for both data sets • draw a line connecting this point and the point of averages • This is the SD line • Draw an ellipse with the SD line as its long axis • Ellipse should go just beyond the 2 SD marks in all directions • The value of r determines how oblong the ellipse is

  30. Sketching Scatter Plots • A study of the IQs of husbands and wives obtained the following results • Husbands: average IQ = 100, SD = 15 • Wives: average IQ = 100, SD = 15 • r = 0.6 • Sketch the scatter plot

  31. Changes of Scale • The correlation coefficient is not affected by changes of scale • Moving: adding the same number to all of the values of one variable • Stretching: multiplying the same positive number to all the values of one variable • Would r change if we multiplied by a negative number? • The correlation coefficient is also unaffected by interchanging the two data sets

  32. Changes of Scale

  33. Changes of Scale

  34. Changes of Scale • Compute r for each of the following data sets r = -0.15

  35. Summary • The relationship between two variables, X and Y, can be graphed in a scatter plot • When the scatter plot is tightly clustered around a line, there is a strong linear association between X and Y • A scatter plot can be characterized by its five-statistic summary • Average and SD of the X values • Average and SD of the Y values • Correlation coefficient

  36. Summary • When the correlation coefficient gets closer to 1 or –1, the points cluster more tightly around a line • Positive association has a positive r-value • Negative association has a negative r-value • Calculating the correlation coefficient • Take the average of the product • Subtract the product of the averages • Divide the difference by the product of the SD’s

  37. Summary • The correlation coefficient is not affected by changes of scale or transposing the variables • Correlation does not measure causation!

More Related