1 / 16

Lecture 8 Sections 3.1-3.2

Lecture 8 Sections 3.1-3.2. Objectives: Bivariate and Multivariate Data and Distributions Scatter Plots Form, Direction, Strength Correlation Properties of Correlation. Multivariate Data.

Download Presentation

Lecture 8 Sections 3.1-3.2

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Lecture 8Sections 3.1-3.2 Objectives: • Bivariate and Multivariate Data and Distributions • Scatter Plots • Form, Direction, Strength • Correlation • Properties of Correlation

  2. Multivariate Data A multivariate data set consists of observations made simultaneously on two or more variables. One important special case is that of bivariate data, in which observations on only two variables, x and y, are available. • We’ll study • the scatter plot: a graphical tool to gain insight into the nature of any relationship between x and y. • the correlation coefficient: a numerical measure of how strongly two variables are related. • 3) the regression problem: a statistical tool to model the relationship between two variables and to predict y from x.

  3. Scatter Plots A scatter plotis a graphical tool for displaying association between two quantitative variables measured on the same individuals. You can’t use the scatter plot to display the association between two qualitative variables or the association between a qualitative variable and quantitative variable. A response variablemeasures or records an outcome of a study. An explanatory variableexplains changes in the response variable. Typically, the explanatory or independent variable is plotted on the x axis, and the response or dependent variable is plotted on the y axis.

  4. Scatter Plots • After plotting two variables on a scatterplot, we describe the relationship by examining the form,direction, and strength of the association. We look for an overall pattern … • Form: linear, curved, clusters, no pattern • Direction: positive, negative, no direction • Strength: how closely the points fit the “form” • … and deviations from that pattern. • Outliers

  5. No relationship Nonlinear Scatter Plots Linear

  6. Strength of association The strength of the relationship between the two variables can be seen by how much variation, or scatter, there is around the main form. With a strong relationship, you can get a pretty good estimate of y if you know x. With a weak relationship, for any x you might get a wide range of y values.

  7. Outliers An outlier is a data value that has a very low probability of occurrence (i.e., it is unusual or unexpected). In a scatterplot, outliers are points that fall outside of the overall pattern of the relationship.

  8. Example Forest growth and decline phenomena throughout the world have attracted considerable public and scientific interest. The following observations were taken on y=mean crown dieback (%) and x=soil pH (in the article “Relationships Among Crown Condition, Growth, and Sand Nutrition in Seven Northern Vermont Sugarbushes”, Cana. J. of Forest Res., 1995: 386-397): x: 3.3 3.4 3.4 3.5 3.6 3.6 3.7 3.7 3.8 3.8 3.9 4.0 4.1 4.2 4.3 4.4 4.5 5.0 5.1 y: 7.3 10.8 13.1 10.4 5.8 9.3 12.4 14.9 11.2 8.0 6.6 10.0 9.2 12.4 2.3 4.3 3.0 1.6 1.0 Plot the scatterplot and explain the association.

  9. Sample Correlation Coefficient Pearson’s sample correlation coefficient r is given by • The correlation coefficient is a measure of the direction and strength of a linear relationship. • It is calculated using the mean and the standard deviation of both the x and y variables. • Correlation can only be used to describe quantitative variables. Categorical variables don’t have means and standard deviations.

  10. Properties of Correlation Coefficient, r • Correlation coefficient does not depend on the unit of measurement for either variable. • Correlation coefficient is not affected by the distinction between explanatory and response variables. • Correlation coefficient is always a number between -1 and 1. Value of r near 0 indicate a very weak linear relationship while values of r close to -1 or 1 indicate a strong linear relationship. Positive r indicates a positive linear association between the variables and negative r indicates a negative linear association. • Correlation coefficient is strongly affected by outliers.

  11. "r" ranges from -1 to +1 "r" quantifies the strength and direction of a linear relationship between 2 quantitative variables. Strength: how closely the points follow a straight line. Direction: is positive when individuals with higher X values tend to have higher values of Y.

  12. Outliers Correlations are calculated using means and standard deviations, and thus are NOT resistant to outliers. Just moving one point away from the general trend here decreases the correlation from -0.91 to -0.75

  13. Example In recent years, environmental scientists have mounted a major effort to investigate the sources of acid rain. Nitrates are a major constituent of acid rain, and arsenic has been proposed as a tracer element. The accompanying data on x=nitrate concentration (μM) of a precipitation sample and y=arsenic concentration (nM) was from the article “The Atmospheric Deposition of Arsenic and Association with Acid Precipitation” (Atmospheric Environ., 1988: 937-943): x: 11 13 18 30 36 40 50 58 67 82 91 102 y: 1.1 .5 2.4 1.2 2.1 1.2 4.0 2.3 1.7 3.7 3.0 3.9 Calculate the correlation coefficient r. Sample correlation coefficient measures the direction and strength of linear association between two quantitative variables. A value of r close to zero does not rule out any strong relationship between x and y; there could still be a strong relationship but one that is not linear.

  14. Example The accompanying data on y=glucose concentration(g/L) and x=fermentation time (days) for a particular brand of malt liquor was read from a scatter plot appearing in the article “Improving Fermentation Productivity with Reverse Osmosis” (Food Tech., 1984:92-96): x: 1 2 3 4 5 6 7 8 y: 74 54 52 51 52 53 58 71 Calculate the correlation coefficient r and state the relationship using the scatter plot.

  15. Population Correlation Coefficient The sample correlation coefficient r measures how strongly the x and y values in a sample of pairs are linearly related. There is an analogous measure of how strongly x and y are related in the entire population of pairs from which the sample (x1,y1),…,( xn,yn) was obtained. It is called the population correlation coefficient and is denoted by ρ. • The population correlation coefficient satisfies • -1 ≤ ρ ≤ 1 • ρ = 1 or -1 if and only if all (x,y) pairs in the population lie exactly on a straight line.

  16. Correlation Not Causation • Correlation and Causation • Correlation between variables need not be the result of a causal link between them. • It is possible to find correlation between variables, that in truth have nothing to do with each other. • Association does not imply causation. • Example. Consider x= # of TV sets per person for a country and y=life expectancy. Suppose that r (correlation b/w x and y) is large & positive. Could we lengthen the lives of people by shipping TV sets? • Economic status can cause such a high correlation b/w x and y. These two variables are strongly related to another third variable like “Economic status”. These variables called “lurking variable”.

More Related