STAT 111 Introductory Statistics

STAT 111 Introductory Statistics Lecture 2: Distributions and Relationships May 19, 2004

Today’s Topics • Density curves • The normal distribution • Relationships between variables • Correlation • Scatterplots • Regression (if time permits)

Strategy for exploring data on a single quantitative variable • Make a graph of the data, e.g., stemplot or histogram • Look for overall pattern and deviations from pattern • Describe center and spread • Try to describe overall pattern with smooth curve

Mathematical Models and Density Curves • A mathematical model is an idealized description. • A density curve is a mathematical model describing the overall pattern of a distribution – ignores minor irregularities and outliers, though. • Two properties of density curves: • They are always on or above the horizontal axis. • The total area under a density curve is exactly 1. • Areas under the curve represent proportions or relative frequencies of observations.

Histogram vs. Density Curve • Histograms show either frequencies (counts) or relative frequencies (proportions) for intervals. • Density curves show the proportion of observations in any region using the area under the curve. • Density curves could be considered a limiting case of the histogram, when the amount of data becomes large enough. • Density curves are faster to draw by hand and easier to use compared to histograms.

Measuring Center and Spread of a Density Curve • The mode of a distribution is the point at which the density curve attains its highest value. • The median is the midpoint in the sense that half of the total area under the curve is to its left and half to its right. Also called the equal-areas point. • If the density curve were made of solid material, then the mean is the point at which the curve would balance.

Measuring Center and Spread (cont.) • For a symmetric density curve, the mean is its center of symmetry. • Since half the area is on either side of the center for a symmetric curve, it is also the median. • For a density curve skewed to the right, the area in the long right tail tips the curve more than the same area near the center, so the mean lies to the right of the median.

Measuring Center and Spread (cont.) • For a density curve skewed to the left, the area in the long left tail tips the curve more than the same area near the center, so the mean lies to the left of the median. • In practice, it is difficult to determine either the mean or median of a skewed curve using just your eyes. • Mathematical methods are used to calculate the mean in such cases.

Measuring Center and Spread (cont.) • The quartiles can be found by dividing the area under the curve into 4 (roughly) equal parts. • First quartile Q1: The point that has ¼ of the total area under the curve to its left. • Third quartile Q3: The point that has ¾ of the total area under the curve to its left (or alternatively, ¼ to its right). • The IQR is calculated as we discussed previously (i.e., IQR = Q3 – Q1).

Measuring Center and Spread (cont.) • Since the density curve is an idealized model, we need to distinguish between the mean and standard deviation of the density curve and the numbers and s computed from the actual observations. • For an idealized distribution, then, the usual symbols used are • µ for the mean, and • σ for the standard deviation.

The Normal Distribution • The most important distribution in statistics quite possibly is the normal distribution. • The normal distribution is described by a density curve that is symmetric, unimodal, and bell-shaped (recall what these terms mean). • The shape of all normal density curves is determined by the associated values of the mean µand standard deviation σ.

The Normal Distribution • The height of a normal density curve at any point x is given by µ is the mean σis the standard deviation

Key Characteristics of Normal Distributions • The normal density curve is • Symmetric about the mean • Single-peaked (Unimodal) • Bell-shaped • The mean, median, and mode are all the same • The mean and standard deviation completely specify the curve

Why Normal Distributions are Important • They are good for describing some distributions of real data • Scores on tests taken by many people • Repeated measurements of the same quantity • Characteristics of biological populations • They are good at approximating the results of many kinds of chance outcomes • Many statistical inference procedures based on normal distributions work well for other roughly symmetric distributions.

The 68-95-99.7 Rule (Empirical Rule) • For a normal distribution with mean µ and standard deviation σ – N(µ, σ), • Approximately 68% of the observations fall within 1σ of the mean µ • Approximately 95% of the observations fall within 2σ of the mean µ • Approximately 99.7% of the observations fall within 3σ of the mean µ

The Empirical Rule (cont.)

Empirical Rule Applet • http://www.stat.sc.edu/~west/applets/empiricalrule.html

Standardizing and z-scores • All normal distributions are identical if measured in units of size σ about the mean µ. • Converting to these units is referred to as standardizing. • To standardize a value, subtract the mean of the distribution and then divide by the standard deviation.

Standardizing and z-scores (cont.) • So, the standardized value of an observation x which comes from a distribution with mean µ and standard deviation σ is • We call a standardized value a z-score. • A z-score tell us how many standard deviations our original observation is away from the mean, and also in which direction.

Example: Heights • The heights of young women are approximately normal with µ = 64.5 inches and σ = 2.5 inches. • Suppose we have a female student whose height is 5’3”. What is her z-score? • Suppose we have another student who is 6’1”. What is her z-score? Is there some indication that her height is unusual?

The Standard Normal Distribution • Standardizing transforms all normal distributions into the standard normal distribution, which is the normal distribution N(0, 1) with mean 0 and standard deviation 1. • If a variable X is normally distributed with mean µ and standard deviation σ, then the standardized variable has the standard normal distribution.

The Standard Normal Table • Table A is a table of areas under the standard normal density curve. The table entry for each value z is the area under the curve to the left of z. • So, Table A can be used to find the proportion of observations of a variable which fall to the left of a specific value z if the variable follows a normal distribution.

More Calculations with the Normal Distribution • As we previously mentioned, the heights of young women are distributed N(64.5,2.5); what is the proportion of young women who are shorter than 5’5” (65 inches)? • What about the proportion of young women who are taller than 5’8” (68 inches)? • Lastly, how about young women who are between 5’3” (63 inches) and 5’7” (67 inches)? What’s their proportion?

Normal Quantile Plot • While some distributions of real data are approximately normal, others are decidedly skewed and hence distinctly non-normal. • So how do we determine whether data are approximately normal? • Histograms and stemplots can reveal non-normal features like outliers, pronounced skewness, and gaps and clusters, but what if both the stemplot and histogram appear symmetric and unimodal?

The Normal Quantile Plot (cont.) • The tool we use most often to identify if data come from a normal distribution is the normal quantile plot. • The basic idea of the normal quantile plot is to rescale the horizontal axis so that a “perfect” standard normal sample would fall along a 45˚ degree line (i.e., the line y = x).

The Normal Quantile Plot (cont.) • Generally, we can use software to construct the normal quantile plot, but here is a very simple version of the procedure that is used: • Arrange our observed data in increasing order and keep track of the percentile occupied by each point. • Find the z-scores at these same percentiles. • Plot each data point x against the matching z. If the distribution of the data is close to any normal distribution, the plotted points should fall along a straight line.

The Normal Quantile Plot: Two Random Data Sets

The Normal Quantile Plot (cont.) • As you might have guessed, the normal quantile plot for the data set on the left is approximately normal. • On the other hand, you should be able to see that the normal quantile plot for the data set on the right is decidedly non-normal. • Tip: If our data is approximately normal, then all of the points should lie between the two dashed red lines.

Relationships between Two Variables • To study the relationship between two variables, we measure both variables on the same individuals. • This allows us to see the connection between the two variables. • Two variables measured on the same individuals are associated if some values of one variable tend to occur more often with some values of the second variable.

Relationships between Two Variables (cont.) • Two variables are considered positively associated if large values of one tend to accompany large values of the other, and similarly for small values. • Examples: Height and weight, distance and time • Two variables are considered negatively associated if large values of one accompany small values of the other, and vice versa.

Questions to Ask when Examining Relationships • What individuals are being described by the data? • What variables are present? • Which variables are quantitative? Categorical? • Are we trying to explore the nature of the relationship or trying to show that one helps explain variation in the other? I.e., are some variables response variables and others explanatory variables?

Scatterplots • If we have two quantitative variables, one of the best ways to display their relationship is using a scatterplot. • A scatterplot is a two-dimensional plot, where one variable’s values are plotted along the horizontal axis, and the other’s along the vertical axis.

Scatterplot Terminology • The variable plotted on the x-axis is called the • Explanatory variable • Independent variable • Predictor • x variable • The variable plotted on the y-axis is called the • Response variable • Dependent variable • y variable

Questions to Ask when Examining a Scatterplot • What is the overall pattern? • What is the form? Linear or non-linear? Are there clusters? • What is the direction? Positive or negatively associated? • How strong is the relationship? I.e., how closely do the points follow a clear form? • Are there any outliers? That is, are there points that fall outside the overall pattern?

Typical Patterns of Scatterplots Positive linear relationship Negative linear relationship No relationship Negative nonlinear relationship Nonlinear (concave) relationship This is a weak linear relationship.A non linear relationship seems to fit the data better.

Time Series and Time Plots • One special type of relationship looks at the evolution of a variable over time. A data set that keeps track of a variable’s values along with its time of measurement is a time series. • Examples: Government, economic, and social data • A time plot plots the value of each observation against the time at which it was measured. Time should always be placed on the horizontal scale and the other variable on the vertical scale.

Time Series (cont.) • There are two main types of patterns in a plot of time series • A trend is a persistent, long-term rise or fall. • Example: U.S. GDP is generally increasing over time. • A pattern that repeats at known regular intervals is considered seasonality. • Example: Retail sales tend to be much higher in November and December than in the other months.

Example: SAT Scores • We have a data set that contains 2407 SAT scores, including the math and verbal break-downs. • What sort of relationship do we expect to see between math scores and verbal scores? How strong do we expect this relationship to be? • Should we expect to see any clusters in this data? • Do we expect that one helps explain variation in the other?

Side-by-side Boxplots • When we wish to see the association between two quantitative variables, we use a scatterplot. • What if we instead want to see the relationship between a categorical explanatory variable and a quantitative response variable? • What we need to do is make a side-by-side comparison of the distributions of the response for each category. • Side-by-side boxplots are a useful tool for doing this.

Correlation • Even if there is a linear relationship between two quantitative variable, it is difficult for the eye to judge how strong that relationship is. • Changing plotting scales or increasing the amount of white space around the cloud of points can throw off our visual impressions. • What we need then is a numerical measure to go with our scatterplot.

Correlation (cont.) • The statistic we use to measure the direction and strength of the linear relationship between two quantitative variables is correlation. • If we have n observations on two variables x and y, then the sample correlation between them is calculated using the formula

Correlation (cont.) • The correlation r is always a number between -1 and 1. • Positive r indicates positive association, negative r indicates negative association. • A value of r close to 0 indicates a very weak linear relationship. Strength of linear association increases as r approaches -1 or 1. • r = -1 and r = 1 occur only when all the points in a scatterplot lie in a perfectly straight line.

Correlation (cont.) • Correlation is not affected by the distinction between explanatory and response variables. • Because r is calculated using standardized units, it is invariant to the rescaling of either x or y. • Correlation only measures the linear relationship between two variables. • Correlation is very sensitive to outliers. • Question: Suppose we have two variables such that r = 0. Is it true that they are not associated?

The Regression Line • A regression line is a straight line that summarizes the linear relationship between two variables. • It describes how a response variable y changes as an explanatory variable x changes. • A regression line is often used as a model to predict the value of the response y for a given value of the explanatory variable x.

The Regression Line (cont.) • We fit a line to data by drawing the line that comes as close as possible to the points. • Once we have a regression line, we can predict the y for a specific value of x. Accuracy depends on how scattered the data are about the line. • Using the regression line for prediction for far outside the range of values of x used to obtain the line is called extrapolation. This is generally not advised, since predictions will be inaccurate.

Example: Predicting SAT Math Scores using SAT Verbal Scores • Making a regression line using JMP: Analyze → Fit Y by X → Put the response variable into Y, explanatory variable into X → Hit OK → Double-click the red triangle above the scatterplot → Fit line • Mathematically, a straight line has an equation of the form y = a + bx, where b is the slope and a is the intercept. But how do we determine the value of these two numbers?

The Least-Squares Regression Line • The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible. • Mathematically, the line is determined by minimizing

The Least-Squares Regression Line (cont.) • The equation of the least-squares regression line of y on x is • The slope is determined using the formula • The intercept is calculated using

Interpreting the Regression Line • The slope b tells us that along the regression line, a change of 1 unit in x corresponds to a change of b units in y. • The least-squares regression line always passes through the point . • If both x and y are standardized variables, then the slope of the least-squares regression line will be r, and the line will pass through the origin (0,0).

STAT 111 Introductory Statistics