1 / 56

# STAT 111 Introductory Statistics - PowerPoint PPT Presentation

STAT 111 Introductory Statistics. Lecture 2: Distributions and Relationships May 19, 2004. Today’s Topics. Density curves The normal distribution Relationships between variables Correlation Scatterplots Regression (if time permits).

Related searches for STAT 111 Introductory Statistics

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'STAT 111 Introductory Statistics' - yates

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### STAT 111 Introductory Statistics

Lecture 2: Distributions and Relationships

May 19, 2004

• Density curves

• The normal distribution

• Relationships between variables

• Correlation

• Scatterplots

• Regression (if time permits)

• Make a graph of the data, e.g., stemplot or histogram

• Look for overall pattern and deviations from pattern

• Try to describe overall pattern with smooth curve

• A mathematical model is an idealized description.

• A density curve is a mathematical model describing the overall pattern of a distribution – ignores minor irregularities and outliers, though.

• Two properties of density curves:

• They are always on or above the horizontal axis.

• The total area under a density curve is exactly 1.

• Areas under the curve represent proportions or relative frequencies of observations.

Histogram vs. Density Curve variable

• Histograms show either frequencies (counts) or relative frequencies (proportions) for intervals.

• Density curves show the proportion of observations in any region using the area under the curve.

• Density curves could be considered a limiting case of the histogram, when the amount of data becomes large enough.

• Density curves are faster to draw by hand and easier to use compared to histograms.

• The mode of a distribution is the point at which the density curve attains its highest value.

• The median is the midpoint in the sense that half of the total area under the curve is to its left and half to its right. Also called the equal-areas point.

• If the density curve were made of solid material, then the mean is the point at which the curve would balance.

• For a symmetric density curve, the mean is its center of symmetry.

• Since half the area is on either side of the center for a symmetric curve, it is also the median.

• For a density curve skewed to the right, the area in the long right tail tips the curve more than the same area near the center, so the mean lies to the right of the median.

• For a density curve skewed to the left, the area in the long left tail tips the curve more than the same area near the center, so the mean lies to the left of the median.

• In practice, it is difficult to determine either the mean or median of a skewed curve using just your eyes.

• Mathematical methods are used to calculate the mean in such cases.

• The quartiles can be found by dividing the area under the curve into 4 (roughly) equal parts.

• First quartile Q1: The point that has ¼ of the total area under the curve to its left.

• Third quartile Q3: The point that has ¾ of the total area under the curve to its left (or alternatively, ¼ to its right).

• The IQR is calculated as we discussed previously (i.e., IQR = Q3 – Q1).

• Since the density curve is an idealized model, we need to distinguish between the mean and standard deviation of the density curve and the numbers and s computed from the actual observations.

• For an idealized distribution, then, the usual symbols used are

• µ for the mean, and

• σ for the standard deviation.

The Normal Distribution variable

• The most important distribution in statistics quite possibly is the normal distribution.

• The normal distribution is described by a density curve that is symmetric, unimodal, and bell-shaped (recall what these terms mean).

• The shape of all normal density curves is determined by the associated values of the mean µand standard deviation σ.

The Normal Distribution variable

• The height of a normal density curve at any point x is given by

µ is the mean

σis the standard deviation

• The normal density curve is

• Single-peaked (Unimodal)

• Bell-shaped

• The mean, median, and mode are all the same

• The mean and standard deviation completely specify the curve

• They are good for describing some distributions of real data

• Scores on tests taken by many people

• Repeated measurements of the same quantity

• Characteristics of biological populations

• They are good at approximating the results of many kinds of chance outcomes

• Many statistical inference procedures based on normal distributions work well for other roughly symmetric distributions.

• For a normal distribution with mean µ and standard deviation σ – N(µ, σ),

• Approximately 68% of the observations fall within 1σ of the mean µ

• Approximately 95% of the observations fall within 2σ of the mean µ

• Approximately 99.7% of the observations fall within 3σ of the mean µ

Empirical Rule Applet variable

• http://www.stat.sc.edu/~west/applets/empiricalrule.html

Standardizing and variablez-scores

• All normal distributions are identical if measured in units of size σ about the mean µ.

• Converting to these units is referred to as standardizing.

• To standardize a value, subtract the mean of the distribution and then divide by the standard deviation.

Standardizing and variablez-scores (cont.)

• So, the standardized value of an observation x which comes from a distribution with mean µ and standard deviation σ is

• We call a standardized value a z-score.

• A z-score tell us how many standard deviations our original observation is away from the mean, and also in which direction.

Example: Heights variable

• The heights of young women are approximately normal with µ = 64.5 inches and σ = 2.5 inches.

• Suppose we have a female student whose height is 5’3”. What is her z-score?

• Suppose we have another student who is 6’1”. What is her z-score? Is there some indication that her height is unusual?

The Standard Normal Distribution variable

• Standardizing transforms all normal distributions into the standard normal distribution, which is the normal distribution N(0, 1) with mean 0 and standard deviation 1.

• If a variable X is normally distributed with mean µ and standard deviation σ, then the standardized variable

has the standard normal distribution.

The Standard Normal Table variable

• Table A is a table of areas under the standard normal density curve. The table entry for each value z is the area under the curve to the left of z.

• So, Table A can be used to find the proportion of observations of a variable which fall to the left of a specific value z if the variable follows a normal distribution.

• As we previously mentioned, the heights of young women are distributed N(64.5,2.5); what is the proportion of young women who are shorter than 5’5” (65 inches)?

• What about the proportion of young women who are taller than 5’8” (68 inches)?

• Lastly, how about young women who are between 5’3” (63 inches) and 5’7” (67 inches)? What’s their proportion?

Normal Quantile Plot variable

• While some distributions of real data are approximately normal, others are decidedly skewed and hence distinctly non-normal.

• So how do we determine whether data are approximately normal?

• Histograms and stemplots can reveal non-normal features like outliers, pronounced skewness, and gaps and clusters, but what if both the stemplot and histogram appear symmetric and unimodal?

The Normal Quantile Plot (cont.) variable

• The tool we use most often to identify if data come from a normal distribution is the normal quantile plot.

• The basic idea of the normal quantile plot is to rescale the horizontal axis so that a “perfect” standard normal sample would fall along a 45˚ degree line (i.e., the line y = x).

The Normal Quantile Plot (cont.) variable

• Generally, we can use software to construct the normal quantile plot, but here is a very simple version of the procedure that is used:

• Arrange our observed data in increasing order and keep track of the percentile occupied by each point.

• Find the z-scores at these same percentiles.

• Plot each data point x against the matching z. If the distribution of the data is close to any normal distribution, the plotted points should fall along a straight line.

The Normal Quantile Plot (cont.) variable

• As you might have guessed, the normal quantile plot for the data set on the left is approximately normal.

• On the other hand, you should be able to see that the normal quantile plot for the data set on the right is decidedly non-normal.

• Tip: If our data is approximately normal, then all of the points should lie between the two dashed red lines.

• To study the relationship between two variables, we measure both variables on the same individuals.

• This allows us to see the connection between the two variables.

• Two variables measured on the same individuals are associated if some values of one variable tend to occur more often with some values of the second variable.

• Two variables are considered positively associated if large values of one tend to accompany large values of the other, and similarly for small values.

• Examples: Height and weight, distance and time

• Two variables are considered negatively associated if large values of one accompany small values of the other, and vice versa.

• What individuals are being described by the data?

• What variables are present?

• Which variables are quantitative? Categorical?

• Are we trying to explore the nature of the relationship or trying to show that one helps explain variation in the other? I.e., are some variables response variables and others explanatory variables?

Scatterplots variable

• If we have two quantitative variables, one of the best ways to display their relationship is using a scatterplot.

• A scatterplot is a two-dimensional plot, where one variable’s values are plotted along the horizontal axis, and the other’s along the vertical axis.

Scatterplot Terminology variable

• The variable plotted on the x-axis is called the

• Explanatory variable

• Independent variable

• Predictor

• x variable

• The variable plotted on the y-axis is called the

• Response variable

• Dependent variable

• y variable

• What is the overall pattern?

• What is the form? Linear or non-linear? Are there clusters?

• What is the direction? Positive or negatively associated?

• How strong is the relationship? I.e., how closely do the points follow a clear form?

• Are there any outliers? That is, are there points that fall outside the overall pattern?

Typical Patterns of Scatterplots variable

Positive linear relationship

Negative linear relationship

No relationship

Negative nonlinear relationship

Nonlinear (concave) relationship

This is a weak linear relationship.A non linear relationship seems to

fit the data better.

Time Series and Time Plots variable

• One special type of relationship looks at the evolution of a variable over time. A data set that keeps track of a variable’s values along with its time of measurement is a time series.

• Examples: Government, economic, and social data

• A time plot plots the value of each observation against the time at which it was measured. Time should always be placed on the horizontal scale and the other variable on the vertical scale.

Time Series (cont.) variable

• There are two main types of patterns in a plot of time series

• A trend is a persistent, long-term rise or fall.

• Example: U.S. GDP is generally increasing over time.

• A pattern that repeats at known regular intervals is considered seasonality.

• Example: Retail sales tend to be much higher in November and December than in the other months.

Example: SAT Scores variable

• We have a data set that contains 2407 SAT scores, including the math and verbal break-downs.

• What sort of relationship do we expect to see between math scores and verbal scores? How strong do we expect this relationship to be?

• Should we expect to see any clusters in this data?

• Do we expect that one helps explain variation in the other?

Side-by-side Boxplots variable

• When we wish to see the association between two quantitative variables, we use a scatterplot.

• What if we instead want to see the relationship between a categorical explanatory variable and a quantitative response variable?

• What we need to do is make a side-by-side comparison of the distributions of the response for each category.

• Side-by-side boxplots are a useful tool for doing this.

Correlation variable

• Even if there is a linear relationship between two quantitative variable, it is difficult for the eye to judge how strong that relationship is.

• Changing plotting scales or increasing the amount of white space around the cloud of points can throw off our visual impressions.

• What we need then is a numerical measure to go with our scatterplot.

Correlation (cont.) variable

• The statistic we use to measure the direction and strength of the linear relationship between two quantitative variables is correlation.

• If we have n observations on two variables x and y, then the sample correlation between them is calculated using the formula

Correlation (cont.) variable

• The correlation r is always a number between -1 and 1.

• Positive r indicates positive association, negative r indicates negative association.

• A value of r close to 0 indicates a very weak linear relationship. Strength of linear association increases as r approaches -1 or 1.

• r = -1 and r = 1 occur only when all the points in a scatterplot lie in a perfectly straight line.

Correlation (cont.) variable

• Correlation is not affected by the distinction between explanatory and response variables.

• Because r is calculated using standardized units, it is invariant to the rescaling of either x or y.

• Correlation only measures the linear relationship between two variables.

• Correlation is very sensitive to outliers.

• Question: Suppose we have two variables such that r = 0. Is it true that they are not associated?

The Regression Line variable

• A regression line is a straight line that summarizes the linear relationship between two variables.

• It describes how a response variable y changes as an explanatory variable x changes.

• A regression line is often used as a model to predict the value of the response y for a given value of the explanatory variable x.

The Regression Line (cont.) variable

• We fit a line to data by drawing the line that comes as close as possible to the points.

• Once we have a regression line, we can predict the y for a specific value of x. Accuracy depends on how scattered the data are about the line.

• Using the regression line for prediction for far outside the range of values of x used to obtain the line is called extrapolation. This is generally not advised, since predictions will be inaccurate.

• Making a regression line using JMP:

Analyze → Fit Y by X → Put the response variable into Y, explanatory variable into X → Hit OK → Double-click the red triangle above the scatterplot → Fit line

• Mathematically, a straight line has an equation of the form y = a + bx, where b is the slope and a is the intercept. But how do we determine the value of these two numbers?

• The least-squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

• Mathematically, the line is determined by minimizing

• The equation of the least-squares regression line of y on x is

• The slope is determined using the formula

• The intercept is calculated using

Interpreting the Regression Line variable

• The slope b tells us that along the regression line, a change of 1 unit in x corresponds to a change of b units in y.

• The least-squares regression line always passes through the point .

• If both x and y are standardized variables, then the slope of the least-squares regression line will be r, and the line will pass through the origin (0,0).

• Since standard deviation can never be negative, the signs of r and b will always be the same.

• Hence, if our slope is positive, we have a positive association between our explanatory variable and our response.

• On the other hand, if our slope is negative, then we have a negative association between our explanatory variable and our response.

Example: SAT Scores Again variable

• In our SAT data, the math score is the response, and the verbal score is the explanatory variable. The least-squares regression line as reported by JMP is

math = 498.00765 + 0.3167866 verbal

• Hence, in the context of the SAT, if a student’s verbal score increases by 10 points, then his math score will increase by a little bit more than 3 points.

Example: SAT Scores (cont.) variable

• Suppose we want to predict using our regression line a student’s math score given that his verbal score was 550.

• The predicted math score then would be

498.00765 + 0.3167866 (550) = 672

• Remember not to extrapolate when you make your predictions.

Example: SAT Scores (cont.) variable

• Now, suppose we instead wanted to predict SAT verbal scores using SAT math scores, and suppose that one student had a math score of 670.

• Naively, we would predict the verbal score by taking the inverse of our existing regression line, in which case we would predict a verbal score between 540 and 550.

• It is not quite as simple as this.

Example: SAT Scores (cont.) variable

• What we would need to do is re-fit the regression line using math scores as our explanatory variable and verbal scores as our response.

• The new regression line is (from JMP)

verbal = 408.37653 + 0.3901289 math

• So, our predicted verbal score given a math score of 670 would be

408.37653 + 0.3901289 (670) = 670

Correlation and Regression variable

• The square of the correlation, r2, is the proportion of the variation in the data that is explained by our least-squares regression line.

• r2 is always between 0 and 1.

• If r = ± 0.7, then r2 = 0.49, or about ½ of the variation.

• In our SAT data, r2 = 0.1238 (it is the same for both regressions), so our regression line only captures about 12% of the response’s variation.