Exploring Descriptive Statistics: Mean, Median, Mode, Variability

Chapter 3 – Descriptive Statistics Numerical Measures

Chapter Outline • Measures of Central Location • Mean • Median • Mode • Percentile (Quartile, Quintile, etc.) • Measures of Variability • Range • Variance (Standard Deviation, Coefficient of Variation)

A Recall • A sample is a subset of a population. • Numerical measures calculated for sample data are called sample statistics. • Numerical measures calculated for population data are called population parameters. • A sample statistic is referred to as the point estimator of the corresponding population parameter.

Mean • As a measure of central location, mean is simply the arithmetic average of all the data values. • The sample mean is the point estimator of the population mean .

Sample Mean • The symbol  (called sigma) means ‘sum up’. • is the value of th observation in the sample. • n is the number of observations in the sample.

Population Mean • The symbol  (called sigma) means ‘sum up’. • is the value of th observation in the sample. • N is the number of observations in the population. • is pronounced as ‘miu’.

Sample Mean • Example: Sales of Starbucks Stores • 50 Starbucks stores are randomly chosen in the NYC. The table below shows the sales of those stores in December 2012.

Sample Mean • Example: Sales of Starbucks Stores

Median • The median of a data set is the value in the middle • when the data items are arranged in ascending order. • Whenever a data set has extreme values, the median • is the preferred measure of central location. • The median is the measure of location most often • reported for annual income and property value data. • A few extremely large incomes or property values • can inflate the mean since the calculation of mean • uses all the data items.

Median • For an odd number of observations: 26 18 27 12 14 27 19 7 observations 27 12 14 18 19 26 27 in ascending order the median is the middle value. Median = 19

Median • For an even number of observations: 26 18 27 12 14 27 19 30 8 observations 27 30 12 14 18 19 26 27 in ascending order the median is the average of the middle two values. Median = (19 + 26)/2 = 22.5

Mean vs. Median • As noted, extremes values can change means remarkably, while medians might not be affected much by extreme values. Therefore, in that regard, median is a better representative of central location. 30 27 30 12 14 18 19 26 27 280 For the previous example, the median is 22.5 and the mean is 21.6. If we add one large number (280) to the data, the median becomes 26 (the value in the middle). But the mean becomes 50.3. In this case we prefer median to mean as a measure of central location.

Mode • The mode of a data set is the value that occurs most frequently. • The greatest frequency can occur at two or more different values. • If the data have exactly two modes, the data are bimodal. • If the data have more than two modes, the data are multimodal. • Caution: If the data are bimodal or multimodal, Excel’s MODE function will incorrectly identify a single mode.

Mode 27 30 12 14 18 19 26 27 For the example above, 27 shows up twice while all the other data values show up once. So, the mode is 27.

Percentiles • A percentile provides information about how the data are spread over the interval from the smallest value to the largest value. • Admission test scores for colleges and universities are frequently reported in terms of percentiles. • The pth percentile of a data set is a value such that at least p percent of the items are less than or equal to this value and at least (100 - p) percent of the items are more than or equal to this value. • The 50th percentile is simply the median.

Percentiles Arrange the data in ascending order. Compute index i, the position of the pth percentile. i = (p/100)n If i is not an integer, round up. The pth percentile is the value in the ith position. If i is an integer, the pth percentile is the average of the values in positions i and i+1.

Percentiles • Find the 75th percentile of the following data 29 30 12 14 18 19 26 27 Note: The data is already in ascending order. i = (p/100)n = (75/100)8 = 6 So, averaging the 6th and 7th data values: 75th percentile = (27 + 29)/2 = 28

Percentiles • Find the 20th percentile of the following data 29 30 12 14 18 19 26 27 Note: The data is already in ascending order. i = (p/100)n = (20/100)8 = 1.6, which is rounded up to 2. So, the 20th percentile is simply the 2nd data value, i.e. 14.

Quartiles • Quartiles are specific percentiles. • First Quartile = 25th percentile • Second Quartile = 50th percentile = Median • Third Quartile = 75th percentile

Measures of Variability • It is often desirable to consider measures of variability (dispersion), as well as measures of central location. • For example, when two stocks provide the same average return of 5% a year, but stock A’s return is very stable – close to 5% and stock B’s return is volatile ( it could be as low as –10%), are you indifferent with regard to which stock to invest in? • For another example, in choosing supplier A or supplier B we might consider not only the average delivery time for each, but also the variability in delivery time for each.

Measures of Variability • Range • Interquartile Range • Variance/Standard Deviation • Coefficient of Variation

Range • The range of a data set is the difference between the largest and smallest data values. • It is the simplest measure of variability. • It is very sensitive to the smallest and largest data values.

Range • Example: 29 30 12 14 18 19 26 27 Range = largest value - smallest value = 30 – 12 = 8

Interquartile Range • The interquartile range of a data set is the difference between the 3rd quartile and the 1st quartile. • It is the range of the middle 50% of the data. • It overcomes the sensitivity to extreme data values.

Interquartile Range • Example: 29 30 12 14 18 19 26 27 3rd Quartile (Q3) = 75th percentile = 28 1st Quartile (Q1) = 25th percentile = 16 Interquartile Range = Q3 – Q1 = 28 – 16 = 12

Variance • The variance is a measure of variability that utilizes all the data. • It is based on the difference between the value of each observation (xi) and the mean ( for a sample, for a population) • The variance is useful in comparing the variability of two or more variables.

Variance • The variance is the average of the squared differences between each data value and the mean. • The variance is calculated as follows: for a sample for a population

Standard Deviation • The standard deviation of a data set is the positive square root of the variance. • It is measured in the same units as the data, making it more appropriately interpreted than the variance.

Standard Deviation The standard deviation is computed as follows: for a sample for a population

Variance and Standard Deviation • Example 29 30 12 14 18 19 26 27 • Variance • Standard Deviation

Coefficient of Variation • The coefficient of variation indicates how large the standard deviation is in relation to the mean. • In a comparison between two data sets with different units or with the same units but a significant difference in magnitude, coefficient of variation should be used instead of variance.

Coefficient of Variation The coefficient of variation is computed as follows: for a sample for a population

Coefficient of Variation • Example 29 30 12 14 18 19 26 27

Coefficient of Variation • Example: Height vs. Weight • In a class of 30 students, the average height is 5’5’’ with a standard deviation of 3’’ and the average weight is 120 lbs with a standard deviation of 20 lbs. Question, in which measure (height or weight) are students more different? • Since height and weight don’t have the same unit, we have to use coefficient of variation to remove the units before comparing the variations in height and weight. • As shown below, students’ weight is more variant than their height.

Measures of Distribution Shape, Relative Location, and Detecting Outliers • Distribution Shape • z-Scores • Chebyshev’s Theorem • Empirical Rule • Detecting Outliers

Distribution Shape: Skewness • An important measure of the shape of a distribution is called skewness. • The formula for the skewness of sample data is • Skewness can be easily computed using statistical software.

.35 .30 .25 .20 .15 .10 .05 0 Distribution Shape: Skewness • Symmetric (not skewed) • Skewness is zero. • Mean and median are equal. Skewness = 0 Relative Frequency

.35 .30 .25 .20 .15 .10 .05 0 Distribution Shape: Skewness • Skewed to the left • Skewness is negative. • Mean is usually less than the median. Skewness = - .33 Relative Frequency

.35 .30 .25 .20 .15 .10 .05 0 Distribution Shape: Skewness • Skewed to the right • Skewness is positive. • Mean is usually more than the median. Skewness = .31 Relative Frequency

Z-Scores The z-score is often called the standardized value. It denotes the number of standard deviations a data value xiis from the mean. Excel’s STANDARDIZE function can be used to compute the z-score.

Z-Scores • An observation’s z-score is a measure of the relative location of the observation in a data set. • A data value less than the sample mean has a negative z-score. • A data value greater than the sample mean has a positive z-score. • A data value equal to the sample mean has a z-score of zero.

Z-Scores • Example 29 30 12 14 18 19 26 27

Chebyshev’s Theorem At least (1 - 1/z2) of the items in any data set will be within z standard deviations of the mean, I.e. between ( ) and ( ), where z is any value greater than 1. Chebyshev’s theorem requires z > 1, but zneed not be an integer.

Chebyshev’s Theorem At least 55.6% of the data values must be within z = 1.5 standard deviations of the mean. At least 89% of the data values must be within z = 3 standard deviations of the mean. At least 94% of the data values must be within z = 4 standard deviations of the mean.

Chebyshev’s Theorem • Example: Given that = 10 and s = 2, at least what percentage of all the data values falls into 2 standard deviations of the mean? • At least (1-1/22) = 1-1/4 = 75% of all the data values must be between 6 and 14. = 10-2(2) = 6 = 10+2(2) = 14

Empirical Rule • When the data are believed to approximate a bell-shaped distribution, the empirical rule can be used to determine the percentage of data values that must be within a specified number of standard deviations of the mean. • The empirical rule is based on the normal distribution, which is covered in Chapter 6.

Empirical Rule • For data having a bell-shaped distribution: About of values of a normal random variable are between  -  and  + . 68% Expected number of correct answers About of values of a normal random variable are between  - 2 and  + 2. 95% About of values of a normal random variable are between  - 3 and  + 3. 99%

Empirical Rule About 99% About 95% About 68% Expected number of correct answers x m m + 3s m – 3s m – 1s m + 1s m – 2s m + 2s

Detecting Outliers • An outlier is an unusually small or unusually large value in a data set. • A data value with a z-score less than –3 or greater than +3 might be considered an outlier. • It might be: • An incorrectly recorded data value • A data value that was incorrectly included in the data set. • A correctly recorded data value that belongs in the data set.

Measures of Association Between Two Variables • So far, we have examined numerical methods used to summarize the data for one variable at a time. • Often a manager or decision maker is interested in the relationship between two variables. • Two numerical measures of the relationship between two variables are covariance and correlation coefficient.

Exploring Descriptive Statistics: Mean, Median, Mode, Variability

Exploring Descriptive Statistics: Mean, Median, Mode, Variability

Presentation Transcript