Statistical Review

Statistical Review Measures of Central Location

SALARY.XLS • Lists starting salaries for 190 graduates from an undergraduate school of business. • The data is in the range named Salary on a sheet called Data.

The Mean • We calculate the mean salary by entering the formula “=AVERAGE(Salary)” in cell B6 of the Excel Functions worksheet. • The mean salary is $29,762. • The mean in this example is a “representative” measure because the distribution of salaries is nearly symmetric. • The mean can be misleading due to skewness.

The Median • The median is the “middle” observation when the data are listed from smallest to largest. • If there is an odd number of observations, the median is the middle observation. • If there is an even number of observations, we take the median to be the average of the two middle observations.

The Median -- continued • We calculate the median salary in Example 3.1 by entering the formula “=MEDIAN(Salary)” in cell B7 of the Excel Functions worksheet. • The median in this example is $29,850. • In this case, the mean and the median values are nearly the same because the distribution is approximately symmetric.

The Median -- continued • If the salary distribution were skewed (for example, a few graduates received abnormally large salaries), the mean would be biased upward while the median would not be affected by the unusual values. • Thus, it is better to use the median in characterizing the center of a distribution when that distribution is skewed.

The Mode • The mode is the most frequently occurring value. • If the values are essentially continuous, as with the salaries in Example 3.1, then the mode is essentially irrelevant. There is typically no single value that occurs more than once. • Thus, the mode is not likely to provide much information.

The Mode

SHOES.XLS • This file lists shoe sizes purchased at a shoe store. • We seek to find the best-selling shoe size at this store.

The Mode • The mode is the most frequently occurring value. • If the values are essentially continuous, then the mode is usually not relevant. There is typically no single value that occurs more than once.

Why is the mode relevant here? • Shoe sizes come in discrete increments, rather than a continuum; so it makes sense to find the mode, the size that is requested most often, in this example.

Computing the Mode in EXCEL • The mode can be found in Excel by entering the formula “=MODE(Range of Data)”. • Applying this formula in the present example, we find that size 11 is the most frequently purchased shoe size. • This is also apparent form the histogram on the next slide.

Distribution of Shoe Sizes

Measures of Variability: Variance and Standard Deviation

OTIS4.XLS • Suppose that Otis Elevator is going to stop manufacturing elevator rails. Instead, it is going to buy them from an outside supplier. • Otis would like each rail to have a diameter of 1 inch. • The company has obtained samples of ten elevator rails from each supplier. They are listed in columns A and B of this Excel file.

Which should Otis prefer? • Observe that the mean, median, and mode are all exactly 1 inch for each of the two suppliers. • Based on these measures, the two suppliers are equally good and right on the mark. However, we when we consider measures of variability, supplier 1 is somewhat better than supplier 2. Why?

Explanation • The reason is that supplier 2’s rails exhibit more variability about the mean than do supplier 1’s rails. • If we want rails to have a diameter of 1 inch, then more variability around the mean is very undesirable!

Variance • The most commonly used measures of variability are the variance and standard deviation. • The variance is essentially the average of the squared deviations from the mean. • We say “essentially” because there are two versions of the variance: the population variance and the sample variance.

More on the Variance • The variance tends to increase when there is more variability around the mean. • Indeed, large deviations from the mean contribute heavily to the variance because they are squared. • One consequence of this is that the variance is expressed in squared units (squared dollars, for example) rather than original units.

Standard Deviation • A more intuitive measure of variability is the standard deviation. • The standard deviation is defined to be the square root of the variance. • Thus, the standard deviation is measured in original units, such as dollars, and it is much easier to interpret.

Computing Variance and Standard Deviation in Excel • Excel has built-in functions for computing these measures of variability. • The sample variances and standard deviations of the rail diameters from the suppliers in the present example can be found by entering the following formulas: “=VAR(Supplier1)” in cell E8 and “=STDEV(Supplier1)” in cell E9.

Computing Variances & Standard Deviations -- continued • Of course, enter similar formulas for supplier 2 in cells F8 and F9. • As we mentioned earlier, it is difficult to interpret the variances numerically because they are expressed in squared inches, not inches. • All we can say is that the variance from supplier 2 is considerably larger than the variance from supplier 1.

Interpretation of the Standard Deviation • The standard deviations, on the other hand, are expressed in inches. The standard deviation for supplier 1 is approximately 0.012 inch, and supplier 2’s standard deviation is approximately three times this large. • This is quite a disparity. Hence, Otis will prefer to buy rails from supplier 1.

Interpretation of the Standard Deviation: Rules of Thumb

DOW.XLS • This file contains monthly closing prices for the Dow Jones Index from January 1947 through January 1993. • The monthly returns from the index are also shown starting with February 1947. Each return is the monthly percentage change (expressed) as a decimal) in the index. • How well do the rules of thumb work for these data?

Rules of Thumb • Many data sets follow “rules of thumb”. • Approximately 68% of the observations are within one standard deviation of the mean. • Approximately 95% of the observations are within two standard deviations of the mean. • Approximately 99.7% - almost all - of the observations are within three standard deviations of the mean.

Index Time Series Plot • A time series plot of the index show that the index has been increasingly fairly steadily over the period. • Whenever a series indicates a clear trend such as the index does, most of the measures we have been discussing are less relevant. • For example, the mean of the index for this period has at most historical interest. We are probably more interested in predicting the future of the Dow, and the historical mean has little relevance for predicting the future.

Time Series Plot of Dow Closing Index

Time Series Plot of Dow Returns

Return Time Series Plot • A time series plot of the returns show no obvious trend over the period. • The measures we have been discussing are relevant in discussing the series of returns, which fluctuate around a stable mean. • We first calculate the mean and standard deviation of the returns by using the Excel functions AVERAGE and STDEV in cells B4 and B5. See the table on the next slide.

Rules of Thumb for Dow Jones Data

Returns -- continued • The average return is 0.59% and the standard deviation of about 3.37%. • Therefore, the rules of thumb (if they apply) imply, for example, that about 2/3 of all returns are within the interval 0.59% + 3.37%, that is from -2.78% to 3.95%. • In order to determine if the rules of thumb apply to these returns, we can use a frequency table.

Creating the Frequency Table • We first enter the upper limits of the suitable categories in the range A8:A15. • Any categories can be chosen but it is convenient to choose categories in which each breakpoint is one standard deviation higher than the previous one with the open-ended categories on either end are “more than 3 standard deviations from the mean”. • Next we use the FREQUENCY function to fill in column C. “=FREQUENCY(Returns,Bins)”

Frequency Table continued • Finally, we use the frequencies in column C to calculate the actual percentage of return within k standard deviations of the mean for k=1, k=2 and k=3 and we compare these with the percentages from the rules of thumb. • The agreement between these percentages is not perfect - there are a few more observations within one standard deviation of the mean than the rule of thumb predicts - but in general the rules of thumb work quite well.

Obtaining Summary Measures with Add-Ins

SALARY.XLS • Lists starting salaries for 190 graduates from an undergraduate school of business. • The data is in the range named Salary on a sheet called Data. • We need to find a set of useful summary measures for the salaries.

Summary Statistics • To find the summary statistics of a set of data we can use the Stat-Pro Add-In or Excel’s Analysis ToolPak. In this example we use the Stat-Pro Add-In . • Begin by placing the cursor anywhere within the data range. Then select StatPro/Summary Stats/One-Variable Summary Stats menu item. • Select all variables you want to summarize, and select the summary measures you want to find from the Available Summary Measures dialog box shown on the next slide.

Available Summary Measures

About the Measures • Four measures are selected by default. These are mean, median, Standard Deviation and Count. You can override these. • A typical output appears here. • It includes many of the measures we have discussed plus a few more.

About the Measures -- continued • The mean absolute deviation is similar to the variance except that it is an average of the absolute (note squared) deviations from the mean. • The kurtosis and skewness indicate the relative peakedness of the distribution and its skewness. • By clicking on any of the cells containing the measures (Column B), you’ll see that StatPro provides the formulas for the outputs. (Analysis ToolPak does not do so.)

About the Measures -- continued • The effect of this is that if any of the data changes the summary measures we produced change automatically. • All output is formatted as “numerical” to three decimal places by default. You can reformat them in a more appropriate manner if you would like.

Measures of Association: Covariance and Correlation

EXPENSES.XLS • A survey questions members of 100 households about their spending habits. • The data in this file represent the salary, expense for cultural activities, expense for sports-related activities, and the expense for dining-out for each household over the past year. • Do these variables appear to be related linearly?

Covariance and Correlation • When we need to summarize the relationship between two variables we can use the measures covariance and correlation. We summarize the type of behavior observed in a scatterplot. • Each measures the strength (and direction) of a linear relationship between two numerical variables. • The relationship is “strong” if the points in a scatterplot cluster tightly around some straight line. If this line rises form left to right then the relationship is “positive”. If it falls from left to right then the relationship is “negative”.

Determining Linear Relationships • Scatterplots of each variable versus each other would provide the answer to the question but six scatterplots would be required, one for each pair. • To get a quick indication of possible linear relationships we can use Stat-Proto obtain a table of correlations and/or covariances.

Table of Correlations and Covariances • To get the table, place the cursor anywhere in the data set and use the StatPro/Summary Stats/Correlations, Covariances menu item and proceed in the obvious way.

Relationships • The only relationships that stand out are the positive relationships between salary and cultural expenses and between salary and dining expenses. • The negative relationships are between cultural and sports-related expenses. • To confirm these graphically we show scatterplots of Salary versus Culture and Culture versus Sports

Scatterplot Indicating Positive Relationship

Scatterplot Indicating Negative Relationship

Correlation and Covariance Properties • In general, the following properties are evident from the Table of correlations and covariances. • The correlation between a variable and itself is 1. • The correlation between X and Y is the same as the correlation between Y and X. Therefore, it is sufficient to list the correlations below (or above) the diagonal in the table. (The same is true for the covariances). • The covariance between a variable and itself is the variance of the variable. We indicate this in the heading of the covariance table.

Statistical Review