Initial Data Analysis

Initial Data Analysis Central Tendency

Outline • What is ‘central tendency’? • Classic measures • Mean, Median, Mode • What’s an ‘average’? • Properties of statistics • Sufficiency • Efficiency • Bias • Resistance • Resistant measures

Measures of Central Tendency • While distributions provide an overall picture of some data set, it is sometimes desirable to represent some property of the entire data set using a single statistic • The first descriptive statistic we will discuss are those used to indicate where the ‘center’ of the distribution lies. • The expected value • It is not a value that has to be in the dataset itself • There are different measures of central tendency, each with their own advantages and disadvantages

The Mode • The mode is simply the value of the relevant variable that occurs most often (i.e., has the highest frequency) in the sample • Note that if you have done a frequency histogram, you can often identify the mode simply by finding the value with the highest bar. • However, that will not work when grouping was performed prior to plotting the histogram (although you can still use the histogram to identify the modal group, just not the modal value). • Modes in particular are probably best applied to nominal data

Mode • Advantages • Very quick and easy to determine • Is an actual value of the data • Not affected by extreme scores • Disadvantages • Sometimes not very informative (e.g. cigarettes smoked in a day) • Can change dramatically from sample to sample • Might be more than one (which is more representative?)

The Median • The median is the point corresponding to the score that lies in the middle of the distribution (i.e., there are as many data points above the median as there are below the median). • To find the median, the data points must first be sorted into either ascending or descending numerical order. • The position of the median value can then be calculated using the following formula:

Median • Advantage: • Resistant to outliers • Disadvantage: • May not be so informative: • (1, 1, 2, 2, 2, 2, 5, 6, 9, 9, 10 ) • Does the value of 2 really represent this sample as a whole very well?

The Mean • The most commonly used measure of central tendency is called the mean (denoted for a sample, and µ for a population). • The mean is the same of what many of us call the ‘average’, and it is calculated in the following manner:

Mode vs. Median vs. Mean • When there is only one mode and distribution is fairly symmetrical the three measures (as well as others to be discussed) will have similar values • However, when the underlying distribution is not symmetrical, the three measures of central tendency can be quite different.

Some Visual Demos • Here is a demonstration1 that allows you to change a frequency histogram while simultaneously noting the effects of those changes on the mean versus the median. • As you use the demo, you should fairly easily be able to think about how these changes are also affecting the mode • Note that the order would go Mode Median then Mean in the direction the tail is pointing.

What’s an average? • We’ve been referring to the mean without qualification, but in fact there are many types of averages, and that is only one • The mean we typically use is the arithmetic mean • Along with the geometric mean and harmonic mean, they are the Pythagorean means. • In their calculation, the Arithmetic mean is greater than or equal to the Geometric mean, which is greater than or equal to the harmonic mean • The geometric mean for n values is to multiply them all and take the nth root of that number • The harmonic mean can be seen as the reciprocal1 of the arithmetic mean of the reciprocals of all the values of the variable in question2

More means • The geometric mean is particularly appropriate for exponential type of data • E.g. Human population over a period of time • The harmonic mean is good for things like rates and ratios where an arithmetic mean would actually be incorrect1, but whenever you see an ANOVA with unequal sample sizes, the far and away most common procedure uses the harmonic mean of sample sizes • As a result, an unbalanced design will have less statistical power because the average sample size will tend toward the least sample

More means • Weighted averages • Sometimes we will want to weight a measure of some variable by the values of some other variable • E.g. If each person gets a score on several items and we want an average of the total score for each person across the items, we might weight them by 1/variance to give the more consistent scorers more importance in the calculation • The arithmetic mean is a weighted average in which all weights = 1.

Properties of a Statistic: Sampling Distribution • In order to examine the properties of a statistic we often want to take repeated samples from some population of data and calculate the relevant statistic on each sample. • We can then look at the distribution of the statistic across these samples and ask a variety of questions about it.

Properties of a Statistic • Sufficiency • A sufficient statistic is one that makes use of all of the information in the sample to estimate its corresponding parameter • For example, this property makes the mean more attractive as a measure of central tendency compared to the mode or median. • Unbiasedness • A statistic is said to be an unbiased estimator if its expected value (i.e., the mean of a number of sample means) is equal to the population parameter it is estimating. • As one can see using the resampling procedure, the mean can be shown to be an unbiased estimator

Properties of a Statistic • Efficiency • The efficiency of a statistic is reflected in the variance that is observed when one examines the statistic over independently chosen samples • Standard error • The smaller the variance, the more efficient the statistic is said to be • Resistance • The resistance of an estimator refers to the degree to which that estimate is effected by extreme values i.e. outliers • Small changes in the data result in only small changes in estimate • Finite-sample breakdown point • Measure of resistance to contamination • The smallest proportion of observations that, when altered sufficiently, can render the statistic arbitrarily large or small • Median = n/2 • Trimmed mean = whatever the trimming amount is • Mean = 1/n

Resistant measures of central tendency • Trimmed mean • Created by “trimming” some percentage of the high and low ends of the data • The median is actually a trimmed estimate • Windsorized mean • M-estimators • Extreme values are given less weight than those closer to the center of the distribution. • May be more robust than mean or median for certain types of “funky” data

Practical Example • Administer the BDI to 10 randomly selected UNT students • 8 of the students score less than 25, two scored greater than 45. • 8, 12, 6, 16, 10, 20, 22, 25, 47, 55 • Median = 18 • Mean =22.1 • Which is more accurate regarding generalization to the ‘typical UNT student’? One that includes: • Two people that perhaps reversed their ratings on the items? • A score that was miskeyed (using the number pad they hit a 4 instead of 1 leading to a score of 47)? • Two people who do not have English as their native language? • Two people that did not answer honestly? • Two people that are actually clinically depressed? • One that is clinically depressed, one that just ‘wants to be different’?

Practical Example • While many think of outliers as representing the ‘complexity of human nature’1 the issue more revolves around inadequate data collection to detect why the score is what it is and problematic population description • E.g. my definition of typical UNT student, if such a thing could be said to exist at all, is not one that is on suicide watch • However, the previous problem most likely represents an attempt to generalize to something that doesn’t exist. • Better populations to try and represent: UNT Texans, UNT Psych grad students, UNT international students, UNT students who have visited C & T in the last semester (in which case those would probably not be outliers) etc. • Application to current events: Do you really think there is a ‘middle America’, a ‘female vote’ etc. to which the presidential candidates are trying to appeal? There are demographics, very specific ones yes, but those connotations do little to note the specifics.

Summary • Favoritism for the arithmetic mean is the result of familiarity only1, and until you came to this course you would have been hard-pressed to explain your preference outside of arguments from authority • The AM is to be valued for some properties it has relative to other measures (sufficiency, efficiency, unbiased), and also rejected for the same reason (least amount of resistance) • In many cases it’s entirely inappropriate to use the AM as it would be a distorted view of central tendency • Which statistics you use to represent your data should be considered as much as the measures themselves.

Initial Data Analysis