1 / 76

Chapter 1 Looking at Data – Distributions

Chapter 1 Looking at Data – Distributions. What is statistics?. The science of collecting, organizing, and interpreting numerical facts ( data ) with the goal of gaining understanding about a problem Always relate calculations back to the problem at hand as numbers alone are not meaningful

kamil
Download Presentation

Chapter 1 Looking at Data – Distributions

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 1Looking at Data –Distributions

  2. What is statistics? • The science of collecting, organizing, and interpreting numerical facts (data) with the goal of gaining understanding about a problem • Always relate calculations back to the problem at hand as numbers alone are not meaningful • Requires thinking and judgment about data

  3. Variables • A variable is a characteristic of an individual, or object of interest (ie. Person, plant, animal) • Variables can take on different values for different individuals • Ex. IndividualVariable Person Age or Height Flower Color Bird Wingspan

  4. Distributions • The distribution of a variable tells us what values the variable takes on (for the group of individuals under consideration) and how often it takes them • Ex. Consider 10 rose bushes in a garden • What colors are represented? • How many of each color?

  5. Variables Categorial Quantitative • takes on numerical values • Mathematical operations (such as • averaging) make sense • Ex. Height, age, number of credit • cards owned • Value falls into one of • two or more groups, or • categories. • Ex. Blood type, hair color It makes sense to talk about the average height of the students in the class, but not the average blood type.

  6. 1.1 Displaying Distributions with Graphs • For a categorical variable, the distribution lists the categories and the count or percent of individuals who fall into each one. • How can we visually display this data? • Bar graphs • each category is represented by a bar • Pie charts • The slices must represent parts of one whole

  7. Example: Top 10 causes of death in the United States 2001 For each individual who died in the United States in 2001, we record what was the cause of death. The table above is a summary of that information.

  8. The number of individuals who died of an accident in 2001 is approximately 100,000. Bar graphs Each category is represented by one bar. The bar’s height shows the count (or sometimes the percentage) for that particular category. Top 10 causes of deaths in the United States 2001

  9. Top 10 causes of deaths in the United States 2001 Bar graph sorted by rank  Easy to analyze Sorted alphabetically  Much less useful

  10. Pie charts Each slice represents a piece of one whole. The size of a slice depends on what percent of the whole this category represents. Percent of people dying from top 10 causes of death in the United States in 2000

  11. Make sure your labels match the data. Make sure all percents add up to 100. Percent of deaths from top 10 causes Percent of deaths from all causes

  12. How to Chart Quantitative Variables? • Histograms – Numerical analog of bar graph • The range of values a variable can take on is divided into equal size intervals (bins) • Histogram shows number of data points (observations) that fall into each interval (bin) • Choosing the correct bin size is judgment call

  13. Histogram • Ex. Test 1 scores for 10 statistics students StudentScore 1 75 2 99 3 79 4 71 5 66 6 82 7 89 8 0 9 53 10 73 10 bins number of students test score

  14. What if we change the bin size? 4 bins number of students test score

  15. Interpreting Histograms • Look for overall pattern of data, and for any striking departures from the pattern • Look for outliers, individual values which fall outside the overall pattern of a distributions • Always watch out for outliers, and try to identify and explain them • Ex. Was the statistics test really hard, or were there unusual circumstances for student 8? Did he not show up for class, or did he cheat on his exam? Should he be included in the distribution?

  16. Stem Plots • Separate each observation into a stem (all but the final digit) and a leaf (final digit) • Write the stems in a vertical column with the smallest value at the top and draw vertical line to right of column • Write each leaf in row to right of its stem, in increasing order • Note: Some stems may have no leaves

  17. Score 0 53 66 71 73 75 79 82 89 99 Stemplot 0 | 0 1 | 2 | 3 | 4 | 5 | 6 | 6 7 | 1 3 5 9 8 | 2 9 9 | 9 Creating a Stem Plot: Test scores of 10 students StudentScore 1 75 2 99 3 79 4 71 5 66 6 82 7 89 8 0 9 53 10 73

  18. More on Stem Plots • Back-to-back stem plots with a common stem may be useful for comparing two related distributions • Stem plots don’t work too well for large data sets • If each stem holds a large number of leaves, you can split each stem into two: • One for leaves 0-4 • One for leaves 5-9 • If observed values have too many digits, trim numbers before making stemplot • Ex. Trim 1234 to 123, then 12 is stem and 3 is leaf. Indicate leaf unit is 10. • See example 1.8 in text

  19. Describing Distributions • Can describe the overall pattern of a distribution by its shape, center, spread and outliers • Center – For now, consider the center the midpoint • Value with approximately half the observations above it and half the observations below it • Spread – For now, describe by indicating smallest and largest values • Shape • How many peaks does the distribution have? • If one, unimodal • If several, multimodal • Is the distribution symmetric? Or skewed? • Outliers – any points that fall far outside the other points • You can use Tukey’s Rule to determine outliers of data

  20. Symmetric distribution Complex, multimodal distribution • Not all distributions have a simple overall shape, especially when there are few observations. Most common distribution shapes • A distribution is skewed to the rightif the right side of the histogram (side with larger values) extends much farther out than the left side. It is skewed to the leftif the left side of the histogram extends much farther out than the right side. • A distribution is symmetricif the right and left sides of the histogram are approximately mirror images of each other. Skewed distribution

  21. Time Plots • A time plot of a variable plots each observation against the time at which it was measured • Time always on horizontal axis! • Look for patterns over time • A trend is a rise or fall that persists over time, despite small irregularities • A pattern that repeats itself at regular intervals of time is called seasonal variation

  22. Ex. Retail price of fresh oranges over time Time is on the horizontal, x axis. The variable of interest—here “retail price of fresh oranges”— goes on the vertical, y axis. This time plot shows a regular pattern of yearly variations. These are seasonal variations in fresh orange pricing most likely due to similar seasonal variations in the production of fresh oranges. There is also an overall upward trend in pricing over time. It could simply be reflecting inflation trends or a more fundamental change in this industry.

  23. Describing Distributions with Numbers • Recall: Distributions of variables are described by shape, center, spread and outliers • We now extend beyond inspecting stemplots and histograms to more precise definitions of center and spread • Measures of center: the mean and the median

  24. The Mean (x-bar) • To find the mean of a set of n observations, x1, x2, x3, … , xn, add their values and divide by the number of observations: or S (Sigma) means sum

  25. Example: Test scores on 2nd exam for 10 statistics students Exam scores: 80, 73, 92, 85, 75, 98, 93, 55, 80, 90 n = 10

  26. Note: The mean is sensitive to a few extreme observations • NOT a resistant measure of center • What if there were an 1lth student in the class who didn’t show up and received a 0 on the 2nd exam? • How would this affect the mean?

  27. The Median (M) • The median is the midpoint of a distribution • Half the observations are smaller and half the observations are larger than M • To find the median: • Arrange data from smallest to largest • If the number of observations (n) is odd, M is the center observation in the ordered list, located (n+1)/2 observations up from the bottom • If the number of observations (n) is even, M is the mean of the two center observations in the ordered list. M is still located at the (n+1)/2 position

  28. Finding the Median • Consider again exam scores for 10 students: Exam scores: 80, 73, 92, 85, 75, 98, 93, 55, 80, 90 • Arrange data from smallest to largest: 55, 73, 75, 80, 80, 85, 90, 92, 93, 98 • n = 10, so n is even and M is the mean of the • 5th and 6th observations in the ordered list. • M is located at (10+1)/2, or 5.5th position in • ordered list • M = (80+85)/2 = 82.5

  29. The median is a more resistant measure of center than the mean. Exam scores (in order): 0, 55, 73, 75, 80, 80, 85, 90, 92, 93, 98 • What happens to M if we include the 11th student who received a 0 in the data set? • There are now 11 data points, so n = 11 and is odd • M is therefore center observation in ordered list, located in position (12+1)/2, or 6th position • M = 80

  30. Comparing the mean and the median The mean and the median are the same only if the distribution is symmetrical. The median is a measure of center that is resistant to skew and outliers. The mean is not. Mean and median for a symmetric distribution Mean Median Mean and median for skewed distributions Mean Median Left skew Right skew Mean Median

  31. Symmetric distribution… Disease X: Mean and median are the same. … and a right-skewed distribution Multiple myeloma: The mean is pulled toward the skew. Impact of skewed data

  32. Measure of spread: the quartiles The first quartile, Q1, is the value in the sample that has 25% of the data at or below it ( it is the median of the lower half of the sorted data, excluding M). The third quartile, Q3, is the value in the sample that has 75% of the data at or below it ( it is the median of the upper half of the sorted data, excluding M). Q1= first quartile = 2.2 M = median = 3.4 Q3= third quartile = 4.35

  33. Five-number summary and boxplot Largest = max = 6.1 BOXPLOT Q3= third quartile = 4.35 M = median = 3.4 Q1= first quartile = 2.2 Five-number summary: min Q1M Q3 max Smallest = min = 0.6

  34. Boxplots for skewed data Comparing box plots for a normal and a right-skewed distribution Boxplots remain true to the data and depict clearly symmetry or skew.

  35. Suspected Outliers • Outliers are troublesome data points, and it is important to be able to identify them. • One way to raise the flag for a suspected outlier is to compare the distance from the suspicious data point to the nearest quartile (Q1 or Q3). We then compare this distance to the interquartile range (distance between Q1 and Q3). • We call an observation a suspected outlier if it falls more than 1.5 times the size of the interquartile range (IQR) above the first quartile or below the third quartile. This is called the “1.5 * IQR rule for outliers.”

  36. 8 Distance to Q3 7.9 − 4.35 = 3.55 Q3 = 4.35 Interquartile range Q3 – Q1 4.35 − 2.2 = 2.15 Q1 = 2.2 Individual #25 has a value of 7.9 years, which is 3.55 years above the third quartile. This is more than 3.225 years, 1.5 * IQR. Thus, individual #25 is a suspected outlier.

  37. Measure of Spread: Standard Deviation • The most common numerical description of a distribution is given by the mean to measure center and the standard deviation (s) to measure spread • Looks at how far observations are from their mean • The variance of a set of observations (s2) is the average of the squares of the deviations of the observations from their mean:

  38. The standard deviation (s) is then given by the square root of the variance: • The deviations xi – x are large in magnitude if observations lie far from the mean • Some deviations will be positive and some will be negative depending on if the observations are smaller or larger than the mean • The sum of the deviations of the observations from the mean will always be zero • s and s2 will be large for widely spread distributions and small if observations do not lie far from the mean

  39. Steps for finding variance and standard deviation: 1. Find the mean 2. subtract each value from the mean 3. Square each of the results 4. Add them together 5. Divide by n-1 (where n is the number of observations) *** This value is the variance 6. take the square root to get the standard deviation

  40. Why divide by n-1? • Since the sum of the deviations are zero, the last observation/deviation can be calculated once the other n-1 are known • Thus we say there are only n-1 degrees of freedom • Why emphasize s over s2? • s has the same unit of measurement as the original observations • Natural measure of spread for Normal distribution (section 1.3)

  41. Women’s height (inches) Calculations … Mean = 63.4 Sum of squared deviations from mean = 85.2 Degrees freedom (df) = (n − 1) = 13 s2 = variance = 85.2/13 = 6.55 inches squared s = standard deviation = √6.55 = 2.56 inches

  42. Mean ± 1 s.d. Mean = 63.4 inches s = 2.56 inches

  43. Standard Deviation in the calculator: Input the values in L1 (under STAT enter) STAT-CALC-enter-enter The Sx value is the sample standard deviation

  44. Another Standard Deviation Example Find the SD for 3, 5, 6, 6, 7, 9, 10, 10, 14 Step 1: Find the mean: (3 + 5 + 6 + 6 + 7 + 9 + 10 + 10 +14) / 9 = 7.8

  45. Step 2: Subtract each value from the mean: (3-7.8) = -4.8 (5-7.8) = -2.8 (6-7.8) = -1.8 (6-7.8) = -1.8 (7-7.8) = -.8 (9-7.8) = 1.2 (10-7.8) = 2.2 (10-7.8) = 2.2 (14-7.8) = 6.2

  46. Step 3: Square each value (be sure to use parenthesis!) (-4.8)²= 23.04 (-2.8)²= 7.84 (-1.8)²= 3.24 (-1.8)²= 3.24 (-.8)²= .64 (1.2)²= 1.44 (2.2)²= 4.84 (2.2)²= 4.84 (6.2)²= 38.44

  47. Step 4: Add them all together 23.04 + 7.84 + 3.24 + 3.24 + .64 + 1.44 + 4.84 + 4.84 + 38.44 = 87.56 Step 5: Divide by n-1 (n is the number of observations) 84.32 / 8 = 10.945 (this is the variance) Step 6: Take the square root sqrt(10.54) = 3.31

  48. Properties of the Standard Deviation • s measures spread about the mean • Only use when mean is measure of center • s = 0 only when there is NO spread • Occurs when all observations have same value • Otherwise, s > 0 • Like the mean, s is not resistant • A few outliers can make s very large • Remember, the deviation is squared!

  49. Choosing among summary statistics • Because the mean is not resistant to outliers or skew, use it to describe distributions that are fairly symmetrical and don’t have outliers.  Plot the mean and use the standard deviation for error bars. • Otherwise use the median in the five number summary which can be plotted as a boxplot. Boxplot Mean ± SD

  50. What should you use, when, and why? $$$ Arithmetic mean or median? • Middletown is considering imposing an income tax on citizens. City hall wants a numerical summary of its citizens’ income to estimate the total tax base. • In a study of standard of living of typical families in Middletown, a sociologist makes a numerical summary of family income in that city. • Mean: Although income is likely to be right-skewed, the city government wants to know about the total tax base. • Median: The sociologist is interested in a “typical” family and wants to lessen the impact of extreme incomes.

More Related