1 / 54

Topic 5: Exploring Quantitative data

Topic 5: Exploring Quantitative data. Dot plot, mean, and standard deviation. Data matrix for emails. Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received. Data matrix for emails.

davidgarcia
Download Presentation

Topic 5: Exploring Quantitative data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Topic 5: Exploring Quantitative data

  2. Dot plot, mean, and standard deviation

  3. Data matrix for emails Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received.

  4. Data matrix for emails Rows 1, 2, 3, and 3921 of a data matrix are displayed below. It contains data collected on 3,921 emails that were received. Quantitative variables

  5. Sample of email data Let’s consider a random sample of 50 emails from the data. Here, rows 1, 2, 3, and 50 of a data matrix are displayed below. It contains data from the randomly selected 50 emails.

  6. Dot plot A dot plot provides a case-by-case view of data for one quantitative variable.

  7. Dot plot A dot plot provides a case-by-case view of data for one quantitative variable.

  8. Dot plot and the mean • The “placement” of data, as seen in a dot plot or some other representation, is called the distribution of the data. • The mean (also called the average) is a common way to measure the center of the distribution. Mean for data below is 10.704

  9. The mean The sample mean, denoted by , can be calculated as where represent the observed values.

  10. Population mean and estimation • The population mean is also computed the same way, but denoted by μ (the Greek letter mu). It is often not possible to compute μ because data on the entire population is not available. • The sample mean is a sample statistic, and serves as a point estimate of the population mean. This estimate is probably not perfect, but if the sample is representative of the population, it is usually a good estimate.

  11. Distributions with the same mean Each dot plot displays 124 observations and the distributions all have a mean of 6. What makes them different?

  12. Distributions with the same mean Order these distributions from the least spread out to the most spread out. A. B. C.

  13. Standard Deviation The standard deviation is the typical distance of an observation from the mean. The mean of the distribution is = 6 and sample size is n = 124. The standard deviation is computed as follows:

  14. Standard deviation measures spread A. Std. dev. = 1.361 The standards deviations of the three distributions are given. B. Std. dev. = 2.550 C. Std. dev. = 1.482

  15. The standard deviation • The standard deviation of a sample is denoted by s and can be calculated using the formula given on the previous slide. • The standard deviation of the population is computed in a similar way, except we divide by n instead of n-1. The standard deviation of the population is denoted by σ (the Greek letter sigma).

  16. Histograms and the shape of a distribution

  17. Histogram A histogram plots binned counts as bars.

  18. Histograms • A histogram is another way to display the distribution of a quantitative variable. • Better than a stem-and-leaf plot for larger data sets, but doesn’t retain the actual numerical values. Basic Steps for Creating a Histogram • Divide the range of the data (smallest to largest) into classes of equal width. The classes should not overlap. • Count the number of observations that fall into each class. Recall that the counts are also called frequencies. • Draw a horizontal axis and mark off the classes along this axis. • The vertical axis can be the count, the proportion, or the percentage. • Draw a rectangle (a vertical bar) above each class with the height equal to the count, the proportion, or the percentage.

  19. Bin width: height of MAT 117 students Bin width can alter the story we get from the histogram. ½ in. bins 1 in. bins 6 in. bins 33 in. bins

  20. Shape of a Distribution: Modality Does the histogram have a single prominent peak (unimodal), several prominent peaks (bimodal/multimodal), or no apparent peaks (uniform)? Note: To determine modality, step back and imagine a smooth curve over the histogram – imagine the bars are wooden blocks and you drop a limp spaghetti noodle over them, the shape the spaghetti would take could be viewed as a smooth curve.

  21. Modality: height of MAT 117 students Which bin width most accurately presents the modality? ½ in. bins 1 in. bins 6 in. bins 33 in. bins

  22. Shape of a Distribution: Skewness Is the histogram right skewed, left skewed, or symmetric? Note: Histograms are said to be skewed to the side of the long tail.

  23. Shape of a Distribution: Unusual Observations Are there any unusual observations or potential outliers

  24. Sample of email data How would you describe the shape of the distribution of the number of characters contained in the emails?

  25. Sample of email data How would you describe the shape of the distribution of the number of characters contained in the emails? Unimodal and right skewed, with a potentially unusual observation at 40,000 characters

  26. Box plot and the five number summary

  27. Percentiles, quartiles, and the median • The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 • 0-th percentile • Minimum • 50-th percentile • Median • 100-th percentile • Maximum

  28. Percentiles, quartiles, and the median • The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 • 0-th percentile • Minimum • 50-th percentile • Median • 100-th percentile • Maximum • 25-th percentile • First quartile • Q1 • 75-th percentile • Third quartile • Q3

  29. Percentiles, quartiles, and the median • The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 • 0-th percentile • Minimum • 50-th percentile • Median • Second quartile • Q2 • 100-th percentile • Maximum • 25-th percentile • First quartile • Q1 • 75-th percentile • Third quartile • Q3

  30. Percentiles, quartiles, and the median • The p-th percentile is a value such that p percent of observations fall at or below that value. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11 • 0-th percentile • Minimum • 50-th percentile • Median • Second quartile • Q2 • 100-th percentile • Maximum • 25-th percentile • First quartile • Q1 • 75-th percentile • Third quartile • Q3

  31. Height of female MAT 117 students

  32. Height of female MAT 117 students

  33. Height of female MAT 117 students Median Q1 Q3 Max. Min. We want to graphically represent these five numbers, called the five-number summary. This graph is called a box plot. As you can see, there is a bit more to it than just these five numbers.

  34. Box plot: height of female MAT 117 students

  35. Anatomy of the box plot Median Lower whisker Upper whisker Potential outliers Q1 Q3 Potential outlier

  36. IQR, whisker, and outliers • Between Q1 and Q3 is the middle 50% of the data. The range these data span is called the interquartile range (IQR). IQR = Q3 – Q1 • Whiskers of a box plot can extend up to 1.5 x IQR away from the the quartiles: Max upper whisker reach = Q3 + 1.5 x IQR Max lower whisker reach = Q1 – 1.5 x IQR • A potential outlier is an observation beyond the maximum reach of the whiskers. It is an observation that appears to be extreme relative to the rest of the data.

  37. Outliers Why is it important to look for outliers? • Identify extreme skew in the distribution. • Identify data collection and entry errors. • Provide insight into interesting features of the data.

  38. Resistant statistics

  39. Extreme Observations: 2006 US household income

  40. Extreme Observations: 2006 US household income

  41. Extreme Observations: 2006 US household income

  42. Extreme Observations: 2006 US household income

  43. Extreme Observations: 2006 US household income

  44. Extreme Observations: 2006 US household income

  45. Quantitative data pairs: scatterplots

  46. Scatterplot A scatterplot provides a case-by-case view of data for two quantitative variables.

  47. Scatterplot A scatterplot provides a case-by-case view of data for two quantitative variables.

  48. Scatterplots: trends Linear trend Nonlinear trend

  49. Scatterplots: trends (continued) Cluster trend No apparent trend

  50. Categorical-quantitative data pairs: comparing groups

More Related