1 / 34

Chapter 3

Chapter 3. Data Characterization. Types of Data Measurements. Measurements of Center and Location Measurements of Variation. ?. Measurements for Population and Sample. In general, we use the same set of measurements for both population and sample

lopezs
Download Presentation

Chapter 3

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Chapter 3 Data Characterization BUS304 – Data Characterization

  2. Types of Data Measurements • Measurements of Center and Location • Measurements of Variation ?

  3. Measurements for Population and Sample • In general, we use the same set of measurements for both population and sample • Population Parameters: numerical measurements for population. Usually represented using Greek letters or capitalized English letters. • “N” for pop. Size; “” for pop mean • Sample Statistic: numerical measurements for sample. Usually represented using small English letters. • “n” for sample size; for sample mean

  4. Sample Mean: “sample average” Formula: Population Mean: “population average” Formula: Most commonly used -- Mean • Characterize the center of the data distribution • The most commonly used data measure • Ways to compute the mean: • Use calculator. • Use Excel. (function: average) BUS304 – Data Characterization

  5. Compute the mean for the following 2 groups of data Household income in community a: (Unit =10000$) Household income in community b: (Unit =10000$) Sensitivity to outliers If the mayor decide to provide more public facilities to poor communities, and the decision is made based on whether the mean income in the community is below $50,000 per year. Does such a decision make sense? BUS304 – Data Characterization

  6. Exercise: The manager of a small hotel in Foster City, CA, was asked by the corporate VP to analyze the Sunday night registration information for the past eight weeks. Data on three variables were collected: • x1 = total number of rooms rented • x2 = total dollar revenue from the room rentals • x3 = number of customer complaints that came from guests each Sunday • Tasks: • Create a histogram for the distribution of number of customer complaints every day • Calculate the average number of rooms rented, the average revenue, the average number of complaints per day. • Calculate the average number of complaints per room rented • Explain the difference between “the average compliant per day”and “the average complaint per room rented“ from a managerial perspective. 6

  7. Below is a frequency table showing the number of days the teams finish their projects How many days on average does a team finish one project? Create a histogram using the data on the left, locate the mean on the graph. How to describe the shape of the histogram? What is the relationship between the mean and peak? Use relative frequency to find out the mean. Compute the mean from frequency table BUS304 – Data Characterization

  8. Estimating the mean from Histogram Treat Histogram as a frequency table, use the mid-value to estimate each range. Mathematical Expression: if sample, if population BUS304 – Data Characterization

  9. Weighted Mean • The mean assumes that each piece of information equally. • E.g. students’ GPA and score calculation. • Weights are subjective. • E.g. Different instructors assign different weights to homework and exams. • Frequency table can be considered as an example of weighted mean (higher weights when higher frequency) BUS304 – Data Characterization

  10. Exercise: • Estimate the mean based on the following histogram • There are 30 full time faculty in CoBA. Their average age was 43 in 2007. In 2008, one new faculty with age 30 was hired and one faculty retired at 65. What is the new mean age for CoBA faculty? BUS304 – Data Characterization

  11. Variance • A measure of data spread. • Also called “the average of squared deviations from the mean” The larger the variance, the fat the histogram -- sample variance -- population variance Note the difference! BUS304 – Data Characterization

  12. Steps to compute the variance • Identify whether the data are of a population or sample (the formulae are different.) • Use the following table to compute the deviation: • Find out the mean: • Find out the distance (fill out the 2nd column) • Find out the squared distance (the 3rd column) • Add up the 3rd column • divided by • population size; or • sample size -1 =5-3.833=1.167 =(1.167)2=1.36 BUS304 – Data Characterization

  13. Comparing variance vs. histogram Find the variance for the following groups of sample data: Compare the mean and variance. Create the histogram to compare the distribution. BUS304 – Data Characterization

  14. What does variance mean? • Variance indicate variation: • The larger the variance, the more spread out the data. • Indicates unpredictability. • E.g. • Weather data: weather changes dramatically, hard to predict tomorrow’s temperature (If look at temperature data: which has larger variance, Chicago or San Diego?) • Stock: more risk on returns. • A person’s performance: consistency. emotional… • Other examples? BUS304 – Data Characterization

  15. Use frequency table to compute the populationvariance: Compute the weighted average BUS304 – Data Characterization

  16. Standard Deviation • Square root of variance. • An indicator of data deviation, can be directly compared to the mean. Exercise: compute the standard deviation from the histogram on slide no. 5 and locate it on the histogram. OR Sample variance Population variance Sample standard deviation Population standard deviation BUS304 – Data Characterization

  17. 68% 99.7% 95% Empirical Rule • If the data is bell shaped (most of the time), then • 68% of all data will fall in the range of • 95% of all data will fall in the range of • 99.7% of all data will fall in the range of BUS304 – Data Characterization

  18. Other Numerical Measures Median Mode Range Percentiles Quartiles, Interquartile range BUS304 – Data Characterization 18

  19. -- The value which divides the data in half, with equal sizes above and below Median • Steps: • Put your data in ordered array (sort) • If n (or N) is odd, the median is the middle number • (i.e. the th number) • If n (or N) is even, the median is the average of two middle numbers • (i.e. the average of the and the +1 th numbers) • The middle value BUS304 – Data Characterization 19

  20. Sensitivity to outliers 0 1 2 3 4 5 6 7 8 9 10 Median = 3 0 1 2 3 4 5 6 7 8 9 10 Median does not affected by extreme values Median = 2.5 0 1 2 3 4 5 6 7 8 9 10 Median = 3 BUS304 – Data Characterization 20

  21. Exercise BUS304 – Data Characterization 21

  22. The value that occurs most often Mode 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 0 1 2 3 4 5 6 Boston Austin San Diego Los Angels 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 • Steps: • Put your data in ordered array (sort) • Find the data value(s) that repeats the most frequently Mode does not affected by extreme value either. No Mode! Mode=5 Mode=San Diego Mode=5 and 9 BUS304 – Data Characterization 22

  23. Find Mode and Median from Frequency Table Below is a frequency table showing the number of days the teams finish their projects Find the mean, median and mode. Create a histogram, locate the mode, median and mode. Describe the shape of the histogram, and find the relationship between mean, median and mode. BUS304 – Data Characterization 23

  24. Shape of a distribution Right-Skewed Symmetric Left-Skewed Mode<Median <Mean Mean < Median <Mode (Longer tail extends to right) (Longer tail extends to left) Mean = Median =Mode Note that Mean is affected by the extreme value the most. So mean is always leaning towards the tail compared to the other two measures. BUS304 – Data Characterization 24

  25. Measures of center location Mean Median Mode • Mean is generally used, unless extreme values (outliers) exist; • the next common is median, since the median is not sensitive to extreme values; • mode is sometime used when there is a really large frequency. Think: Are house prices normally right-skewed or left-skewed? What measurement People normally use to measure the house market? BUS304 – Data Characterization 25

  26. Range Simplest measure of variation Describe how wide the data spread Formula Range = Maximum Value – Minimum Value Example: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 Range = 14 - 1 = 13 BUS304 – Data Characterization 26

  27. Disadvantage of Range Ignores the way in which data are distributed Sensitive to outliers 1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,5 7 8 9 10 11 12 Range = 5 - 1 = 4 Range = 12 - 7 = 5 1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3,4,120 Range = 120 - 1 = 119 7 8 9 10 11 12 Range = 12 - 7 = 5 Range is affected the most by outliers. Feb 8, 2006 BUS304 – Data Characterization 27

  28. Other measures Percentiles: Measures the percentage of data below the value. e.g. if the 60th percentile is 1240 (SAT score), that means there are 60% students getting a score less than 1240. Correspondingly, there are 40% of students getting 1240 or higher. How to find percentile? The pth percentile in an ordered array of n values is the value in the ith position, where BUS304 – Data Characterization 28

  29. Example Find the 80th percentile from the annual income data Step: Sort the data Find the location for the 80th percentile: Find the 80.8th person’s income Where is the 80.8th person? Combine the 80th and 81st numbers 80th 62245 81st  63485 80.8th  62245*20%+63485*80%=63237 1st 100th 80th 81st 80.8th should be in between, and closer to 81st. 80% because of the decimal is .8 BUS304 – Data Characterization 29

  30. Exercise Find the 25th percentile Find the 50th percentile Find the 75th percentile Explain the meaning of 50th percentile? Have you learnt a similar measurement? How many people have income levels between the 25th and the 50th percentiles? How many people have income levels between 50th and the 75th percentile? BUS304 – Data Characterization 30

  31. Quartiles The 25th, 50th, and 75th percentiles Called the first, second, and third quartiles, respectively. Written as Q1, Q2, Q3, respectively. The quartiles split the ranked data into 4 equal groups. 25% 25% 25% 25% Q1 Q2 Q3 BUS304 – Data Characterization 31

  32. Example: Example:Find the first quartile in the data sample: 22 12 14 16 17 16 132018 BUS304 – Data Characterization 32

  33. Interquartile Range Recall: Range? Disadvantage of range? Interquartile Range: Interquartile Range = Q3 – Q1 Example: 12 13 14 16 16 17 18 20 22 Q1=13.5 Q3=19 Interquartile range = Q3 – Q1 = 19 – 13.5 = 5.5 BUS304 – Data Characterization 33

  34. Summary Understand and compute the following two sets of data measures: Measures of central tendency Mean, Median, and Mode Measures of variation Range, Variance, and Standard deviation Other ways to describe data: Percentiles, Quartiles, Interquartile range BUS304 – Data Characterization 34

More Related