2.1 Graphs for qualitative variables Bar graphs (frequency and relative frequency) Pie charts Pareto
Graphs for qualitative variables • The values of a qualitative or categorical variable are labels. • The distribution of a categorical variable lists the count or percentage of individuals in each category. Counts: 212 168 20 A sample of 400 wireless internet users.
Frequency Distribution (or Frequency Table) lists each category of data and the number of occurrences for each category of data.
Frequency Distribution Ages of Best Actresses Frequency Distribution Original Data
Lower Class Limits Lower Class Limits are the smallest numbers that can actually belong to different classes
Upper Class Limits Upper Class Limits are the largest numbers that can actually belong to different classes
25.5 35.5 45.5 55.5 65.5 75.5 Class Midpoints Class Midpoints can be found by adding the lower class limit to the upper class limit and dividing the sum by two
10 10 10 10 10 10 Class Width Class Width is the difference between two consecutive lower class limits or two consecutive lower class boundaries Editor: Substitute Table 2-2
EXAMPLE Organizing Qualitative Data into a Frequency Distribution The data on the next slide represent the color of M&Ms in a bag of plain M&Ms. Construct a frequency distribution of the color of plain M&Ms.
The relative frequency is the proportion (or percent) of observations within a category and is found using the formula: • A relative frequency distribution lists the relative frequency of each category of data. 2-14
EXAMPLE Organizing Qualitative Data into a Relative Frequency Distribution Use the frequency distribution obtained in the prior example to construct a relative frequency distribution of the color of plain M&Ms.
Bar Graphs A bar graph is constructed by labeling each category of data on either the horizontal or vertical axis and the frequency or relative frequency of the category on the other axis.
EXAMPLE Constructing a Frequency and Relative Frequency Bar Graph • Use the M&M data to construct • a frequency bar graph and • a relative frequency bar graph. 2-18
Actresses example 28/76 = 37% 30/76 = 39% etc. Total Frequency = 76
Frequency bar graph • The horizontal scale represents the classes of data values • the vertical scale represents the frequencies 20 30 40 50 60 70 80
Relative Frequency Graph Has the same shape and horizontal scale as the bar graph, but the vertical scale is marked with relative frequencies instead of actual frequencies
Interpreting Frequency Distributions In later chapters, there will be frequent reference to data with a normal distribution. One key characteristic of a normal distribution is that it has a “bell” shape. • The frequencies start low, then increase to some maximum frequency, then decrease to a low frequency. • The distribution should be approximately symmetric.
Example: “bell” shape
EXAMPLE Comparing Two Data Sets The following data represent the marital status (in millions) of U.S. residents 18 years of age or older in 1990 and 2006. Draw a side-by-side relative frequency bar graph of the data.
Marital Status in 1990 vs. 2006 1990 2006 Relative Frequency Marital Status
Another Example: On the morning of April 10, 1912 the Titanic sailed from the port of Southampton (UK) directed to NY. Altogether there were 2,201 passengers and crew members on board. This is the table of the survivors of the famous tragic accident. Define the categorical variables
A Pareto chart is a bar graph where the bars are drawn in decreasing order of frequency or relative frequency. 2-30
Pareto Chart 2-31
Pie Chart A pie chart is a circle divided into sectors. Each sector represents a category of data. The area of each sector is proportional to the frequency of the category.
EXAMPLE Constructing a Pie Chart The following data represent the marital status (in millions) of U.S. residents 18 years of age or older in 2006. Draw a pie chart of the data.
Other example: A graph depicting qualitative data as slices of a pie
2.2 Graphs for quantitative variables: • Histograms (discrete data and continuous data) • Stem-and-leaf plots • Time series • Dot plots • Distributions
Histogram: Example: CEO salaries Forbes magazine published data on the best small firms in 1993. These were firms with annual sales of more than five and less than $350 million. Firms were ranked by five-year average return on investment. The data extracted are the age and annual salary of the chief executive officer for the first 60 ranked firms. (Data at http://lib.stat.cmu.edu/DASL/DataArchive.html ) Salary of chief executive officer (including bonuses), in $thousands 145 621 262208 362 424 339 736 29158 498 643 390 332 750 368 659 234 396 300 343 536 543 2172981103 406 254 862 20420625021298 350 800 726 370 536 291 808 543 149 350 242198213296 317 482 155 802 200282 573 388 250 396 572
Drawing a histogram • Construct a distribution table: • Define class intervals or bins (Choose intervals of equal width!) • Count the percentage of observations in each interval • End-point convention: left endpoint of the interval is included, and the right endpoint is excluded, i.e. [a,b[ • Draw the horizontal axis. • Construct the blocks: • Height of block = percentages! • The total area under an histogram must be 100%
30.50% 23.73% 3.39% 1.70% The area of each block represents the percentages of cases in the corresponding class interval (or bin).
Remarks • A histogram represents percent by area. The area of each block represents the percentages of cases in the corresponding class interval. • The total area under a histogram is 100% • There is no fixed choice for the number of classes in a histogram: • If class intervals are too small, the histogram will have spikes; • If class intervals are too large, some information will be missed. • Use your judgment! • Typically statistical software will choose the class intervals for you, but you can modify them. • Let's try various binning levels.
Example: Smoking • In a Public Health Service study, a histogram was plotted showing the • number of cigarettes smoked per day by each subject (male current smokers), • as shown below. The density is marked in parentheses. The class intervals • include the left endpoint, but not the right. • The percentage who smoked less than two packs a day but at least a pack, is around (note: there are 20 cigarettes in a pack.) • 1.5% 15% 30% 50% • The percent who smoked at least a pack a day is around • 1.5% 15% 30% 50% • The percent who smoked at least 3 packs a day is around • 0.25 of 1% 0.5 of 1% 10% • The percent who smoked 20 cigarettes a day is around • 0.35 of 1% 0.5 of 1% 1.5% 3.5% 10%
Answers: • The percentage who smoked less than two packs a day but at least a pack, is given by (note: there are 20 cigarettes in a pack.) the area of the third block: 1.5x(40-20)=1.5x20=30% • The percent who smoked at least a pack a day is given by the area of the third and fourth blocks: 30+0.5x40=50% • The percent who smoked at least 3 packs a day is the area of the block for number of cigarettes greater or equal to 60. This is half of the fourth block: 10% • The percent who smoked 20 cigarettes a day: use the left endpoint convention, so 20 belongs to the third block. The answer is 1.5%.
Using histograms for comparisons Fuel economy for model year 2001 compact and two-seater cars (Table 1.8 pg 38) City Consumption Highway consumption
Stemplot (or Stem-and-Leaf Plot) Represents data by separating each value into two parts: the stem (leftmost digits) and the leaf (the last rightmost digit) Example: a data value of 147 would have 14 as the stem and 7 as the leaf.
Advantage of Stem-and-Leaf Diagrams over Histograms Once a frequency distribution or histogram of continuous data is created, the raw data is lost (unless reported with the frequency distribution), however, the raw data can be retrieved from the stem-and-leaf plot.
Dot plots A dot plot is drawn by placing each observation horizontally in increasing order and placing a dot above the observation each time it is observed. 2-50