Loading in 2 Seconds...
Loading in 2 Seconds...
Chapter 2 ~ Descriptive Analysis & Presentation of SingleVariable Data. Black Bears. Mean : 60.07 inches Median : 62.50 inches Range : 42 inches Variance : 117.681 Standard deviation : 10.85 inches Minimum : 36 inches Maximum : 78 inches First quartile : 51.63 inches
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Black Bears
Mean: 60.07 inches
Median: 62.50 inches
Range: 42 inches
Variance: 117.681
Standard deviation: 10.85 inches
Minimum: 36 inches
Maximum: 78 inches
First quartile: 51.63 inches
Third quartile: 67.38 inches
Count: 58 bears
Sum: 3438.1 inches
20
Frequency
10
0
30
40
50
60
70
80
Length in Inches
Circle Graphs and Bar Graphs: Graphs that are used to summarize attribute data
Day
Number Sold
Monday
15
Tuesday
23
Wednesday
35
Thursday
11
Friday
12
Saturday
42
Automobiles Sold Last Week
Notes:
Defect
Number
Dent
5
Stain
12
Blemish
43
Chip
25
Scratch
40
Others
10
1) Construct a Pareto diagram for this defect report. Management has given the cabinet production line the goal of reducing their defects by 50%.
2) What two defects should they give special attention to in working toward this goal?
Daily Defect Inspection Report
1)
1
4
0
1
0
0
1
2
0
8
0
1
0
0
6
0
8
0
Count
Percent
6
0
4
0
4
0
2
0
2
0
0
0
Defect:
Blemish
Scratch
Chip
Stain
Others
Dent
Count
43
40
25
12
10
5
Percent
31.9
29.6
18.5
8.9
7.4
3.7
Cum%
31.9
61.5
80.0
88.9
96.3
100.0
2) The production line should try to eliminate blemishes and scratches. This would cut defects by more than 50%.
Quantitative Data: One reason for constructing a graph of quantitative data is to examine the distribution  is the data compact, spread out, skewed, symmetric, etc.
Distribution: The pattern of variability displayed by the data of a variable. The distribution displays the frequency of each value of the variable.
Dotplot Display: Displays the data of a sample by representing each piece of data with a dot positioned along a scale. This scale can be either horizontal or vertical. The frequency of the values is represented along the other scale.
2.5
8.9
12.2
4.1
18.1
1.6
12.2
16.9
2.5
3.5
0.4
2.6
2.2
4.0
4.5
6.4
2.9
3.3
4.4
9.2
4.1
0.9
14.5
4.0
0.9
7.2
5.2
1.8
1.5
0.7
3.7
4.2
6.9
15.3
21.8
17.8
7.3
6.8
3.3
7.0
4.0
18.3
8.5
1.4
7.4
4.7
0.7
10.4
3.6
.
: . . .:. .
..: :.::::::.. .::. ... . : . . . :. .
++++++
0.0 4.0 8.0 12.0 16.0 20.0
The figure below is a dotplot for the 50 lifetimes:
Note: Notice how the data is “bunched” near the lower extreme and more“spread out” near the higher extreme
StemandLeaf Display: Pictures the data of a sample using the actual digits that make up the data values. Each numerical data is divided into two parts: The leading digit(s) becomes the stem, and the trailing digit(s) becomes the leaf. The stems are located along the main axis, and a leaf for each piece of data is located so as to display the distribution of the data.
41 31 33 35 36 37 39 49
33 19 26 27 24 32 40
39 16 55 38 36
Solution:
All the speeds are in the 10s, 20s, 30s, 40s, and 50s. Use the first digit of each speed as the stem and the second digit as the leaf. Draw a vertical line and list the stems, in order to the left of the line. Place each leaf on its stem: place the trailing digit on the right side of the vertical line opposite its corresponding leading digit.
20 Speeds

1  6 9
2  4 6 7
3  1 2 3 3 5 6 6 7 8 9 9
4  0 1 9
5  5

Note: The display could be constructed so that only five possible values (instead of ten) could fall in each stem. What would the stems look like? Would there be a difference in appearance?
1. It is fairly typical of many variables to display a distribution that is concentrated (mounded) about a central value and then in some manner be dispersed in both directions. (Why?)
2. A display that indicates two “mounds” may really be two overlapping distributions
3. A backtoback stemandleaf display makes it possible to compare two distributions graphically
4. A sidebyside dotplot is also useful for comparing two distributions
Frequency Distribution: A listing, often expressed in chart form, that pairs each value of a variable with its frequency
Ungrouped Frequency Distribution: Each value of x in the distribution stands alone
Grouped Frequency Distribution: Group the values into a set of classes
1. A table that summarizes data by classes, or class intervals
2. In a typical grouped frequency distribution, there are usually 512 classes of equal width
3. The table may contain columns for class number, class interval, tally (if constructing by hand), frequency, relative frequency, cumulative relative frequency, and class midpoint
4. In an ungrouped frequency distribution each class consists of a single value
Guidelines for constructing a frequency distribution:
1. All classes should be of the same width
2. Classes should be set up so that they do not overlap and so that each piece of data belongs to exactly one class
3. For problems in the text, 512 classes are most desirable. The square root of n is a reasonable guideline for the number of classes if n is less than 150.
4. Use a system that takes advantage of a number pattern, to guarantee accuracy
5. If possible, an even class width is often advantageous
Procedure for constructing a frequency distribution:
1. Identify the high (H) and low (L) scores. Find the range.Range = H  L
2. Select a number of classes and a class width so that the product is a bit larger than the range
3. Pick a starting point a little smaller than L. Count from L by the width to obtain the class boundaries. Observations that fall on class boundaries are placed into the class interval to the right.
1) Construct a grouped frequency distribution using the classes 3.7  <4.7, 4.7  <5.7, 5.7  <6.7, etc.
2) Which class has the highest frequency?
6.5 5.0 5.6 7.6 4.8 8.0 7.5 7.9 8.0 9.2
6.4 6.0 5.6 6.0 5.7 9.2 8.1 8.0 6.5 6.6
5.0 8.0 6.5 6.1 6.4 6.6 7.2 5.9 4.0 5.7
7.9 6.0 5.6 6.0 6.2 7.7 6.7 7.7 8.2 9.0
Class Frequency Relative Cumulative Class
Boundaries f Frequency Rel. Frequency Midpoint, x

3.7  <4.7 1 0.025 0.025 4.2
4.7  <5.7 6 0.150 0.175 5.2
5.7  <6.7 16 0.400 0.575 6.2
6.7  <7.7 4 0.100 0.675 7.2
7.7  <8.7 10 0.250 0.925 8.2
8.7  <9.7 3 0.075 1.000 9.2
1)
2) The class 5.7  <6.7 has the highest frequency. The frequency is 16 and the relative frequency is 0.40
Histogram: A bar graph representing a frequency distribution of a quantitative variable. A histogram is made up of the following components:
1. A title, which identifies the population of interest
2. A vertical scale, which identifies the frequencies in the various classes
3. A horizontal scale, which identifies the variable x. Values for the class boundaries or class midpoints may be labeled along the xaxis. Use whichever method of labeling the axis best presents the variable.
Notes:
The Hemoglobin Test
Solution:
1
5
1
0
Frequency
5
0
4
.
2
5
.
2
6
.
2
7
.
2
8
.
2
9
.
2
Blood Test
Age Frequency Class Midpoint

20 up to 30 34 25
30 up to 40 58 35
40 up to 50 76 45
50 up to 60 187 55
60 up to 70 254 65
70 up to 80 241 75
80 up to 90 147 85
Symmetrical: Both sides of the distribution are identical mirror images. There is a line of symmetry.
Uniform (Rectangular): Every value appears with equal frequency
Skewed: One tail is stretched out longer than the other. The direction of skewness is on the side of the longer tail. (Positively skewed vs. negatively skewed)
JShaped: There is no tail on the side of the class with the highest frequency
Bimodal: The two largest classes are separated by one or more classes. Often implies two populations are sampled.
Normal: A symmetrical distribution is mounded about the mean and becomes sparse at the extremes
Cumulative Frequency Distribution: A frequency distribution that pairs cumulative frequencies with values of the variable
Class Relative Cumulative Cumulative
Boundaries Frequency Frequency Frequency Rel. Frequency

0 up to 4 4 0.08 4 0.08
4 up to 8 8 0.16 12 0.24
8 up to 12 8 0.16 20 0.40
12 up to 16 20 0.40 40 0.80
16 up to 20 6 0.12 46 0.92
20 up to 24 3 0.06 49 0.98
24 up to 28 1 0.02 50 1.00
Ogive: A line graph of a cumulative frequency or cumulative relative frequency distribution. An ogive has the following components:
1. A title, which identifies the population or sample
2. A vertical scale, which identifies either the cumulative frequencies or the cumulative relative frequencies
3. A horizontal scale, which identifies the upper class boundaries. Until the upper boundary of a class has been reached, you cannot be sure you have accumulated all the data in the class. Therefore, the horizontal scale for an ogive is always based on the upper class boundaries.
Note: Every ogive starts on the left with a relative frequency of zero at the lower class boundary of the first class and ends on the right with a relative frequency of 100% at the upper class boundary of the last class.
Computer Science Aptitude Test
1.0
0.9
0.8
0.7
0.6
Cumulative Relative Frequency
0.5
0.4
0.3
0.2
0.1
0.0
0
4
8
12
16
20
24
28
Test Score
1
1
å
=
=
+
+ . . .
+
x
x
(
x
x
x
)
i
1
2
n
n
n
Notes:
Mean: The type of average with which you are probably most familiar. The mean is the sum of all the values divided by the total number of values, n:
1
Solution:
=
+
+
+
+
+
+
+
=
x
(
8
9
3
5
2
6
4
5
)
5
.
25
8
1
Solution:
=
+
+
+
+
+
+
+
=
x
(
8
9
3
5
2
26
4
5
)
7
.
75
8
Note: The mean can be greatly influenced by outliers
Notes:
To find the median:
1. Rank the data
2. Determine the depth of the median:
3. Determine the value of the median
Median: The value of the data that occupies the middle position when the data are ranked in order according to size
Solution:
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11
2. Find the depth:
3. The median is the fifth number from either end in the rankeddata:
Suppose the data set is {4, 8, 3, 8, 2, 9, 2, 11, 3, 15}:
1. Rank the data: 2, 2, 3, 3, 4, 8, 8, 9, 11, 15
2. Find the depth:
3. The median is halfway between the fifth and sixth observations:
{4, 8, 3, 8, 2, 9, 2, 11, 3}
Mode: The mode is the value of x that occurs most frequently
Note: If two or more values in a sample are tied for the highest frequency (number of occurrences), there is no mode
Midrange: The number exactly midway between a lowest value data L and a highest value data H. It is found by averaging the low and the high values:
+
+
L
H
12
.
7
44
.
2
=
=
=
Midrange
28
.
45
2
2
Notes:
Deviation from the Mean: A deviation from the mean, ,is the difference between the value of x and the mean
Range: The difference in value between the highestvalued (H) and the lowestvalued (L) pieces of data:
Solutions:
2)
1)
Data Deviation from Mean
_________________________
12
23
17
15
18
5
6
0
2
1
å

=
(
x
x
)
0
1
å

Mean
absolute
deviation

x
x

=
n
For the previous example:
1
1
14
å

=
+
+
+
+
=
=

x
x

(
5
6
0
2
1
)
2
.
8
n
5
5
Note: (Always!)
Mean Absolute Deviation: The mean of the absolute values of the deviations from the mean:
1
å
2
2
=

where n is the sample size
s
(
x
x
)

n
1
Note: The numerator for the sample variance is called the sum of squares for x, denoted SS(x):
1
(
)
å
å
å
2
where
2
2
=

=

SS
(
x
)
(
x
x
)
x
x
n
Sample Variance: The sample variance, s2, is the mean of the squared deviations, calculated using n  1 as the divisor:
Standard Deviation: The standard deviation of a sample, s, is the positive square root of the variance:
Solutions:
1
First:
=
+
+
+
+
=
x
(
5
7
1
3
8
)
4
.
8
5


(
x
x
)
2
x
x
x
5
0.2
0.04
7
2.2
4.84
1
3.8
14.44
3
1.8
3.24
8
3.2
10.24
Sum:
24
0
32.08
1
2
=
=
1)
s
(
32
.
8
)
8
.
2
2)
4
=
=
s
8
.
2
2
.
86
1. In an ungrouped frequency distribution, use the frequency of occurrence, f, of each observation
2. In a grouped frequency distribution, we use the frequency of occurrence associated with each class midpoint:
Solutions:
0
15
0
0
First:
1
17
17
17
2
23
46
92
4
5
20
80
5
2
10
50
Sum:
62
93
239
1)
2)
3)
Input the class midpoints or data values into L1 and the frequencies into L2; then continue with
Highlight: L3
Enter: L3 = L1*L2
Highlight: L4
Enter: L4 = L1*L3
Highlight: L5(1) (first position in L5 column)
Enter: L5(1) = sum(L2) (Σf)
To find sum use 2nd “List”>Math>5:sum(
L5(2) = sum(L3) (Σxf)
L5(3) = sum(L4) (Σx2f)
L5(4) = L5(2)/L5(1) to find the mean
L5(5) = (L5(3)(L5(2))2/L5(1))/(L5(1)1)to find the variance
L5(6) = 2nd √ (L5(5) to find the standard deviation
Let’s work problem 2.108 as an example!
Class Boundaries f
2 – 6 7
6 – 10 15
10 – 14 22
14 – 18 14
18 – 22 2
Step 1: Enter the midpoints into L1
Step 2: Enter the frequencies into L2
s2 = ∑(xxbar)2/(n1) = 34/4 = 8.5
s = √s2 = √8.5 = 2.9
Ranked data, increasing order
1. The first quartile, Q1, is a number such that at most 25% of the data are smaller in value than Q1 and at most 75% are larger
2. The second quartile, Q2, is the median
3. The third quartile, Q3, is a number such that at most 75% of the data are smaller in value than Q3 and at most 25% are larger
Quartiles: Values of the variable that divide the ranked data into quarters; each set of data has three quartiles
Notes:
~
=
=
x
Q
P
2
50
Percentiles: Values of the variable that divide a set of ranked data into 100 equal subsets; each set of data has 99 percentiles. The kth percentile, Pk, is a value such that at most k% of the data is smaller in value than Pk and at most (100 k)% of the data is larger.
1. Rank the n observations, lowest to highest
2. Compute A = (nk)/100
3. If A is an integer:
If A is a fraction:
1) k = 25: (20) (25) / 100 = 5, depth = 5.5, Q1 = 6
Solutions:
2) k = 75: (20) (75) / 100 = 15, depth = 15.5, Q3 = 6.95
3) k = 37: (20) (37) / 100 = 7.4, depth = 8, P37 = 6.2
+
+
Q
Q
6
6
.
95
12
.
95
=
=
=
=
1
3
midquartil
e
6
.
475
2
2
2
Midquartile: The numerical value midway between the first and third quartile:
Note: The mean, median, midrange, and midquartile are all measures of central tendency. They are not necessarily equal. Can you think of an example when they would be the same value?
1. L, the smallest value in the data set
2. Q1, the first quartile (also P25)
3. , the median (also P50 and 2nd quartile)
4. Q3, the third quartile (also P75)
5. H, the largest value in the data set
5Number Summary: The 5number summary is composed of:
Notes:
BoxandWhisker Display: A graphic representation of the5number summary:
Solution:
63 64 76 76 81 83 85 86 88 89 90 91 92 93 93 93 94 97 99 99 99 101 108 109 112
zScore: The position a particular value of x has relative to the mean, measured in standard deviations. The zscore is found by the formula:
Notes:


x
x
33
35
.
6
=
=
=
z
0
.
37
s
7
.
1
33 is 0.37 standard deviations below the mean.
Solutions:
46 is 1.46 standard deviations above the mean
1. Approximately 68% of the observations lie within 1 standard deviation of the mean
2. Approximately 95% of the observations lie within 2 standard deviations of the mean
3. Approximately 99.7% of the observations lie within 3 standard deviations of the mean
Empirical Rule: If a variable is normally distributed, then:
Notes:
Solutions:
1)

+
=

+
=
(
x
2
s
,
x
2
s
)
(
6
.
5
2
(0.
4
),
6
.
5
2
(0.
4
))
(
5
.
7
,
7
.
3
)
Approximat
ely 95% of
the weigh
ts fall be
tween 5.7
and 7.3
2)

+
=

+
=
(
x
3
s
,
x
3
s
)
(
6
.
5
3
(0.
4
),
6
.
5
3
(0.
4
))
(
5
.
3
,
7
.
7
)
Approximat
ely 99.7%
of the wei
ghts fall
between 5.
3 and 7.7
Approximat
ely 0.3% of
the weigh
ts fall out
side (5.3,
7.7)
Approximat
ely (0.3
/
2)
=0.15% of t
he weights
fall abov
e 7.7
1) What percentage of weights fall between 5.7 and 7.3?
2) What percentage of weights fall above 7.7?
1. Find the mean and standard deviation for the data
2. Compute the actual proportion of data within 1, 2, and 3 standard deviations from the mean
3. Compare these actual proportions with those given by the empirical rule
4. If the proportions found are reasonably close to those of the empirical rule, then the data is approximately normally distributed
Note: The empirical rule may be used to determine whether or not a set of data is approximately normally distributed
Illustration:
Chebyshev’s Theorem: The proportion of any distribution that lies within k standard deviations of the mean is at least 1  (1/k2), where k is any positive number larger than 1. This theorem applies to all distributions of data.
Solutions:
Using k=2: At least 75% of the observations lie within 2 standard deviations of the mean:
Using k=3: At least 89% of the observations lie within 3 standard deviations of the mean:
Good Arithmetic, Bad Statistics
Misleading Graphs
Insufficient Information
Misleading graphs:
1. The frequency scale should start at zero to present a complete picture. Graphs that do not start at zero are used to save space.
2. Graphs that start at zero emphasize the size of the numbers involved
3. Graphs that are chopped off emphasize variation