Chapter 4
Download
1 / 86

- PowerPoint PPT Presentation


  • 99 Views
  • Updated On :

Chapter 4. Univariate Data – Graphical / Numeric Categorical Data Numerical Data Bivariate Data. Summarizing & Exploring Data. Basic Terms.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - tori


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Chapter 4 l.jpg

Chapter 4

  • Univariate Data – Graphical / Numeric

    • Categorical Data

    • Numerical Data

  • Bivariate Data

Summarizing & Exploring Data


Basic terms l.jpg
Basic Terms

  • A frequency distribution for categorical data is a table that displays the possible categories along with the associated frequencies or relative frequencies.

  • The frequency for a particular category is the number of times the category appears in the data set.


Basic terms3 l.jpg
Basic Terms

  • The relative frequency for a particular category is the fraction or proportion of the time that the category appears in the data set. It is calculated as

  • When the table includes relative frequencies, it is sometimes referred to as a relative frequency distribution.


Classroom data l.jpg
Classroom Data

This slide along with the next contains a data set obtained from a large section of students taking Math320 in the Spring of 1999 and will be utilized throughout this slide show in the examples.


Classroom data continued l.jpg
Classroom Datacontinued


Frequency distribution example l.jpg
Frequency Distribution Example

The data in the column labeled vision is the answer to the question, “What is your principle means of correcting your vision?” The results are tabulated below


Bar chart procedure l.jpg
Bar Chart – Procedure

  • Draw a horizontal line, and write the category names or labels below the line at regularly spaced intervals.

  • Draw a vertical line, and label the scale using either frequency (or relative frequency).

  • Place a rectangular bar above each category label. The height is the frequency (or relative frequency) and all bars have the same width.





Comparative bar charts example l.jpg
Comparative Bar Charts - Example

  • Consider the vision correction for each of the two genders. We then have the following contingencytable.


Comparative bar charts example12 l.jpg
Comparative Bar Charts – Example

  • To compare the vision correction for each gender, we should use relative frequencies because the groups are not the same size. So the following table of relative frequencies is used to draw the comparative bar chart.



Pie charts procedure l.jpg
*Pie Charts - Procedure

  • Draw a circle to represent the entire data set.

  • For each category, calculate the “slice” size.

  • Slice size = category relative frequency x 360

  • Draw a slice of appropriate size for each category.


Pie chart example l.jpg
*Pie Chart - Example

  • Using the vision correction data we have:


Pie chart another example l.jpg
*Pie Chart – Another Example

  • Using the grade data we have:


Dotplots procedure l.jpg
*Dotplots - Procedure

  • Draw a horizontal line and mark it with an appropriate measurement scale.

  • Locate each value in the data set along the measurement, and represent it by a dot. If there are two or more observations with the same value, stack the dots vertically.


Dotplots example l.jpg
*Dotplots - Example

Using the weights of the 79 students

To compare the weights of the males and females we put the dotplots on top of each other, using the same scales.


Stem and leaf l.jpg
Stem and Leaf

A quick technique for picturing the distributional pattern associated with numerical data is to create a picture called a stem-and-leaf diagram (Commonly called a stem plot).

  • We want to break up the data into a reasonable number of groups.

  • Looking at the range of the data, we choose the stems (one or more of the leading digits) to get the desired number of groups.

  • The next digits (or digit) after the stem become(s) the leaf.

  • Typically, we truncate (leave off) the remaining digits.


Stem and leaf20 l.jpg

10

11

12

13

14

15

16

17

18

19

20

3

3

154504

90050

000

05700

0

0

5

0

Stem and Leaf

For our first example, we use the weights of the 25 female students.

150 140 155 195 139

200 157 130 113 130

121 140 140 150 125

135 124 130 150 125

120 103 170 124 160

Choosing the 1st two digits as the stem and the 3rd digit as the leaf we have the following


Stem and leaf21 l.jpg

10

11

12

13

14

15

16

17

18

19

20

3

3

014455

00059

000

00057

0

0

5

0

Probable outliers

Stem and Leaf

Typically we sort the order the stems in increasing order.

We also note on the diagram the units for stems and leaves

Stem: Tens and hundreds digits

Leaf: Ones digit


Stem and leaf gpa example l.jpg
Stem-and-leaf – GPA example

The following are the GPAs for the 20 advisees of a faculty member.

GPA

3.09 2.04 2.27 3.94 3.70 2.69 3.72 3.23

3.13 3.50 2.26 3.15 2.80 1.75 3.89 3.38

2.74 1.65 2.22 2.66

If the ones digit is used as the stem, you only get three groups. You can expand this a little by breaking up the stems by using each stem twice letting the 2nd digits 0-4 go with the first and the 2nd digits 5-9 with the second.

The next slide gives two versions of the stem-and-leaf diagram.


Stem and leaf gpa example23 l.jpg

1L

1H

2L

2H

3L

3H

65,75

04,22,26,27

66,69,74,80

09,13,15,23,38

50,70,72,89,94

1L

1H

2L

2H

3L

3H

67

0222

6678

01123

57789

Stem-and-leaf – GPA example

Stem: Ones digit

Leaf: Tenths and hundredths digits

Stem: Ones digit

Leaf: Tenths digits

Note: The characters in a stem-and-leaf diagram must all have the same width, so if typing use courier.


Comparative stem and leaf diagram student weight comparing two groups l.jpg

3 10

3 11 7

554410 12 145

95000 13 0004558

000 14 000000555

75000 15 0005556

0 16 00005558

0 17 000005555

18 0358

5 19

0 20 0

21 0

22 55

23 79

*Comparative Stem and Leaf DiagramStudent Weight (Comparing two groups)

When it is desirable to compare two groups, back-to-back stem and leaf diagrams are useful. Here is the result from the student weights.

From this comparative stem and leaf diagram, it is clear that the males weigh more (as a group not necessarily as individuals) than the females.


Comparative stem and leaf diagram student age l.jpg

female male

7 1

9999 1 888889999999999999999

1111000 2 00000001111111111

3322222 2 2222223333

4 2 445

2 6

2 88

0 3

3

3

7 3

8 3

4

4

4 4

7 4

*Comparative Stem and Leaf DiagramStudent Age

From this comparative stem and leaf diagram, it is clear that the male ages are all more closely grouped then the females. Also the females had a number of outliers.


Frequency distributions histograms l.jpg
Frequency Distributions & Histograms

  • When working with discrete data, the frequency tables are similar to those produced for qualitative data.

  • For example, a survey of local law firms in a medium sized town gave


Frequency distributions histograms27 l.jpg
Frequency Distributions & Histograms

  • When working with discrete data, the steps to construct a histogram are

  • Draw a horizontal scale, and mark the possible values.

  • Draw a vertical scale and mark it with either frequencies or relative frequencies (usually start at 0).

  • Above each possible value, draw a rectangle whose height is the frequency (or relative frequency) centered at the data value with a width chosen appropriately. Typically if the data values are integers then the widths will be one.


Frequency distributions histograms28 l.jpg
Frequency Distributions & Histograms

  • The number of lawyers in the firm will have the following histogram.


Frequency distributions histograms29 l.jpg
Frequency Distributions & Histograms

  • 50 students were asked the question, “How many textbooks did you purchase last term?” The result is summarized below and the histogram is on the next slide.


Frequency distributions histograms30 l.jpg
Frequency Distributions & Histograms

  • “How many textbooks did you purchase last term?”


Frequency distributions histograms31 l.jpg
Frequency Distributions & Histograms

  • Another version with the scale produced differently.


Frequency distributions histograms32 l.jpg
Frequency Distributions & Histograms

  • When working with continuous data, the steps to construct a histogram are

  • Decide into how many groups or “classes” you want to break up the data. Typically somewhere between 5 and 20. A good rule of thumb is to think having an average of more than 5 per group and break n observations into .

  • Use your answer to help decide the “width” of each group. I.e., You need to determine the width of the intervals that you are determining.

  • Determine the “starting point” for the lowest group.


Example of frequency distribution l.jpg
Example of Frequency Distribution

  • Consider the student weights in the student data set. The data values fall between 103 (lowest) and 239 (highest). The range of the dataset is 239-103=136.

  • There are 79 data values, so to have an average of at least 5 per group, we need 14 or fewer groups. We need to choose a width that breaks the data into 14 or fewer groups. (9 or 10 groups might be good.) Any width 10 or large would be reasonable.


Example of frequency distribution34 l.jpg
Example of Frequency Distribution

  • Choosing a width of 15 we have the following frequency distribution.


Histogram for continuous data l.jpg
Histogram for Continuous Data

  • Mark the boundaries of the class intervals on a horizontal axis

  • Use frequency or relative frequency on the vertical scale.


Histogram for continuous data36 l.jpg
Histogram for Continuous Data

  • The following histogram is for the frequency table of the weight data.


Histogram for continuous data37 l.jpg
Histogram for Continuous Data

  • Another version of a frequency table and histogram for the weight data with a class width of 20.


Histogram for continuous data38 l.jpg
Histogram for Continuous Data

  • The resulting histogram.


Histogram for continuous data39 l.jpg
Histogram for Continuous Data

  • Another version of a frequency table and histogram for the weight data with a class width of 20.


Histogram for continuous data40 l.jpg
Histogram for Continuous Data

  • The corresponding histogram.


Histogram for continuous data41 l.jpg
Histogram for Continuous Data

  • A class width of 15 or 20 seems to work well because all of the pictures tell the same story.

  • The bulk of the weights appear to be centered around 150 lbs with a few values substantially large. The distribution of the weights is unimodal and is positively skewed.


Illustrated distribution shapes l.jpg
Illustrated Distribution Shapes

Unimodal

Bimodal

Multimodal

Skew negatively

Symmetric

Skew positively


Histograms with uneven class widths l.jpg
Histograms with uneven class widths

  • Consider the following frequency histogram of ages based on A with class widths of 2. Notice it is a bit choppy. Because of the positively skewed data, sometimes frequency distributions are created with unequal class widths.


Histograms with uneven class widths44 l.jpg
Histograms with uneven class widths

  • For many reasons, either for convenience or because that is the way data was obtained, the data may be broken up in groups of uneven width as in the following example referring to the student ages.


Histograms with uneven class widths45 l.jpg
Histograms with uneven class widths

  • If a frequency histogram is draw with the heights of the bars being the frequencies, the result is distorted. Notice that it appears that there are a lot of people over 28 when there is only a few.


Histograms with uneven class widths46 l.jpg
Histograms with uneven class widths

  • To correct the distortion, we create a density histogram. The vertical scale is called the density and the density of a class is calculated by

This choice for the density makes the area of the rectangle equal to the relative frequency.


Histograms with uneven class widths47 l.jpg
Histograms with uneven class widths

  • Continuing this example we have


Histograms with uneven class widths48 l.jpg
Histograms with uneven class widths

  • The resulting histogram is now a reasonable representation of the data.


Describing the center of a data set with the arithmetic mean l.jpg

The sample mean of a numerical sample,

x1, x2, x3,…, xn, denoted , is

Describing the Center of a Data Set with the arithmetic mean

The population mean is denoted by m.


Example calculations l.jpg
Example calculations

  • During a two week period 10 houses were sold in Fancytown.

The “average” or mean price for this sample of 10 houses in Fancytown is $291,000


Example calculations51 l.jpg

Outlier

Example calculations

  • During a two week period 10 houses were sold in Lowtown.

The “average” or mean price for this sample of 10 houses in Lowtown is $295,000


Comments l.jpg
Comments

  • In the previous example of the house prices in the sample of 10 houses from town, the mean was affected very strongly by the one house with the extremely high price.

  • The other 9 houses had selling prices around $100,000.

  • This illustrates that the mean can be very sensitive to a few extreme values.


Describing the center of a data set with the median l.jpg
Describing the Center of a Data Set with the median

The sample median is obtained by first ordering the n observations from smallest to largest (with any repeated values included, so that every sample observation appears in the ordered list). Then


Examples of median calculation l.jpg
Examples of median calculation

Consider the housing data. First, we put the data in numerical increasing order to get

225,000 285,000 287,000 287,000 291,000

299,000 300,000 310,000 311,000 315,000

Since there are 10 (even) data values, the median is the mean of the two values in the middle.




Comparing the sample mean sample median57 l.jpg
Comparing the sample mean & sample median

Notice from the preceding pictures that the median splits the area in the distribution in half and the mean is the point of balance.

  • Typically,

  • when a distribution is skewed positively, the mean is larger than the median,

  • when a distribution is skewed negatively, the mean is smaller then the median, and

  • when a distribution is symmetric, the mean and the median are equal.


The trimmed mean a robust statistics l.jpg
The Trimmed Mean – A Robust Statistics

  • A trimmed mean is computed by first ordering the data values from smallest to largest, deleting a selected number of values from each end of the ordered list, and finally computing the mean of the remaining values.

  • The trimming percentage is the percentage of values deleted from each end of the ordered list.



Another example60 l.jpg
Another Example

  • Here’s an example of what happens if you compute the mean, median, and 5% & 10% trimmed means for the Ages for the 79 students taking Data Analysis


Categorical data sample proportion l.jpg

The sample proportion of successes, denoted by p, is

Where S is the label used for the response designated as success.

The population proportion of successes is denoted by p.

Categorical Data – Sample Proportion


Categorical data sample proportion62 l.jpg
Categorical Data – Sample Proportion

If we look at the student data sample, consider the variable gender and treat being female as a success, the sample proportion (of females) is


Describing variability dispersion l.jpg
Describing Variability/Dispersion

  • The simplest numerical measure of the variability of a numerical data set is the range, which is defined to be the difference between the largest and smallest data values.

range = maximum - minimum


Quartiles and the interquartile range l.jpg
Quartiles and the Interquartile Range

  • Lower quartile (Q1) = median of the lower half of the data set.

  • Upper Quartile (Q3) = median of the upper half of the data set.

The interquartile range (IQR), a resistant measure of variability is given by

Iqr = upper quartile – lower quartile

= Q3 – Q1

Note: If n is odd, the median is excluded from both the lower and upper halves of the data.


Quartiles and iqr example l.jpg
Quartiles and IQR Example

  • 15 students with part time jobs were randomly selected and the number of hours worked last week was recorded.

19, 12, 14, 10, 12, 10, 25, 9, 8, 4, 2, 10, 7, 11, 15

The is order in increasing order to get

2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25


Quartiles and iqr example66 l.jpg

Lower quartile Q1

Upper quartile Q1

Median

Quartiles and IQR Example

With 15 data values, the median is the 8th value. Specifically, the median is 10.

2, 4, 7, 8, 9, 10, 10, 10, 11, 12, 12, 14, 15, 19, 25

Upper Half

Lower Half

Lower quarter = 8 Upper quarter = 14

Iqr = 14 - 8 = 6


Box plots l.jpg
Box plots

  • Constructing a Box plot

    • Draw a vertical (or horizontal) scale.

    • Construct a rectangular box whose lower (or left) edge is at the lower quartile and whose upper (or light) edge is at the upper quartile (so box width = IQR). Draw a horizontal (or vertical) line segment inside the box at the location of the median.

    • Calculate outliers using “fences”.

    • Extend vertical (or horizontal) line segments from each end of the box to the smallest and largest observations in the data set (or fences). (These lines are called whisker.)

    • Represent mild outliers by asterisks and extreme outliers by open circles. Whiskers extend on each end to the most extreme observations that are not outliers.


Outliers l.jpg
Outliers

  • An observations is an outlier if it is more than 1.5 IQR away from the closest end of the box (less than the lower quartile minus 1.5 IQR or more than the upper quartile plus 1.5 IQR.

  • An outlier is extreme if it is more than 3 IQR from the closest end of the box, and it is mild otherwise.


Skeletal boxplot example l.jpg

0 5 10 15 20 25

Skeletal Boxplot Example

  • Using the student work hours data we have


Modified boxplot example l.jpg

Smallest data 20 25

value that isn’t

an outlier

Largest data

value that isn’t

an outlier

Mild

Outlier

Upper quartile + 1.5 IQR = 14 + 1.5(6) = 23

0 5 10 15 20 25

Lower quartile + 1.5 IQR = 14 - 1.5(6) = -1

Upper quartile + 3 IQR = 14 + 3(6) = 32

Modified Boxplot Example

  • Using the student work hours data we have


Modified boxplot example71 l.jpg

17 18 18 18 18 18 19 19 19 19

19 19 19 19 19 19 19 19 19 19

19 19 19 19 19 19 20 20 20 20

20 20 20 20 20 20 21 21 21 21

21 21 21 21 21 21 21 21 21 21

22 22 22 22 22 22 22 22 22 22

22 23 23 23 23 23 23 24 24 24

25 26 28 28 30 37 38 44 47

Lower

Quartile

Median

Upper

Quartile

Moderate Outliers

Extreme Outliers

Modified Boxplot Example

  • Consider the ages of the 79 students from the classroom data set.

IQR = 22 – 19 = 3

Lower quartile – 3 IQR = 10 Lower quartile – 1.5 IQR =14.5

Upper quartile + 3 IQR = 31 Upper quartile + 1.5 IQR = 25.5


Another box plot example l.jpg
Another Box plot Example 19 19

50

45

40

35

30

25

20

15

Here is the same box plot reproduced with a vertical orientation.


Box plot example l.jpg

Largest data 19 19

value that isn’t

an outlier

Smallest data

value that isn’t

an outlier

Mild

Outliers

Extreme

Outliers

Box plot Example

15 20 25 30 35 40 45 50


Comparative boxplot example l.jpg

G 19 19

e

n

d

e

r

Males

Females

100 120 140 160 180 200 220 240

Student Weight

Comparative Boxplot Example


Describing variability l.jpg
Describing Variability 19 19

  • The n deviations from the sample mean are the differences:

Note: The sum of all of the deviations from the sample mean will be equal to 0, except possibly for the effects of rounding the numbers.


Sample variance l.jpg
Sample Variance 19 19

  • The sample variance, denoted s2 is the sum of the squared deviations from the mean divided by n-1.


Sample standard deviation l.jpg
Sample Standard Deviation 19 19

  • The sample standard deviation, denoted s is the positive square root of the sample variance.

The population standard deviation is denoted by s.


Example calculations78 l.jpg
Example calculations 19 19

  • 10 Macintosh Apples were randomly selected and weighed (in ounces).


Calculations revisited l.jpg
Calculations Revisited 19 19

The values for s2 and s are exactly the same as were obtained earlier.


Z scores l.jpg

The 19 19z score corresponding to a particular observation in a data set is

The z score is how many standard deviations the observation is from the mean.

A positive z score indicates the observation is above the mean and a negative z score indicates the observation is below the mean.

Z Scores


Z scores81 l.jpg

Computing the z score is often referred to as 19 19standardization and the z score is called a standardized score.

The formula used with sample data is

Z Scores


Example l.jpg
Example 19 19

A sample of GPAs of 38 statistics students appear below (sorted in increasing order)

2.00 2.25 2.36 2.37 2.50 2.50 2.60 2.67 2.70 2.70 2.75 2.78 2.80 2.80 2.82 2.90 2.90 3.00 3.02 3.07 3.15 3.20 3.20 3.20 3.23 3.29 3.30 3.30 3.42 3.46 3.48 3.50 3.50 3.58 3.75 3.80 3.83 3.97


Example83 l.jpg

2 0 19 19

2 233

2 55

2 667777

2 88899

3 0001

3 2222233

3 444555

3 7

3 889

Stem: Units digit

Leaf: Tenths digit

Example

The following stem and leaf indicates that the data is reasonably symmetric and unimodal.


Example84 l.jpg

Using the formula we compute the z scores and color code the values as we did in an earlier example.

-2.21-1.68 -1.45 -1.43 -1.15 -1.15-0.94 -0.79 -0.73 -0.73 -0.62 -0.56 -0.52 -0.52 -0.47 -0.30 -0.30 -0.09 -0.05 0.06 0.23 0.33 0.33 0.33 0.40 0.52 0.54 0.54 0.80 0.88 0.93 0.97 0.971.14 1.50 1.60 1.67 1.96

Example


Empirical rule l.jpg
Empirical Rule z scores and color code the values as we did in an earlier example.

  • If the histogram of values in a data set is reasonably symmetric and unimodal (specifically, is reasonably approximated by a normal curve), then

    • Approximately 68% of the observations are within 1 standard deviation of the mean.

    • Approximately 95% of the observations are within 2 standard deviation of the mean.

    • Approximately 99.7% of the observations are within 3 standard deviation of the mean.


Example86 l.jpg
Example z scores and color code the values as we did in an earlier example.

Notice that the empirical rule gives reasonably good estimates for this example.


ad