# Descriptive Statistics, Histograms, and Normal Approximations - PowerPoint PPT Presentation

1 / 61

Descriptive Statistics, Histograms, and Normal Approximations. Math 1680. Overview. Obtaining Data Sets Descriptive Statistics Histograms The Normal Curve Standardization Normal Approximation Summary. Obtaining Data Sets. Before we can analyze a data set, we need to have a data set

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

Descriptive Statistics, Histograms, and Normal Approximations

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

## Descriptive Statistics, Histograms, and Normal Approximations

Math 1680

### Overview

• Obtaining Data Sets

• Descriptive Statistics

• Histograms

• The Normal Curve

• Standardization

• Normal Approximation

• Summary

### Obtaining Data Sets

• Before we can analyze a data set, we need to have a data set

• How far do you travel to get to class, in miles?

• How tall are you?

• Today, numerical data is easily stored and organized (and even analyzed) by several computer programs

### Obtaining Data Sets

• Notice that in its raw form, the data is difficult to deal with

• By sorting the data, we can get a better picture of its distribution, or shape

• We are often interested in…

• Where the data are centered

• How spread out they are

• With what frequency numbers appear

### Obtaining Data Sets

• Usually, the entire data set is too large to work with directly

• We want ways to summarize the data

• We have quantitative (numerical) and pictorial descriptions available to us

• Descriptive Statistics

• Histograms

### Descriptive Statistics

• We can summarize the data set with a few simple numbers, called descriptive (or summary) statistics

• The first and most often-used summary stat is the average (or mean)

• Represents the central tendency of the data set

• Gives an idea of where the bulk of the points lie

• To calculate the average, add up the values of all of the points and divide by the total number of points in the set

### Descriptive Statistics

• Calculate the average of the following data sets

• 60 60 60 60 60

• 18 59 60 63 100

• 18 35 60 87 100

### Descriptive Statistics

• Despite having the same average, the three data sets are clearly different

• The average alone usually does not describe data sets uniquely

### Descriptive Statistics

• The median is another central tendency measure

• The median marks the point where exactly half of the data are less than (or equal to) the median

• If there are an odd number of data points, then the median is just the number in the middle of the sorted set

• Otherwise, the median is the average of the two points in the middle of the sorted set

### Descriptive Statistics

• Calculate the median of each data set

• 1 4 5 7 10 15 18

### Descriptive Statistics

• The average is like a balance point

• It represents the place where the data set is equally “heavy” on both sides

• If there are outliers on one side of the data set, the average will be skewed

• The median is more robust

• What this means is that it is usually less s affected by outliers or data entry errors.

### Descriptive Statistics

• In a certain class of 13 students, 10 showed up the first exam, while 3 blew it off

• Here are the grades; in order:

• Calculate the class median…

• Including all students

• Not counting those who slept in

79

82.5

### Descriptive Statistics

• In a certain class of 13 students, 10 showed up the first exam, while 3 blew it off

• Here are the grades; in order:

• Calculate the class average…

• Including all students

• Not counting those who slept in

62.8

81.7

### Descriptive Statistics

• Suppose the teacher mistyped the grade of 55 as being a 15

• Not counting the sleepers,

• What is the new median?

• What is the new average?

82.5

77.7

### Descriptive Statistics

• Earlier, we saw that the average did not necessarily uniquely describe a data set

• We use the standard deviation (SD) to measure spread in a data set

• When paired, the average and SD are highly effective summary statistics

### Descriptive Statistics

• The Root-Mean-Square (RMS) measures the typical absolute value of data points in a set

• Calculated by reading its name backwards

• Square all entries in the data set

• Take their mean

• Take the square root of that mean

• Find the average and then the RMS size of the numbers of the list

Average = 0

RMS = 4

### Descriptive Statistics

• The SD embodies the same concept of “typical” distance

• Where the RMS measures typical distance from 0, the SD measures typical distance from the data set’s average

• This is accomplished by subtracting the average from every data point and then taking the RMS of the differences (or deviations from the mean)

### Descriptive Statistics

• 1 4 5 7 10 15 has an average of 7

• The deviations are then -6 -3 -2 0 3 8

• Note how the subtraction process re-centers the data set so that the average is at 0

### Descriptive Statistics

• Taking the RMS of the deviations gives the standard deviation

• Normally, about two thirds to three quarters of a data set should be within one SD of the mean

• 1 4 5 7 10 15 has an average of 7 and an SD of about 4.5

### Descriptive Statistics

1 4 5 7 10 15

(1+4+5+7+10+15)/6 = 7

Average = 7

1-7 4-7 5-7 7-7 10-7 15-7

-6 -3 -2 0 3 8

(-6)2 (-3)2 (-2)2 02 32 82

(36+9+4+0+9+64)/6 = 122/6 ≈ 20.3

√(20.3) ≈ 4.5

36 9 4 0 9 64

SD ≈ 4.5

### Descriptive Statistics

• What we had on the previous slide is called the SD of the sample. However, if the goal is to use this sample to estimate the SD of a larger population, we would divide by n-1 instead of n (where n is the number of points) and call the result Sample SD.

• Most calculators actually calculate the sample SD.

• In general, the higher a set’s SD, the more spread out its points are

• An SD of 0 indicates that every point in the data set has the same value

### Descriptive Statistics

• Calculate the SD’s of the data sets

• 60 60 60 60 60

• 18 59 60 63 100

• 12 35 60 87 100

0

26.0

30.7

### Histograms

• Often, we would prefer a pictorial representation of a data set to a two-number summary

• The most common way to graphically represent a data set is to draw a frequency histogram (or just histogram)

### Histograms

• Histograms tend to look like city skylines

• In a histogram, the area under the curve between two points on the horizontal axis represents the proportion of data points between those two points

• Continuing the city skyline analogy, the size of the building determines how many people live there

• A long, low building can house as many people as a thin skyscraper

### Histograms

• To draw a histogram, we first need to organize our data into bins (or class intervals)

• Often, the bins are dictated to us

• If we get to choose them, we try to pick the bins so that they give a fair representation of the data

• Then mark a horizontal axis with the bin values, spacing them correctly

### Histograms

• Often, data is given in percentage form

• If not, divide the number of points in the bin by the number of points in the data set to get the percentage

• Draw a box for each bin so that the area of the box is the percentage of the data in that bin

• To get the correct height of the box, divide the percentage of the box by the width of the bin

### Histograms

• Note that the average and median can be visually located on a histogram

• If the histogram was balanced on a see-saw, the fulcrum would meet the histogram at the average

• If you draw a vertical line through the histogram so that it splits the area in half, then the line passes through the median

• On a symmetric histogram, the average and median tend to coincide

• Asymmetric tails pull the average in the direction of the tail

### The Normal Curve

• A great many data sets have similarly-shaped histograms

• SAT scores

• Attendance at baseball games

• Battery life

• Cash flow of a bank

• Heights of adult males/females

### The Normal Curve

• These histograms are similar to one generated by a very special distribution

• It is called the normal distribution, and it is identified by two parameters we are already familiar with

• average

• standard deviation

### The Normal Curve

• This is the standard normal curve, where the average is 0 and the SD is 1

### The Normal Curve

• Though the equation used to draw the curve is not easy to work with, there is a table of values for the standard normal distribution

• We will use this table to find areas under the curve

• The table is on page A-105 of your text

### The Normal Curve

• Properties of the standard normal curve

• The curve is “bell-shaped” with its highest point at 0

• It is symmetric about a vertical line through 0

• The curve approaches the horizontal axis, but the curve and the horizontal axis never meet

### The Normal Curve

• Area underneath the standard normal curve

• Half the area lies to the left of 0; half lies to the right

• Approximately 68% of the area lies between –1 and 1

• Approximately 95% of the area lies between –2 and 2

• Approximately 99.7% of the area lies between –3 and 3

### Standardization

• Most data sets do not have a mean of 0 and an SD of 1

• To be able to use the standard normal curve, we’ll need to standardize numbers in the original data set

• To standardize a number, subtract the data set’s average and then divide the difference by the data set’s SD

• Standardizing is basically a change of scale

• Like converting feet to miles

### Standardization

• Suppose there are two different sections of the same course

• The scores for the midterm in each section were approximately normally distributed

• In first section, the average was 64 and the standard deviation was 5

• Tina scored a 74 in first section

• In second section, the average was 72 and the standard deviation was 10

• Jack scored an 82 in second section

• Which of the two scores is most impressive, relative to the students in his/her section?

### Standardization

• Convert the following scores in the first section to standard units

• Alice got a 50

• Bob got a 61

• Carol got a 64

• Dan got a 77

-2.8

-0.6

0

2.6

### Standardization

• In Jack’s section, students with grades between 62 and 82 received a B

• What percentage of students in this section received Bs?

• Is this percentage exact?

68.27%

No

### Normal Approximation

• According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches

### Normal Approximation

• The normal curve is a smooth-curve histogram for normally distributed data

• We can estimate percentages within a given range

• Find the area under the curve between those ranges using the standard normal table

### Normal Approximation

• Sometimes will require cutting and pasting different areas together

• The standard normal table on page A-105 takes a standard score z

• It returns to you the area under the curve between –z and z

### Normal Approximation

• Find the area between –1.2 and 1.2 under normal curve

76.99%

### Normal Approximation

• Find the area between 0 and 1.65 under the standard normal curve

45.055%

### Normal Approximation

• Find the area between 0 and 3.3 under the standard normal curve

49.9515%

### Normal Approximation

• Find the area between –0.35 and 0.95 under normal curve

46.58%

### Normal Approximation

• Find the area between 1.2 and 1.85 under the normal curve

8.29%

### Normal Approximation

• Find the area between –2.1 and –1.05 under the normal curve

12.9%

### Normal Approximation

• Find the area to the right of 1 under the normal curve

15.865%

### Normal Approximation

• Find the area to the left of 0.85 under the normal curve

80.235%

### Normal Approximation

• If a data set is approximately normal in distribution, we can use the normal curve in place of the data set’s histogram

• If you want to estimate the percentage of the data set between two numbers…

• Standardize the numbers to get z scores

• Look each z score up in the standard normal table

• Cut and paste the areas to match the region you originally wanted

• The percentage under the curve will be close to the percentage in the data set

### Normal Approximation

• It is generally helpful to sketch the curve first and shade in the desired area

• This will remind you what the target area is

### Normal Approximation

• According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches

• What percentage of women has heights between 60 and 68 inches?

88.71%

### Normal Approximation

• According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches

• What percentage of women are taller than 66 inches?

15.865%

### Normal Approximation

• Sometimes, you will be given the percentage of the data set

• Want to find score(s) which mark(s) off that percentage

• Adjust the area to “center” it

• Look up the z score associated with that area in the table

• Unstandardize the z score by multiplying it by the SD and adding the average to the product

### Normal Approximation

• For a certain population of high school students, the SAT-M scores are normally distributed with average 500 and SD=100

• A certain engineering college will accept only high school seniors with SAT-M scores in the top 5%

• What is the minimum SAT-M score for this program?

665

### Normal Approximation

• One way to determine how large a number is in the data set is to find its percentile rank

• The kth percentile is the value so that k percent of the data set have values below it

• Percentile ranks can be calculated for any data set

### Normal Approximation

• In one year, the 1600-point SAT scores were approximately normal with an average of 1030 and an SD of 190

• If a student scores a 1460, what is her percentile rank?

98th percentile

### Summary

• It is often useful to describe a data set with summary statistics

• The average and median are central tendency statistics

• The average is more sensitive to outliers

• The standard deviation (SD) is the most common summary statistic for describing a data set’s spread

• The SD is calculated by taking the RMS of the deviations from the mean of each data point in the set

• Most of the points in most data sets will lie within one or two SD’s from the average

### Summary

• We can represent a data set graphically by drawing a histogram

• The percentage of the data set in a bin is the area under the histogram of that section

• The height of each block in a histogram is the percentage of the data in the corresponding bin divided by the width of the bin

• The total area under any histogram is 100%

• The average of a data set is located at the balance point of the histogram

• Long tails pull the average in the direction of the tail

### Summary

• Using the average and SD, we can standardize numbers in the data set

• The standard score (z) of a number is its distance from the average in terms of SD’s

• We can also take a standard score and convert it back to a raw score

### Summary

• Many data sets are approximately normal

• We can estimate the percentage of points in a data set that fall between two numbers

• Convert the numbers to standard units

• Find the area under the standard normal curve by using the normal table

• If a data set is approximately normal, we can use the normal table to estimate percentile ranks