- 56 Views
- Uploaded on
- Presentation posted in: General

Descriptive Statistics, Histograms, and Normal Approximations

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Descriptive Statistics, Histograms, and Normal Approximations

Math 1680

- Obtaining Data Sets
- Descriptive Statistics
- Histograms
- The Normal Curve
- Standardization
- Normal Approximation
- Summary

- Before we can analyze a data set, we need to have a data set
- How far do you travel to get to class, in miles?
- How tall are you?

- Today, numerical data is easily stored and organized (and even analyzed) by several computer programs

- Notice that in its raw form, the data is difficult to deal with
- By sorting the data, we can get a better picture of its distribution, or shape
- We are often interested in…
- Where the data are centered
- How spread out they are
- With what frequency numbers appear

- We are often interested in…

- Usually, the entire data set is too large to work with directly
- We want ways to summarize the data
- We have quantitative (numerical) and pictorial descriptions available to us
- Descriptive Statistics
- Histograms

- We can summarize the data set with a few simple numbers, called descriptive (or summary) statistics
- The first and most often-used summary stat is the average (or mean)
- Represents the central tendency of the data set
- Gives an idea of where the bulk of the points lie

- To calculate the average, add up the values of all of the points and divide by the total number of points in the set

- Represents the central tendency of the data set

- Calculate the average of the following data sets
- 60 60 60 60 60
- 18 59 60 63 100
- 18 35 60 87 100

- Despite having the same average, the three data sets are clearly different
- The average alone usually does not describe data sets uniquely

- The median is another central tendency measure
- The median marks the point where exactly half of the data are less than (or equal to) the median
- If there are an odd number of data points, then the median is just the number in the middle of the sorted set
- Otherwise, the median is the average of the two points in the middle of the sorted set

- The median marks the point where exactly half of the data are less than (or equal to) the median

- Calculate the median of each data set
- 1 4 5 7 10 15 18

- The average is like a balance point
- It represents the place where the data set is equally “heavy” on both sides
- If there are outliers on one side of the data set, the average will be skewed

- The median is more robust
- What this means is that it is usually less s affected by outliers or data entry errors.

- In a certain class of 13 students, 10 showed up the first exam, while 3 blew it off
- Here are the grades; in order:

- Calculate the class median…
- Including all students
- Not counting those who slept in

79

82.5

- In a certain class of 13 students, 10 showed up the first exam, while 3 blew it off
- Here are the grades; in order:

- Calculate the class average…
- Including all students
- Not counting those who slept in

62.8

81.7

- Suppose the teacher mistyped the grade of 55 as being a 15
- Not counting the sleepers,

- What is the new median?
- What is the new average?

82.5

77.7

- Earlier, we saw that the average did not necessarily uniquely describe a data set
- We use the standard deviation (SD) to measure spread in a data set
- When paired, the average and SD are highly effective summary statistics

- The Root-Mean-Square (RMS) measures the typical absolute value of data points in a set
- Calculated by reading its name backwards
- Square all entries in the data set
- Take their mean
- Take the square root of that mean

- Calculated by reading its name backwards
- Find the average and then the RMS size of the numbers of the list

Average = 0

RMS = 4

- The SD embodies the same concept of “typical” distance
- Where the RMS measures typical distance from 0, the SD measures typical distance from the data set’s average
- This is accomplished by subtracting the average from every data point and then taking the RMS of the differences (or deviations from the mean)

- 1 4 5 7 10 15 has an average of 7
- The deviations are then -6 -3 -2 0 3 8
- Note how the subtraction process re-centers the data set so that the average is at 0

- Taking the RMS of the deviations gives the standard deviation
- Normally, about two thirds to three quarters of a data set should be within one SD of the mean

- 1 4 5 7 10 15 has an average of 7 and an SD of about 4.5

1 4 5 7 10 15

(1+4+5+7+10+15)/6 = 7

Average = 7

1-7 4-7 5-7 7-7 10-7 15-7

-6 -3 -2 0 3 8

(-6)2 (-3)2 (-2)2 02 32 82

(36+9+4+0+9+64)/6 = 122/6 ≈ 20.3

√(20.3) ≈ 4.5

36 9 4 0 9 64

SD ≈ 4.5

- What we had on the previous slide is called the SD of the sample. However, if the goal is to use this sample to estimate the SD of a larger population, we would divide by n-1 instead of n (where n is the number of points) and call the result Sample SD.
- Most calculators actually calculate the sample SD.
- In general, the higher a set’s SD, the more spread out its points are
- An SD of 0 indicates that every point in the data set has the same value

- Calculate the SD’s of the data sets
- 60 60 60 60 60
- 18 59 60 63 100
- 12 35 60 87 100

0

26.0

30.7

- Often, we would prefer a pictorial representation of a data set to a two-number summary
- The most common way to graphically represent a data set is to draw a frequency histogram (or just histogram)

- Histograms tend to look like city skylines
- In a histogram, the area under the curve between two points on the horizontal axis represents the proportion of data points between those two points
- Continuing the city skyline analogy, the size of the building determines how many people live there
- A long, low building can house as many people as a thin skyscraper

- To draw a histogram, we first need to organize our data into bins (or class intervals)
- Often, the bins are dictated to us
- If we get to choose them, we try to pick the bins so that they give a fair representation of the data

- Then mark a horizontal axis with the bin values, spacing them correctly

- Often, data is given in percentage form
- If not, divide the number of points in the bin by the number of points in the data set to get the percentage

- Draw a box for each bin so that the area of the box is the percentage of the data in that bin
- To get the correct height of the box, divide the percentage of the box by the width of the bin

- Note that the average and median can be visually located on a histogram
- If the histogram was balanced on a see-saw, the fulcrum would meet the histogram at the average
- If you draw a vertical line through the histogram so that it splits the area in half, then the line passes through the median

- On a symmetric histogram, the average and median tend to coincide
- Asymmetric tails pull the average in the direction of the tail

- A great many data sets have similarly-shaped histograms
- SAT scores
- Attendance at baseball games
- Battery life
- Cash flow of a bank
- Heights of adult males/females

- These histograms are similar to one generated by a very special distribution
- It is called the normal distribution, and it is identified by two parameters we are already familiar with
- average
- standard deviation

- It is called the normal distribution, and it is identified by two parameters we are already familiar with

- This is the standard normal curve, where the average is 0 and the SD is 1

- Though the equation used to draw the curve is not easy to work with, there is a table of values for the standard normal distribution
- We will use this table to find areas under the curve
- The table is on page A-105 of your text

- Properties of the standard normal curve
- The curve is “bell-shaped” with its highest point at 0
- It is symmetric about a vertical line through 0
- The curve approaches the horizontal axis, but the curve and the horizontal axis never meet

- Area underneath the standard normal curve
- Half the area lies to the left of 0; half lies to the right
- Approximately 68% of the area lies between –1 and 1
- Approximately 95% of the area lies between –2 and 2
- Approximately 99.7% of the area lies between –3 and 3

- Most data sets do not have a mean of 0 and an SD of 1
- To be able to use the standard normal curve, we’ll need to standardize numbers in the original data set
- To standardize a number, subtract the data set’s average and then divide the difference by the data set’s SD
- Standardizing is basically a change of scale
- Like converting feet to miles

- Suppose there are two different sections of the same course
- The scores for the midterm in each section were approximately normally distributed
- In first section, the average was 64 and the standard deviation was 5
- Tina scored a 74 in first section

- In second section, the average was 72 and the standard deviation was 10
- Jack scored an 82 in second section

- In first section, the average was 64 and the standard deviation was 5
- Which of the two scores is most impressive, relative to the students in his/her section?

- The scores for the midterm in each section were approximately normally distributed

- Convert the following scores in the first section to standard units
- Alice got a 50
- Bob got a 61
- Carol got a 64
- Dan got a 77

-2.8

-0.6

0

2.6

- In Jack’s section, students with grades between 62 and 82 received a B
- What percentage of students in this section received Bs?
- Is this percentage exact?

68.27%

No

- According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches

- The normal curve is a smooth-curve histogram for normally distributed data
- We can estimate percentages within a given range
- Find the area under the curve between those ranges using the standard normal table

- We can estimate percentages within a given range

- Sometimes will require cutting and pasting different areas together
- The standard normal table on page A-105 takes a standard score z
- It returns to you the area under the curve between –z and z

- Find the area between –1.2 and 1.2 under normal curve

76.99%

- Find the area between 0 and 1.65 under the standard normal curve

45.055%

- Find the area between 0 and 3.3 under the standard normal curve

49.9515%

- Find the area between –0.35 and 0.95 under normal curve

46.58%

- Find the area between 1.2 and 1.85 under the normal curve

8.29%

- Find the area between –2.1 and –1.05 under the normal curve

12.9%

- Find the area to the right of 1 under the normal curve

15.865%

- Find the area to the left of 0.85 under the normal curve

80.235%

- If a data set is approximately normal in distribution, we can use the normal curve in place of the data set’s histogram
- If you want to estimate the percentage of the data set between two numbers…
- Standardize the numbers to get z scores
- Look each z score up in the standard normal table
- Cut and paste the areas to match the region you originally wanted
- The percentage under the curve will be close to the percentage in the data set

- It is generally helpful to sketch the curve first and shade in the desired area
- This will remind you what the target area is

- According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches
- What percentage of women has heights between 60 and 68 inches?

88.71%

- According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches
- What percentage of women are taller than 66 inches?

15.865%

- Sometimes, you will be given the percentage of the data set
- Want to find score(s) which mark(s) off that percentage

- Adjust the area to “center” it
- Look up the z score associated with that area in the table
- Unstandardize the z score by multiplying it by the SD and adding the average to the product

- For a certain population of high school students, the SAT-M scores are normally distributed with average 500 and SD=100
- A certain engineering college will accept only high school seniors with SAT-M scores in the top 5%
- What is the minimum SAT-M score for this program?

665

- One way to determine how large a number is in the data set is to find its percentile rank
- The kth percentile is the value so that k percent of the data set have values below it

- Percentile ranks can be calculated for any data set

- In one year, the 1600-point SAT scores were approximately normal with an average of 1030 and an SD of 190
- If a student scores a 1460, what is her percentile rank?

98th percentile

- It is often useful to describe a data set with summary statistics
- The average and median are central tendency statistics
- The average is more sensitive to outliers

- The average and median are central tendency statistics
- The standard deviation (SD) is the most common summary statistic for describing a data set’s spread
- The SD is calculated by taking the RMS of the deviations from the mean of each data point in the set
- Most of the points in most data sets will lie within one or two SD’s from the average

- We can represent a data set graphically by drawing a histogram
- The percentage of the data set in a bin is the area under the histogram of that section
- The height of each block in a histogram is the percentage of the data in the corresponding bin divided by the width of the bin

- The total area under any histogram is 100%
- The average of a data set is located at the balance point of the histogram
- Long tails pull the average in the direction of the tail

- Using the average and SD, we can standardize numbers in the data set
- The standard score (z) of a number is its distance from the average in terms of SD’s
- We can also take a standard score and convert it back to a raw score

- Many data sets are approximately normal
- We can estimate the percentage of points in a data set that fall between two numbers
- Convert the numbers to standard units
- Find the area under the standard normal curve by using the normal table

- If a data set is approximately normal, we can use the normal table to estimate percentile ranks

- We can estimate the percentage of points in a data set that fall between two numbers