Descriptive statistics histograms and normal approximations
This presentation is the property of its rightful owner.
Sponsored Links
1 / 61

Descriptive Statistics, Histograms, and Normal Approximations PowerPoint PPT Presentation


  • 50 Views
  • Uploaded on
  • Presentation posted in: General

Descriptive Statistics, Histograms, and Normal Approximations. Math 1680. Overview. Obtaining Data Sets Descriptive Statistics Histograms The Normal Curve Standardization Normal Approximation Summary. Obtaining Data Sets. Before we can analyze a data set, we need to have a data set

Download Presentation

Descriptive Statistics, Histograms, and Normal Approximations

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Descriptive statistics histograms and normal approximations

Descriptive Statistics, Histograms, and Normal Approximations

Math 1680


Overview

Overview

  • Obtaining Data Sets

  • Descriptive Statistics

  • Histograms

  • The Normal Curve

  • Standardization

  • Normal Approximation

  • Summary


Obtaining data sets

Obtaining Data Sets

  • Before we can analyze a data set, we need to have a data set

    • How far do you travel to get to class, in miles?

    • How tall are you?

  • Today, numerical data is easily stored and organized (and even analyzed) by several computer programs


Obtaining data sets1

Obtaining Data Sets

  • Notice that in its raw form, the data is difficult to deal with

  • By sorting the data, we can get a better picture of its distribution, or shape

    • We are often interested in…

      • Where the data are centered

      • How spread out they are

      • With what frequency numbers appear


Obtaining data sets2

Obtaining Data Sets

  • Usually, the entire data set is too large to work with directly

    • We want ways to summarize the data

    • We have quantitative (numerical) and pictorial descriptions available to us

      • Descriptive Statistics

      • Histograms


Descriptive statistics

Descriptive Statistics

  • We can summarize the data set with a few simple numbers, called descriptive (or summary) statistics

  • The first and most often-used summary stat is the average (or mean)

    • Represents the central tendency of the data set

      • Gives an idea of where the bulk of the points lie

    • To calculate the average, add up the values of all of the points and divide by the total number of points in the set


Descriptive statistics1

Descriptive Statistics

  • Calculate the average of the following data sets

    • 60 60 60 60 60

    • 18 59 60 63 100

    • 18 35 60 87 100


Descriptive statistics2

Descriptive Statistics

  • Despite having the same average, the three data sets are clearly different

    • The average alone usually does not describe data sets uniquely


Descriptive statistics3

Descriptive Statistics

  • The median is another central tendency measure

    • The median marks the point where exactly half of the data are less than (or equal to) the median

      • If there are an odd number of data points, then the median is just the number in the middle of the sorted set

      • Otherwise, the median is the average of the two points in the middle of the sorted set


Descriptive statistics4

Descriptive Statistics

  • Calculate the median of each data set

    • 1 4 5 7 10 15 18


Descriptive statistics5

Descriptive Statistics

  • The average is like a balance point

    • It represents the place where the data set is equally “heavy” on both sides

    • If there are outliers on one side of the data set, the average will be skewed

  • The median is more robust

    • What this means is that it is usually less s affected by outliers or data entry errors.


Descriptive statistics6

Descriptive Statistics

  • In a certain class of 13 students, 10 showed up the first exam, while 3 blew it off

    • Here are the grades; in order:

  • Calculate the class median…

    • Including all students

    • Not counting those who slept in

79

82.5


Descriptive statistics7

Descriptive Statistics

  • In a certain class of 13 students, 10 showed up the first exam, while 3 blew it off

    • Here are the grades; in order:

  • Calculate the class average…

    • Including all students

    • Not counting those who slept in

62.8

81.7


Descriptive statistics8

Descriptive Statistics

  • Suppose the teacher mistyped the grade of 55 as being a 15

    • Not counting the sleepers,

  • What is the new median?

  • What is the new average?

82.5

77.7


Descriptive statistics9

Descriptive Statistics

  • Earlier, we saw that the average did not necessarily uniquely describe a data set

  • We use the standard deviation (SD) to measure spread in a data set

  • When paired, the average and SD are highly effective summary statistics


Descriptive statistics10

Descriptive Statistics

  • The Root-Mean-Square (RMS) measures the typical absolute value of data points in a set

    • Calculated by reading its name backwards

      • Square all entries in the data set

      • Take their mean

      • Take the square root of that mean

  • Find the average and then the RMS size of the numbers of the list

Average = 0

RMS = 4


Descriptive statistics11

Descriptive Statistics

  • The SD embodies the same concept of “typical” distance

    • Where the RMS measures typical distance from 0, the SD measures typical distance from the data set’s average

    • This is accomplished by subtracting the average from every data point and then taking the RMS of the differences (or deviations from the mean)


Descriptive statistics12

Descriptive Statistics

  • 1 4 5 7 10 15 has an average of 7

  • The deviations are then -6 -3 -2 0 3 8

    • Note how the subtraction process re-centers the data set so that the average is at 0


Descriptive statistics13

Descriptive Statistics

  • Taking the RMS of the deviations gives the standard deviation

    • Normally, about two thirds to three quarters of a data set should be within one SD of the mean

  • 1 4 5 7 10 15 has an average of 7 and an SD of about 4.5


Descriptive statistics14

Descriptive Statistics

1 4 5 7 10 15

(1+4+5+7+10+15)/6 = 7

Average = 7

1-7 4-7 5-7 7-7 10-7 15-7

-6 -3 -2 0 3 8

(-6)2 (-3)2 (-2)2 02 32 82

(36+9+4+0+9+64)/6 = 122/6 ≈ 20.3

√(20.3) ≈ 4.5

36 9 4 0 9 64

SD ≈ 4.5


Descriptive statistics15

Descriptive Statistics

  • What we had on the previous slide is called the SD of the sample. However, if the goal is to use this sample to estimate the SD of a larger population, we would divide by n-1 instead of n (where n is the number of points) and call the result Sample SD.

  • Most calculators actually calculate the sample SD.

  • In general, the higher a set’s SD, the more spread out its points are

  • An SD of 0 indicates that every point in the data set has the same value


Descriptive statistics16

Descriptive Statistics

  • Calculate the SD’s of the data sets

    • 60 60 60 60 60

    • 18 59 60 63 100

    • 12 35 60 87 100

0

26.0

30.7


Histograms

Histograms

  • Often, we would prefer a pictorial representation of a data set to a two-number summary

  • The most common way to graphically represent a data set is to draw a frequency histogram (or just histogram)


Histograms1

Histograms

  • Histograms tend to look like city skylines

    • In a histogram, the area under the curve between two points on the horizontal axis represents the proportion of data points between those two points

    • Continuing the city skyline analogy, the size of the building determines how many people live there

      • A long, low building can house as many people as a thin skyscraper


Histograms2

Histograms

  • To draw a histogram, we first need to organize our data into bins (or class intervals)

    • Often, the bins are dictated to us

    • If we get to choose them, we try to pick the bins so that they give a fair representation of the data

  • Then mark a horizontal axis with the bin values, spacing them correctly


Histograms3

Histograms

  • Often, data is given in percentage form

    • If not, divide the number of points in the bin by the number of points in the data set to get the percentage

  • Draw a box for each bin so that the area of the box is the percentage of the data in that bin

    • To get the correct height of the box, divide the percentage of the box by the width of the bin


Histograms4

Histograms

  • Note that the average and median can be visually located on a histogram

    • If the histogram was balanced on a see-saw, the fulcrum would meet the histogram at the average

    • If you draw a vertical line through the histogram so that it splits the area in half, then the line passes through the median

  • On a symmetric histogram, the average and median tend to coincide

  • Asymmetric tails pull the average in the direction of the tail


The normal curve

The Normal Curve

  • A great many data sets have similarly-shaped histograms

    • SAT scores

    • Attendance at baseball games

    • Battery life

    • Cash flow of a bank

    • Heights of adult males/females


The normal curve1

The Normal Curve

  • These histograms are similar to one generated by a very special distribution

    • It is called the normal distribution, and it is identified by two parameters we are already familiar with

      • average

      • standard deviation


The normal curve2

The Normal Curve

  • This is the standard normal curve, where the average is 0 and the SD is 1


The normal curve3

The Normal Curve

  • Though the equation used to draw the curve is not easy to work with, there is a table of values for the standard normal distribution

    • We will use this table to find areas under the curve

    • The table is on page A-105 of your text


The normal curve4

The Normal Curve

  • Properties of the standard normal curve

    • The curve is “bell-shaped” with its highest point at 0

    • It is symmetric about a vertical line through 0

    • The curve approaches the horizontal axis, but the curve and the horizontal axis never meet


The normal curve5

The Normal Curve

  • Area underneath the standard normal curve

    • Half the area lies to the left of 0; half lies to the right

    • Approximately 68% of the area lies between –1 and 1

    • Approximately 95% of the area lies between –2 and 2

    • Approximately 99.7% of the area lies between –3 and 3


Standardization

Standardization

  • Most data sets do not have a mean of 0 and an SD of 1

  • To be able to use the standard normal curve, we’ll need to standardize numbers in the original data set

    • To standardize a number, subtract the data set’s average and then divide the difference by the data set’s SD

    • Standardizing is basically a change of scale

      • Like converting feet to miles


Standardization1

Standardization

  • Suppose there are two different sections of the same course

    • The scores for the midterm in each section were approximately normally distributed

      • In first section, the average was 64 and the standard deviation was 5

        • Tina scored a 74 in first section

      • In second section, the average was 72 and the standard deviation was 10

        • Jack scored an 82 in second section

    • Which of the two scores is most impressive, relative to the students in his/her section?


Standardization2

Standardization

  • Convert the following scores in the first section to standard units

    • Alice got a 50

    • Bob got a 61

    • Carol got a 64

    • Dan got a 77

-2.8

-0.6

0

2.6


Standardization3

Standardization

  • In Jack’s section, students with grades between 62 and 82 received a B

    • What percentage of students in this section received Bs?

    • Is this percentage exact?

68.27%

No


Normal approximation

Normal Approximation

  • According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches


Normal approximation1

Normal Approximation

  • The normal curve is a smooth-curve histogram for normally distributed data

    • We can estimate percentages within a given range

      • Find the area under the curve between those ranges using the standard normal table


Normal approximation2

Normal Approximation

  • Sometimes will require cutting and pasting different areas together

    • The standard normal table on page A-105 takes a standard score z

    • It returns to you the area under the curve between –z and z


Normal approximation3

Normal Approximation

  • Find the area between –1.2 and 1.2 under normal curve

76.99%


Normal approximation4

Normal Approximation

  • Find the area between 0 and 1.65 under the standard normal curve 

45.055%


Normal approximation5

Normal Approximation

  • Find the area between 0 and 3.3 under the standard normal curve

49.9515%


Normal approximation6

Normal Approximation

  • Find the area between –0.35 and 0.95 under normal curve

46.58%


Normal approximation7

Normal Approximation

  • Find the area between 1.2 and 1.85 under the normal curve

8.29%


Normal approximation8

Normal Approximation

  • Find the area between –2.1 and –1.05 under the normal curve

12.9%


Normal approximation9

Normal Approximation

  • Find the area to the right of 1 under the normal curve

15.865%


Normal approximation10

Normal Approximation

  • Find the area to the left of 0.85 under the normal curve

80.235%


Normal approximation11

Normal Approximation

  • If a data set is approximately normal in distribution, we can use the normal curve in place of the data set’s histogram

  • If you want to estimate the percentage of the data set between two numbers…

    • Standardize the numbers to get z scores

    • Look each z score up in the standard normal table

    • Cut and paste the areas to match the region you originally wanted

      • The percentage under the curve will be close to the percentage in the data set


Normal approximation12

Normal Approximation

  • It is generally helpful to sketch the curve first and shade in the desired area

    • This will remind you what the target area is


Normal approximation13

Normal Approximation

  • According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches

    • What percentage of women has heights between 60 and 68 inches?

88.71%


Normal approximation14

Normal Approximation

  • According to the HANES study, the height of U.S. women was 63.5 inches with an SD of 2.5 inches

    • What percentage of women are taller than 66 inches?

15.865%


Normal approximation15

Normal Approximation

  • Sometimes, you will be given the percentage of the data set

    • Want to find score(s) which mark(s) off that percentage

  • Adjust the area to “center” it

    • Look up the z score associated with that area in the table

    • Unstandardize the z score by multiplying it by the SD and adding the average to the product


Normal approximation16

Normal Approximation

  • For a certain population of high school students, the SAT-M scores are normally distributed with average 500 and SD=100

    • A certain engineering college will accept only high school seniors with SAT-M scores in the top 5%

    • What is the minimum SAT-M score for this program?

665


Normal approximation17

Normal Approximation

  • One way to determine how large a number is in the data set is to find its percentile rank

    • The kth percentile is the value so that k percent of the data set have values below it

  • Percentile ranks can be calculated for any data set


Normal approximation18

Normal Approximation

  • In one year, the 1600-point SAT scores were approximately normal with an average of 1030 and an SD of 190

    • If a student scores a 1460, what is her percentile rank?

98th percentile


Summary

Summary

  • It is often useful to describe a data set with summary statistics

    • The average and median are central tendency statistics

      • The average is more sensitive to outliers

  • The standard deviation (SD) is the most common summary statistic for describing a data set’s spread

    • The SD is calculated by taking the RMS of the deviations from the mean of each data point in the set

    • Most of the points in most data sets will lie within one or two SD’s from the average


Summary1

Summary

  • We can represent a data set graphically by drawing a histogram

    • The percentage of the data set in a bin is the area under the histogram of that section

    • The height of each block in a histogram is the percentage of the data in the corresponding bin divided by the width of the bin

  • The total area under any histogram is 100%

  • The average of a data set is located at the balance point of the histogram

    • Long tails pull the average in the direction of the tail


Summary2

Summary

  • Using the average and SD, we can standardize numbers in the data set

    • The standard score (z) of a number is its distance from the average in terms of SD’s

    • We can also take a standard score and convert it back to a raw score


Summary3

Summary

  • Many data sets are approximately normal

    • We can estimate the percentage of points in a data set that fall between two numbers

      • Convert the numbers to standard units

      • Find the area under the standard normal curve by using the normal table

    • If a data set is approximately normal, we can use the normal table to estimate percentile ranks


  • Login