exploratory data analysis one variable n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Exploratory Data Analysis: One Variable PowerPoint Presentation
Download Presentation
Exploratory Data Analysis: One Variable

Loading in 2 Seconds...

play fullscreen
1 / 59

Exploratory Data Analysis: One Variable - PowerPoint PPT Presentation


  • 142 Views
  • Uploaded on

Exploratory Data Analysis: One Variable. FPP 3-6. Plan of attack. Distinguish different types of variables Summarize data numerically Summarize data graphically Use theoretical distributions to potentially learn more about a variable. The five steps of statistical analyses.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Exploratory Data Analysis: One Variable' - jamil


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
plan of attack
Plan of attack
  • Distinguish different types of variables
  • Summarize data numerically
  • Summarize data graphically
  • Use theoretical distributions to potentially learn more about a variable.
the five steps of statistical analyses
The five steps of statistical analyses
  • Form the question
  • Collect data
  • Model the observed data
    • We start with exploratory techniques.
  • Check the model for reasonableness
  • Make and present conclusions
just to make sure we are on the same page
Just to make sure we are on the same page
  • More (or repeated) vocabulary
    • Individuals are the objects described by a set of data
      • examples: employees, lab mice, states…
    • A variable is any characteristic of an individual that is of interest to the researcher. Takes on different values for different individuals
      • examples: age, salary, weight, location…
    • How is this different from a mathematical variable?
just to make sure we are on the same page 2
Just to make sure we are on the same page #2
  • Measurement The value of a variable obtained and recorded on an individual
    • Example: 145 recorded as a person’s weight, 65 recorded as the height of a tree, etc.
  • Data is a set of measurements made on a group of individuals
  • The distribution of a variable tells us what values it takes and how often it takes these values
two types of variables
Two Types of Variables
  • a categorical/qualitative variable places an individual into one of several groups or categories
    • examples:
      • Gender, Race, Job Type, Geographic location…
    • JMP calls these variables nominal
  • a quantitative variable takes numerical values for which arithmetic operations such as adding and averaging make sense
    • examples:
      • Height, Age, Salary, Price, Cost…
    • Can be further divided to ordinal and continuous
  • Why two types?
    • Both require their own summaries (graphically and numerically) and analysis.
  • I can’t emphasis enough the importance of identifying the type of variable being considered before proceeding with any type of statistical analysis
example
Example
  • Age: quantitative
  • Gender: categorical
  • Race: categorical
  • Salary: quantitative
  • Job type: categorical
variable types in jmp
Variable types in JMP
  • Qualitative/categorical
    • JMP uses Nominal
  • Quantitative
    • Discrete
      • JMP uses Ordinal
    • Continuous
      • JMP uses Continuous
exploratory data analysis
Exploratory data analysis
  • Statistical tools that help examine data in order to describe their main features
  • Basic strategy
    • Examine variables one by one, then look at the relationships among the different variables
    • Start with graphs, then add numerical summaries of specific aspects of the data
exploratory data analysis one variable1
Exploratory data analysis: One variable
  • Graphical displays
    • Qualitative/categorical data: bar chart, pie chart, etc.
    • Quantitative data: histogram, stem-leaf, boxplot, timeplot etc.
  • Summary statistics
    • Qualitative/categorical: contingency tables
    • Quantitative: mean, median, standard deviation, range etc.
  • Probability models
    • Qualitative: Binomial distribution(others we won’t cover in this class)
    • Quantitative: Normal curve (others we won’t cover in this class)
summary table
Summary table
  • we summarize categorical data using a table. Note that percentages are often called Relative Frequencies.
bar graph
Bar graph
  • The bar graph quickly compares the degrees of the four groups
  • The heights of the four bars show the counts for the four degree categories
pie chart
Pie chart
  • A pie chart helps us see what part of the whole group forms
  • To make a pie chart, you must include all the categories that make up a whole
summary of categorical variables
Summary of categorical variables
  • Graphically
    • Bar graphs, pie charts
      • Bar graph nearly always preferable to a pie chart. It is easier to compare bar heights compared to slices of a pie
  • Numerically: tables with total counts or percents
quantitative variables
Quantitative variables
  • Graphical summary
    • Histogram
    • Stemplots
    • Time plots
    • more
  • Numerical sumary
    • Mean
    • Median
    • Quartiles
    • Range
    • Standard deviation
    • more
histograms
Histograms

The bins are:

3.0 ≤ rate < 4.0

4.0 ≤ rate < 5.0

5.0 ≤ rate < 6.0

6.0 ≤ rate < 7.0

7.0 ≤ rate < 8.0

8.0 ≤ rate < 9.0

9.0 ≤ rate < 10.0

10.0 ≤ rate < 11.0

11.0 ≤ rate < 12.0

12.0 ≤ rate < 13.0

13.0 ≤ rate < 14.0

14.0 ≤ rate < 15.0

histograms1
Histograms

The bins are:

3.0 ≤ rate < 4.0

4.0 ≤ rate < 5.0

5.0 ≤ rate < 6.0

6.0 ≤ rate < 7.0

7.0 ≤ rate < 8.0

8.0 ≤ rate < 9.0

9.0 ≤ rate < 10.0

10.0 ≤ rate < 11.0

11.0 ≤ rate < 12.0

12.0 ≤ rate < 13.0

13.0 ≤ rate < 14.0

14.0 ≤ rate < 15.0

histograms2
Histograms

The bins are:

2.0 ≤ rate < 4.0

4.0 ≤ rate < 6.0

6.0 ≤ rate < 8.0

8.0 ≤ rate < 10.0

10.0 ≤ rate < 12.0

12.0 ≤ rate < 14.0

14.0 ≤ rate < 16.0

16.0 ≤ rate < 18.0

histograms3
Histograms
  • Where did the bins come from?
    • They were chosen rather arbitrarily
  • Does choosing other bins change the picture?
    • Yes!! And sometimes dramatically
  • What do we do about this?
    • Some pretty smart people have come up with some “optimal” bin widths and we will rely on there suggestions
histogram
Histogram
  • The purpose of a graph is to help us understand the data
  • After you make a graph, always ask, “What do I see?”
  • Once you have displayed a distribution you can see the important features
histograms4
Histograms
  • We will describe the features of the distribution that the histogram is displaying with three characteristics
  • Shape
    • Symmetric, skewed right, skewed left, uni-modal, multi-modal, bell shaped
  • Center
    • Mean, median
  • Spread (outliers or not)
    • Standard deviation, Inter-quartile range
histogram vs bar graph
Histogram vs. Bar graph
  • Spaces mean something in histograms but not in bar graphs
  • Shape means nothing with bar graphs
  • The biggest difference is that they are displaying fundamentally different types of variables
time plots
Time Plots
  • Many variables are measured at intervals over time
    • Examples
      • Closing stock prices
      • Number of hurricanes
      • Unemployment rates
  • If interest is a variable is to see change over time use a time plot
time plots1
Time Plots
  • Patterns to look for
    • Patterns that repeat themselves at known regular intervals of time are called seasonal variation
    • A trend is a persistant, long-term rise or fall
time plots2
Time plots

number of hurricanes each year from

1970 - 1990

numerical summaries of quantitative variables
Numerical summaries of quantitative variables
  • Want a numerical summary for center and spread
    • Center
      • Mean
      • Median
      • Mode
    • Spread
      • Range
      • Inter-quartile range
      • Standard deviation
    • 5 number summary is a popular collection of the following
      • min, 1st quartile, median, 3rd quartile, max
slide30
Mean
  • To find the mean of a set of observations, add their values and divide by the number of observations
    • equation 1:
    • equation 2:
mean example
Mean example
  • The average age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room.
    • Does the average age change?
    • If so, what is the new average age?
median
Median
  • The median is the midpoint of a distribution
    • The number such that half the observations are smaller and the other half are larger
    • Also called the 50th percentile or 2nd quartile
  • To compute a median
    • Order observations
    • If number of observations is odd the median is the center observation
    • If number of observations is even the median is the average of the two center observations
median example
Median example
  • The median age of 20 people in a room is 25. A 28 year old leaves while a 30 year old enters the room.
    • Does the median age change?
    • If so, what is the new median age?
  • The median age of 21 people in a room is 25. A 28 year old leaves while a 30 year old enters the room.
    • Does the median age change?
    • If so, what is the new median age?
mean vs median
Mean vs Median
  • When histogram is symmetric mean and median are similar
  • Mean and median are different when histogram is skewed
    • Skewed to the right mean is larger than median
    • Skewed to the left mean is smaller than median
  • The business magazine Forbes estimates that the “average” household wealth of its readers is either about $800,000 or about $2.2 million, depending on which “average” it reports. Which of these numbers is the mean wealth and which is the median wealth? Why?
mean vs median1
Mean vs Median
  • Symmetric distribution
mean vs median2
Mean vs Median
  • Right skewed distribution
mean vs median3
Mean vs Median
  • Left skewed distribution
extreme example
Extreme example
  • Income in small town of 6 people

$25,000 $27,000 $29,000

$35,000 $37,000 $38,000

  • Mean is $31,830 and median is $32,000
  • Bill Gates moves to town

$25,000 $27,000 $29,000

$35,000 $37,000 $38,000 $40,000,000

  • Mean is $5,741,571 median is $35,000
  • Mean is pulled by the outlier while the median is not. The median is a better of measure of center for these data
is a central measure enough
Is a central measure enough?
  • A warm, stable climate greatly affects some individual’s health. Atlanta and San Diego have about equal average temperatures (62o vs. 64o). If a person’s health requires a stable climate, in which city would you recommend they live?
measures of spread
Measures of spread
  • Range:
    • subtract the largest value form the smallest
  • Inter-quartile range:
    • subtract the 3rd quartile from the 1st quartile
  • Standard Deviation (SD):
    • “average” distance from the mean
  • Which one should we use?
standard deviation
Standard Deviation
  • The standard deviation looks at how far observations are from their mean
  • It is the square root of the average squared deviations from the mean
  • Compute distance of each value from mean
  • Square each of these distances
  • Take the average of these squares and square root
  • Often we will use SD to denote standard deviation
standard deviation1
Standard deviation
  • Order these histograms by the SD of the numbers they portray. Go from smallest largest
  • What is a reasonable guess of the SD for each?
problem from text p 74 2
Problem from text (p. 74, #2)
  • Which of the following sets of numbers has the smaller SD’

a) 50, 40, 60, 30, 70, 25, 75b) 50, 40, 60, 30, 70, 25, 75, 50, 50, 50

  • Repeat for these two sets

c) 50, 40, 60, 30, 70, 25, 75d) 50, 40, 60, 30, 70, 25, 75, 99, 1

more intuition behind the sd
More intuition behind the SD
  • This is a variance contest. You must give a list of six numbers chosen from the whole numbers 0, 1, 2, 3, 4, 5, 6, 7, 8, 9 with repeats allowed.
  • Give a list of six numbers with the largest standard deviation such a list described above can possibly have.
  • Give a list of six numbers with the smallest standard deviation such a list can possibly have.
properties of sd
Properties of SD
  • SD ≥ 0. (When is SD = 0)?
  • Has the same unit of measurement as the original observations
  • Inflated by outliers
mean and sd
Mean and SD
  • What happens to the mean if you add 5 to every number in a list?
  • What happens to the SD?
standard deviation2
Standard deviation
  • SDs are like measurement units on a ruler
  • Any quantitative variable can be converted into “standardized” units
    • These are often called z-scores and are denoted by the letter z
  • Important formula
  • Example
    • ACT versus SAT scores
    • Which is more impressive
      • A 1340 on the SAT, or a 32 on the ACT?
the normal curve
The normal curve
  • When histogram looks like a bell-shaped curve, z-scores are associated with percentages
  • The percentage of the data in between two different z-score values equals the area under the normal curve in between the two z-score values
  • A bit of notation here.
    • N(, ) is short hand for writing normal curve with mean  and standard deviation  (get used to this notation as it will be used fairly regularly through out the course)
properties of normal curve
Properties of normal curve
  • In the Normal distribution with mean  and standard deviation :
    • 68% of the observations fall within 1  of 
    • 95% of the observations fall within 2 s of 
    • 99.7% of the observations fall within 3 s of 
  • By remembering these numbers, you can think about Normal curves without constantly making detailed calculations
properties of normal curves
Properties of normal curves
  • For a N(0,1) the following holds
slide55
IQ
  • A person is considered to have mental retardation when
    • IQ is below 70
    • Significant limitations exist in two or more adaptive skill areas
    • Condition is present from childhood
  • What percentage of people have IQ that meet the first criterion of mental retardation
slide56
IQ
  • A histogram of all people’s IQ scores has a μ=100 and a σ=16
  • How to get % of people with IQ < 70
more iq
More IQ
  • Reggie Jackson, one of the greatest baseball players ever, has an IQ of 140. What percentage of people have bigger IQs than Reggie?
  • Marilyn vos Savant, self-proclaimed smartest person in the world, has a reported IQ of 205. What percentage of people have IQ scores smaller than Marilyn’s score?
  • Mensa is a society for “intelligent people.” To qualify for Mensa, one needs to be in at least the upper 2% of the population in IQ score. What is the score needed to qualify for Mensa?
checking if data follow normal curve
Checking if data follow normal curve
  • Look for symmetric histogram
  • A different method is a normal probability plot. When normal curve is a good fit, points fall on a nearly straight line
measurement error
Measurement error
  • Measurement error model
    • Measurement = truth + chance error
  • Outliers
  • Bias effects all measurements in the same way
    • Measurement = truth + bias + chance error
  • Often we assume that the chance error follows a normal curve that is centered at 0