Statistics bootcamp

1 / 99

# Statistics bootcamp - PowerPoint PPT Presentation

Statistics bootcamp. Laine Ruus Data Library Service, University of Toronto Rev. 2005-04-26. Outline. Describing a variable Describing relationships among two or more variables. First, some vocabulary.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Statistics bootcamp' - nan

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### Statistics bootcamp

Laine Ruus

Data Library Service, University of Toronto

Rev. 2005-04-26

Outline
• Describing a variable
• Describing relationships among two or more variables
First, some vocabulary
• Variable: In social science research, for each unit of analysis, each item of data (e.g., age of person, income of family, consumer price index) is called a variable.
• Unit of analysis: The basic observable entity being analyzed by a study and for which data are collected in the form of variables. A unit of analysis is sometimes referred to as a case or "observation"
The variable ‘Sex’ can be coded as:
• 1=’male’ 2 =’female’ 3 = ‘no response’, or
• 1=’female’ 2 =’male’, or
• 1=’male’ 2 =’female’, or
• ‘M’=’male’ ‘F’ =’female’, or
• ‘male’ ‘female’, or even
• 1=’yes’ 2 =’no’ 3 = ‘maybe’
The values a variable can take must be:
• exhaustive: include the characteristics of all cases, and
• mutually exclusive: each case must have one and only one value or code for each variable
What’s wrong with this coding scheme?
• Under \$3,000
• \$3,000-\$7,000
• \$8,000-\$12,000
• \$13,000-\$17,000
• \$18,000-\$22,000
• \$23,000-\$27,000
• \$28,000-\$32,000
• \$33,000-\$37,000
• \$38,000 & over

(Source: Census of Canada, 1961: user summary tapes)

Variables are normally coded numerically, because:
• arithmetic is easier with numbers than with letters
• some characteristics are inherently numeric: age, weight in kilograms or pounds, number of children ever borne, income, value of dwelling, years of schooling, etc.
• space/size of data sets has, until recently, been a major consideration
Three basic types of variables
• Categorical
• Nominal, aka nonorderable discrete

Eg gender, ethnicity, immigrant status, province of residence, marital status, labour force status, etc.

• Ordinal, aka orderable discrete

Eg highest level of schooling, social class, left-right political identification, Likert scales, income groups, age groups, etc.

• Continuous, aka interval, numeric

Eg actual age, income, etc.

Descriptive statistics summarize the properties of a sample of observations
• how the units of observation are the same (central tendency)
• how they are different (dispersion)
• how representative the sample is of the population at large (significance)
Nominal variable:
• Central tendency
• mode
• Dispersion
• frequency distribution
• percentages, proportions, odds
• Index of qualitative variation (IQV)
• Significance
• coefficient of variation (CV)
• Visualization
• bar chart
Mode: the category with the largest number or percentage of observations in a frequency distribution
Why the differences?
• What is the population in each table?
• What is in the denominator in each table?
• Which one is correct?

The most important thing to know about any distribution,

whether it is a rate, a proportion,

or a percent, is what is in the

denominator. And it must always be reported.

We can also derive the distribution information from the 2001 individual pumf using a statistical package:
If we weight the distribution, both the frequencies and the percentages will almost match the distribution from the Profile file:
Just a few words on weighting:
• The weight is the chance that any member of the population (universe) had to be selected for the sample
• In general
• the weight can be used to produce estimates for the total population (population weight)
• and/or the weight can adjust for known deficiencies in the sample (sample weight)
and a few more words on weighting…
• The 2001 census public use microdata file of individuals is a 2.7% sample of the population
• The weight variable (weightp) ranges from 35.545777-39.464996
• Knowing who was excluded from the sample is as important as knowing who was included
And some final words on weighting…
• When to use the population weight variable
• when you are producing frequencies to reflect the frequency in the population
• When you don’t need to use the population weight variable
• when you are producing percentages, proportions, ratios, rates, etc.
• When to use a sample weight variable
• always
Proportions, percents, and odds
• Percent: ‘the percent married in the population 15 years and over is…”

= (#married/population >15 years)*100=49.47%

• Proportion: “the proportion married in the population 15 years and over is …”

= (#married/population >15 years)=.4947

• Odds: “the odds on being married are…”

=(#married/#not married in the population) or

=proportion married/(1-proportion married)

=.4947/(1-.4947)=.4947/.5053=.9790

Coefficient of variation
• measures how representative the variable in the sample is of the distribution in the population
• computed as ((standard deviation/mean)*100)

[we will discuss these measures in the context of continuous variables]

• see Stats Can guidelines in user guides:
• cv< 16.6% is ok to publish, cv>33.3% do not publish
• SDA reports the cv when generating frequencies
Ordinal variable:
• Central tendency
• median and mode
• Dispersion
• frequency distribution
• range
• percentages/quantiles, proportions, odds
• Index of qualitative variation (IQV)
• Significance
• coefficient of variation (CV)
• Visualization
• histogram
The Median is the value that divides an orderable distribution exactly into halves.

Finding the median is easier if we compute cumulative percentages, eg in Excel

% Cum%

0-4 years 5.65 5.65

5-9 years 6.59 12.24

10-14 years 6.84 19.08

15-24 years 13.36 32.44

25-34 years 13.31 45.75

35-44 years 17.00 62.75

45-54 years 14.73 77.48

55-64 years 9.56 87.04

65-74 years 7.14 94.18

75-84 years 4.43 98.61

85 years and over 1.39 100

So how can we describe this distribution, using the vocabulary we have so far?
• What is the mode of this distribution?
• What is the median?
• What is the range?
Percentiles/quantiles
• percentiles/quantiles are the value below which a given percentage of the cases fall
• the median = the 50th percentile
• quartiles are the values of a distribution broken into 4 even intervals each containing 25% of the cases
Interquartile range
• The interquartile range is the difference between the value at 75% of cases, and the value at 25% of cases
• What is the interquartile range for the following distribution?

% Cum%

0-4 years 5.65 5.65

5-9 years 6.59 12.24

10-14 years 6.84 19.08

15-24 years 13.36 32.44

25-34 years 13.31 45.75

35-44 years 17.00 62.75

45-54 years 14.73 77.48

55-64 years 9.56 87.04

65-74 years 7.14 94.18

75-84 years 4.43 98.61

85 years and over 1.39 100

Continuous variable:
• Central tendency
• mean, median and mode
• Dispersion
• range
• variance or standard deviation
• quantiles/percentiles
• interquartile range
• Significance
• standard error
• coefficient of variation
• Visualization
• polygon (line graph)
Means, variances, and standard deviations:
• Mean: the average, computed by adding up the values of all observations and dividing by the number of observations (cases)
• Variance: computed by taking each value and subtracting the value of the mean from it, squaring the results, adding them up, and dividing by the number of cases (actually N-1)
• Standard deviation: the value that cuts off 68% of the cases above or below the mean, in a normal distribution. It’s the square root of the variance, in the same metric as the variable.
Availability of continuous variables in Stats Can products:
• Stats Can rarely publishes truly continuous variables in its aggregate statistics products
• Some exceptions are:
• age by single years (census)
• estimates of population by single years of age (Annual demographic statistics)
Statistics Canada generally reports the distribution of continuous variables as:
• Measures of central tendency
• an ordinal (categorical) variable
• an average (mean) and standard error, variance, or standard deviation
• a median
• rates (other than of 100), eg. children ever born per 1,000 women in the 1991 census 2B profile
• Measures of dispersion
• percentage – essentially, a rate per 100 population. Eg, Incidence of low income as a % of economic families, in the census profiles. This includes employment and unemployment rates
• quantiles (eg quintiles in Income trends in Canada)
• Gini coefficients (computed from the coefficient of variation) (eg Income trends in Canada)
• Measures of significance
• standard error
In the following distribution…
• What is the median?
• What is the mean?
• What is the range?
Using the percentages and cumulative percentages:
• What is your best estimate of the interquartile range?
• What is your best estimate of the standard deviation?
• Why is the average (mean) income so much higher than the median?
• See pages 18-19 of your handout
Using the standard error to describe more of the distribution:
• standard errorof the mean is a measure of how likely it is that the mean in the data we are looking at is the same as or similar to the mean in the population at large
• computed from the standard deviation divided by the square root of the N
• the larger the N, the smaller the standard error, and the more confidence we can have in the distribution in the sample as representative of the population
Confidence intervals
• The standard error makes most sense when we use it to compute a confidence interval around the mean
• For Canada, in the previous example:
• 95% upper confidence limit (UCL)

=(mean)+1.96(standard error)

=29769+1.96(19)= 29769 + 37.24=\$29,806.24

• 95% lower confidence limit (LCL)

=(mean)-1.96(standard error)

=29769 -1.96(19)= 29769 - 37.24 =\$29,731.76

• 1.96 is the Z-statistic that represents 95% of a normal distribution
• The handout contains computed confidence intervals for each of the provinces (p.22)
How do we interpret this?
• if we draw repeated random samples from the same population, 95% of them will have a mean total income between \$29,732 and \$29,806
• this is not the same as saying that we are 95% confident that the population mean falls within those two limits.
Using microdata
• Statistical packages such as SAS, SPSS, etc. will compute the mean, the standard deviation, standard error, etc. for continuous variables
• SDA will only report these measures for variables with less than 8,000 values
How many values?
• What is the maximum number of values for the wagesp variable? For the totincp variable?
• If you were to do a frequency distribution of wagesp, how many rows might you have in the distribution?
• How would you go about finding out what the modal category for wagesp is?
• they are not immutable or un-changeable
• continuous variables can be changed into ordinal, or even nominal variables
• ordinal variables can be changed into nominal variables, and
• nominal variables can be collapsed still further
• nominal and ordinal variables can be combined to create indices or scales
Describing relationships among two or more variables:
• Objectives of describing multivariate relationships
• description
• improved ability to predict the value of a variable for a case, by using the value of another variable (or variables)
• examine causation, which requires
• covariation
• the causal variable must occur before the outcome variable in time (temporal precedence)
• A non-spurious relationship
Variables are either…
• Dependent (aka ‘outcome’ variables) or
• Independent (eg ‘causal’ variables)
The same variable can be an independent variable in one hypothesis, and a dependent variable in another hypothesis:
• Does gender make a difference in level of education?

‘level of education’ is dependant; ‘gender’ is independant

• Does level of education make a difference to earned income?

‘earned income’ is dependant; ‘level of education’ is independent

• Is the effect of education on earned income the same for men and women?

‘earned income’ is dependant; ‘level of education’ is independent, and ‘gender’ is the control variable

Statistical measures describe…
• The strength of the relationship between two (or more) variables
• The direction of the relationship between two (or more) variables
• The significance of the relationship between two (or more) variables in the sample vis-à-vis the population
• whether the dependant variable is nominal, ordinal, or continuous
• whether the dependant variable is a dichotomy or a polytomy
• whether the independent variable(s) is a dichotomy, a polytomy, or continuous
• cross-tabulations
• a common census output product (all the Topic-based tabulations in 2001)
• strength and direction:
• odds ratios
• significance:
• chi-square statistic
The chi-square statistic:
• Tells us how likely we are to be wrong if we say there is a relationship between these two variables
• Significance (probability of being wrong)
• Evaluated based on
• Critical value (found in statistics texts)
• Degrees of freedom (df=1)
• Level of significance (a 5% possibility of being wrong is normally acceptable in the social sciences)
So we know:
• Direction & strength: from graphing the odds
• Significance: from a chi-square statistic
• Can be computed in eg Excel
• For the previous table, 2.97 (see handout page 30)
• Critical value: 3.84 with 1 degree of freedom, at .05 significance (ie probability of being wrong)
• This table is not statistically significant, therefore a good chance of seeing this relationship as a chance of eg measurement error
Did you notice?
• In the 1996 data, when we looked only at immigrants, versus non-immigrants, the non-immigrants were more likely to be employed
• When we break it out by visible minority status,
• Non-visible majority: immigrants are more likely to be employed than non-immigrants (11.73/9.16=1.28)
• Visible minorities: immigrants are also more likely to be employed than non-immigrants (6.30/5.52=1.14)
• This is an example of Simpson’s paradox
Using a statistical package to examine these relationships:
• Automatically computes a chi-square
• Computes degrees of freedom and probability
• Is sensitive to sample size (so we need to take a subsample)
In the table on the previous slide:
• How many cases are there in this subsample?
• What percentage overall are unemployed?
• What percentage of immigrants are unemployed?
• Is this different from the percentage of non-immigrants that are unemployed?
• What percentage are immigrants? How would you find out?
And tell me what you found:
• How many tables are produced?
• What is the difference between them?
• How many of these tables are statistically significant (based on the Chisq)?
• In which two groups is the relationship the opposite of the majority of the groups?
Comparison of means
• Relationship between a continuous dependant variable, and a dichotomous independent variable
• Significance:
• Student’s t-statistic (similar to Z-statistic)
• Example: average employment income for visible minorities versus all others (2001 census: dimensions series)
Visible minority status and employment income
• In 1996 census, a Dimensions series table showed visible minority status and employment income
• This table not available in 2001 census
• Relationship can only be examined using microdata, or requesting a custom tabulation
ANOVA (analysis of variance)
• When the dependent variable is a continuous variable, and the independent variable is a polytomy
• Significance:
• F-ratio
• Eta-squared (amount of explained variance)
• Example: mean earned income for individual visible minorities
• Correlations
• Strength and direction
• Pearson correlation coefficient
• Significance
• Z-statistics for each pair

(r-to-Z transformation table)

Combined effects of one or more independent variables on a continuous dependant variable
• Regression analysis
• Categorical variables must be recoded to dummy variables
• Strength and direction:
• t-statistic for each independent variable
• Significance:
• F-ratio
What you need to remember:
• To examine the relationships between 2 or more variables in Beyond 20/20, the variables must be in the same file and in different dimensions
• If no table exists with those variables, the alternative is to use a relevant microdata file
• To generate the table (or request a custom tabulation from Statistics Canada)
• To compute the more complicated measures of association and significance
What you need to remember (cont’d):
• A user who needs to do correlations or regression analysis needs continuous outcome (dependant variables)