Statistics bootcamp
Download
1 / 99

Statistics bootcamp - PowerPoint PPT Presentation


  • 233 Views
  • Uploaded on

Statistics bootcamp. Laine Ruus Data Library Service, University of Toronto Rev. 2005-04-26. Outline. Describing a variable Describing relationships among two or more variables. First, some vocabulary.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Statistics bootcamp' - nan


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Statistics bootcamp l.jpg

Statistics bootcamp

Laine Ruus

Data Library Service, University of Toronto

Rev. 2005-04-26


Outline l.jpg
Outline

  • Describing a variable

  • Describing relationships among two or more variables


First some vocabulary l.jpg
First, some vocabulary

  • Variable: In social science research, for each unit of analysis, each item of data (e.g., age of person, income of family, consumer price index) is called a variable.

  • Unit of analysis: The basic observable entity being analyzed by a study and for which data are collected in the form of variables. A unit of analysis is sometimes referred to as a case or "observation"


The variable sex can be coded as l.jpg
The variable ‘Sex’ can be coded as:

  • 1=’male’ 2 =’female’ 3 = ‘no response’, or

  • 1=’female’ 2 =’male’, or

  • 1=’male’ 2 =’female’, or

  • ‘M’=’male’ ‘F’ =’female’, or

  • ‘male’ ‘female’, or even

  • 1=’yes’ 2 =’no’ 3 = ‘maybe’


The values a variable can take must be l.jpg
The values a variable can take must be:

  • exhaustive: include the characteristics of all cases, and

  • mutually exclusive: each case must have one and only one value or code for each variable


What s wrong with this coding scheme l.jpg
What’s wrong with this coding scheme?

  • Under $3,000

  • $3,000-$7,000

  • $8,000-$12,000

  • $13,000-$17,000

  • $18,000-$22,000

  • $23,000-$27,000

  • $28,000-$32,000

  • $33,000-$37,000

  • $38,000 & over

    (Source: Census of Canada, 1961: user summary tapes)


Variables are normally coded numerically because l.jpg
Variables are normally coded numerically, because:

  • arithmetic is easier with numbers than with letters

  • some characteristics are inherently numeric: age, weight in kilograms or pounds, number of children ever borne, income, value of dwelling, years of schooling, etc.

  • space/size of data sets has, until recently, been a major consideration


Three basic types of variables l.jpg
Three basic types of variables

  • Categorical

    • Nominal, aka nonorderable discrete

      Eg gender, ethnicity, immigrant status, province of residence, marital status, labour force status, etc.

    • Ordinal, aka orderable discrete

      Eg highest level of schooling, social class, left-right political identification, Likert scales, income groups, age groups, etc.

  • Continuous, aka interval, numeric

    Eg actual age, income, etc.


Descriptive statistics summarize the properties of a sample of observations l.jpg
Descriptive statistics summarize the properties of a sample of observations

  • how the units of observation are the same (central tendency)

  • how they are different (dispersion)

  • how representative the sample is of the population at large (significance)


Nominal variable l.jpg
Nominal variable:

  • Central tendency

    • mode

  • Dispersion

    • frequency distribution

    • percentages, proportions, odds

    • Index of qualitative variation (IQV)

  • Significance

    • coefficient of variation (CV)

  • Visualization

    • bar chart


Mode the category with the largest number or percentage of observations in a frequency distribution l.jpg
Mode: the category with the largest number or percentage of observations in a frequency distribution






Why the differences l.jpg
Why the differences? files:

  • What is the population in each table?

  • What is in the denominator in each table?

  • Which one is correct?


Slide17 l.jpg

The most important thing to know about any distribution, files:

whether it is a rate, a proportion,

or a percent, is what is in the

denominator. And it must always be reported.


Slide18 l.jpg
We can also derive the distribution information from the 2001 individual pumf using a statistical package:


Slide19 l.jpg
If we weight the distribution, both the frequencies and the percentages will almost match the distribution from the Profile file:


Just a few words on weighting l.jpg
Just a few words on weighting: percentages will almost match the distribution from the Profile file:

  • The weight is the chance that any member of the population (universe) had to be selected for the sample

  • In general

    • the weight can be used to produce estimates for the total population (population weight)

    • and/or the weight can adjust for known deficiencies in the sample (sample weight)


And a few more words on weighting l.jpg
and a few more words on weighting… percentages will almost match the distribution from the Profile file:

  • The 2001 census public use microdata file of individuals is a 2.7% sample of the population

  • The weight variable (weightp) ranges from 35.545777-39.464996

  • Knowing who was excluded from the sample is as important as knowing who was included


And some final words on weighting l.jpg
And some final words on weighting… percentages will almost match the distribution from the Profile file:

  • When to use the population weight variable

    • when you are producing frequencies to reflect the frequency in the population

  • When you don’t need to use the population weight variable

    • when you are producing percentages, proportions, ratios, rates, etc.

  • When to use a sample weight variable

    • always


Proportions percents and odds l.jpg
Proportions, percents, and odds percentages will almost match the distribution from the Profile file:

  • Percent: ‘the percent married in the population 15 years and over is…”

    = (#married/population >15 years)*100=49.47%

  • Proportion: “the proportion married in the population 15 years and over is …”

    = (#married/population >15 years)=.4947

  • Odds: “the odds on being married are…”

    =(#married/#not married in the population) or

    =proportion married/(1-proportion married)

    =.4947/(1-.4947)=.4947/.5053=.9790


Coefficient of variation l.jpg
Coefficient of variation percentages will almost match the distribution from the Profile file:

  • measures how representative the variable in the sample is of the distribution in the population

  • computed as ((standard deviation/mean)*100)

    [we will discuss these measures in the context of continuous variables]

  • see Stats Can guidelines in user guides:

    • cv< 16.6% is ok to publish, cv>33.3% do not publish

  • SDA reports the cv when generating frequencies


Ordinal variable l.jpg
Ordinal variable: percentages will almost match the distribution from the Profile file:

  • Central tendency

    • median and mode

  • Dispersion

    • frequency distribution

    • range

    • percentages/quantiles, proportions, odds

    • Index of qualitative variation (IQV)

  • Significance

    • coefficient of variation (CV)

  • Visualization

    • histogram


Example frequency distribution l.jpg
Example: frequency distribution percentages will almost match the distribution from the Profile file:


Example relative percentages l.jpg
Example: relative percentages percentages will almost match the distribution from the Profile file:


The median is the value that divides an orderable distribution exactly into halves l.jpg
The percentages will almost match the distribution from the Profile file:Median is the value that divides an orderable distribution exactly into halves.

Finding the median is easier if we compute cumulative percentages, eg in Excel


Slide30 l.jpg

% Cum% percentages will almost match the distribution from the Profile file:

0-4 years 5.65 5.65

5-9 years 6.59 12.24

10-14 years 6.84 19.08

15-24 years 13.36 32.44

25-34 years 13.31 45.75

35-44 years 17.00 62.75

45-54 years 14.73 77.48

55-64 years 9.56 87.04

65-74 years 7.14 94.18

75-84 years 4.43 98.61

85 years and over 1.39 100


So how can we describe this distribution using the vocabulary we have so far l.jpg
So how can we describe this distribution, using the vocabulary we have so far?

  • What is the mode of this distribution?

  • What is the median?

  • What is the range?



Percentiles quantiles l.jpg
Percentiles/quantiles vocabulary we have so far?

  • percentiles/quantiles are the value below which a given percentage of the cases fall

  • the median = the 50th percentile

  • quartiles are the values of a distribution broken into 4 even intervals each containing 25% of the cases


Interquartile range l.jpg
Interquartile range vocabulary we have so far?

  • The interquartile range is the difference between the value at 75% of cases, and the value at 25% of cases

  • What is the interquartile range for the following distribution?


Slide36 l.jpg

% Cum% vocabulary we have so far?

0-4 years 5.65 5.65

5-9 years 6.59 12.24

10-14 years 6.84 19.08

15-24 years 13.36 32.44

25-34 years 13.31 45.75

35-44 years 17.00 62.75

45-54 years 14.73 77.48

55-64 years 9.56 87.04

65-74 years 7.14 94.18

75-84 years 4.43 98.61

85 years and over 1.39 100


Continuous variable l.jpg
Continuous variable: vocabulary we have so far?

  • Central tendency

    • mean, median and mode

  • Dispersion

    • range

    • variance or standard deviation

    • quantiles/percentiles

    • interquartile range

  • Significance

    • standard error

    • coefficient of variation

  • Visualization

    • polygon (line graph)


Means variances and standard deviations l.jpg
Means, variances, and standard deviations: vocabulary we have so far?

  • Mean: the average, computed by adding up the values of all observations and dividing by the number of observations (cases)

  • Variance: computed by taking each value and subtracting the value of the mean from it, squaring the results, adding them up, and dividing by the number of cases (actually N-1)

  • Standard deviation: the value that cuts off 68% of the cases above or below the mean, in a normal distribution. It’s the square root of the variance, in the same metric as the variable.


Availability of continuous variables in stats can products l.jpg
Availability of continuous variables in Stats Can products: vocabulary we have so far?

  • Stats Can rarely publishes truly continuous variables in its aggregate statistics products

  • Some exceptions are:

    • age by single years (census)

    • estimates of population by single years of age (Annual demographic statistics)


Statistics canada generally reports the distribution of continuous variables as l.jpg
Statistics Canada generally reports the distribution of continuous variables as:

  • Measures of central tendency

    • an ordinal (categorical) variable

    • an average (mean) and standard error, variance, or standard deviation

    • a median

    • rates (other than of 100), eg. children ever born per 1,000 women in the 1991 census 2B profile

  • Measures of dispersion

    • percentage – essentially, a rate per 100 population. Eg, Incidence of low income as a % of economic families, in the census profiles. This includes employment and unemployment rates

    • quantiles (eg quintiles in Income trends in Canada)

    • Gini coefficients (computed from the coefficient of variation) (eg Income trends in Canada)

  • Measures of significance

    • standard error


In the following distribution l.jpg
In the following distribution continuous variables as:…

  • What is the median?

  • What is the mean?

  • What is the range?


Using the percentages and cumulative percentages l.jpg
Using the percentages and cumulative percentages: continuous variables as:

  • What is your best estimate of the interquartile range?

  • What is your best estimate of the standard deviation?

  • Why is the average (mean) income so much higher than the median?

  • See pages 18-19 of your handout


As a percentage distribution l.jpg
As a percentage distribution continuous variables as:



Using the standard error to describe more of the distribution l.jpg
Using the standard error to describe more of the distribution:

  • standard errorof the mean is a measure of how likely it is that the mean in the data we are looking at is the same as or similar to the mean in the population at large

  • computed from the standard deviation divided by the square root of the N

  • the larger the N, the smaller the standard error, and the more confidence we can have in the distribution in the sample as representative of the population


For example l.jpg
For example…. distribution:


Confidence intervals l.jpg
Confidence intervals distribution:

  • The standard error makes most sense when we use it to compute a confidence interval around the mean

  • For Canada, in the previous example:

    • 95% upper confidence limit (UCL)

      =(mean)+1.96(standard error)

      =29769+1.96(19)= 29769 + 37.24=$29,806.24

    • 95% lower confidence limit (LCL)

      =(mean)-1.96(standard error)

      =29769 -1.96(19)= 29769 - 37.24 =$29,731.76

  • 1.96 is the Z-statistic that represents 95% of a normal distribution

  • The handout contains computed confidence intervals for each of the provinces (p.22)


How do we interpret this l.jpg
How do we interpret this? distribution:

  • if we draw repeated random samples from the same population, 95% of them will have a mean total income between $29,732 and $29,806

  • this is not the same as saying that we are 95% confident that the population mean falls within those two limits.


Using microdata l.jpg
Using microdata distribution:

  • Statistical packages such as SAS, SPSS, etc. will compute the mean, the standard deviation, standard error, etc. for continuous variables

  • SDA will only report these measures for variables with less than 8,000 values



How many values l.jpg
How many values? distribution:

  • What is the maximum number of values for the wagesp variable? For the totincp variable?

  • If you were to do a frequency distribution of wagesp, how many rows might you have in the distribution?

  • How would you go about finding out what the modal category for wagesp is?


Some final points about variables l.jpg
Some final points about variables: distribution:

  • they are not immutable or un-changeable

  • continuous variables can be changed into ordinal, or even nominal variables

  • ordinal variables can be changed into nominal variables, and

  • nominal variables can be collapsed still further

  • nominal and ordinal variables can be combined to create indices or scales


Describing relationships among two or more variables l.jpg
Describing relationships among two or more variables: distribution:

  • Objectives of describing multivariate relationships

    • description

    • improved ability to predict the value of a variable for a case, by using the value of another variable (or variables)

    • examine causation, which requires

      • covariation

      • the causal variable must occur before the outcome variable in time (temporal precedence)

      • A non-spurious relationship


Variables are either l.jpg
Variables are either… distribution:

  • Dependent (aka ‘outcome’ variables) or

  • Independent (eg ‘causal’ variables)


Slide57 l.jpg
The same variable can be an independent variable in one hypothesis, and a dependent variable in another hypothesis:

  • Does gender make a difference in level of education?

    ‘level of education’ is dependant; ‘gender’ is independant

  • Does level of education make a difference to earned income?

    ‘earned income’ is dependant; ‘level of education’ is independent

  • Is the effect of education on earned income the same for men and women?

    ‘earned income’ is dependant; ‘level of education’ is independent, and ‘gender’ is the control variable


Statistical measures describe l.jpg
Statistical measures describe… hypothesis, and a dependent variable in another hypothesis:

  • The strength of the relationship between two (or more) variables

  • The direction of the relationship between two (or more) variables

  • The significance of the relationship between two (or more) variables in the sample vis-à-vis the population


The choice of appropriate statistical technique is dependant on l.jpg
The choice of appropriate statistical technique is dependant on:

  • whether the dependant variable is nominal, ordinal, or continuous

  • whether the dependant variable is a dichotomy or a polytomy

  • whether the independent variable(s) is a dichotomy, a polytomy, or continuous


Relationships between two or more nominal or ordinal variables l.jpg
Relationships between two or more nominal or ordinal variables

  • cross-tabulations

    • a common census output product (all the Topic-based tabulations in 2001)

  • strength and direction:

    • odds ratios

  • significance:

    • chi-square statistic





The chi square statistic l.jpg
The chi-square statistic: variables

  • Tells us how likely we are to be wrong if we say there is a relationship between these two variables

  • Significance (probability of being wrong)

  • Evaluated based on

    • Critical value (found in statistics texts)

    • Degrees of freedom (df=1)

    • Level of significance (a 5% possibility of being wrong is normally acceptable in the social sciences)


So we know l.jpg
So we know: variables

  • Direction & strength: from graphing the odds

  • Significance: from a chi-square statistic

    • Can be computed in eg Excel

    • For the previous table, 2.97 (see handout page 30)

    • Critical value: 3.84 with 1 degree of freedom, at .05 significance (ie probability of being wrong)

    • This table is not statistically significant, therefore a good chance of seeing this relationship as a chance of eg measurement error


Adding a control variable in beyond 20 20 l.jpg
Adding a control variable in variablesBeyond 20/20:





But when we added visible minority as a control variable l.jpg
But when we added visible minority as a control variable… variables

…it ended up looking like this:


Did you notice l.jpg
Did you notice? variables

  • In the 1996 data, when we looked only at immigrants, versus non-immigrants, the non-immigrants were more likely to be employed

  • When we break it out by visible minority status,

    • Non-visible majority: immigrants are more likely to be employed than non-immigrants (11.73/9.16=1.28)

    • Visible minorities: immigrants are also more likely to be employed than non-immigrants (6.30/5.52=1.14)

  • This is an example of Simpson’s paradox


Using a statistical package to examine these relationships l.jpg
Using a statistical package to examine these relationships: variables

  • Automatically computes a chi-square

  • Computes degrees of freedom and probability

  • Is sensitive to sample size (so we need to take a subsample)



In the table on the previous slide l.jpg
In the table on the previous slide: variables

  • How many cases are there in this subsample?

  • What percentage overall are unemployed?

  • What percentage of immigrants are unemployed?

  • Is this different from the percentage of non-immigrants that are unemployed?

  • What percentage are immigrants? How would you find out?



And tell me what you found l.jpg
And tell me what you found: groups:

  • How many tables are produced?

  • What is the difference between them?

  • How many of these tables are statistically significant (based on the Chisq)?

  • In which two groups is the relationship the opposite of the majority of the groups?


Comparison of means l.jpg
Comparison of means groups:

  • Relationship between a continuous dependant variable, and a dichotomous independent variable

  • Significance:

    • Student’s t-statistic (similar to Z-statistic)

  • Example: average employment income for visible minorities versus all others (2001 census: dimensions series)


Visible minority status and employment income l.jpg
Visible minority status and employment income groups:

  • In 1996 census, a Dimensions series table showed visible minority status and employment income

  • This table not available in 2001 census

  • Relationship can only be examined using microdata, or requesting a custom tabulation




Anova analysis of variance l.jpg
ANOVA (analysis of variance) groups:

  • When the dependent variable is a continuous variable, and the independent variable is a polytomy

  • Significance:

    • F-ratio

    • Eta-squared (amount of explained variance)

  • Example: mean earned income for individual visible minorities






Pairwise relationships among continuous and dichotomous variables l.jpg
Pairwise relationships among continuous and dichotomous variables

  • Correlations

  • Strength and direction

    • Pearson correlation coefficient

  • Significance

    • Z-statistics for each pair

      (r-to-Z transformation table)



Combined effects of one or more independent variables on a continuous dependant variable l.jpg
Combined effects of one or more independent variables on a continuous dependant variable

  • Regression analysis

  • Categorical variables must be recoded to dummy variables

  • Strength and direction:

    • t-statistic for each independent variable

  • Significance:

    • F-ratio


Techniques we ve covered l.jpg
Techniques we’ve covered: continuous dependant variable


What you need to remember l.jpg
What you need to remember: continuous dependant variable

  • To examine the relationships between 2 or more variables in Beyond 20/20, the variables must be in the same file and in different dimensions

  • If no table exists with those variables, the alternative is to use a relevant microdata file

    • To generate the table (or request a custom tabulation from Statistics Canada)

    • To compute the more complicated measures of association and significance


What you need to remember cont d l.jpg
What you need to remember (cont’d): continuous dependant variable

  • A user who needs to do correlations or regression analysis needs continuous outcome (dependant variables)

  • Information about statistical techniques is readily available on the WWW


ad