- By
**nan** - Follow User

- 233 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Statistics bootcamp' - nan

Download Now**An Image/Link below is provided (as is) to download presentation**

Download Now

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Outline

- Describing a variable
- Describing relationships among two or more variables

First, some vocabulary

- Variable: In social science research, for each unit of analysis, each item of data (e.g., age of person, income of family, consumer price index) is called a variable.
- Unit of analysis: The basic observable entity being analyzed by a study and for which data are collected in the form of variables. A unit of analysis is sometimes referred to as a case or "observation"

The variable ‘Sex’ can be coded as:

- 1=’male’ 2 =’female’ 3 = ‘no response’, or
- 1=’female’ 2 =’male’, or
- 1=’male’ 2 =’female’, or
- ‘M’=’male’ ‘F’ =’female’, or
- ‘male’ ‘female’, or even
- 1=’yes’ 2 =’no’ 3 = ‘maybe’

The values a variable can take must be:

- exhaustive: include the characteristics of all cases, and
- mutually exclusive: each case must have one and only one value or code for each variable

What’s wrong with this coding scheme?

- Under $3,000
- $3,000-$7,000
- $8,000-$12,000
- $13,000-$17,000
- $18,000-$22,000
- $23,000-$27,000
- $28,000-$32,000
- $33,000-$37,000
- $38,000 & over

(Source: Census of Canada, 1961: user summary tapes)

Variables are normally coded numerically, because:

- arithmetic is easier with numbers than with letters
- some characteristics are inherently numeric: age, weight in kilograms or pounds, number of children ever borne, income, value of dwelling, years of schooling, etc.
- space/size of data sets has, until recently, been a major consideration

Three basic types of variables

- Categorical
- Nominal, aka nonorderable discrete

Eg gender, ethnicity, immigrant status, province of residence, marital status, labour force status, etc.

- Ordinal, aka orderable discrete

Eg highest level of schooling, social class, left-right political identification, Likert scales, income groups, age groups, etc.

- Continuous, aka interval, numeric

Eg actual age, income, etc.

Descriptive statistics summarize the properties of a sample of observations

- how the units of observation are the same (central tendency)
- how they are different (dispersion)
- how representative the sample is of the population at large (significance)

Nominal variable:

- Central tendency
- mode
- Dispersion
- frequency distribution
- percentages, proportions, odds
- Index of qualitative variation (IQV)
- Significance
- coefficient of variation (CV)
- Visualization
- bar chart

Mode: the category with the largest number or percentage of observations in a frequency distribution

Why the differences?

- What is the population in each table?
- What is in the denominator in each table?
- Which one is correct?

The most important thing to know about any distribution,

whether it is a rate, a proportion,

or a percent, is what is in the

denominator. And it must always be reported.

We can also derive the distribution information from the 2001 individual pumf using a statistical package:

If we weight the distribution, both the frequencies and the percentages will almost match the distribution from the Profile file:

Just a few words on weighting:

- The weight is the chance that any member of the population (universe) had to be selected for the sample
- In general
- the weight can be used to produce estimates for the total population (population weight)
- and/or the weight can adjust for known deficiencies in the sample (sample weight)

and a few more words on weighting…

- The 2001 census public use microdata file of individuals is a 2.7% sample of the population
- The weight variable (weightp) ranges from 35.545777-39.464996
- Knowing who was excluded from the sample is as important as knowing who was included

And some final words on weighting…

- When to use the population weight variable
- when you are producing frequencies to reflect the frequency in the population
- When you don’t need to use the population weight variable
- when you are producing percentages, proportions, ratios, rates, etc.
- When to use a sample weight variable
- always

Proportions, percents, and odds

- Percent: ‘the percent married in the population 15 years and over is…”

= (#married/population >15 years)*100=49.47%

- Proportion: “the proportion married in the population 15 years and over is …”

= (#married/population >15 years)=.4947

- Odds: “the odds on being married are…”

=(#married/#not married in the population) or

=proportion married/(1-proportion married)

=.4947/(1-.4947)=.4947/.5053=.9790

Coefficient of variation

- measures how representative the variable in the sample is of the distribution in the population
- computed as ((standard deviation/mean)*100)

[we will discuss these measures in the context of continuous variables]

- see Stats Can guidelines in user guides:
- cv< 16.6% is ok to publish, cv>33.3% do not publish
- SDA reports the cv when generating frequencies

Ordinal variable:

- Central tendency
- median and mode
- Dispersion
- frequency distribution
- range
- percentages/quantiles, proportions, odds
- Index of qualitative variation (IQV)
- Significance
- coefficient of variation (CV)
- Visualization
- histogram

The Median is the value that divides an orderable distribution exactly into halves.

Finding the median is easier if we compute cumulative percentages, eg in Excel

0-4 years 5.65 5.65

5-9 years 6.59 12.24

10-14 years 6.84 19.08

15-24 years 13.36 32.44

25-34 years 13.31 45.75

35-44 years 17.00 62.75

45-54 years 14.73 77.48

55-64 years 9.56 87.04

65-74 years 7.14 94.18

75-84 years 4.43 98.61

85 years and over 1.39 100

So how can we describe this distribution, using the vocabulary we have so far?

- What is the mode of this distribution?
- What is the median?
- What is the range?

Percentiles/quantiles

- percentiles/quantiles are the value below which a given percentage of the cases fall
- the median = the 50th percentile
- quartiles are the values of a distribution broken into 4 even intervals each containing 25% of the cases

Interquartile range

- The interquartile range is the difference between the value at 75% of cases, and the value at 25% of cases
- What is the interquartile range for the following distribution?

0-4 years 5.65 5.65

5-9 years 6.59 12.24

10-14 years 6.84 19.08

15-24 years 13.36 32.44

25-34 years 13.31 45.75

35-44 years 17.00 62.75

45-54 years 14.73 77.48

55-64 years 9.56 87.04

65-74 years 7.14 94.18

75-84 years 4.43 98.61

85 years and over 1.39 100

Continuous variable:

- Central tendency
- mean, median and mode
- Dispersion
- range
- variance or standard deviation
- quantiles/percentiles
- interquartile range
- Significance
- standard error
- coefficient of variation
- Visualization
- polygon (line graph)

Means, variances, and standard deviations:

- Mean: the average, computed by adding up the values of all observations and dividing by the number of observations (cases)
- Variance: computed by taking each value and subtracting the value of the mean from it, squaring the results, adding them up, and dividing by the number of cases (actually N-1)
- Standard deviation: the value that cuts off 68% of the cases above or below the mean, in a normal distribution. It’s the square root of the variance, in the same metric as the variable.

Availability of continuous variables in Stats Can products:

- Stats Can rarely publishes truly continuous variables in its aggregate statistics products
- Some exceptions are:
- age by single years (census)
- estimates of population by single years of age (Annual demographic statistics)

Statistics Canada generally reports the distribution of continuous variables as:

- Measures of central tendency
- an ordinal (categorical) variable
- an average (mean) and standard error, variance, or standard deviation
- a median
- rates (other than of 100), eg. children ever born per 1,000 women in the 1991 census 2B profile
- Measures of dispersion
- percentage – essentially, a rate per 100 population. Eg, Incidence of low income as a % of economic families, in the census profiles. This includes employment and unemployment rates
- quantiles (eg quintiles in Income trends in Canada)
- Gini coefficients (computed from the coefficient of variation) (eg Income trends in Canada)
- Measures of significance
- standard error

In the following distribution…

- What is the median?
- What is the mean?
- What is the range?

Using the percentages and cumulative percentages:

- What is your best estimate of the interquartile range?
- What is your best estimate of the standard deviation?
- Why is the average (mean) income so much higher than the median?
- See pages 18-19 of your handout

Using the standard error to describe more of the distribution:

- standard errorof the mean is a measure of how likely it is that the mean in the data we are looking at is the same as or similar to the mean in the population at large
- computed from the standard deviation divided by the square root of the N
- the larger the N, the smaller the standard error, and the more confidence we can have in the distribution in the sample as representative of the population

Confidence intervals

- The standard error makes most sense when we use it to compute a confidence interval around the mean
- For Canada, in the previous example:
- 95% upper confidence limit (UCL)

=(mean)+1.96(standard error)

=29769+1.96(19)= 29769 + 37.24=$29,806.24

- 95% lower confidence limit (LCL)

=(mean)-1.96(standard error)

=29769 -1.96(19)= 29769 - 37.24 =$29,731.76

- 1.96 is the Z-statistic that represents 95% of a normal distribution
- The handout contains computed confidence intervals for each of the provinces (p.22)

How do we interpret this?

- if we draw repeated random samples from the same population, 95% of them will have a mean total income between $29,732 and $29,806
- this is not the same as saying that we are 95% confident that the population mean falls within those two limits.

Using microdata

- Statistical packages such as SAS, SPSS, etc. will compute the mean, the standard deviation, standard error, etc. for continuous variables
- SDA will only report these measures for variables with less than 8,000 values

How many values?

- What is the maximum number of values for the wagesp variable? For the totincp variable?
- If you were to do a frequency distribution of wagesp, how many rows might you have in the distribution?
- How would you go about finding out what the modal category for wagesp is?

Some final points about variables:

- they are not immutable or un-changeable
- continuous variables can be changed into ordinal, or even nominal variables
- ordinal variables can be changed into nominal variables, and
- nominal variables can be collapsed still further
- nominal and ordinal variables can be combined to create indices or scales

Describing relationships among two or more variables:

- Objectives of describing multivariate relationships
- description
- improved ability to predict the value of a variable for a case, by using the value of another variable (or variables)
- examine causation, which requires
- covariation
- the causal variable must occur before the outcome variable in time (temporal precedence)
- A non-spurious relationship

Variables are either…

- Dependent (aka ‘outcome’ variables) or
- Independent (eg ‘causal’ variables)

The same variable can be an independent variable in one hypothesis, and a dependent variable in another hypothesis:

- Does gender make a difference in level of education?

‘level of education’ is dependant; ‘gender’ is independant

- Does level of education make a difference to earned income?

‘earned income’ is dependant; ‘level of education’ is independent

- Is the effect of education on earned income the same for men and women?

‘earned income’ is dependant; ‘level of education’ is independent, and ‘gender’ is the control variable

Statistical measures describe…

- The strength of the relationship between two (or more) variables
- The direction of the relationship between two (or more) variables
- The significance of the relationship between two (or more) variables in the sample vis-à-vis the population

The choice of appropriate statistical technique is dependant on:

- whether the dependant variable is nominal, ordinal, or continuous
- whether the dependant variable is a dichotomy or a polytomy
- whether the independent variable(s) is a dichotomy, a polytomy, or continuous

Relationships between two or more nominal or ordinal variables

- cross-tabulations
- a common census output product (all the Topic-based tabulations in 2001)
- strength and direction:
- odds ratios
- significance:
- chi-square statistic

The chi-square statistic:

- Tells us how likely we are to be wrong if we say there is a relationship between these two variables
- Significance (probability of being wrong)
- Evaluated based on
- Critical value (found in statistics texts)
- Degrees of freedom (df=1)
- Level of significance (a 5% possibility of being wrong is normally acceptable in the social sciences)

So we know:

- Direction & strength: from graphing the odds
- Significance: from a chi-square statistic
- Can be computed in eg Excel
- For the previous table, 2.97 (see handout page 30)
- Critical value: 3.84 with 1 degree of freedom, at .05 significance (ie probability of being wrong)
- This table is not statistically significant, therefore a good chance of seeing this relationship as a chance of eg measurement error

But when we added visible minority as a control variable…

…it ended up looking like this:

Did you notice?

- In the 1996 data, when we looked only at immigrants, versus non-immigrants, the non-immigrants were more likely to be employed
- When we break it out by visible minority status,
- Non-visible majority: immigrants are more likely to be employed than non-immigrants (11.73/9.16=1.28)
- Visible minorities: immigrants are also more likely to be employed than non-immigrants (6.30/5.52=1.14)
- This is an example of Simpson’s paradox

Using a statistical package to examine these relationships:

- Automatically computes a chi-square
- Computes degrees of freedom and probability
- Is sensitive to sample size (so we need to take a subsample)

In the table on the previous slide:

- How many cases are there in this subsample?
- What percentage overall are unemployed?
- What percentage of immigrants are unemployed?
- Is this different from the percentage of non-immigrants that are unemployed?
- What percentage are immigrants? How would you find out?

And tell me what you found:

- How many tables are produced?
- What is the difference between them?
- How many of these tables are statistically significant (based on the Chisq)?
- In which two groups is the relationship the opposite of the majority of the groups?

Comparison of means

- Relationship between a continuous dependant variable, and a dichotomous independent variable
- Significance:
- Student’s t-statistic (similar to Z-statistic)
- Example: average employment income for visible minorities versus all others (2001 census: dimensions series)

Visible minority status and employment income

- In 1996 census, a Dimensions series table showed visible minority status and employment income
- This table not available in 2001 census
- Relationship can only be examined using microdata, or requesting a custom tabulation

ANOVA (analysis of variance)

- When the dependent variable is a continuous variable, and the independent variable is a polytomy
- Significance:
- F-ratio
- Eta-squared (amount of explained variance)
- Example: mean earned income for individual visible minorities

Pairwise relationships among continuous and dichotomous variables

- Correlations
- Strength and direction
- Pearson correlation coefficient
- Significance
- Z-statistics for each pair

(r-to-Z transformation table)

Combined effects of one or more independent variables on a continuous dependant variable

- Regression analysis
- Categorical variables must be recoded to dummy variables
- Strength and direction:
- t-statistic for each independent variable
- Significance:
- R²
- F-ratio

What you need to remember:

- To examine the relationships between 2 or more variables in Beyond 20/20, the variables must be in the same file and in different dimensions
- If no table exists with those variables, the alternative is to use a relevant microdata file
- To generate the table (or request a custom tabulation from Statistics Canada)
- To compute the more complicated measures of association and significance

What you need to remember (cont’d):

- A user who needs to do correlations or regression analysis needs continuous outcome (dependant variables)
- Information about statistical techniques is readily available on the WWW

Download Presentation

Connecting to Server..