Statistics bootcamp Laine Ruus Data Library Service, University of Toronto For ACCOLEDS 2004 2004-12-07
Outline • Describing a variable • Describing relationships among two or more variables
First, some vocabulary • Variable: In social science research, for each unit of analysis, each item of data (e.g., age of person, income of family, consumer price index) is called a variable. • Unit of analysis: The basic observable entity being analyzed by a study and for which data are collected in the form of variables. A unit of analysis is sometimes referred to as a case or "observation"
‘Sex’ can be coded with values: • 1=’male’ 2 =’female’ 3 = ‘no response’ or • 1=’female’ 2 =’male’ or • 1=’male’ 2 =’female’ or • ‘M’=’male’ ‘F’ =’female’ or • ‘male’ ‘female’ or even • 1=’yes’ 2 =’no’ 3 = ‘maybe’
The values a variable takes must be: • exhaustive: include the characteristics of all cases, and • mutually exclusive: each case must have one and only one value or code for each variable
What’s wrong with this coding scheme? • Under $3,000 • $3,000-$7,000 • $8,000-$12,000 • $13,000-$17,000 • $18,000-$22,000 • $23,000-$27,000 • $28,000-$32,000 • $33,000-$37,000 • $38,000 & over (Source: Census of Canada, 1961: user summary tapes)
Variables are normally coded numerically, because: • arithmetic is easier with numbers than with letters • some characteristics are inherently numeric: age, weight in kilograms, number of children ever borne, income, value of dwelling, years of schooling, etc. • space/size of data sets has, until recently, been a major consideration
Three basic types of variables • Categorical • Nominal, aka nonorderable discrete Eg gender, ethnicity, immigrant status, province of residence, marital status, labour force status, etc. • Ordinal, aka orderable discrete Eg highest level of schooling, social class, left-right political identification, Likert scales, income groups, age groups, etc. • Continuous, aka interval, numeric Eg age, income, etc.
Descriptive statistics summarize the properties of a sample of observations • how the units of observation are the same (central tendency) • how they are different (dispersion) • how representative the sample is of the population at large (significance)
Nominal variable: • Central tendency • mode • Dispersion • frequency distribution • percentages, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • bar chart
Why the differences? • What is the population in each table? • Which one is correct?
The most important thing to know about any distribution, whether it is a rate, a proportion, or a percent, is what is in the denominator. And it must always be reported.
We can also derive the distribution information from the 1996 individual pumf using a statistical package:
If we weight the distribution, both the frequencies and the percentages will almost match the distribution from the Profile file:
Just a few words on weighting: • A weight is the chance that any member of the population (universe) had to be selected for the sample • In general • the weight can be used to produce estimates for the total population (population weight) • and/or the weight can adjust for known deficiencies in the sample (sample weight)
and a few more words on weighting… • The 1996 census public use microdata file of individuals is a 2.8% sample of the population • Stats Can calls this a ‘self-weighting’ sample: every case has a weight of 36 • Knowing who was excluded from the sample is as important as knowing who was included
And some final words on weighting… • When to use the population weight variable • when you are producing frequencies to reflect the frequency in the population • When you don’t need to use the population weight variable • when you are producing percentages, proportions, ratios, rates, etc. • When to use a sample weight variable • always
Proportions, percents, and odds • Percent: ‘the percent married in the population 15 years and over is…” = (#married/population >15 years)*100=51.2% • Proportion: “the proportion married in the population 15 years and over is …” = (#married/population >15 years)=.512 • Odds: “the odds on being married are…” =(#married/#not married in the population) or =proportion married/(1-proportion married) =.512/(1-.512)=.512/.488=1.049
Ordinal variable: • Central tendency • median and mode • Dispersion • frequency distribution • range • percentages/quantiles, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • histogram
The Median is the value that divides an orderable distribution exactly into halves. Finding the median is easier if we compute cumulative percentages
Example: cumulative percentages % Cumulative % 0-4 years 6.65 6.65 5-9 years 6.90 13.55 10-14 years 6.91 20.46 15-24 years 13.37 33.83 25-34 years 15.60 49.43 35-44 years 16.85 66.28 45-54 years 12.86 79.14 55-64 years 8.63 87.77 65-74 years 7.15 94.92 75-84 years 3.91 98.83 85 years and over 1.17 100.00
So how can we describe this distribution, using the vocabulary we have so far? • What is the mode of this distribution? • What is the median? • What is the range?
Percentiles/quantiles • percentiles/quantiles are the value below which a given percentage of the cases fall • the median = the 50th percentile • quartiles are the values of a distribution broken into 4 even intervals each containing 25% of the cases
Interquartile range • The interquartile range is the difference between the value at 75% of cases, and the value at 25% of cases • What is the interquartile range for the following distribution?
Cumulative percentages % Cumulative % 0-4 years 6.65 6.65 5-9 years 6.90 13.55 10-14 years 6.91 20.46 15-24 years 13.37 33.83 25-34 years 15.60 49.43 35-44 years 16.85 66.28 45-54 years 12.86 79.14 55-64 years 8.63 87.77 65-74 years 7.15 94.92 75-84 years 3.91 98.83 85 years and over 1.17 100.00
Continuous variable: • Central tendency • mean, median and mode • Dispersion • range • variance or standard deviation • quantiles/percentiles • interquartile range • Significance • standard error • coefficient of variation • Visualization • polygon (line graph)
Means, variances, and standard deviations: • Mean: the average, computed by adding up the values of all observations and dividing by the number of observations (cases) • Variance: computed by taking each value and subtracting the value of the mean from it, squaring the results, adding them up, and dividing by the number of cases (actually N-1) • Standard deviation is the value that cuts off 68% of the cases above or below the mean, in a normal distribution
Availability of continuous variables in Stats Can products: • Stats Can only rarely publishes truly continuous variables • Some exceptions are: • age by single years of age (census) • estimates of population by age (Annual demographic statistics)
Statistics Canada generally reports the distribution of continuous variables as: • Measures of central tendency • an ordinal (categorical) variable • an average (mean) and standard error, variance, or standard deviation • a median • rates (other than of 100), eg. children ever born per 1,000 women in the 1991 census 2B profile • Measures of dispersion • percentage – essentially, a rate per 100 population. Eg, Incidence of low income as a % of economic families, in the census profiles. This includes employment and unemployment rates • quantiles (eg quintiles in Income trends in Canada) • Gini coefficients (computed from the coefficient of variation) (eg Income trends in Canada) • Measures of significance • standard error
For the following distribution… • What is the median? • What is the mean? • What is the range?
Using the percentages and cumulative percentages: • What is your best estimate of the interquartile range? • What is your best estimate of the standard deviation? • Why is the average (mean) income so much higher than the median? • See pages 18-19 of your handout
Using the standard error to describe more of the distribution: • standard error of the mean is a measure of how likely it is that the mean in the data we are looking at is the same as or similar to the mean in the population at large • computed from the variance divided by the square root of the N • the larger the N, the smaller the standard error
Confidence intervals • The standard error makes most sense when we use it to compute a confidence interval around the mean • For Canada, in the previous example: • 95% upper confidence limit (UCL) =(mean)+1.96(standard error) =25196+1.96(13)= 25196+25.48=$25,221 • 95% lower confidence limit (LCL) =(mean)-1.96(standard error) 25196-1.96(13)= 25196-25.48=$25,171 • 1.96 is the Z-statistic that represents 95% of a normal distribution • The handout contains computed confidence intervals for each of the provinces (p.21)
How do we interpret this? • if we draw repeated random samples from the same population, 95% of them will have a mean total income between $25,221 and $25,171 • this is not the same as saying that we are 95% confident that the population mean falls within those two limits.
Using microdata • Statistical packages such as SAS, SPSS, etc. will compute the mean, the standard deviation, standard error, etc. for continuous variables • SDA will only report these measures for variables with less than 8,000 values