Statistics bootcamp

Statistics bootcamp Laine Ruus Data Library Service, University of Toronto Rev. 2005-04-26

Outline • Describing a variable • Describing relationships among two or more variables

First, some vocabulary • Variable: In social science research, for each unit of analysis, each item of data (e.g., age of person, income of family, consumer price index) is called a variable. • Unit of analysis: The basic observable entity being analyzed by a study and for which data are collected in the form of variables. A unit of analysis is sometimes referred to as a case or "observation"

The variable ‘Sex’ can be coded as: • 1=’male’ 2 =’female’ 3 = ‘no response’, or • 1=’female’ 2 =’male’, or • 1=’male’ 2 =’female’, or • ‘M’=’male’ ‘F’ =’female’, or • ‘male’ ‘female’, or even • 1=’yes’ 2 =’no’ 3 = ‘maybe’

The values a variable can take must be: • exhaustive: include the characteristics of all cases, and • mutually exclusive: each case must have one and only one value or code for each variable

What’s wrong with this coding scheme? • Under $3,000 • $3,000-$7,000 • $8,000-$12,000 • $13,000-$17,000 • $18,000-$22,000 • $23,000-$27,000 • $28,000-$32,000 • $33,000-$37,000 • $38,000 & over (Source: Census of Canada, 1961: user summary tapes)

Variables are normally coded numerically, because: • arithmetic is easier with numbers than with letters • some characteristics are inherently numeric: age, weight in kilograms or pounds, number of children ever borne, income, value of dwelling, years of schooling, etc. • space/size of data sets has, until recently, been a major consideration

Three basic types of variables • Categorical • Nominal, aka nonorderable discrete Eg gender, ethnicity, immigrant status, province of residence, marital status, labour force status, etc. • Ordinal, aka orderable discrete Eg highest level of schooling, social class, left-right political identification, Likert scales, income groups, age groups, etc. • Continuous, aka interval, numeric Eg actual age, income, etc.

Descriptive statistics summarize the properties of a sample of observations • how the units of observation are the same (central tendency) • how they are different (dispersion) • how representative the sample is of the population at large (significance)

Nominal variable: • Central tendency • mode • Dispersion • frequency distribution • percentages, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • bar chart

Mode: the category with the largest number or percentage of observations in a frequency distribution

The frequencies can be visualized as a bar chart (based on percentages):

The same distribution, from one of the Canadian overview files:

and showing percentages:

Notice the differences:

Why the differences? • What is the population in each table? • What is in the denominator in each table? • Which one is correct?

The most important thing to know about any distribution, whether it is a rate, a proportion, or a percent, is what is in the denominator. And it must always be reported.

We can also derive the distribution information from the 2001 individual pumf using a statistical package:

If we weight the distribution, both the frequencies and the percentages will almost match the distribution from the Profile file:

Just a few words on weighting: • The weight is the chance that any member of the population (universe) had to be selected for the sample • In general • the weight can be used to produce estimates for the total population (population weight) • and/or the weight can adjust for known deficiencies in the sample (sample weight)

and a few more words on weighting… • The 2001 census public use microdata file of individuals is a 2.7% sample of the population • The weight variable (weightp) ranges from 35.545777-39.464996 • Knowing who was excluded from the sample is as important as knowing who was included

And some final words on weighting… • When to use the population weight variable • when you are producing frequencies to reflect the frequency in the population • When you don’t need to use the population weight variable • when you are producing percentages, proportions, ratios, rates, etc. • When to use a sample weight variable • always

Proportions, percents, and odds • Percent: ‘the percent married in the population 15 years and over is…” = (#married/population >15 years)*100=49.47% • Proportion: “the proportion married in the population 15 years and over is …” = (#married/population >15 years)=.4947 • Odds: “the odds on being married are…” =(#married/#not married in the population) or =proportion married/(1-proportion married) =.4947/(1-.4947)=.4947/.5053=.9790

Coefficient of variation • measures how representative the variable in the sample is of the distribution in the population • computed as ((standard deviation/mean)*100) [we will discuss these measures in the context of continuous variables] • see Stats Can guidelines in user guides: • cv< 16.6% is ok to publish, cv>33.3% do not publish • SDA reports the cv when generating frequencies

Ordinal variable: • Central tendency • median and mode • Dispersion • frequency distribution • range • percentages/quantiles, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • histogram

Example: frequency distribution

Example: relative percentages

The Median is the value that divides an orderable distribution exactly into halves. Finding the median is easier if we compute cumulative percentages, eg in Excel

% Cum% 0-4 years 5.65 5.65 5-9 years 6.59 12.24 10-14 years 6.84 19.08 15-24 years 13.36 32.44 25-34 years 13.31 45.75 35-44 years 17.00 62.75 45-54 years 14.73 77.48 55-64 years 9.56 87.04 65-74 years 7.14 94.18 75-84 years 4.43 98.61 85 years and over 1.39 100

So how can we describe this distribution, using the vocabulary we have so far? • What is the mode of this distribution? • What is the median? • What is the range?

A statistical package can also report these measures

Percentiles/quantiles • percentiles/quantiles are the value below which a given percentage of the cases fall • the median = the 50th percentile • quartiles are the values of a distribution broken into 4 even intervals each containing 25% of the cases

Interquartile range • The interquartile range is the difference between the value at 75% of cases, and the value at 25% of cases • What is the interquartile range for the following distribution?

% Cum% 0-4 years 5.65 5.65 5-9 years 6.59 12.24 10-14 years 6.84 19.08 15-24 years 13.36 32.44 25-34 years 13.31 45.75 35-44 years 17.00 62.75 45-54 years 14.73 77.48 55-64 years 9.56 87.04 65-74 years 7.14 94.18 75-84 years 4.43 98.61 85 years and over 1.39 100

Continuous variable: • Central tendency • mean, median and mode • Dispersion • range • variance or standard deviation • quantiles/percentiles • interquartile range • Significance • standard error • coefficient of variation • Visualization • polygon (line graph)

Means, variances, and standard deviations: • Mean: the average, computed by adding up the values of all observations and dividing by the number of observations (cases) • Variance: computed by taking each value and subtracting the value of the mean from it, squaring the results, adding them up, and dividing by the number of cases (actually N-1) • Standard deviation: the value that cuts off 68% of the cases above or below the mean, in a normal distribution. It’s the square root of the variance, in the same metric as the variable.

Availability of continuous variables in Stats Can products: • Stats Can rarely publishes truly continuous variables in its aggregate statistics products • Some exceptions are: • age by single years (census) • estimates of population by single years of age (Annual demographic statistics)

Statistics Canada generally reports the distribution of continuous variables as: • Measures of central tendency • an ordinal (categorical) variable • an average (mean) and standard error, variance, or standard deviation • a median • rates (other than of 100), eg. children ever born per 1,000 women in the 1991 census 2B profile • Measures of dispersion • percentage – essentially, a rate per 100 population. Eg, Incidence of low income as a % of economic families, in the census profiles. This includes employment and unemployment rates • quantiles (eg quintiles in Income trends in Canada) • Gini coefficients (computed from the coefficient of variation) (eg Income trends in Canada) • Measures of significance • standard error

In the following distribution… • What is the median? • What is the mean? • What is the range?

Using the percentages and cumulative percentages: • What is your best estimate of the interquartile range? • What is your best estimate of the standard deviation? • Why is the average (mean) income so much higher than the median? • See pages 18-19 of your handout

As a percentage distribution

The polygon produced by Beyond 20/20 isn’t very useful…

Using the standard error to describe more of the distribution: • standard errorof the mean is a measure of how likely it is that the mean in the data we are looking at is the same as or similar to the mean in the population at large • computed from the standard deviation divided by the square root of the N • the larger the N, the smaller the standard error, and the more confidence we can have in the distribution in the sample as representative of the population

For example….

Confidence intervals • The standard error makes most sense when we use it to compute a confidence interval around the mean • For Canada, in the previous example: • 95% upper confidence limit (UCL) =(mean)+1.96(standard error) =29769+1.96(19)= 29769 + 37.24=$29,806.24 • 95% lower confidence limit (LCL) =(mean)-1.96(standard error) =29769 -1.96(19)= 29769 - 37.24 =$29,731.76 • 1.96 is the Z-statistic that represents 95% of a normal distribution • The handout contains computed confidence intervals for each of the provinces (p.22)

How do we interpret this? • if we draw repeated random samples from the same population, 95% of them will have a mean total income between $29,732 and $29,806 • this is not the same as saying that we are 95% confident that the population mean falls within those two limits.

Using microdata • Statistical packages such as SAS, SPSS, etc. will compute the mean, the standard deviation, standard error, etc. for continuous variables • SDA will only report these measures for variables with less than 8,000 values

Statistics bootcamp