Statistics bootcamp

Statistics bootcamp Laine Ruus Data Library Service, University of Toronto For ACCOLEDS 2004 2004-12-07

Outline • Describing a variable • Describing relationships among two or more variables

First, some vocabulary • Variable: In social science research, for each unit of analysis, each item of data (e.g., age of person, income of family, consumer price index) is called a variable. • Unit of analysis: The basic observable entity being analyzed by a study and for which data are collected in the form of variables. A unit of analysis is sometimes referred to as a case or "observation"

‘Sex’ can be coded with values: • 1=’male’ 2 =’female’ 3 = ‘no response’ or • 1=’female’ 2 =’male’ or • 1=’male’ 2 =’female’ or • ‘M’=’male’ ‘F’ =’female’ or • ‘male’ ‘female’ or even • 1=’yes’ 2 =’no’ 3 = ‘maybe’

The values a variable takes must be: • exhaustive: include the characteristics of all cases, and • mutually exclusive: each case must have one and only one value or code for each variable

What’s wrong with this coding scheme? • Under $3,000 • $3,000-$7,000 • $8,000-$12,000 • $13,000-$17,000 • $18,000-$22,000 • $23,000-$27,000 • $28,000-$32,000 • $33,000-$37,000 • $38,000 & over (Source: Census of Canada, 1961: user summary tapes)

Variables are normally coded numerically, because: • arithmetic is easier with numbers than with letters • some characteristics are inherently numeric: age, weight in kilograms, number of children ever borne, income, value of dwelling, years of schooling, etc. • space/size of data sets has, until recently, been a major consideration

Three basic types of variables • Categorical • Nominal, aka nonorderable discrete Eg gender, ethnicity, immigrant status, province of residence, marital status, labour force status, etc. • Ordinal, aka orderable discrete Eg highest level of schooling, social class, left-right political identification, Likert scales, income groups, age groups, etc. • Continuous, aka interval, numeric Eg age, income, etc.

Descriptive statistics summarize the properties of a sample of observations • how the units of observation are the same (central tendency) • how they are different (dispersion) • how representative the sample is of the population at large (significance)

Nominal variable: • Central tendency • mode • Dispersion • frequency distribution • percentages, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • bar chart

Mode: the category with the highest number or percentage of observations

These frequencies can be visualized as a bar chart (based on percentages:

The same distribution, from one of the Nation series files:

and showing percentages:

Notice the differences:

Why the differences? • What is the population in each table? • Which one is correct?

The most important thing to know about any distribution, whether it is a rate, a proportion, or a percent, is what is in the denominator. And it must always be reported.

We can also derive the distribution information from the 1996 individual pumf using a statistical package:

If we weight the distribution, both the frequencies and the percentages will almost match the distribution from the Profile file:

Just a few words on weighting: • A weight is the chance that any member of the population (universe) had to be selected for the sample • In general • the weight can be used to produce estimates for the total population (population weight) • and/or the weight can adjust for known deficiencies in the sample (sample weight)

and a few more words on weighting… • The 1996 census public use microdata file of individuals is a 2.8% sample of the population • Stats Can calls this a ‘self-weighting’ sample: every case has a weight of 36 • Knowing who was excluded from the sample is as important as knowing who was included

And some final words on weighting… • When to use the population weight variable • when you are producing frequencies to reflect the frequency in the population • When you don’t need to use the population weight variable • when you are producing percentages, proportions, ratios, rates, etc. • When to use a sample weight variable • always

Proportions, percents, and odds • Percent: ‘the percent married in the population 15 years and over is…” = (#married/population >15 years)*100=51.2% • Proportion: “the proportion married in the population 15 years and over is …” = (#married/population >15 years)=.512 • Odds: “the odds on being married are…” =(#married/#not married in the population) or =proportion married/(1-proportion married) =.512/(1-.512)=.512/.488=1.049

Ordinal variable: • Central tendency • median and mode • Dispersion • frequency distribution • range • percentages/quantiles, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • histogram

Example: frequency distribution

Example: relative percentages

The Median is the value that divides an orderable distribution exactly into halves. Finding the median is easier if we compute cumulative percentages

Example: cumulative percentages % Cumulative % 0-4 years 6.65 6.65 5-9 years 6.90 13.55 10-14 years 6.91 20.46 15-24 years 13.37 33.83 25-34 years 15.60 49.43 35-44 years 16.85 66.28 45-54 years 12.86 79.14 55-64 years 8.63 87.77 65-74 years 7.15 94.92 75-84 years 3.91 98.83 85 years and over 1.17 100.00

So how can we describe this distribution, using the vocabulary we have so far? • What is the mode of this distribution? • What is the median? • What is the range?

A statistical package can also report these measures

Percentiles/quantiles • percentiles/quantiles are the value below which a given percentage of the cases fall • the median = the 50th percentile • quartiles are the values of a distribution broken into 4 even intervals each containing 25% of the cases

Interquartile range • The interquartile range is the difference between the value at 75% of cases, and the value at 25% of cases • What is the interquartile range for the following distribution?

Cumulative percentages % Cumulative % 0-4 years 6.65 6.65 5-9 years 6.90 13.55 10-14 years 6.91 20.46 15-24 years 13.37 33.83 25-34 years 15.60 49.43 35-44 years 16.85 66.28 45-54 years 12.86 79.14 55-64 years 8.63 87.77 65-74 years 7.15 94.92 75-84 years 3.91 98.83 85 years and over 1.17 100.00

Continuous variable: • Central tendency • mean, median and mode • Dispersion • range • variance or standard deviation • quantiles/percentiles • interquartile range • Significance • standard error • coefficient of variation • Visualization • polygon (line graph)

Means, variances, and standard deviations: • Mean: the average, computed by adding up the values of all observations and dividing by the number of observations (cases) • Variance: computed by taking each value and subtracting the value of the mean from it, squaring the results, adding them up, and dividing by the number of cases (actually N-1) • Standard deviation is the value that cuts off 68% of the cases above or below the mean, in a normal distribution

Availability of continuous variables in Stats Can products: • Stats Can only rarely publishes truly continuous variables • Some exceptions are: • age by single years of age (census) • estimates of population by age (Annual demographic statistics)

Statistics Canada generally reports the distribution of continuous variables as: • Measures of central tendency • an ordinal (categorical) variable • an average (mean) and standard error, variance, or standard deviation • a median • rates (other than of 100), eg. children ever born per 1,000 women in the 1991 census 2B profile • Measures of dispersion • percentage – essentially, a rate per 100 population. Eg, Incidence of low income as a % of economic families, in the census profiles. This includes employment and unemployment rates • quantiles (eg quintiles in Income trends in Canada) • Gini coefficients (computed from the coefficient of variation) (eg Income trends in Canada) • Measures of significance • standard error

For the following distribution… • What is the median? • What is the mean? • What is the range?

Using the percentages and cumulative percentages: • What is your best estimate of the interquartile range? • What is your best estimate of the standard deviation? • Why is the average (mean) income so much higher than the median? • See pages 18-19 of your handout

As a percentage distribution

The polygon produced by Beyond 20/20 isn’t very useful…

Using the standard error to describe more of the distribution: • standard error of the mean is a measure of how likely it is that the mean in the data we are looking at is the same as or similar to the mean in the population at large • computed from the variance divided by the square root of the N • the larger the N, the smaller the standard error

For example….

Confidence intervals • The standard error makes most sense when we use it to compute a confidence interval around the mean • For Canada, in the previous example: • 95% upper confidence limit (UCL) =(mean)+1.96(standard error) =25196+1.96(13)= 25196+25.48=$25,221 • 95% lower confidence limit (LCL) =(mean)-1.96(standard error) 25196-1.96(13)= 25196-25.48=$25,171 • 1.96 is the Z-statistic that represents 95% of a normal distribution • The handout contains computed confidence intervals for each of the provinces (p.21)

How do we interpret this? • if we draw repeated random samples from the same population, 95% of them will have a mean total income between $25,221 and $25,171 • this is not the same as saying that we are 95% confident that the population mean falls within those two limits.

Using microdata • Statistical packages such as SAS, SPSS, etc. will compute the mean, the standard deviation, standard error, etc. for continuous variables • SDA will only report these measures for variables with less than 8,000 values

Statistics bootcamp

Statistics bootcamp

Presentation Transcript

ORGANIZATION BOOTCAMP

2017 Bootcamp

KEYBOARDING BOOTCAMP

Bootcamp

Application bOOTCAMP

BootcaMp

CSO Bootcamp

Bootcamp - Sepsis

EAP BOOTCAMP

Ideation Bootcamp

Wireless Bootcamp

Shakespeare Bootcamp

GIS BOOTCAMP

BOOTCAMP

Statistics bootcamp

Grammar Bootcamp

Bootcamp Melbourne

Fitness Bootcamp

CS Bootcamp

aws bootcamp

java bootcamp