Download Presentation
## Statistics bootcamp

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Statistics bootcamp**Laine Ruus Data Library Service, University of Toronto Rev. 2005-04-26**Outline**• Describing a variable • Describing relationships among two or more variables**First, some vocabulary**• Variable: In social science research, for each unit of analysis, each item of data (e.g., age of person, income of family, consumer price index) is called a variable. • Unit of analysis: The basic observable entity being analyzed by a study and for which data are collected in the form of variables. A unit of analysis is sometimes referred to as a case or "observation"**The variable ‘Sex’ can be coded as:**• 1=’male’ 2 =’female’ 3 = ‘no response’, or • 1=’female’ 2 =’male’, or • 1=’male’ 2 =’female’, or • ‘M’=’male’ ‘F’ =’female’, or • ‘male’ ‘female’, or even • 1=’yes’ 2 =’no’ 3 = ‘maybe’**The values a variable can take must be:**• exhaustive: include the characteristics of all cases, and • mutually exclusive: each case must have one and only one value or code for each variable**What’s wrong with this coding scheme?**• Under $3,000 • $3,000-$7,000 • $8,000-$12,000 • $13,000-$17,000 • $18,000-$22,000 • $23,000-$27,000 • $28,000-$32,000 • $33,000-$37,000 • $38,000 & over (Source: Census of Canada, 1961: user summary tapes)**Variables are normally coded numerically, because:**• arithmetic is easier with numbers than with letters • some characteristics are inherently numeric: age, weight in kilograms or pounds, number of children ever borne, income, value of dwelling, years of schooling, etc. • space/size of data sets has, until recently, been a major consideration**Three basic types of variables**• Categorical • Nominal, aka nonorderable discrete Eg gender, ethnicity, immigrant status, province of residence, marital status, labour force status, etc. • Ordinal, aka orderable discrete Eg highest level of schooling, social class, left-right political identification, Likert scales, income groups, age groups, etc. • Continuous, aka interval, numeric Eg actual age, income, etc.**Descriptive statistics summarize the properties of a sample**of observations • how the units of observation are the same (central tendency) • how they are different (dispersion) • how representative the sample is of the population at large (significance)**Nominal variable:**• Central tendency • mode • Dispersion • frequency distribution • percentages, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • bar chart**Mode: the category with the largest number or percentage of**observations in a frequency distribution**The frequencies can be visualized as a bar chart (based on**percentages):**The same distribution, from one of the Canadian overview**files:**Why the differences?**• What is the population in each table? • What is in the denominator in each table? • Which one is correct?**The most important thing to know about any distribution,**whether it is a rate, a proportion, or a percent, is what is in the denominator. And it must always be reported.**We can also derive the distribution information from the**2001 individual pumf using a statistical package:**If we weight the distribution, both the frequencies and the**percentages will almost match the distribution from the Profile file:**Just a few words on weighting:**• The weight is the chance that any member of the population (universe) had to be selected for the sample • In general • the weight can be used to produce estimates for the total population (population weight) • and/or the weight can adjust for known deficiencies in the sample (sample weight)**and a few more words on weighting…**• The 2001 census public use microdata file of individuals is a 2.7% sample of the population • The weight variable (weightp) ranges from 35.545777-39.464996 • Knowing who was excluded from the sample is as important as knowing who was included**And some final words on weighting…**• When to use the population weight variable • when you are producing frequencies to reflect the frequency in the population • When you don’t need to use the population weight variable • when you are producing percentages, proportions, ratios, rates, etc. • When to use a sample weight variable • always**Proportions, percents, and odds**• Percent: ‘the percent married in the population 15 years and over is…” = (#married/population >15 years)*100=49.47% • Proportion: “the proportion married in the population 15 years and over is …” = (#married/population >15 years)=.4947 • Odds: “the odds on being married are…” =(#married/#not married in the population) or =proportion married/(1-proportion married) =.4947/(1-.4947)=.4947/.5053=.9790**Coefficient of variation**• measures how representative the variable in the sample is of the distribution in the population • computed as ((standard deviation/mean)*100) [we will discuss these measures in the context of continuous variables] • see Stats Can guidelines in user guides: • cv< 16.6% is ok to publish, cv>33.3% do not publish • SDA reports the cv when generating frequencies**Ordinal variable:**• Central tendency • median and mode • Dispersion • frequency distribution • range • percentages/quantiles, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • histogram**The Median is the value that divides an orderable**distribution exactly into halves. Finding the median is easier if we compute cumulative percentages, eg in Excel**% Cum%**0-4 years 5.65 5.65 5-9 years 6.59 12.24 10-14 years 6.84 19.08 15-24 years 13.36 32.44 25-34 years 13.31 45.75 35-44 years 17.00 62.75 45-54 years 14.73 77.48 55-64 years 9.56 87.04 65-74 years 7.14 94.18 75-84 years 4.43 98.61 85 years and over 1.39 100**So how can we describe this distribution, using the**vocabulary we have so far? • What is the mode of this distribution? • What is the median? • What is the range?**Percentiles/quantiles**• percentiles/quantiles are the value below which a given percentage of the cases fall • the median = the 50th percentile • quartiles are the values of a distribution broken into 4 even intervals each containing 25% of the cases**Interquartile range**• The interquartile range is the difference between the value at 75% of cases, and the value at 25% of cases • What is the interquartile range for the following distribution?**% Cum%**0-4 years 5.65 5.65 5-9 years 6.59 12.24 10-14 years 6.84 19.08 15-24 years 13.36 32.44 25-34 years 13.31 45.75 35-44 years 17.00 62.75 45-54 years 14.73 77.48 55-64 years 9.56 87.04 65-74 years 7.14 94.18 75-84 years 4.43 98.61 85 years and over 1.39 100**Continuous variable:**• Central tendency • mean, median and mode • Dispersion • range • variance or standard deviation • quantiles/percentiles • interquartile range • Significance • standard error • coefficient of variation • Visualization • polygon (line graph)**Means, variances, and standard deviations:**• Mean: the average, computed by adding up the values of all observations and dividing by the number of observations (cases) • Variance: computed by taking each value and subtracting the value of the mean from it, squaring the results, adding them up, and dividing by the number of cases (actually N-1) • Standard deviation: the value that cuts off 68% of the cases above or below the mean, in a normal distribution. It’s the square root of the variance, in the same metric as the variable.**Availability of continuous variables in Stats Can products:**• Stats Can rarely publishes truly continuous variables in its aggregate statistics products • Some exceptions are: • age by single years (census) • estimates of population by single years of age (Annual demographic statistics)**Statistics Canada generally reports the distribution of**continuous variables as: • Measures of central tendency • an ordinal (categorical) variable • an average (mean) and standard error, variance, or standard deviation • a median • rates (other than of 100), eg. children ever born per 1,000 women in the 1991 census 2B profile • Measures of dispersion • percentage – essentially, a rate per 100 population. Eg, Incidence of low income as a % of economic families, in the census profiles. This includes employment and unemployment rates • quantiles (eg quintiles in Income trends in Canada) • Gini coefficients (computed from the coefficient of variation) (eg Income trends in Canada) • Measures of significance • standard error**In the following distribution…**• What is the median? • What is the mean? • What is the range?**Using the percentages and cumulative percentages:**• What is your best estimate of the interquartile range? • What is your best estimate of the standard deviation? • Why is the average (mean) income so much higher than the median? • See pages 18-19 of your handout**Using the standard error to describe more of the**distribution: • standard errorof the mean is a measure of how likely it is that the mean in the data we are looking at is the same as or similar to the mean in the population at large • computed from the standard deviation divided by the square root of the N • the larger the N, the smaller the standard error, and the more confidence we can have in the distribution in the sample as representative of the population**Confidence intervals**• The standard error makes most sense when we use it to compute a confidence interval around the mean • For Canada, in the previous example: • 95% upper confidence limit (UCL) =(mean)+1.96(standard error) =29769+1.96(19)= 29769 + 37.24=$29,806.24 • 95% lower confidence limit (LCL) =(mean)-1.96(standard error) =29769 -1.96(19)= 29769 - 37.24 =$29,731.76 • 1.96 is the Z-statistic that represents 95% of a normal distribution • The handout contains computed confidence intervals for each of the provinces (p.22)**How do we interpret this?**• if we draw repeated random samples from the same population, 95% of them will have a mean total income between $29,732 and $29,806 • this is not the same as saying that we are 95% confident that the population mean falls within those two limits.**Using microdata**• Statistical packages such as SAS, SPSS, etc. will compute the mean, the standard deviation, standard error, etc. for continuous variables • SDA will only report these measures for variables with less than 8,000 values