Download Presentation
## Statistics bootcamp

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**Statistics bootcamp**Laine Ruus Data Library Service, University of Toronto For ACCOLEDS 2004 2004-12-07**Outline**• Describing a variable • Describing relationships among two or more variables**First, some vocabulary**• Variable: In social science research, for each unit of analysis, each item of data (e.g., age of person, income of family, consumer price index) is called a variable. • Unit of analysis: The basic observable entity being analyzed by a study and for which data are collected in the form of variables. A unit of analysis is sometimes referred to as a case or "observation"**‘Sex’ can be coded with values:**• 1=’male’ 2 =’female’ 3 = ‘no response’ or • 1=’female’ 2 =’male’ or • 1=’male’ 2 =’female’ or • ‘M’=’male’ ‘F’ =’female’ or • ‘male’ ‘female’ or even • 1=’yes’ 2 =’no’ 3 = ‘maybe’**The values a variable takes must be:**• exhaustive: include the characteristics of all cases, and • mutually exclusive: each case must have one and only one value or code for each variable**What’s wrong with this coding scheme?**• Under $3,000 • $3,000-$7,000 • $8,000-$12,000 • $13,000-$17,000 • $18,000-$22,000 • $23,000-$27,000 • $28,000-$32,000 • $33,000-$37,000 • $38,000 & over (Source: Census of Canada, 1961: user summary tapes)**Variables are normally coded numerically, because:**• arithmetic is easier with numbers than with letters • some characteristics are inherently numeric: age, weight in kilograms, number of children ever borne, income, value of dwelling, years of schooling, etc. • space/size of data sets has, until recently, been a major consideration**Three basic types of variables**• Categorical • Nominal, aka nonorderable discrete Eg gender, ethnicity, immigrant status, province of residence, marital status, labour force status, etc. • Ordinal, aka orderable discrete Eg highest level of schooling, social class, left-right political identification, Likert scales, income groups, age groups, etc. • Continuous, aka interval, numeric Eg age, income, etc.**Descriptive statistics summarize the properties of a sample**of observations • how the units of observation are the same (central tendency) • how they are different (dispersion) • how representative the sample is of the population at large (significance)**Nominal variable:**• Central tendency • mode • Dispersion • frequency distribution • percentages, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • bar chart**Mode: the category with the highest number or percentage of**observations**Mode: the category with the highest number or percentage of**observations**These frequencies can be visualized as a bar chart (based on**percentages:**Why the differences?**• What is the population in each table? • Which one is correct?**The most important thing to know about any distribution,**whether it is a rate, a proportion, or a percent, is what is in the denominator. And it must always be reported.**We can also derive the distribution information from the**1996 individual pumf using a statistical package:**If we weight the distribution, both the frequencies and the**percentages will almost match the distribution from the Profile file:**Just a few words on weighting:**• A weight is the chance that any member of the population (universe) had to be selected for the sample • In general • the weight can be used to produce estimates for the total population (population weight) • and/or the weight can adjust for known deficiencies in the sample (sample weight)**and a few more words on weighting…**• The 1996 census public use microdata file of individuals is a 2.8% sample of the population • Stats Can calls this a ‘self-weighting’ sample: every case has a weight of 36 • Knowing who was excluded from the sample is as important as knowing who was included**And some final words on weighting…**• When to use the population weight variable • when you are producing frequencies to reflect the frequency in the population • When you don’t need to use the population weight variable • when you are producing percentages, proportions, ratios, rates, etc. • When to use a sample weight variable • always**Proportions, percents, and odds**• Percent: ‘the percent married in the population 15 years and over is…” = (#married/population >15 years)*100=51.2% • Proportion: “the proportion married in the population 15 years and over is …” = (#married/population >15 years)=.512 • Odds: “the odds on being married are…” =(#married/#not married in the population) or =proportion married/(1-proportion married) =.512/(1-.512)=.512/.488=1.049**Ordinal variable:**• Central tendency • median and mode • Dispersion • frequency distribution • range • percentages/quantiles, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • histogram**The Median is the value that divides an orderable**distribution exactly into halves. Finding the median is easier if we compute cumulative percentages**Example: cumulative percentages**% Cumulative % 0-4 years 6.65 6.65 5-9 years 6.90 13.55 10-14 years 6.91 20.46 15-24 years 13.37 33.83 25-34 years 15.60 49.43 35-44 years 16.85 66.28 45-54 years 12.86 79.14 55-64 years 8.63 87.77 65-74 years 7.15 94.92 75-84 years 3.91 98.83 85 years and over 1.17 100.00**So how can we describe this distribution, using the**vocabulary we have so far? • What is the mode of this distribution? • What is the median? • What is the range?**Percentiles/quantiles**• percentiles/quantiles are the value below which a given percentage of the cases fall • the median = the 50th percentile • quartiles are the values of a distribution broken into 4 even intervals each containing 25% of the cases**Interquartile range**• The interquartile range is the difference between the value at 75% of cases, and the value at 25% of cases • What is the interquartile range for the following distribution?**Cumulative percentages**% Cumulative % 0-4 years 6.65 6.65 5-9 years 6.90 13.55 10-14 years 6.91 20.46 15-24 years 13.37 33.83 25-34 years 15.60 49.43 35-44 years 16.85 66.28 45-54 years 12.86 79.14 55-64 years 8.63 87.77 65-74 years 7.15 94.92 75-84 years 3.91 98.83 85 years and over 1.17 100.00**Continuous variable:**• Central tendency • mean, median and mode • Dispersion • range • variance or standard deviation • quantiles/percentiles • interquartile range • Significance • standard error • coefficient of variation • Visualization • polygon (line graph)**Means, variances, and standard deviations:**• Mean: the average, computed by adding up the values of all observations and dividing by the number of observations (cases) • Variance: computed by taking each value and subtracting the value of the mean from it, squaring the results, adding them up, and dividing by the number of cases (actually N-1) • Standard deviation is the value that cuts off 68% of the cases above or below the mean, in a normal distribution**Availability of continuous variables in Stats Can products:**• Stats Can only rarely publishes truly continuous variables • Some exceptions are: • age by single years of age (census) • estimates of population by age (Annual demographic statistics)**Statistics Canada generally reports the distribution of**continuous variables as: • Measures of central tendency • an ordinal (categorical) variable • an average (mean) and standard error, variance, or standard deviation • a median • rates (other than of 100), eg. children ever born per 1,000 women in the 1991 census 2B profile • Measures of dispersion • percentage – essentially, a rate per 100 population. Eg, Incidence of low income as a % of economic families, in the census profiles. This includes employment and unemployment rates • quantiles (eg quintiles in Income trends in Canada) • Gini coefficients (computed from the coefficient of variation) (eg Income trends in Canada) • Measures of significance • standard error**For the following distribution…**• What is the median? • What is the mean? • What is the range?**Using the percentages and cumulative percentages:**• What is your best estimate of the interquartile range? • What is your best estimate of the standard deviation? • Why is the average (mean) income so much higher than the median? • See pages 18-19 of your handout**Using the standard error to describe more of the**distribution: • standard error of the mean is a measure of how likely it is that the mean in the data we are looking at is the same as or similar to the mean in the population at large • computed from the variance divided by the square root of the N • the larger the N, the smaller the standard error**Confidence intervals**• The standard error makes most sense when we use it to compute a confidence interval around the mean • For Canada, in the previous example: • 95% upper confidence limit (UCL) =(mean)+1.96(standard error) =25196+1.96(13)= 25196+25.48=$25,221 • 95% lower confidence limit (LCL) =(mean)-1.96(standard error) 25196-1.96(13)= 25196-25.48=$25,171 • 1.96 is the Z-statistic that represents 95% of a normal distribution • The handout contains computed confidence intervals for each of the provinces (p.21)**How do we interpret this?**• if we draw repeated random samples from the same population, 95% of them will have a mean total income between $25,221 and $25,171 • this is not the same as saying that we are 95% confident that the population mean falls within those two limits.**Using microdata**• Statistical packages such as SAS, SPSS, etc. will compute the mean, the standard deviation, standard error, etc. for continuous variables • SDA will only report these measures for variables with less than 8,000 values