1 / 92

Statistics bootcamp

Statistics bootcamp. Laine Ruus Data Library Service, University of Toronto For ACCOLEDS 2004 2004-12-07. Outline. Describing a variable Describing relationships among two or more variables. First, some vocabulary.

vida
Download Presentation

Statistics bootcamp

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Statistics bootcamp Laine Ruus Data Library Service, University of Toronto For ACCOLEDS 2004 2004-12-07

  2. Outline • Describing a variable • Describing relationships among two or more variables

  3. First, some vocabulary • Variable: In social science research, for each unit of analysis, each item of data (e.g., age of person, income of family, consumer price index) is called a variable. • Unit of analysis: The basic observable entity being analyzed by a study and for which data are collected in the form of variables. A unit of analysis is sometimes referred to as a case or "observation"

  4. ‘Sex’ can be coded with values: • 1=’male’ 2 =’female’ 3 = ‘no response’ or • 1=’female’ 2 =’male’ or • 1=’male’ 2 =’female’ or • ‘M’=’male’ ‘F’ =’female’ or • ‘male’ ‘female’ or even • 1=’yes’ 2 =’no’ 3 = ‘maybe’

  5. The values a variable takes must be: • exhaustive: include the characteristics of all cases, and • mutually exclusive: each case must have one and only one value or code for each variable

  6. What’s wrong with this coding scheme? • Under $3,000 • $3,000-$7,000 • $8,000-$12,000 • $13,000-$17,000 • $18,000-$22,000 • $23,000-$27,000 • $28,000-$32,000 • $33,000-$37,000 • $38,000 & over (Source: Census of Canada, 1961: user summary tapes)

  7. Variables are normally coded numerically, because: • arithmetic is easier with numbers than with letters • some characteristics are inherently numeric: age, weight in kilograms, number of children ever borne, income, value of dwelling, years of schooling, etc. • space/size of data sets has, until recently, been a major consideration

  8. Three basic types of variables • Categorical • Nominal, aka nonorderable discrete Eg gender, ethnicity, immigrant status, province of residence, marital status, labour force status, etc. • Ordinal, aka orderable discrete Eg highest level of schooling, social class, left-right political identification, Likert scales, income groups, age groups, etc. • Continuous, aka interval, numeric Eg age, income, etc.

  9. Descriptive statistics summarize the properties of a sample of observations • how the units of observation are the same (central tendency) • how they are different (dispersion) • how representative the sample is of the population at large (significance)

  10. Nominal variable: • Central tendency • mode • Dispersion • frequency distribution • percentages, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • bar chart

  11. Mode: the category with the highest number or percentage of observations

  12. Mode: the category with the highest number or percentage of observations

  13. These frequencies can be visualized as a bar chart (based on percentages:

  14. The same distribution, from one of the Nation series files:

  15. and showing percentages:

  16. Notice the differences:

  17. Why the differences? • What is the population in each table? • Which one is correct?

  18. The most important thing to know about any distribution, whether it is a rate, a proportion, or a percent, is what is in the denominator. And it must always be reported.

  19. We can also derive the distribution information from the 1996 individual pumf using a statistical package:

  20. If we weight the distribution, both the frequencies and the percentages will almost match the distribution from the Profile file:

  21. Just a few words on weighting: • A weight is the chance that any member of the population (universe) had to be selected for the sample • In general • the weight can be used to produce estimates for the total population (population weight) • and/or the weight can adjust for known deficiencies in the sample (sample weight)

  22. and a few more words on weighting… • The 1996 census public use microdata file of individuals is a 2.8% sample of the population • Stats Can calls this a ‘self-weighting’ sample: every case has a weight of 36 • Knowing who was excluded from the sample is as important as knowing who was included

  23. And some final words on weighting… • When to use the population weight variable • when you are producing frequencies to reflect the frequency in the population • When you don’t need to use the population weight variable • when you are producing percentages, proportions, ratios, rates, etc. • When to use a sample weight variable • always

  24. Proportions, percents, and odds • Percent: ‘the percent married in the population 15 years and over is…” = (#married/population >15 years)*100=51.2% • Proportion: “the proportion married in the population 15 years and over is …” = (#married/population >15 years)=.512 • Odds: “the odds on being married are…” =(#married/#not married in the population) or =proportion married/(1-proportion married) =.512/(1-.512)=.512/.488=1.049

  25. Ordinal variable: • Central tendency • median and mode • Dispersion • frequency distribution • range • percentages/quantiles, proportions, odds • Index of qualitative variation (IQV) • Significance • coefficient of variation (CV) • Visualization • histogram

  26. Example: frequency distribution

  27. Example: relative percentages

  28. The Median is the value that divides an orderable distribution exactly into halves. Finding the median is easier if we compute cumulative percentages

  29. Example: cumulative percentages % Cumulative % 0-4 years 6.65 6.65 5-9 years 6.90 13.55 10-14 years 6.91 20.46 15-24 years 13.37 33.83 25-34 years 15.60 49.43 35-44 years 16.85 66.28 45-54 years 12.86 79.14 55-64 years 8.63 87.77 65-74 years 7.15 94.92 75-84 years 3.91 98.83 85 years and over 1.17 100.00

  30. So how can we describe this distribution, using the vocabulary we have so far? • What is the mode of this distribution? • What is the median? • What is the range?

  31. A statistical package can also report these measures

  32. Percentiles/quantiles • percentiles/quantiles are the value below which a given percentage of the cases fall • the median = the 50th percentile • quartiles are the values of a distribution broken into 4 even intervals each containing 25% of the cases

  33. Interquartile range • The interquartile range is the difference between the value at 75% of cases, and the value at 25% of cases • What is the interquartile range for the following distribution?

  34. Cumulative percentages % Cumulative % 0-4 years 6.65 6.65 5-9 years 6.90 13.55 10-14 years 6.91 20.46 15-24 years 13.37 33.83 25-34 years 15.60 49.43 35-44 years 16.85 66.28 45-54 years 12.86 79.14 55-64 years 8.63 87.77 65-74 years 7.15 94.92 75-84 years 3.91 98.83 85 years and over 1.17 100.00

  35. Continuous variable: • Central tendency • mean, median and mode • Dispersion • range • variance or standard deviation • quantiles/percentiles • interquartile range • Significance • standard error • coefficient of variation • Visualization • polygon (line graph)

  36. Means, variances, and standard deviations: • Mean: the average, computed by adding up the values of all observations and dividing by the number of observations (cases) • Variance: computed by taking each value and subtracting the value of the mean from it, squaring the results, adding them up, and dividing by the number of cases (actually N-1) • Standard deviation is the value that cuts off 68% of the cases above or below the mean, in a normal distribution

  37. Availability of continuous variables in Stats Can products: • Stats Can only rarely publishes truly continuous variables • Some exceptions are: • age by single years of age (census) • estimates of population by age (Annual demographic statistics)

  38. Statistics Canada generally reports the distribution of continuous variables as: • Measures of central tendency • an ordinal (categorical) variable • an average (mean) and standard error, variance, or standard deviation • a median • rates (other than of 100), eg. children ever born per 1,000 women in the 1991 census 2B profile • Measures of dispersion • percentage – essentially, a rate per 100 population. Eg, Incidence of low income as a % of economic families, in the census profiles. This includes employment and unemployment rates • quantiles (eg quintiles in Income trends in Canada) • Gini coefficients (computed from the coefficient of variation) (eg Income trends in Canada) • Measures of significance • standard error

  39. For the following distribution… • What is the median? • What is the mean? • What is the range?

  40. Using the percentages and cumulative percentages: • What is your best estimate of the interquartile range? • What is your best estimate of the standard deviation? • Why is the average (mean) income so much higher than the median? • See pages 18-19 of your handout

  41. As a percentage distribution

  42. The polygon produced by Beyond 20/20 isn’t very useful…

  43. Using the standard error to describe more of the distribution: • standard error of the mean is a measure of how likely it is that the mean in the data we are looking at is the same as or similar to the mean in the population at large • computed from the variance divided by the square root of the N • the larger the N, the smaller the standard error

  44. For example….

  45. Confidence intervals • The standard error makes most sense when we use it to compute a confidence interval around the mean • For Canada, in the previous example: • 95% upper confidence limit (UCL) =(mean)+1.96(standard error) =25196+1.96(13)= 25196+25.48=$25,221 • 95% lower confidence limit (LCL) =(mean)-1.96(standard error) 25196-1.96(13)= 25196-25.48=$25,171 • 1.96 is the Z-statistic that represents 95% of a normal distribution • The handout contains computed confidence intervals for each of the provinces (p.21)

  46. How do we interpret this? • if we draw repeated random samples from the same population, 95% of them will have a mean total income between $25,221 and $25,171 • this is not the same as saying that we are 95% confident that the population mean falls within those two limits.

  47. Using microdata • Statistical packages such as SAS, SPSS, etc. will compute the mean, the standard deviation, standard error, etc. for continuous variables • SDA will only report these measures for variables with less than 8,000 values

More Related