Friday 26th February

Analysing Cross-Tabulations Friday 26th February

Outline • Week 9 (The session in two weeks’ time) – Seminar organization • Recap on last week • Cross-tabular analysis • Testing for relationships in cross-tabulations: Chi-Square • Measuring the strength of association: Cramér’s V and Odds Ratios • Choosing statistical tests • Multivariate analysis – a quick overview • What multivariate analysis tells you • Multivariate cross-tabular analysis using Cramér’s V

Week 9 (in two weeks’ time) In the morning session in Week 9, we will be looking at multivariate analysis in published work. In preparation for this I would like you, as individuals or – even better - in groups to take a look at a published sociological article that employs multivariate analysis. It is up to you to choose an article that you are interested in. In general, any article that says it uses multiple regression or logistic regression should be OK. One way of finding an article is to choose a journal that is available online and leans towards quantitative research, and browse the latest issues. You should read the article, and try to work out: What is the question that is being addressed (hypothesis or hypotheses)? What is/are the dependent variable (or variables)? What is/are the main independent variable(s) (the focus of the article)? What control variable(s) is/are being used? Which independent/control variables have significant effects? What does these effects mean substantively? Do not worry if you do not understand everything. Just work out what you can!

Last week... • Hypothesis testing involves testing the NULL HYPOTHESIS that there is no difference/effect/relationship, e.g. variables are independent. • When we test such a hypothesis we test whether the relationships or differences that we observe in our sample are likely to have occurred if the null hypothesis were true for the population. • The sampling distribution (a distribution based upon the various possible samples) plays an important role in inferential statistics because it allows us to work out how much variation is likely to be produced by sampling error (e.g. if levels of happiness vary a lot, then small gender differences are quite likely to reflect sampling error). • When we find a very low probability (p<0.05, or less than 5%) that we could have found what we found in our sample if the null hypothesis were true we can infer that it is unlikely to be true. • And therefore, we can infer that the ALTERNATIVE HYPOTHESIS (saying that there is a difference/effect) is likely to be true. • This is a sort of backwards logic. So… if you find it easier to think forwards, the simple version (although less technically correct) is that if we find that p<0.05 we have identified a relationship/effect.

Last week... • Last week we looked at tests that had interval-level variables (such as income, or years at an address) as dependent variables. • Specifically we looked at tests that investigated: • Whether a population has a mean that is different from a stated mean (z-tests, or ‘one-sample t-tests’ in SPSS). e.g. Do people on average stay at an address for 10 years? Or whether a population has a proportion in category x that is different from a stated proportion (‘binomial’ tests in SPSS). e.g. Is the proportion of people who have been at their current address for five or more years equal to 50%? • Whether two groups have population means that differ from one another (t-tests, or ‘independent samples t-test’ in SPSS). e.g. Is there a gender difference in mean time spent at current address? • Whether the different categories of a variable (e.g. social class) have population means that differ in some way (ANOVA, or ‘one way ANOVA’ in SPSS). e.g. Do people in different social classes have different average lengths of time at their addresses?

Categorical data analysis • Today we are going to look at relationships between categorical variables (e.g. gender, ‘race’, religious denomination). • When both variables are categorical we cannot produce means. Instead we construct contingency tables that show the frequency with which cases fall into each combination of categories – e.g. ‘man’ and ‘Christian’ (we cannot do this directly with continuous variables, e.g. age, as people often fall into numerous categories, leading to tables that are enormous and unmanageable). • When we conduct statistical analyses of cross-tabulated data, we are trying to work out whether there is any systematic relationship between the variables being analyzed or whether any ‘patterns’ are ‘random’, i.e. only reflect sampling error. • Therefore the tests that we do compare what we find (observe) with what would be expected if there were no relationship – i.e. given the null hypothesis of no relationship in the population.

From: Phoenix, A. 1991. Young Mothers? Cambridge: Polity Press. Table 3.2 in Chapter 3: ‘How the Women Came to be Mothers’ (p61) MARITAL STATUS (AT CONCEPTION) by ORIENTATION TO PREGNANCY Wanted to Did not Had not Important TOTAL conceive mind thought not to about it Single 4 (8%) 12 (23%) 13 (25%) 24 (45%) 53 Cohabiting 4 (44%) 2 (22%) 1 (11%) 2 (22%) 9 Married 9 (53%) 6 (35%) 0 (0%) 2 (12%) 17 TOTAL 17 20 14 28 79

From: Jupp, P. 1993. ‘Cremation or burial? Contemporary choice in city and village’. In Clark, D. (ed.) The Sociology of Death. Oxford: Blackwell.(Derived from Tables 5 and 6 on pages 177 and 178). OCCUPATIONAL CLASS by DISPOSAL CHOICE Cremation Burial TOTAL Working class 20 (59%) 14 (41%) 34 Middle class 21 (88%) 3 (13%) 24 TOTAL 41 17 58

What can be learned from these cross-tabulations? Do you think that the patterns in the MARITAL STATUS by ORIENTATION TO PREGNANCY and OCCUPATIONAL CLASS by DISPOSAL CHOICE cross-tabulations provide sufficient evidence to conclude that relationships exist? Where can the relationship, if any, be found in the MARITAL STATUS (AT CONCEPTION) by ORIENTATION TO PREGNANCY cross-tabulation?

How do you work out whether a difference between men and women is likely to be due to chance? We use chi-square (2) to look at the difference between what we observe and what would be likely if there were no difference except that generated by chance (i.e. sampling error): 2 = (Observedij – Expectedij)2 Expectedij • For each cell in the table: we work out the frequency that we would expect and see how much the observed frequency differs from this. • We then square this difference and divide by the expected frequency. • We then sum these values. • The observed frequency for each cell is what we see. • The expected frequency is the frequency that you would get in each cell if men and women were exactly as likely as each other to fall into each of the categories of the other variable. It may or may not be clear from this slide that the above formula simply (or not so simply?!) represents the process described to the right, so working through an example may help... 

DEGREE SUBJECT by GENDER: ‘BLACK ‘ GRADUATES. SubjectMaleFemaleTotal Arts 16 (47%) 18 (53%) 34 Sciences 29 (58%) 21 (42%) 50 Social Sciences 42 (53%) 38 (47%) 80 Education 3 (19%) 13 (81%) 16 TOTAL 90 (50%) 90 (50%) 180 The above table is based on a random sample of 180 ‘Black’ graduates born in the UK and aged 25-34 in 1991. (Data adapted from the 1991 Census SARs; ‘Black’ includes Black-African, Black-Caribbean and Black-Other).

‘Expected’ frequencies SubjectMaleFemaleTotal Arts 17 (50%) 17 (50%) 34 Sciences 25 (50%) 25 (50%) 50 Social Sciences 40 (50%) 40 (50%) 80 Education 8 (50%) 8 (50%) 16 TOTAL 90 (50%) 90 (50%) 180

Differences (‘Observed’ minus ‘Expected’) SubjectMaleFemaleTotal Arts -1 1 0 Sciences 4 -4 0 Social Sciences 2 -2 0 Education -5 5 0 TOTAL 0 0 0

Squared differences SubjectMaleFemale Arts 1 1 Sciences 16 16 Social Sciences 4 4 Education 25 25

... divided by ‘Expected’ values and summed SubjectMaleFemale Arts 1/17 = 0.06 1/17 = 0.06 Sciences 16/25 = 0.64 16/25 = 0.64 Social Sciences 4/40 = 0.10 4/40 = 0.10 Education 25/8 = 3.13 25/8 = 3.13 0.06 + 0.06 + 0.64 + 0.64 + 0.10 + 0.10 + 3.13 + 3.13 = 7.86 7.86 is the value of the chi-square statistic (2) for the original table (cross-tabulation).

Degrees of freedom If a cross-tabulation has R rows and C columns, the corresponding chi-square statistic has (R-1) x (C-1) ‘degrees of freedom’ i.e. in this case it has (4 - 1) x (2 - 1) = 3 degrees of freedom. Degrees of freedom can be thought of as sources of variation. In an independent samples t-test, the number of sources of variation depends on the numbers of cases in the samples being compared. In a chi-square test, however, it depends on the number of cells in the cross-tabulation being examined.

Chi-square distributions for1 to 5 degrees of freedom Here, chi-square values of more than 6 are relatively rare. However higher values of chi-square are more common when the degrees of freedom (k in this chart) is larger. Chi-square tables reflect this: it can be seen in these that the values for p=0.05 (the point at which only 5% of cases lie to the right) increase as the degrees of freedom increase.

Checking the chi-square statistic • To check whether a result is statistically significant (i.e. unlikely to simply reflect sampling error) we look up critical values of chi-square in a table. • For 3 d.f. the critical values are 7.81 (p=0.05) and 11.34 (p=0.01). • Because our chi-square value (7.86) is bigger than 7.81 we can say that it is significant at p<0.05. (Since it is not bigger than 11.34 we cannot say that it is significant at p<0.01). • What this means is that we would find a difference in our sample of at least the magnitude of the difference that we observed in the distributions of proportions for men and for women between 1% and 5% of the time if there were no real difference between the distributions of proportions in the population (which is the null hypothesis). • This is rare enough that we consider the null hypothesis to be unlikely. We can therefore reject the null hypothesis and accept the alternative hypothesis of a relationship between gender and subject. • When you use SPSS, it produces an estimate of the precise probability of obtaining a chi-square statistic at least as big as (in this case) 7.86 by chance. A p-value of less than 0.05 is said to be statistically significant (and to imply a significant relationship between the two variables).

A note on using chi-square • Chi-square tests can be safely used if each cell in the table has an expected value of at least 5. Opinions differ as to what is appropriate where this is not the case: one ‘rule of thumb’ is that no expected values should be less than 1 and no more than 20% of expected values should be less than 5. • The categories must be discrete, i.e. mutually exclusive. No case should fall into more than one category (which might pose some difficulties when carrying out analyses focusing on degree subjects!)

Chi-square statistics MARITAL STATUS (AT CONCEPTION) 26 = 24.86 by ORIENTATION TO PREGNANCY (p < 0.001) (but the sparseness of cases means that the chi-square statistic is invalid!) OCCUPATIONAL CLASS 21 = 5.58 by DISPOSAL CHOICE (p < 0.05) (but the chi-square statistic for a 2x2 cross-tabulation needs – arguably – to be adjusted using ‘Yates’ correction for continuity’, giving an adjusted value of 4.29 (p <0.05)) GENDER by SUBJECT: 23 = 1542 OTHER GRADUATES (p < 0.0001)

Marital Status (at Conception) by Orientation to Pregnancy: Collapsed/reduced versions of cross-tabulation Wanted to Did not Had not Important conceive mind thought + not to about it Single 4 (8%) 12 (23%) 37 (70%) Cohab./Married 13 (50%) 8 (31%) 5 (19%) 22 = 23.46 (p < 0.0001) Wanted to Did not Had not Important conceive mind thought not to about it Cohabiting 4 (44%) 2 (22%) 1 (11%) 2 (22%) Married 9 (53%) 6 (35%) 0 (0%) 2 (12%) 23 = 2.72 (p > 0.05) [N.B. ‘Sparse’ table: chi-square invalid]

Strength of association • Chi-square tells us whether there is a ‘significant’ relationship between two variables (or whether a relationship exists that would have been unlikely to have been found by chance). • However it does not tell us in a clear-cut way how strong this association is, since the size of the chi-square statistic depends in part on the sample size (as well as the cross-tabulation shape) • We will therefore look at two different measures that do tell us about the strength of association: • Cramér’s V. This is a chi-square-based measure that tells us the strength of association in a cross-tabulation. No association is represented by 0 and ‘perfect’ association by 1. • Odds ratios. These tell us the relative odds of an event occurring for different categories or groups of cases (or people). As such they are a quite easy-to-understand way of discussing the strength of association in cross-tabulations.

Example: Sex, Age and SportData from Young People’s Social Attitudes Study 2003 (available from Nesstar!!) Boys: Girls:

What can we say about these tables? • It looks like boys play sport as part of a club more than girls do. • And it looks like both boys and girls become less likely to play sport as part of a club as they get older. Questions: • Is there a significant relationship between age and sports club membership for boys?/for girls? • Is the association between age and sports club membership stronger/weaker for boys than it is for girls?

1. Does age significantly affect sports club membership for boys? To answer this question we work out chi-square, by calculating: χ2= (Observedij – Expectedij)2 Expectedij = (62 – (95*164)/303)2 + (33 – (95*139)/303)2 + (68 – (125*164)/303)2 + (95*164)/303 (95*139)/303 (125*164)/303 (57 – (125*139)/303)2 + (34 – (83*164)/303)2 + (49 – (83*139)/303)2 + (125*139)/303 (83*164)/303 (83*139)/303 = (62-51.4)2/51.4 + (33-43.6)2/43.6 + (68-67.7)2/67.7 + (57-57.3)2/57.3 + (34-44.9)2/44.9 + (49-38.1)2/38.1 = 2.2 +2.6 + 0 + 0 +2.6 + 3.1 = 10.5 

1. Does age significantly affect sports club membership for boys? • Chi-square = 10.5. • d.f. = (r-1) x (c-1) = 1 x 2 = 2 • If we look up the .05 value for 2 degrees of freedom it is 5.99, and the .01 value is 9.21. Since 10.5 is bigger than both of these, it is significant at (p < 0.01). • The following SPSS output confirms that, in fact, the p-value is .005, which is less than 0.01 (N.B. the chi-square valuewithout rounding is 10.541).

And for girls…? • Work out the chi-square statistic and test whether it is significant for girls…

And for girls…? • The SPSS output for the chi-square test for girls shows a chi-square value of 23.394. • The p-value for this chi-square statistic ,with 2 degrees of freedom is rounded to 0.000 (SPSS only shows you results to 3 decimal places). This means that it is less than 0.001 (or, more precisely, that it is less than 0.0005). • Therefore “Age has a significant effect on whether or not girls play sport in clubs (p < 0.001)”:

2. Is the effect of age stronger/weaker for boys than it is for girls? • The chi-square statistic is bigger for girls than for boys, however the sample of girls is also bigger (360 as compared to 303) so this will have affected the relative size of the two values. • To work out the strength of association, we need to correct for both sample size and for the table shape (since this also affects the magnitude of chi-square statistics). A frequently-used measure of association is Cramér’s V: where 2 is chi-square, N is the sample size, and L is the lesser (smaller) of the number of rows and number of columns. Note: In any table where either the number of rows or the number of columns is equal to 2, Cramér’s V is equal to another measure of association, referred to as phi (or Φ).

Comparing strength of association between ageand involvement in sport for boys and for girls Boys: χ2 = 10.541 Girls : χ2 = 23.394 Cramér’s V values for the two tables are therefore: Boys Girls

Cramér’s V in SPSS • We can also see Cramér’s V in SPSS output • The value for boys is above and that for girls is below – note that the values of Cramér’s V are the same as those we worked out (0.186 and 0.258), with small differences due to rounding error.

What do the results mean substantively? • We can say that age has a significant effect on boys’ participation in sport. • And that age has a significant and somewhat stronger effect on girls’ participation in sport. • Hence both girls and boys are likely to decrease their participation in sports clubs as they get older but this effect is more pronounced among girls than among boys. • There is thus a (small-ish) gender difference in the relationship between age and participation in sport.

√ Cramér’s V values Subject and Gender: ‘Black’ graduates = 7.86 = 0.209 180(2-1) Subject and Gender: Other graduates = 1542 = 0.300 17094(2-1) But could this difference just reflect sampling error? Log-linear model: Test for difference between form of Subject/Gender relationship for ‘Black’ graduates and for Other graduates: 23 = 3.67 (p > 0.05) i.e. Not enough evidence to conclude subject ‘gendering’ varies between ‘Black’ graduates and Other graduates √

A different way of measuring strength of association:Odds ratios The odds ratio is calculated as the odds of an outcome, given membership of group a, divided by the odds of that outcome, given membership of group b. Or, looking at the table below, the odds of playing sport if you’re male (or given membership of the group ‘male’), divided by the odds of playing sport if you’re female (or given membership of the group ‘female’).

Working out the odds The odds of an event occurring can be worked out by the number of times that it occurs divided by the number of times that it does not occur. ODDS OF A MALE PLAYING SPORT IN A CLUB = 164/139 = 1.18 This means that a male is 1.18 times more likely to be a member of a sports club than not to be. ODDS OF A FEMALE PLAYING SPORT IN A CLUB = 114/246 = 0.46 This means that a female is 0.46 times as likely to be a member of a sports club as not to be (or less than 50% as likely). N.B. it is often easier to talk about an odds ratio of less than 1 the other way around (i.e. 246/114 = 2.15 – hence women are more than two times as likely not to take part in sports clubs as they are to take part in them). The ODDS RATIO is the odds of a male playing sport divided by the odds of a woman playing sport: 1.18 / 0.46 = 2.57 Therefore the odds that a male is part of a sports club are over two-and-a-half times as great as the odds of a female being part of a sports club.

Odds ratios • When an odds ratios is equal to 1 the two groups are identical (i.e. the odds of the given outcome are the same for each group). • When odds ratios get close to 0 or to infinity the groups are very different (i.e. the odds of the given outcome are very high (or close to certain) for one group and very low (or zero) for the other). • We will revisit odds ratios when we look at logistic regression and log-linear analysis, since both techniques use them. • Note: where the independent variable is in the columnsa mathematical description of the odds ratio is: (a/c) / (b/d)which is the same as: (a*d) / (b*c) • Where the independent variable is in the rows it’s (a/b) / (c/d)

Choosing statistical tests Most of the statistical tests/procedures that we have mentioned are only suitable for application in particular situations. You can find this discussed in full by Buckingham and Saunders (2004) ‘Appendix E: Choosing Statistical Tests’ (available online). In order to determine what sort of test to use you need to ask yourself the following questions: • Are the relevant data univariate, bivariate or multivariate? • What type of variable is your ‘focal’ (dependent) variable? • What type(s) of variable is/are your other (i.e. independent) variable(s)? • What do you want to know? (Do you want to infer things about the population? Make a causal argument? Investigate interactions? Note: Since you can always simplify the level of measurement of a variable (e.g. from interval-ratio to categorical) by recoding, albeit at the cost of losing some of the information contained in the data, you can in theory look at any question via a cross-tabular analysis and carry out any bivariate analysis using a chi-square test.

Multivariate analysis • So far we have tended to concentrate on two-way relationships (e.g. between gender and participation in sports). But we have started to look at about three-way relationships. • Social relationships and phenomena are usually more complex than is allowed for in a bivariate analysis. • Multivariate analyses are thus commonly used as a reflection of this complexity. • In the remaining weeks of the term we will look at linear regression, logistic regression and (hierarchical) log-linear models, all types of multivariate analysis. • However, this week we will look briefly about the rationale for multivariate analysis and have a think about cross-tabular techniques for conducting this form of analysis.

Multivariate analysis de Vaus (1996: 198) suggests that we can use multivariate analysis to elaborate bivariate relationships, in order to answer the following questions: • Why does the relationship [between two variables] exist? What are the mechanisms and processes by which one variable is linked to another? • What is the nature of the relationship? Is it causal or non-causal? • How general is the relationship? Does it hold for people in general, or is it specific to certain subgroups? • This is because multivariate analysis enables the identification of: • Spurious relationships • Intervening variables • The replication of relationships • The specification of relationships

Age Spurious relationship Reading ability Height Spurious relationships • A spurious relationship exists where two variables are not related but a relationship between them is generated by their relationships with a third variable. • For example:

Ethnicity Education Unemployment Intervening variables • Sometimes, although there is a real (non-spurious) relationship between two variables, we want to establish why that relationship exists. • For example, if we discover that there is a relationship between risk of unemployment and ethnicity, we want to know why that is the case. One possibility is that some ethnic groups have lower educational levels and that this has implications for their ability to get work. In this case education would be an intervening variable. • Intervening variables enable us to answer questions about the bivariate relationship between two variables – suggesting that (in this case) the relationship between ethnicity and unemployment is not direct but (at least in part) occurs via educational levels.

Is it spurious or intervening? When we do statistical tests we will find similar results for a spurious variable and an intervening variable: In both cases the effect of the independent variable on the dependent variable will be moderated by the third variable. So how do we know whether this third variable provides evidence of a spurious relationship or is an intervening variable? • There is no hard-and-fast statistical rule for deciding this. • But if we are suggesting that a variable is intervening, the logic of the process must make sense – i.e. you must have a cogent theoretical reason for thinking that your independent variable affects the intervening variable which in turn affects the dependent variable. • This kind of causal process is easiest to argue for when the timing supports it, i.e. when the intervening variable can be seen to occur in between the independent and dependent variables (e.g. education in the earlier example of the relationship between ethnicity and unemployment).

Replication • Sometimes when we have found a basic (‘zero-order’) relationship between two variables (e.g. ethnicity and unemployment), we want to demonstrate that this relationship exists within different subgroups of the population (e.g. for both men and women; for those of different ages…). • Where the relationship IS replicated we can rule out the possibility that it is produced by the variable in question, either as an intervening variable or in a spurious way.

Specification • Sometimes a particular variable only has an effect in specific situations. The variable that determines these situations is said to ‘interact’ with the independent variable. • For example, an example in de Vaus’s book suggests that going to a religious school makes boys more religious but has little or no effect on girls. • In this case type of school interacts with gender: religious education only affects students’ religiosity in combination with being male.

Specification (interactions) Graphical representation of the relationship between religious educationand religiousness, controlling for sex: Interaction between No interaction sex and religiousness of school Religiousness Religiousness high high boys boys girls girls low low Not at all Very How religious was your education? Not at all Very How religious was your education?

Using Cramér’s V to classify a multivariate situation If we use SPSS to produce a cross-tabulation of two variables, then we can elaborate this relationship by introducing a third variable as a layer variable. Examining the Cramér’s V values for the original cross-tabulation and for the layers of the elaborated cross-tabulation tells us what kind of situation we are looking at: • If the Cramér’s V values for the layers are all similar, then we have a situation of replication. • If the Cramér’s V values are smaller for the layered cross-tabulation than the value for the original cross-tabulation, then we either have a situation where the third variable is acting as an intervening variable, or one where it is inducing a spurious relationship between the original two variables. Deciding between these two options involves reflecting on whether the third variable makes sense conceptually as part of some causal mechanism linking the original two variables.

Using Cramér’s V to classify a multivariate situation (continued) • If the Cramér’s V values for the layered cross-tabulation vary in size, perhaps with some being smaller than the original value and some being as large or larger than it, then the situation is one of specification. • However, if one or more of the Cramér’s V values is larger than the original value, then a failure to take account of the third variable in the first instance may also have been suppressing an underlying relationship between the two variables. • This latter situation is a variation on the theme of spuriousness: in this case, the absence of a bivariate relationship is spurious rather than the presence of one!)

More generally… • Multivariate analyses can utilise a variety of techniques (depending on the form of the data, research questions to be addressed, etc. – we will be looking at multiple (linear) regression, logistic regression and log-linear models), in order to determine whether the relationship between two variables persists or is altered when we ‘control for’ a third (or fourth, or fifth...) variable. • Multivariate analysis can also enable us to establish which variable(s) has/have the greatest impact on a dependent variable – e.g. Is sex more important than ‘race’ in determining income? • It is often important for a multivariate analysis to check for interactions between the effects of independent variables, as discussed earlier under the heading of specification.

Friday 26th February