Understanding Contingency Tables: A Practical Guide to Analyzing Dependence of Variables

Contingency (frequency) tables Dependence of two qualitative variables

Examples of problems • Is survival of a person send to choleric area dependent on the fact whether the person have been vaccinated against cholera or not? • Is there any connection between hair colour and sex? • Are parasite species distributed independently?

Contingency table

Dependence of survival on vaccination Mutual dependence of two species

Relationship between two categorial variables in table • in the case, when one from the variables is manipulated • in the case, when one of the variables is probably a cause and the second one is a consequence (response), but the study is based on non-manipulative observations • And finally, in the case, when the possible causality is unclear

Basic rules from theory of probability • Probability of common occurrence of two independent events is Pi,j = Pi . Pj • Example: In population is a half of its members male gender (Pmale=0.5) and a tenth of all individuals are albino (Palbino =0.1). If albinos are equally common in both sexes (i.e. albinism and sex are independent events), then probability that randomly chosen individual is albino male is Pmale *Palbino 0.5 * 0.1 = 0.05

Basic rules from theory of probability • Expected number of successes E(a) from n experiments, where probability of a success is Pa is • E(a)=Pa . n • Example: Probability that mutation occurs is 0.02 - in 100 randomly chosen individuals we expect 2 individuals with this mutation

How we compute 2 ? How we obtain expected values? H0 says – events are independent – so, with help of probability of common occurrence of two independent events.

Calculation of expected values With help of marginal sums Pi. = Ri /n P.j = Cj / n Pij=Pi.P.j, E (fij) = Pij .n = (Ri / n) .(Cj / n) . n = Ri .Cj / n

What I need to know to know result of complete experiment (given the fixed marginal frequencies?) df = (c-1) . (r - 1) number of rows number of columns

Critical value on 5% level of significance by df=3.

What we usually write to our paper This area is 0.029, so we write2=8.99, df=3, P=0.029

Even here is sometimes (when extremely low expected frequencies) used Yates’ correlation better protection against Type I error, but weaker test

Another test criteria, but also with 2distribution so-called 2likelihood ratio (LR)

Similar results “Normal” 2=8.99

2 by 2 tables Notice, that for null hypothesis’ table holds ad = bc

Statistical and causal dependence • Causal dependence can be proved just due to manipulative experiment For “correct” experiment everyone has to be vaccinated, but half of them gets just placebo (compare what is possible and what is demanded by statistics).

Fundamentals of experimenter • Every treatment has to have its control • Control differs from treatment just in impact, which I want to prove (it is often very difficult) • I have to have independent replications

Advantages of experiment and observation study • Causality can be proved due to experiment • Range of experimental manipulations is usually limited • Almost every experimental impact has side effects, which are sometimes unpredictable

Fisher’s exact test How big is probability, that I get such or more different table in given marginal frequencies (providing that null hypothesis is true, computed with help of combinatorics). It is used for 2 x 2 table when numbers of observations are low.

If I have table Than Fisher’s test computes directly probability of this table, and all (from the view of H0) more extreme, i.e. Sum of all these probabilities is reached level of significance for one-way test (that’s why statistics also prints 2*p)

Let us compare two tables: 2and power of testgrow with number of observations - hereat both tables are choice from one population in great probability

Measurements of association stregth in 2 x 2 table – independent on sample size Y = ad/bc =f11f22 / f21f12 - disadvantage - asymmetric: 0 for negative association, 1 for independence, to + infinity for positive association from -1 over 0 for independence to + 1; -1 and + 1 (maximal possible association for given values of marg. frequencies) from -1 over 0 for independence to + 1; -1 and + 1 (maximal possible association for any values of marg. frequenies)

Multidimensional frequency tables Years present Species A absent present absent Nowadays generalized linear models are used in these cases. Species B

Understanding Contingency Tables: A Practical Guide to Analyzing Dependence of Variables