Categorical Data. Frühling Rijsdijk 1 & Caroline van Baal 2 1 IoP, London 2 Vrije Universiteit, A’dam Twin Workshop, Boulder Tuesday March 2, 2004. Aims. Introduce Categorical Data Define liability and describe assumptions of the liability model
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Frühling Rijsdijk1 & Caroline van Baal2
1IoP, London 2Vrije Universiteit, A’dam
Twin Workshop, Boulder
Tuesday March 2, 2004
Measuring instrument is able to only discriminate between two or a few ordered categories : e.g. absence or presence of a disease
Data therefore take the form of counts, i.e. the number of individuals within each category
Assumptions:
(1) Underlying normal distribution of liability
(2) The liability distribution has 1 or more thresholds (cutoffs)
2
3

0
1
1
2
3
The standard Normal distributionLiability is a latentvariable, the scale is arbitrary,
distribution is, therefore, assumed to be a
Standard Normal Distribution (SND) or zdistribution:
z0
3
3
Standard Normal Cumulative Probability in righthand tail
(For negative z values, areas are found by symmetry)
Z0 Area
0 .50 50%
.2 .42 42%
.4 .35 35%
.6 .27 27%
.8 .21 21%
1 .16 16%
1.2 .12 12%
1.4 .08 8%
1.6 .06 6%
1.8 .036 3.6%
2 .023 2.3%
2.2 .014 1.4%
2.4 .008 .8%
2.6 .005 .5%
2.8 .003 .3%
2.9 .002 .2%
3
When we have one variable it is possible to find a zvalue (threshold) on the SND, so that the proportion exactly matches the observed proportion of the sample
1.04
Trait2
Trait1
0
1
Trait1
0
1
00
01
545
(.76)
75
(.11)
0
0
10
11
56
(.08)
40
(.05)
1
1
Two categorical traitsWhen we have two categorical traits, the data are
represented in a Contingency Table, containing
cell counts that can be translated into proportions
0 = absent
1 = present
Twin1
0
1
a
b
00
01
0
c
d
10
11
1
Categorical Data for twins:When the measured trait is dichotomous i.e. a disorder either present or absent in an unselected sample of twins:
cell a: number of pairs concordant for unaffected
cell d: number of pairs concordant for affected
cell b/c: number of pairs discordant for the disorder
0 = unaffected
1 = affected
Bivariate Normal (R=0.6) partitioned at threshold 1.4 (zvalue) on both liabilities
Twin1
0
1
a
b
00
01
0
c
d
10
11
1
Correlated dimensions:
The correlation (shape) and the two thresholds determine the relative proportions of observations in the 4 cells of the CT.
Conversely, the sample proportions in the 4 cells can be used to estimate the correlation and the thresholds.
c
c
d
d
a
b
b
a
It is possible to estimate a correlation
between categorical traits from simple
counts (CT), because of the assumptions
we make about their joint distributions
nij is the observed frequency in cell ij
pij is the expected proportion in cell ij
Are calculated by numerical integration of the
bivariate normal over two dimensions:
the liabilities for twin1 and twin2
e.g. the probability that both twins are affected :
Φ is the bivariate normal probability density function, L1and L2 are the liabilities of twin1 and twin2, with means 0, and is the correlation matrix of the two liabilities.
B
a
L2
B
L1
For example: for a correlation of .9 and thresholds (zvalues)
of 1, the probability that both twins are above threshold
(proportion d) is around .12
The probability that both twins are are below threshold
(proportion a) is given by another integral function with
reversed boundaries :
and is around .80 in this example
The model’s failure to predict the observed
data i.e. a bad fitting model,is reflected
in a significant χ²
Expected proportion are calculated by
numerical integration of the MN normal
over n dimensions. In this example it will
be two, the liabilities for twin1 and twin2
is the MN pdf, which is a function of
, the correlation matrix of the variables
By maximizing the likelihood of the data
under a MN distribution, the ML estimate of the
correlation matrix and the thresholds are obtained
Simulated data for 625 MZ and 625 DZ pairs
(h2 = .40 c2 = .20 e2 = .40 > rmz=.60 rdz=.40)
Dichotomized 0 = bottom 88%, 1 = top 12%
This corresponds to threshold (zvalue) of 1.18
Observed counts:
MZDZ
0 1 0 1
0 508 48 0 497 59
1 35 34 1 49 20
Raw ORD File: bin.dat
Scripts: tetracor.mx and ACEbin.mx
Same simulated data
Categorized 0 = bottom 22%, 1 = mid 66%,
2 = top 12%
This corresponds to thresholds (zvalues) of
0.75 1.18
Observed counts:
MZDZ
0 1 2 0 1 2
0 80 58 1 0 63 74 2
1 68 302 47 1 71 289 57
2 1 34 34 2 4 45 20
Raw ORD File: cat.dat
Adjust the correlation and ACE script
2 Categories
Threshold Matrix : 1 x 2
T(1,1) T(1,2) threshold twin1 & twin2
3 Categories
Threshold Matrix : 2 x 2
T(1,1) T(1,2) threshold 1 for twin1 & twin2
T(2,1) T(2,2) threshold 2 for twin1 & twin2
3 Categories
nthresh=2 nvar=2
Matrix T: nthresh nvar (2 x 2)
T(1,1) T(1,2) threshold 1 for twin1 & twin2
T(2,1) T(2,2) increment
L LOW nthresh nthresh
Value 1 T 1 1 to T nthresh nthresh
Threshold Model L*T /
1 0
1 1
t11 t12
t21 t22
t11 t12
t11 + t21 t12 + t22
=
*
The 1250 lines data file (bin.dat) can be summarized like this
G1: Data and model for MZ correlation
DAta NGroups=2 NInput_vars=4
Missing=.
Ordinal File=binF.dat
Labels
zyg bin1 bin2 freq
SELECT IF zyg = 1
SELECT bin1 bin2 freq /
DEFINITION freq /
Begin Matrices;
R STAN 2 2 FREE !Correlation matrix
T FULL nthresh nvar FREE !thresh tw1, thresh tw2
L LOW nthresh nthresh
F FULL 1 1
End matrices;
Value 1 L 1 1 to L nthresh nthresh ! initialize L
COV R / !Predicted Correlation matrix for MZ pairs
Thresholds L*T / !to ensure t1>t2>t3 etc.......
FREQ F /
Example: 2 Variables measured in twins:
For rare disorders (e.g. Schizophrenia), selecting a random sample of twins will lead to the vast majority of pairs being unaffected.
A more efficient design is to ascertain twin pairs through a register of affected individuals.
When an affected twin (the proband) is identified, the cotwin is followed up to see if he or she is also affected.
There are several types of ascertainment
Omission of certain classes from observation leads to an increase
of the likelihood of observing the remaining individuals
Mx corrects for incomplete ascertainment by dividing the likelihood
by the proportion of the population remaining after ascertainment
CT from ascertained data can be analysed in Mx
by simply substituting a –1 for the missing cells
CTable 2 2
1 11
1 13
For a 2 x 2 CT:
3 observed statistics, 3 parameters
(1 correlation, 2 threshold) df=0
any pattern of observed frequencies
can be accounted for, no goodness of fit
of the normal distribution assumption.
This problem is solved when we have a CT
which is at least 3 x 2: df>0
A significant 2 reflects departure from
normality.
Power to detect certain effects increases
with increasing number of categories >
continuous data most powerful
For raw ordinal data analyses, the first category
must be coded 0!
Threshold specification when analyzing CT are
different