Which snp genotyping errors are most costly and when
Download
1 / 84

- PowerPoint PPT Presentation


  • 395 Views
  • Updated On :

Which SNP genotyping errors are most costly and when?. Stephen J. Finch Stony Brook University. Acknowledgments . Joint work Derek Gordon (Rockefeller University) Sun Jung Kang (Duke University) Five papers are the material for this talk with additional coauthors

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '' - jana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Which snp genotyping errors are most costly and when

Which SNP genotyping errors are most costly and when?

Stephen J. Finch

Stony Brook University


Acknowledgments
Acknowledgments

  • Joint work

    • Derek Gordon (Rockefeller University)

    • Sun Jung Kang (Duke University)

  • Five papers are the material for this talk with additional coauthors

    • Michael Nothnagel and Jurg Ott in paper 1

    • Mark Levenstien and Jurg Ott in paper 2

    • Abe Brown and Jurg Ott in paper 4


Acknowledgments1
Acknowledgments

  • Colleagues:

    • Nancy Mendell

    • Kenny Ye

  • Stony Brook students (work in progress)

    • Nathan Tintle (repeated sampling)

    • Qing Wang (LRT for mixtures)

    • Kwangmi Ahn, Rose Saint Fleur

    • Undergraduates: Alex Borress, Josh Ren, Jelani Wiltshire


First paper
First Paper

  • Gordon, D., Finch, S.J., Nothnagel, M., Ott, J. (2002). Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Human Heredity, 54, 22-33.


Second paper
Second Paper

  • Gordon D., Levenstien M.A., Finch S.J., and Ott J. (2003). Errors and linkage disequilibrium interact multiplicatively when computing sample sizes for genetic case-control association studies. Pacific Symposium on Biocomputing: 490-501.


Third paper
Third Paper

  • Kang, S.J., Gordon, D., Finch, S.J. (2004). What SNP Genotyping Errors Are Most Costly for Genetic Association Studies. Genetic Epidemiology, 26, 132-141.


Fourth paper
Fourth Paper

  • Kang, S.J., Gordon, D., Brown, A.M., Ott, J., Finch, S.J. (2004). Tradeoff between No-Call Reduction in Genotyping Error Rate and Loss of Sample Size for Genetic Case/Control Association Studies. Pacific Symposium on Biocomputing:


Fifth paper
Fifth Paper

  • Kang, S.J., Finch, S.J., Gordon, D. (2004). Quantifying the cost of SNP genotyping errors in genetic model based association studies. Human Heredity, In press.


Pawe web site
PAWE Web Site

http://linkage.rockefeller.edu/pawe/pawe.cgi


Review paper
Review Paper

  • Gordon, D., Finch, S.J. (2004). Factors affecting statistical power to detect genetic association. Submitted for publication.


Background
Background

  • Definition of SNPs

  • SNP genotyping measurements

  • Specification of error models

  • Tests of association

  • Two supplementary measurement approaches


Definition of snp
Definition of SNP

  • A gene with two possible alleles (here A and B)

    • A is the more common allele in the controls

  • Three possible genotypes

    • AA, index=1 (more common homozygote)

    • AB, index=2 (heterozygote)

    • BB, index=3 (less common homozygote)


Measure of cost
Measure of Cost

  • The percentage increase in the minimum sample size necessary to maintain constant Type I and Type II error rates associated with an increase of 1% in a genotyping error rate is our measure of the cost of a genotyping error.

  • %MSSN is our abbreviation for this measure.


Snp genotyping measurements
SNP Genotyping Measurements

  • Two die intensities are measured: R and G.

  • Measurements are typically taken at two or three time points.

  • Ratio F=R/(R+G) is used to classify into genotypes.

  • Genotyping error – event in which an observed genotype is different from the true genotype.





Approaches to replication
Approaches to Replication

  • Sutcliffe studied the reclassification of subjects using the same classification procedure at all remeasurements.

  • Tenenbein studied the reclassification of subjects using a virtually perfect instrument for the second reclassification.


Regenotyping results
Regenotyping Results

  • There is a common perception that genotyping error is negligible.

  • One test is to regenotype a set of data.

  • COGA provided such data to last GAW.

  • Tintle et al. (2004) analyzed it.



Observations on table
Observations on Table

  • Homozygote to homozygote inconsistencies are extremely rare.

  • CIDR “missing rate” is 6.7%.

  • Affymetrix “missing rate” is 6.1%

  • Double missing rate is 1.7%, much higher than the 0.4% expected under independence, suggesting some subjects may be consistently more difficult to genotype.


Regenotyping definitions
Regenotyping Definitions

  • Consistency: Two genotypes on a SNP for a regenotyped subject exist and are the same.

  • Nonreplication: One genotype on a SNP for a regenotyed subject exists, and data is “missing” for the other genotype. Note that we treat two missing genotypes as replicated.

  • SNP nonreplication rate: the number of non-replications divided by the sum of the number of replications and the number of non replications.


Critical assumptions about errors
Critical assumptions about errors

  • Regardless of nature of errors, they are random and independent

  • Error model is same for cases (affecteds) and controls (unaffecteds)


Mote anderson model 1965 penetrance table most general

True Genotype

Observed Genotype

AA

AB

BB

AA

AB

BB

Mote-Anderson Model [1965] Penetrance Table (most general)


Simple but realistic error model
Simple but Realistic Error Model

  • Homozygote to homozygote error rates set to zero

  • All other error rates set to equal error rate


Three component normal mixture
Three Component Normal Mixture

  • Given AA, F is normal(-Δ, 1)

  • Given AB, F is normal(0, 1)

  • Given BB, F is normal(Δ,1)

  • Symmetric cutpoints create an error model that has equal error rates for all errors except homozygote to homozygote errors.


Tests of association
Tests of Association

  • Case-control study. The ratio of number of controls to number of cases is k.

  • We use the 2x3 chi-squared test of independence (simplest non-trivial case).

    • Mitra found the noncentrality parameter of the chi-squared test of association which is needed for power and sample size calculations

  • Recommended (Sasieni) test is test of trend (Armitage).


Test statistic
Test Statistic

  • Pearson’s on 2 × 3 tables

    Example Table


Effect of misclassification errors on tests of association
Effect of Misclassification Errors on Tests of Association

  • Bross found that level of significance is unchanged when the same error mechanism affects cases and control and that parameter estimates are biased.

  • Mote and Anderson found that the power is reduced (level of significance constant) when there are misclassification errors.


Notation
Notation

Count parameters:

NA = number of cases in the absence of errors

NU = number of controls in the absence of errors

NA* = number of cases in the presence of errors

NU* = number of controls in the presence of errors



Genetic model free parameterization
Genetic model free parameterization

  • Specify the genotype probabilities directly

    Assuming Hardy Weinberg Equilibrium (HWE), all probabilities specified with two parameters ( p, q ):


Genetic model free parameterization1
Genetic model free parameterization

  • Specify the genotype probabilities directly

    • Not assuming HWE, can specify all probabilities with four parameters:


Genetic model specification
Genetic Model Specification

  • p1 = allele frequency of SNP marker 1allele

  • p2 = allele frequency of SNP marker 2 allele = 1- p1

  • pd = allele frequency of disease locus d allele

  • p+ = allele frequency of disease wild-type allele = 1- pd


Genetic model specification1
Genetic Model Specification

  • D= disequilibrium (non-scaled as defined in Hartl and Clark

  • DMAX= min (p1 pd, p2 p+)

  • D’=D/ DMAX



Results
Results

Demonstrate analytic solution of asymptotic power using standard chi-square test of genotypic association


Genotype frequencies in the presence of errors
Genotype Frequencies in the Presence of Errors


Noncentrality parameter
Noncentrality Parameter

We assume NU = kNA.

Using Mitra’s work (1958),


Noncentrality parameter1
Noncentrality Parameter

  • Let λ=kNAg, where g is the bracketed function for genotypes measured without error.

  • Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error.


To maintain constant asymptotic power
To maintain constant asymptotic power

We choose NA* so that λ* = λ.


Paper 1 findings
Paper 1 Findings

  • Noncentrality parameter for the 2x3 chi-squared test of independence from Mitra to describe asymptotic power.

  • Increase in error rate (three error models) requires a corresponding increase in sample size to maintain Type I and Type II error rates.

  • Regression analysis of increase in %MSSN as function of error rate in a number of published models.

  • Interaction of linkage disequilibrium (D) and measure of overall error rate (S).


Paper 2 findings
Paper 2 Findings

  • Linkage Disequilibrium (LD) and errors interact in a non-linear fashion.

  • The increase in sample size necessary to maintain constant asymptotic power and level of significance as a function of S (sum of error rates) is smallest when D’ = 1 (perfect LD).

  • The increase grows monotonically as D’ decreases to 0.5 for all studies.


Paper 3 method
Paper 3 Method

  • Saturated error model (called Mote-Anderson in PAWE software).

  • Taylor series expansion of the ratio of sample sizes expressed with the non-centrality parameters.

  • The coefficients of each error parameter give the %MSSN for a 1% increase in that error rate.


Recall the noncentrality parameters
Recall the Noncentrality Parameters:

  • Let λ=kNAg, where g is the bracketed function.

  • Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error.

  • Then, when λ= λ* (that is, equal power for both specifications), NA*/NA=g/g*.


Mssn function
%MSSN Function

( NA*/ NA )~ 1+ C12ε12+C13ε13 + C21ε21+C23ε23+ C31ε31+ C 32ε32.

Suppose C13 = 7. Then every 1% increase in ε13 requires a 7% increase in sample size to maintain constant power


Mssn coefficients
%MSSN Coefficients

  • The %MSSN coefficient associated with the error rate of misclassifying the more common homozygote as the heterozygote is given by


Mssn coefficients1
%MSSN Coefficients

  • Similar expressions hold for the other five %MSSN coefficients.




Comparison of genotype frequencies
Comparison of Genotype Frequencies

Without errorWith 1% errorWith 3% error


Sample size in presence of errors
Sample size in presence of errors

  • Assume we want 0.80 power at 0.05 level of significance. Let k = 1.


Cost coefficients for our example
Cost coefficients for our example

CoefficientType of error

More common hom to het

Het to more common hom

Het to less common hom

Less common hom to het


Simplest non trivial case to develop insights
Simplest non-trivial case to develop insights

Assume HWE, cases and controls

pa= 0.2, 0.3, 0.4, 0.5

pu= pa + δ, δ = 0.01

P01= (1- pa )2 ,P02= 2(1- pa ) pa , P03= (pa )2

P11= (1- pu )2 ,P12= 2(1- pu ) pu , P13= (pu )2


Results for mssn coefficients 0 01
Results for %MSSN Coefficients δ = 0.01

Cost

Case SNP minor allele frequency


Conclusion what happens to mssn coefficients as minor snp allele frequency approaches 0
Conclusion: What happens to %MSSN coefficients as minor SNP allele frequency approaches 0?

Costly errors are those made on the more common homozygote


Extension to non hwe generalizing example
Extension to non-HWE generalizing example allele frequency approaches 0?

  • %MSSN coefficients C12 and C13 have infinite limits.

  • Additionally, C23 may have infinite limit.


How to perform calculations in practice
How to perform calculations in practice? allele frequency approaches 0?

  • Use PAWE webtool.


Paper 5 findings
Paper 5 Findings allele frequency approaches 0?

  • %MSSN coefficients with infinite limit hold when studying usual genetic models.

  • Recessive models can have C23 with infinite limit as minor SNP allele frequency goes to zero.

  • Dominant models have a notably different behavior with fewer %MSSN coefficients with infinite limit. Behavior can be more problematic.

  • %MSSN coefficients are complex functions that should be studied on a case-by-case basis.


Paper 5 definitions
Paper 5 Definitions allele frequency approaches 0?

  • Total %MSSN is defined to be




Dominant model total mssn
Dominant Model, Total %MSSN allele frequency approaches 0?


Possible strategies to counter effects of snp genotyping errors
Possible Strategies to Counter Effects of SNP Genotyping Errors

  • Increase sample size to compensate for loss of power. Use small Type I and Type II error rates in designing studies. (This works.)

  • When a three component normal mixture describes the measurements that are the basis of genotyping, use “no-call” rules to lessen error rates and reduce consequent cost.


Possible strategies to counter effects of snp genotyping errors1
Possible Strategies to Counter Effects of SNP Genotyping Errors

  • Use the same genotyping classification procedure and regenotype subjects (Tintle’s problem).

  • Use a perfect genotyping classification procedure on some of the subjects (Gordon et al.)


Increase sample size
Increase sample size Errors

  • Use PAWE software to identify whether the problem under consideration has the possibility of large %MSSN coefficients.

  • Good design (using small Type I and Type II error rates) can yield protocols that are less sensitive to the consequences of SNP genotyping errors.


Power in presence of errors
Power in presence of errors Errors

  • A study design in which type I error rate is low and power is high is less sensitive to genotyping error rate




No call rules paper 4
“No-Call” Rules (Paper 4) Errors

  • The gain (less reduction in power) from a reduced error rate using no call is almost exactly balanced by the loss of power due to reduced sample size.

  • That is, there is only so much information in the sample.

  • Conclusion: Use all of the data without resorting to “no call” procedures.


Regenotype subjects
Regenotype Subjects Errors

  • Tintle will report on this approach in the next seminar.


Double sampling
Double Sampling Errors

  • See the following paper.

  • Gordon, D., Yang, Y., Haynes, C., Finch, S.J., Mendell, N.R., Brown, A.M., Haroutunian, V. (2004) "Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling." Statistical Applications in Genetics and Molecular Biology.


Summary
Summary Errors

1. We have described quantitatively the magnitude of the effect of genotype errors on case/control association studies: How much power or (equivalently) how much increase in sample size necessary to maintain constant power

- We have quantified this magnitude for the chi-square test of independence (http://linkage.rockefeller.edu/pawe)


Summary1
Summary Errors

2. Under HWE, cost coefficients of both error types made on the more common homozygote have infinite limits as SNP minor allele frequency approaches 0


Recommendations
Recommendations Errors

1. Researchers should increase sample size to maintain specification of type I error rate and power in case/control studies

  • A study design in which type I error rate is low and power is high is less sensitive to genotyping error rate


Recommendations1
Recommendations Errors

2. Researchers designing SNP genotyping technologies should avoid designs where homozygote->homozygote misclassifications might occur with non-zero probability


References
References Errors

  • Armitage, P., Tests for linear trends in proportions and frequencies. Biometrics, 1955. 11: p. 375-386.

  • Bross, I., Misclassification in 2 x 2 tables. Biometrics, 1954. 10: p. 478-486.

  • Hartl, D.L. and A.G. Clark, Principles of population genetics. 2nd ed. 1989, Sunderland: Sinauer Associates.

  • Mitra, S.K., On the limiting power function of the frequency chi-square test. Annals of Mathematical Statistics, 1958. 29(4): p. 1221-1233.

  • Mote VL, Anderson RL (1965) An investigation of the effect of misclassification on the properties of chisquare-tests in the analysis of categorical data. Biometrika 52:95-109


References1
References Errors

  • Sasieni, P.D., From genotypes to genes: doubling the sample size. Biometrics, 1997. 53(4): p. 1253-61.

  • Sutcliffe, J.P. (1965) A probability model for errors of classification. I. General considerations. Psychometrika,30, 73-96.

  • Sutcliffe, J.P. (1965) A probability model for errors of classification. II. Particular cases. Psychometrika,30, 129-155.


References2
References Errors

  • Tenenbein, A. 1970. A double sampling scheme for estimating from binomial data with misclassifications. Journal of the American Statistical Association 65:1350-1361.

  • Tenenbein, A. 1972. A double sampling scheme for estimating from misclassified multinomial data with applications to sampling inspection. Technometrics 14:187-202.

  • Tintle, N., Ahn, K., Mendell, N.R., Gordon, D., Finch, S.J. (2004). Using Replicated SNP Genotypes for CoGA. Genetics Analysis Workshop contribution.


ad