- By
**jana** - Follow User

- 430 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Which SNP genotyping errors are most costly and when?' - jana

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Acknowledgments

- Joint work
- Derek Gordon (Rockefeller University)
- Sun Jung Kang (Duke University)
- Five papers are the material for this talk with additional coauthors
- Michael Nothnagel and Jurg Ott in paper 1
- Mark Levenstien and Jurg Ott in paper 2
- Abe Brown and Jurg Ott in paper 4

Acknowledgments

- Colleagues:
- Nancy Mendell
- Kenny Ye
- Stony Brook students (work in progress)
- Nathan Tintle (repeated sampling)
- Qing Wang (LRT for mixtures)
- Kwangmi Ahn, Rose Saint Fleur
- Undergraduates: Alex Borress, Josh Ren, Jelani Wiltshire

First Paper

- Gordon, D., Finch, S.J., Nothnagel, M., Ott, J. (2002). Power and sample size calculations for case-control genetic association tests when errors are present: application to single nucleotide polymorphisms. Human Heredity, 54, 22-33.

Second Paper

- Gordon D., Levenstien M.A., Finch S.J., and Ott J. (2003). Errors and linkage disequilibrium interact multiplicatively when computing sample sizes for genetic case-control association studies. Pacific Symposium on Biocomputing: 490-501.

Third Paper

- Kang, S.J., Gordon, D., Finch, S.J. (2004). What SNP Genotyping Errors Are Most Costly for Genetic Association Studies. Genetic Epidemiology, 26, 132-141.

Fourth Paper

- Kang, S.J., Gordon, D., Brown, A.M., Ott, J., Finch, S.J. (2004). Tradeoff between No-Call Reduction in Genotyping Error Rate and Loss of Sample Size for Genetic Case/Control Association Studies. Pacific Symposium on Biocomputing:

Fifth Paper

- Kang, S.J., Finch, S.J., Gordon, D. (2004). Quantifying the cost of SNP genotyping errors in genetic model based association studies. Human Heredity, In press.

PAWE Web Site

http://linkage.rockefeller.edu/pawe/pawe.cgi

Review Paper

- Gordon, D., Finch, S.J. (2004). Factors affecting statistical power to detect genetic association. Submitted for publication.

Background

- Definition of SNPs
- SNP genotyping measurements
- Specification of error models
- Tests of association
- Two supplementary measurement approaches

Definition of SNP

- A gene with two possible alleles (here A and B)
- A is the more common allele in the controls
- Three possible genotypes
- AA, index=1 (more common homozygote)
- AB, index=2 (heterozygote)
- BB, index=3 (less common homozygote)

Measure of Cost

- The percentage increase in the minimum sample size necessary to maintain constant Type I and Type II error rates associated with an increase of 1% in a genotyping error rate is our measure of the cost of a genotyping error.
- %MSSN is our abbreviation for this measure.

SNP Genotyping Measurements

- Two die intensities are measured: R and G.
- Measurements are typically taken at two or three time points.
- Ratio F=R/(R+G) is used to classify into genotypes.
- Genotyping error – event in which an observed genotype is different from the true genotype.

Approaches to Replication

- Sutcliffe studied the reclassification of subjects using the same classification procedure at all remeasurements.
- Tenenbein studied the reclassification of subjects using a virtually perfect instrument for the second reclassification.

Regenotyping Results

- There is a common perception that genotyping error is negligible.
- One test is to regenotype a set of data.
- COGA provided such data to last GAW.
- Tintle et al. (2004) analyzed it.

Observations on Table

- Homozygote to homozygote inconsistencies are extremely rare.
- CIDR “missing rate” is 6.7%.
- Affymetrix “missing rate” is 6.1%
- Double missing rate is 1.7%, much higher than the 0.4% expected under independence, suggesting some subjects may be consistently more difficult to genotype.

Regenotyping Definitions

- Consistency: Two genotypes on a SNP for a regenotyped subject exist and are the same.
- Nonreplication: One genotype on a SNP for a regenotyed subject exists, and data is “missing” for the other genotype. Note that we treat two missing genotypes as replicated.
- SNP nonreplication rate: the number of non-replications divided by the sum of the number of replications and the number of non replications.

Critical assumptions about errors

- Regardless of nature of errors, they are random and independent
- Error model is same for cases (affecteds) and controls (unaffecteds)

Simple but Realistic Error Model

- Homozygote to homozygote error rates set to zero
- All other error rates set to equal error rate

Three Component Normal Mixture

- Given AA, F is normal(-Δ, 1)
- Given AB, F is normal(0, 1)
- Given BB, F is normal(Δ,1)
- Symmetric cutpoints create an error model that has equal error rates for all errors except homozygote to homozygote errors.

Tests of Association

- Case-control study. The ratio of number of controls to number of cases is k.
- We use the 2x3 chi-squared test of independence (simplest non-trivial case).
- Mitra found the noncentrality parameter of the chi-squared test of association which is needed for power and sample size calculations
- Recommended (Sasieni) test is test of trend (Armitage).

Effect of Misclassification Errors on Tests of Association

- Bross found that level of significance is unchanged when the same error mechanism affects cases and control and that parameter estimates are biased.
- Mote and Anderson found that the power is reduced (level of significance constant) when there are misclassification errors.

Notation

Count parameters:

NA = number of cases in the absence of errors

NU = number of controls in the absence of errors

NA* = number of cases in the presence of errors

NU* = number of controls in the presence of errors

Genetic model free parameterization

- Specify the genotype probabilities directly

Assuming Hardy Weinberg Equilibrium (HWE), all probabilities specified with two parameters ( p, q ):

Genetic model free parameterization

- Specify the genotype probabilities directly
- Not assuming HWE, can specify all probabilities with four parameters:

Genetic Model Specification

- p1 = allele frequency of SNP marker 1allele
- p2 = allele frequency of SNP marker 2 allele = 1- p1
- pd = allele frequency of disease locus d allele
- p+ = allele frequency of disease wild-type allele = 1- pd

Genetic Model Specification

- D= disequilibrium (non-scaled as defined in Hartl and Clark
- DMAX= min (p1 pd, p2 p+)
- D’=D/ DMAX

Results

Demonstrate analytic solution of asymptotic power using standard chi-square test of genotypic association

Noncentrality Parameter

- Let λ=kNAg, where g is the bracketed function for genotypes measured without error.
- Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error.

To maintain constant asymptotic power

We choose NA* so that λ* = λ.

Paper 1 Findings

- Noncentrality parameter for the 2x3 chi-squared test of independence from Mitra to describe asymptotic power.
- Increase in error rate (three error models) requires a corresponding increase in sample size to maintain Type I and Type II error rates.
- Regression analysis of increase in %MSSN as function of error rate in a number of published models.
- Interaction of linkage disequilibrium (D) and measure of overall error rate (S).

Paper 2 Findings

- Linkage Disequilibrium (LD) and errors interact in a non-linear fashion.
- The increase in sample size necessary to maintain constant asymptotic power and level of significance as a function of S (sum of error rates) is smallest when D’ = 1 (perfect LD).
- The increase grows monotonically as D’ decreases to 0.5 for all studies.

Paper 3 Method

- Saturated error model (called Mote-Anderson in PAWE software).
- Taylor series expansion of the ratio of sample sizes expressed with the non-centrality parameters.
- The coefficients of each error parameter give the %MSSN for a 1% increase in that error rate.

Recall the Noncentrality Parameters:

- Let λ=kNAg, where g is the bracketed function.
- Let λ*=kNA*g*, where g* is the bracketed function using frequencies for genotypes observed with error.
- Then, when λ= λ* (that is, equal power for both specifications), NA*/NA=g/g*.

%MSSN Function

( NA*/ NA )~ 1+ C12ε12+C13ε13 + C21ε21+C23ε23+ C31ε31+ C 32ε32.

Suppose C13 = 7. Then every 1% increase in ε13 requires a 7% increase in sample size to maintain constant power

%MSSN Coefficients

- The %MSSN coefficient associated with the error rate of misclassifying the more common homozygote as the heterozygote is given by

%MSSN Coefficients

- Similar expressions hold for the other five %MSSN coefficients.

Example of Sample Size increase in presence of errors

Suppose we have:

Comparison of Genotype Frequencies

Without errorWith 1% errorWith 3% error

Sample size in presence of errors

- Assume we want 0.80 power at 0.05 level of significance. Let k = 1.

Cost coefficients for our example

CoefficientType of error

More common hom to het

Het to more common hom

Het to less common hom

Less common hom to het

Simplest non-trivial case to develop insights

Assume HWE, cases and controls

pa= 0.2, 0.3, 0.4, 0.5

pu= pa + δ, δ = 0.01

P01= (1- pa )2 ,P02= 2(1- pa ) pa , P03= (pa )2

P11= (1- pu )2 ,P12= 2(1- pu ) pu , P13= (pu )2

Conclusion: What happens to %MSSN coefficients as minor SNP allele frequency approaches 0?

Costly errors are those made on the more common homozygote

Extension to non-HWE generalizing example

- %MSSN coefficients C12 and C13 have infinite limits.
- Additionally, C23 may have infinite limit.

How to perform calculations in practice?

- Use PAWE webtool.

Paper 5 Findings

- %MSSN coefficients with infinite limit hold when studying usual genetic models.
- Recessive models can have C23 with infinite limit as minor SNP allele frequency goes to zero.
- Dominant models have a notably different behavior with fewer %MSSN coefficients with infinite limit. Behavior can be more problematic.
- %MSSN coefficients are complex functions that should be studied on a case-by-case basis.

Paper 5 Definitions

- Total %MSSN is defined to be

Possible Strategies to Counter Effects of SNP Genotyping Errors

- Increase sample size to compensate for loss of power. Use small Type I and Type II error rates in designing studies. (This works.)
- When a three component normal mixture describes the measurements that are the basis of genotyping, use “no-call” rules to lessen error rates and reduce consequent cost.

Possible Strategies to Counter Effects of SNP Genotyping Errors

- Use the same genotyping classification procedure and regenotype subjects (Tintle’s problem).
- Use a perfect genotyping classification procedure on some of the subjects (Gordon et al.)

Increase sample size

- Use PAWE software to identify whether the problem under consideration has the possibility of large %MSSN coefficients.
- Good design (using small Type I and Type II error rates) can yield protocols that are less sensitive to the consequences of SNP genotyping errors.

Power in presence of errors

- A study design in which type I error rate is low and power is high is less sensitive to genotyping error rate

“No-Call” Rules (Paper 4)

- The gain (less reduction in power) from a reduced error rate using no call is almost exactly balanced by the loss of power due to reduced sample size.
- That is, there is only so much information in the sample.
- Conclusion: Use all of the data without resorting to “no call” procedures.

Regenotype Subjects

- Tintle will report on this approach in the next seminar.

Double Sampling

- See the following paper.
- Gordon, D., Yang, Y., Haynes, C., Finch, S.J., Mendell, N.R., Brown, A.M., Haroutunian, V. (2004) "Increasing power for tests of genetic association in the presence of phenotype and/or genotype error by use of double-sampling." Statistical Applications in Genetics and Molecular Biology.

Summary

1. We have described quantitatively the magnitude of the effect of genotype errors on case/control association studies: How much power or (equivalently) how much increase in sample size necessary to maintain constant power

- We have quantified this magnitude for the chi-square test of independence (http://linkage.rockefeller.edu/pawe)

Summary

2. Under HWE, cost coefficients of both error types made on the more common homozygote have infinite limits as SNP minor allele frequency approaches 0

Recommendations

1. Researchers should increase sample size to maintain specification of type I error rate and power in case/control studies

- A study design in which type I error rate is low and power is high is less sensitive to genotyping error rate

Recommendations

2. Researchers designing SNP genotyping technologies should avoid designs where homozygote->homozygote misclassifications might occur with non-zero probability

References

- Armitage, P., Tests for linear trends in proportions and frequencies. Biometrics, 1955. 11: p. 375-386.
- Bross, I., Misclassification in 2 x 2 tables. Biometrics, 1954. 10: p. 478-486.
- Hartl, D.L. and A.G. Clark, Principles of population genetics. 2nd ed. 1989, Sunderland: Sinauer Associates.
- Mitra, S.K., On the limiting power function of the frequency chi-square test. Annals of Mathematical Statistics, 1958. 29(4): p. 1221-1233.
- Mote VL, Anderson RL (1965) An investigation of the effect of misclassification on the properties of chisquare-tests in the analysis of categorical data. Biometrika 52:95-109

References

- Sasieni, P.D., From genotypes to genes: doubling the sample size. Biometrics, 1997. 53(4): p. 1253-61.
- Sutcliffe, J.P. (1965) A probability model for errors of classification. I. General considerations. Psychometrika,30, 73-96.
- Sutcliffe, J.P. (1965) A probability model for errors of classification. II. Particular cases. Psychometrika,30, 129-155.

References

- Tenenbein, A. 1970. A double sampling scheme for estimating from binomial data with misclassifications. Journal of the American Statistical Association 65:1350-1361.
- Tenenbein, A. 1972. A double sampling scheme for estimating from misclassified multinomial data with applications to sampling inspection. Technometrics 14:187-202.
- Tintle, N., Ahn, K., Mendell, N.R., Gordon, D., Finch, S.J. (2004). Using Replicated SNP Genotypes for CoGA. Genetics Analysis Workshop contribution.

Download Presentation

Connecting to Server..