1 / 21

Phasing and Missing data recovery in Family Trios

Phasing and Missing data recovery in Family Trios. CS Department. D. Brinza J. He W. Mao A. Zelikovsky. Overview. SNP, Genotypes and Haplotypes Phasing & Missing Data Recovery for Trios Family trios & trio constraints ILP for Pure Parsimony Trio phasing without recombinations.

Download Presentation

Phasing and Missing data recovery in Family Trios

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Phasing and Missing data recovery in Family Trios CS Department D. Brinza J. He W. Mao A. Zelikovsky International Workshop on Bioinformatics Research and Applications, May 2005

  2. Overview • SNP, Genotypes and Haplotypes • Phasing & Missing Data Recovery for Trios • Family trios & trio constraints • ILP for Pure Parsimony • Trio phasing without recombinations International Workshop on Bioinformatics Research and Applications, May 2005

  3. SNP, Genotypes and Haplotypes • Length of Human Genome  3  109 • #Single nucleotide polymorphism (SNPs)  1  107 • SNPs are mostly biallelic, e.g., AC • Minor allele frequency should be considerable e.g. >.1% • Difference b/w ALL people  0.25% (b/w any 2  0.1%) • Diploid = two different copies of each chromosome • Haplotype = description of a single copy (expensive) • example: 00110101 (0 is for major, 1 is for minor allele) • Genotype = description of the mixed two copies • example 01122110 (0=00, 1=11, 2=01) • International Hapmap project: www.hapmap.org International Workshop on Bioinformatics Research and Applications, May 2005

  4. Population Phasing Problem • Given genotype n  m matrix G • n genotype-rows with m snips-columns • Find haplotype 2n  m matrix H • 2n haplotyp-rows with m snips-columns • each g genotype is explained with two haplotypes h1,h2 h1 = 0011010 h2 = 0110110 g = 0212210 Remarks: • For an individual with k heterozygous sites (2’s), 2k-1haplotype pairs can be a possible solution • This is hopeless without a genetic model • Programs: PHASE, HAPLOTYPER, HAP, GERBIL, DPPH, etc. International Workshop on Bioinformatics Research and Applications, May 2005

  5. Family Trios & Trio Constraints • Common genotype data are in family trios consisting of two parents and one offspring • Trio data allows to recover offspring haplotypes with higher confidence. • Haplotype reconstruction should satisfy trio constraints. • Example: • If genotypes are f=22 m=02 k=01 • Then haplotypes are f1=10 m1=01 k1=01 f2=01 m2=00 k2=01 Only if f=m=k=22, the ambiguity remains International Workshop on Bioinformatics Research and Applications, May 2005

  6. Family Trio Phasing • Parental Trio Phasing Problem • Given a set of genotype partitioned into family trios • Find for each trio a quartet of parent haplotypes which agree with all three genotypes: • Parental haplotypes agree with parental genotypes • Inherited parental haplotypes agree with offspring genotype • General Trio Phasing Problem • Find (additionally) for each offspring the “true” recombination of inherited parental haplotypes International Workshop on Bioinformatics Research and Applications, May 2005

  7. ILP for Parental Trio Phasing • Introduce four template haplotypes {0,1,2,?} • Variables: x -- for each possible haplotype y -- for each 2 Objective: Constraints: International Workshop on Bioinformatics Research and Applications, May 2005

  8. Results International Workshop on Bioinformatics Research and Applications, May 2005

  9. Trio Phasing w/o Crossovers Three phasing methods on the real and simulated data sets Error = % of sites where (best choice of) inherited paternal and maternal haplotypes disagree with the offspring genotype. D = Hamming distance in % between the phased haplotypes and the closest feasible haplotypes. International Workshop on Bioinformatics Research and Applications, May 2005

  10. Trio Phasing w/o Crossovers pure parsimonious = no recombinations trio-feasible phasings Projections = closest trio-feasible random PHASE parent/offspring-feasible phasings International Workshop on Bioinformatics Research and Applications, May 2005

  11. Missing Data Recovery Problem • Real data often miss some snips • Daly et al data (Chron Disease) 10%-16% • Gabriel et al data (Hapmap) 7%-10% • How to reconstruct missing values? • How to verify reconstruction method? • Scramble extra 10% and reconstruct them • Karp-Halperin (2004) have error rate 2.8% International Workshop on Bioinformatics Research and Applications, May 2005

  12. Results for Trio Missing Data Recovery International Workshop on Bioinformatics Research and Applications, May 2005

  13. Missing Data Recovery Problem International Workshop on Bioinformatics Research and Applications, May 2005

  14. Diploid - two haplotypes (different copies of each chromosome) • SNP - single nucleotide site where two or more different • nucleotides occur in a large percentage of population • 0 = willde type/major (frequency) allele • 1 = mutation/minor (frequency) allele • Haplotype - description of a single copy • Example: 00110101 (0 is for major, 1 is for minor allele) • Genotype - description of the mixed two copies • Example: 01122110 (0=00, 1=11, 2=01) International Workshop on Bioinformatics Research and Applications, May 2005

  15. Formulating the Pure-parsimony Trio Phasing Problem(PTPP) and the Trio MissingData Recovery Problem (TMDRP) • Two new greedy and integer linear programming (ILP) based methodssolving PTPP and TMDRP • New 2-SNP Statistics (2SNP) phasing method for unrelated individuals • Extensive experimental validation of proposed methods and comparison with thepreviously known methods International Workshop on Bioinformatics Research and Applications, May 2005

  16. PHASE – Bayesian statistical method (Stephens et al., 2001, 2003) • HAPLOTYPER – proposed a Monte Carlo aproach (Niu et al., 2002) • Phamily – phase the trio families based on PHASE (Acherman et al., 2003) • Greedy method for phasing and missing data recovery–by (Halperin and Karp, 2004) • GERBIL – statistical method using maximum likelihood (ML), MST and expectation-maximization (EM) (Kimmel and Shamir, 2005) • SNPHAP – use ML/EM assuming Hardy-Weinberg equilibrium (Clayton et al., 2004) International Workshop on Bioinformatics Research and Applications, May 2005

  17. Given a set of family trios of genotypes each with m sites corresponding to m SNPs: • 0 – homozygote with major allele, 1 – homozygote with minor allele, 2 – heterozygote, ? – missing SNP value • Find for each trio four haplotypes h1, h2, h3, h4 each with m 0-1-sites such that: • h1 and h2 explain father’s genotype, h3 and h4 explain mother’s genotype, h1 and h3 explain offspring’s genotype International Workshop on Bioinformatics Research and Applications, May 2005

  18. Easy to find a feasible solution to TPP (exponential number of feasible solutions) • We pursue parsimonious objective,i.e.,minimization of the total number of haplotypes • Drawback of PP is that when the number of SNPs becomes large (as wellas the number of recombinations), then the quality of pure parsimony phasing is diminishing • Partition the genotypes into blocks • In case of trio data we do not have joining blocks problem • Pure-Parsimony Trio Phasing(PPTP). Given 3n genotypes corresponding to n family trios find minimum number of distinct haplotypes explaining all trios International Workshop on Bioinformatics Research and Applications, May 2005

  19. Proposed by Halperin et al. in “Perfect phylogeny and haplotype assignment” (2004) • For each trio weintroduce four partial haplotypes with SNPs 0, 1 and ? • Algorithm iteratively finds the complete haplotype which covers the maximum possible number of partial haplotypes, removes this set of resolved partial haplotypes and continues in that manner • The drawback of this method is introducing errors to trio constraint International Workshop on Bioinformatics Research and Applications, May 2005

  20. For each trio we introduce four template haplotypes {0,1,2,?} • 0,1 – correspond to fully resolved haplotypes, 2 – comes in SNPs corresponding to the genotypes 2’s, ? – unconstrained SNPs • Variables: • for each possible haplotype i, xi {0,1}, • for each heterozigous SNP j in each template, yj {0,1} International Workshop on Bioinformatics Research and Applications, May 2005

  21. International Workshop on Bioinformatics Research and Applications, May 2005

More Related