340 likes | 481 Views
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs. Tamar Barzuza 1 Jacques S. Beckmann 2,3 Ron Shamir 4 Itsik Pe’er 5 1 Computer Science and Applied Mathematics, Weizmann Institute of Science 2 Molecular Genetics, Weizmann Institute of Science
E N D
Computational Problems in Perfect Phylogeny Haplotyping: Xor-Genotypes and Tag SNPs Tamar Barzuza1Jacques S. Beckmann2,3 Ron Shamir4 Itsik Pe’er5 1Computer Science and Applied Mathematics, Weizmann Institute of Science 2Molecular Genetics, Weizmann Institute of Science 3Génétique Médicale, Universitätsspital Lausanne 4School of Computer Science, Tel- Aviv University 5Medical and Population Genetics Group, Broad Institute
Overview • Introduction • Xor PPH • Theoretical outlines and results • Experimental results • Informative SNPs • Theoretical results • Summary and Future research
G A C A AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATTAGCTGCCACA T A C T AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTAATGCTAGCACGCGCGCCAGGATTAGCTGCCACA T C C T AATATATCGCTTTCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCCAGGATTAGCTGCCACA A T C T AATATATCGCTATCCGTATACCTAATTTGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATTAGCTGCCACA A G C T AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATTAGCTGCCACA A G C T AATATATCGCTATCCGTATACCTAATTGGGGGTGTGTGTACGTACTGCTAGCACGCGCGCTAGGATTAGCTGCCACA SNP – Single nucleotide polymorphism
G A C A T A C T T C C T A T C T A G C T A G C T SNP – Single nucleotide polymorphism
1 2 3 4 1 G 0 A 1 C 1 A T 0 0 A C 1 0 T 0 T 1 C 1 C T 0 A 1 0 T C 1 T 0 A 1 1 G 1 C T 0 A 1 1 G C 1 0 T Haplotypes, Genotypes and XOR-Genotypes Haplotypes: A G A C T T A C Genotype: A/T T/G A C XOR-Genotype: Het Het Hom Hom
1 2 3 4 G 1 A 0 1 C 1 A T 0 A 0 1 C T 0 T 0 1 C C 1 0 T A 1 0 T C 1 0 T A 1 G 1 C 1 T 0 1 A G 1 1 C T 0 Haplotypes, Genotypes and XOR-Genotypes Haplotypes: 1 1 0 1 0 0 0 1 Genotype: 2 2 0 1 XOR-Genotype: {1, 2}
1 0 0 0 0 1 0 0 1 0 1 0 1 0 0 4: 1→0 1 1 0 0 0 5: 0→1 1: 1→0 1 0 0 1 1 0 0 0 1 0 2: 0→1 3: 0→1 2 3 1 0 1 0 0 1 1 0 0 0 Perfect Phylogeny SNPs only 1 0 0 1 0 1 0 0 0 0 0 0 0 1 0 1 0 0 1 1 1 1 0 0 0 1 0 1 0 0
Previous work Haplotyping: haplotypes from genotypes: Input: Genotypes G={G1,…,Gn} on SNPs S={s1,…,sm} Output: Find the haplotypes H={H1,…,H2n} that gave rise to G • General heuristics: • Clark ’90 • Excoffier+Slatkin ‘95 • PPH: Perfect phylogeny haplotyping (ngenotypes, mSNPs): • Gusfield 2002 O(nm(n,m)) • Bafna et. al 2002 O(nm2) • Eskin et. al 2003 O(nm2) Graph Realization
1 3 2 3 2 1 Previous work The graph realization problem: Input: A hypergraphH=({1,…,m}, P) • P={P1,P2,…,Pn}, Pi{1,…,m} Goal: A treeT=(V,E) with E=Ns.tPilabels a path inT Input:{ {1,2}, {2,3} } Output: Tutte 1959 O(n2m), Gavril and Tamari 1983 O(nm2), Bixby and Wagner 1988 O(nm(n,m))
Overview • Introduction • Xor PPH • Theoretical outlines and results • Experimental results • Informative SNPs • Theoretical results • Summary and Future research
? 1 1 0 1 0 0 0 1 0/1 0/1 0 1 {1, 2} ? 0 1 0 1 0 0 0 0 0 0/1 0 0/1 {2, 4} ? 0 1 1 1 0 0 0 0 0 0/1 0/1 0/1 {2, 3, 4} ? 1 1 0 1 0 0 0 0 1 1 0 1 0 1 0 1 ? {1, 2, 4} 0/1 0/1 0 0/1 {1} 0/1 1 0 0 XPPH - Xor perfect phylogeny haplotyping Xor-haplotyping: haplotypes from xor-genotypes: Input: 1. Xor-genotype data (can be obtained by DHPLC) 2. Three genotypes Goal: Resolve the haplotypes and their perfect phylogeny Xor-genotypes genotypes haplotypes
XPPH - Xor perfect phylogeny haplotyping Xor-haplotyping: haplotypes from xor-genotypes: Input: 1. Xor-genotype data (can be obtained by DHPLC) 2. Three genotypes Goal: Resolve the haplotypes and their perfect phylogeny Xor-genotypes genotypes ? 0/1 0/1 0 1 {1, 2} haplotypes ? 0 0/1 0 0/1 {2, 4} ? 0 0/1 0/1 0/1 {2, 3, 4} ? {1, 2, 4} 0/1 0 0/1 0/1 ? {1} 0/1 1 0 0
XPPH - Xor perfect phylogeny haplotyping Strategy:1. Input: Xor-genotype data Goal: Find the perfect phylogeny 2. Additional Input: 3 genotypes Goal: Find haplotypes Step 1: Xor-genotype = {Het SNPs} = A path in the perfect phylogeny • Build a tree from its paths Graph realization Input reduction: Merge SNPs that are equivalent in the xor-data Proof: Unique graph realization solution A perfect phylogeny
GREAL We implemented Gavril & Tamari’s algorithm (83) for graph realization: O(m2n) • Find graph realization or determine that none exists • Count num of graph realization solutions for data • Stable and fast • Available at http://www.cs.tau.ac.il/~rshamir/greal/ Simulations • Simulate data of n individuals using Hudson 2002 • Remove all SNPs with <5% minor allele frequency • Apply GREAL: Is there a single solution? • Repeat 5000 times for each n
Results The percentage of single solutions vs sample size
The percentage of single solutions vs sample size R.H. Chung and D. Gusfield 2003 Results
1 1 Xor-genotypes 3 0 0 0 3 1 0 0 {1, 2} {1, 3} {2, 3} 2 2 1 0 1 0 0 1 1 1 0 0 1 0 XPPH • Perfect phylogeny • Haplotypes Step 2 ? Resolution up to bit flipping : gives the haplotypes structure
1 3 Genotype 2 1 2 2 1 x x 1 x x XPPH • Perfect phylogeny • Haplotypes Step 2 Xor-genotypes {1, 2} {1, 3} {2, 3} 0 x x SNP #1 homozygous Can infer SNP #1 for all haplotypes Need individuals with xor-genotypes (={het SNPs}) =
Theorem:xor-genotypes= there are three xor-genotypes with empty intersection Proof: ! xor-genotypes are tree paths (ow: NP-hard) (1) The intersection of two tree paths is an interval
X1 (Proof) (2) Pick X1 arbitrarily, takeX1 X2,X1 X3,… X1Xn
X1 (Proof) (2) Pick X1 arbitrarily, takeX1 X2,X1 X3,… X1Xn
X1 (Proof) (2) Pick X1 arbitrarily, takeX1 X2,X1 X3,… X1Xn (3) XLends first,XR begins last XR X1 XL
(Proof) (2) Pick X1 arbitrarily, takeX1 X2,X1 X3,… X1Xn (3)XLends first,XR begins last XR XR XL X1 X1 XL
XR XL X1 (Proof) (2) Pick X1 arbitrarily, takeX1 X2,X1 X3,… X1Xn X1XLXR= XR X1 XL XR XL X1
XR XL X1 • Find 3 individuals to genotype in O(nm) • Resolve the haplotypes XR X1 XL XR XL X1
Overview • Introduction • Xor PPH • Theoretical outlines and results • Experimental results • Informative SNPs • Theoretical results • Summary and Future research
Informative SNPs SNPs 1 2 3 4 5 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 Input: 1. Haplotypes H={H1,…,Hn} on SNPs S={s1,…,sm} 2. A set of interesting SNPsS"S Output:Minimal setSS\S"that distinguishes the same haplotypes as S" Haplotypes 4 3 2 1 Informative SNPs (Bafna et al. 2003): Not perfect phylogeny: NP-hard (MINIMUM TEST SET) Perfect phylogeny, 1 interesting SNP: O(nm), Bafna et al. 2003
Informative SNPs SNPs 1 2 3 4 5 1 0 0 0 0 0 0 1 0 0 0 0 0 1 1 0 1 0 1 0 Input: 1. Haplotypes H={H1,…,Hn} on SNPs S={s1,…,sm} 2. A set of interesting SNPsS"S 3. A perfect phylogeny for H. 4. A cost functionC:SR+. Output:SS\S"with minimal costthat distinguishes the same haplotypes as S" Haplotypes 4 3 2 1 Informative SNPs: Generalization of prev def
We find informative SNPs set • Of minimal cost • For any number of interesting SNPs • In O(m) • By a dynamic programming algorithm that climbs up the perfect phylogeny tree • We prove that the definition of informative SNPs generalizes to a more practical definition • Under the perfect phylogeny model, informative SNPs on genotypes and haplotypes are equivalent
Summary • Xor-haplotyping: • Definition • Resolve haplotypes given xor-data and 3 genotypes in O(nm(m,n)) • Implementation • Experimental results • Selection of tag SNPs: • Generalize to • arbitrary cost • many interesting SNPs • Find optimal informative SNPs set in O(m) time • Combinatorial observation allows practical uses
Future research • Relax the strong assumption of perfect phylogeny • Deal with data errors and missing data • Obtain empirical results for the theoretical work on informative SNPs • Preliminary results show that blocks of up to 600 SNPs are distinguishable by ~20 informative SNPs
1 1 0 0 1 1 0 0 1 0 1 0 0 1 01 2 2 2 10 10 01 2 2 2 1 1 0 0 1 0 1 0 1 0 1 1 0 0 Haplotype Pair 1 Theorem: All genotypes are distinct within a block Proof: Assume to the contrary equivalency of two: Genotype 1 Genotype 2 Haplotype Pair 2