Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes)

Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes) Yufeng Wu and Dan Gusfield University of California, Davis CSB 2006

Haplotypes/Genotypes • Diploid organisms have two copies of (not identical) chromosomes. A single copy is a haplotype, vector of 0,1.The mixed description is a genotype, vector of 0,1,2. At each site, • If both haplotypes are 0, genotype is 0 • If both haplotypes are 1, genotype is 1 • If one is 0 and the other is 1, genotype is 2 • Key fact: easier to collect genotypes, but many downstream applications work better with haplotypes

0 1 1 1 1 0 0 1 0 1 1 1 Haplotyping Sites: 1 2 3 4 5 6 7 8 9 Phasing the 2s 0 1 1 1 0 0 1 1 0 1 1 0 1 0 0 1 0 0 Haplotype 2 1 2 1 0 0 1 2 0 Genotype 2 1 2 1 0 0 1 2 0 Haplotype Inference (HI) Problem: given a set of n genotypes, infer the real n haplotype pairs that form the given genotypes

Two-stage Approach • Given a set of genotypes G, we are interested in downstream problems • Many HI solutions for G • Two stage: first infer the “correct” HI solution from the genotypes, then do the downstream analysis with the inferred haplotypes • Haplotype inference: extensively studied and believed to be accurate to certain extent

One-stage Approach • What effect does haplotyping inaccuracy have on downstream questions? • Our work: directly use genotype data for downstream problems • Without fixing a choice for the HI solution • Minimum recombination problem

Suffix Prefix 11000 0000001111 breakpoint Recombination: Single Crossover • Recombination is one of the principle genetic force shaping variation within species • Two equal length sequences generate a third equal length sequence 110001111111001 000110000001111

Kreitman’s Data (1983) 0000000011000000001101110111100000000000000 0010000000000000001101110111100000000000000 0000000000000000000000000000000000010000101 0000000000000000110000000000000000010011000 0001100010110011110000000000000000001000000 0010000000000001000000000000001010111000010 0010000000000001000000000000011111101000000 1111100010111001000000000000011111101100000 1111100010111001000000000000011111101100000 1111100010111001000000000000011111101100000 1111111110000101000010001000011111101000000 Question: what is the minimum number of recombinations needed to derive these sequences? Assume at most 1 mutation per site

Minimizing Recombination • Compute the minimum number of recombinations (Rmin) for deriving a set of haplotypes, assuming at most 1 mutation per site • NP-hard in general • Heuristics • Lower bounds on Rmin

Lower Bounds on Genotypes • For a particular recombination lower bound method L, what is the range of possible bounds for L over all possible HI solutions? • MinL(G): minimum L over all HI solutions for G. • MaxL(G): maximum L over all HI solutions for G. • This paper: HK bound, connected component bound and relaxed haplotype bound. • Polynomial-time algorithms for MaxHK, MinCC. • Heuristic method for relaxed haplotype bound.

Lower Bound: Incompatibility 1 2 3 4 5 Incompatibility Graph (IG): A node each site, edge between incompatible pair a b c d e f g 0 0 0 1 0 1 0 0 1 0 0 0 1 0 0 1 0 1 0 0 0 1 1 0 0 0 1 1 0 1 0 0 1 0 1 M 1 2 3 4 5 • Two sites (columns) p, q are incompatible if columns p,q contains all four ordered pairs (gametes): 00, 01, 10, 11 • Sites p,q are incompatible  A recombination must occur between p,q

Arrange the nodes of the incompatibility graph on the line in order that the sites appear in the sequence. HK bound = maximum number of non-overlapping edges in incompatibility graph (IG). Easy to compute for haplotype data. HK Bound (1985) 1 2 3 4 5 HK Lower Bound = 1

01010 01010 10101 10101 00000 00101 01000 10100 HK = 1 HI1 1 2 3 4 5 01010 01010 10101 10101 00001 00100 00000 11100 HK = 3 HI2 1 2 3 4 5 IG for HI Solutions 01010 10101 00202 22200

HK Bounds on Genotypes • Known efficient algorithm for MinHK(G) (Wiuf, 2004). • This paper: polynomial-time algorithm for MaxHK(G)

MIG(G) E(G) = {12, 23, 35} Maximal Incompatibility Graph G 01010 10101 00202 22200 • An edge between sites p and q if there is a phasing of p, q so p and q are incompatible • Each pair of sites is considered independently • E(G): a maximum-sized set of non-overlapping edges in MIG(G) 1 2 3 4 5

Claim: MaxHK(G) = |E(G)| MaxHK(G) |E(G)| MIG(G): supergraph of IG(H) for any HI solution H If we can find an HI solution H, whose every pair of sites in E(G) is incompatible, then HK(H) |E(G)| Together, MaxHK(G) = |E(G)| MaxHK(G)

Finding such an H MIG(G) • Phase sites from left to right. • Each component in E(G) is a simple path • Each site only constrained by at most one site to the left

01010 01010 10101 10101 00?0? 00?0? 00?00 11?00 01010 01010 10101 10101 0010? 0000? 00000 11100 Phasing G for Incompatibility 01010 01010 10101 10101 00?0? 00?0? 0??00 1??00 • No matter how a previous site p is phased, can always phase this site q to make p, q incompatible

Haplotyping With Minimum Number of Recombinations • Compute Rmin(G) • Haplotyping on a network with fewest recombinations • NP-hard • This paper: A branch and bound method computing exact Rmin(G) for data with small number of sites • APOE data: 47 non-trivial genotypes, 9 sites • Our method: 2 minutes, Rmin(G) = 5

Application: Recombination Hotspot • Recombination hotspot: regions where recombination rate is much higher than neighboring regions • Previous study (Bafna and Bansal, 2005): a recombination lower bound with inferred haplotypes were used to identify recombination hotspots • Our work: compute the exact Rmin(G) with genotypes for a sliding window of a small number of SNPs to detect recombination hotspots

MS32 data (Jeffreys, et al. 2001) Result from haplotypes (Bafna and Bansal, 2005) Result from original genotypes (this paper)

Other Applications • Finding true Rmin from genotypes G • Two stage approach: run PHAS to get an HI solution H, and compute Rmin(H) • One stage approach: directly compute Rmin(G) • Accuracy of haplotype inference on a minimum network • Simulation results: comparable, slightly weaker and non-conclusive

Summary • Main goal of this paper: develop computational tools for the minimum recombination problem with genotypes • Polynomial-time algorithm for MaxHK and MinCC problems • Practical heuristics for other problems • Simulation results to several application questions are not conclusive • Our tools facilitate the study of these problems

Thank You • Software: available upon request

Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes)

Efficient Computation of Minimum Recombination With Genotypes (Not Haplotypes)

Presentation Transcript

Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population

Models of Computation

What about Genotypes?

Efficient Computation of Trade-Off Skylines

Reporting of Haplotypes with Recessive Effects on Fertility

Estimating Recombination Rates

Haplotypes and imputed genotypes in diverse human populations

Genotypes are Useful for More Than Genomic Evaluation

Efficient Computation of Reverse Skyline Queries

On the limitations of efficient computation

Efficient computation of photohadronic interactions

Haplotyping algorithms and structure of human variation

Efficient Computation of Temporal Aggregates with Range Predicates

Efficient computation of diverse query results

Filling missing genotypes using haplotypes

Fill ing Missing Genotypes Using Haplotypes

A Coalescent-based Method for Population Tree Inference with Haplotypes

Efficient Computation of Diverse Query Results

Inferring Local Tree Topologies for SNP Sequences Under Recombination in a Population

Efficient Computation of Substring Equivalence Classes with Suffix Arrays

Estimating Recombination Rates

The Limits of Efficient Computation