1 / 28

Polyploid haplotype assembly

Polyploid haplotype assembly. CSCI2820: Medical Bioinformatics Derek Aguiar. G ` C. haplotype phasing. ##. 1 0. 1 1. 1 1. 1 1. 1 1. ##. 01. 00. 00. 00. 00. ##. v 3. v 0. v 1. v 2. 00. 00. 00. 00. 00. Edge decidability. New to the haplotype assembly model

ismet
Download Presentation

Polyploid haplotype assembly

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Polyploid haplotype assembly CSCI2820: Medical Bioinformatics Derek Aguiar

  2. G`C haplotype phasing ## 10 11 11 11 11 ## 01 00 00 00 00 ## v3 v0 v1 v2 00 00 00 00 00

  3. Edge decidability • New to the haplotype assembly model • Phasings of edges are no longer disjoint triploid diploid 11 01 11 01 00 10 00 10 00 00 shared haplotypes={} shared haplotypes={00}

  4. Polyploid edge decidability • Edge phasing is decided by maximum likelihood

  5. Disjoint paths • Given k source and sink pairs in a graph, find k pairwise disjoint paths between each source-sink pair that are • node disjoint • edge disjoint • If k is not part of the input, NP-complete (hard) • If k is part of the input, polynomial time solutions exist (although not practical)

  6. 2 v3 v1 v2 GC subgraph(no conflict) The chain graph is a trellis graph. 11 11 11 v1v2 v2v3 v3v1 00 00 00 s1 t1 The chain graph 11 11 11 • Vertex for each 2 variant phased haplotype organized in levels • Edges connect valid haplotype extensions (shared variant allele) Disjoint paths in trellis graphs are polynomially solvable and algorithms are fast in practice. 00 00 00 s0 t0

  7. GC subgraph(conflict) v0 v1 v2 v0v1v1v2v2v0 01 11 11 10 00 00 s1 t1 The chain graph 10 11 11 • Vertex for each 2 variant phased haplotype organized in levels • Edges connect valid haplotype extensions (shared variant allele) 01 00 00 s0 t0

  8. Correcting errors Edge with least likely phasing Minimum weighted edge removal (MWER) 11 01 11 00 10 00 v0 v1 v2 s1 10 11 11 t1 v0v1v2 01 00 00 100 t0 011 s0

  9. Polyploid phase extension • New to the haplotype assembly model • Phase extension is no longer unique 10 00 00 01 v0 v1 v2 ??? 11 10 v0v1v2 ??? 110

  10. G`C haplotype phasing ## 10 11 11 11 11 ## 01 00 00 00 00 ## v3 v0 v1 v2 00 00 00 00 00

  11. haplotype phasing ## ## 11 11 11 00 00 00 ## v3 v1 v2 00 00 00

  12. 3 Polyploid chain v1v2 v2v3 v3v1 t0 t0 t0 t0 s0 t1,2 t1 t1,2 t1 s1 t2 t1,2 t1,2 t2 s2

  13. Matchings in bipartite graphs • Given an undirected bipartite graph • A matching M is a set of edges such that each node exists in exactly 1 edge v3 v6 v2 v5 v1 v4

  14. Transformation to max flow • A flow is defined for a directed graph with edge capacities. • Each edge can accept at most its capacity in integer flow (capacity constraint) • Each node (besides source and sink) holds the property that the amount of flow incoming = flow outgoing (flow conservation constraint) v3 v6 1 1 1 1 1 1 1 t v2 v5 1 S 1 1 1 1 v1 v4

  15. Transformation to max flow I want you to prove a max-flow in this reduced graph corresponds to a maximum bipartite matching. • Maximize the flow from source (s) to sink (t) v3 v6 1 1 1 1 1 1 1 t v2 v5 1 S 1 1 1 1 v1 v4

  16. Transformation to max flow • Let the set of edges included in the flow between bipartite levels be M. • Prove M corresponds to a matching. Each node has at most 1 incoming and 1 outgoing edge in the flow (flow conservation constraint) v3 v6 1 1 1 1 1 1 1 t v2 v5 1 S 1 1 1 1 v1 v4

  17. Transformation to max flow • Prove the matching is maximum. matching -> flow flow -> matching v3 v6 1 Let the number of edges included in the maximum flow be f, then if there were a flow including >f edges, then the flow we computed was not maximum. 1 1 1 1 1 1 t v2 v5 1 S 1 1 1 1 v1 v4

  18. 3 Polyploid chain v1v2 v2v3 v3v1 Lemma There exists at least one valid phasing of k haplotypes for a cycle c if and only if there exists a valid matching between sink node annotation and chain graph nodes at each level of Gc. t0 t0 t0 t0 s0 t1,2 t1 t1,2 t1 s1 t2 t1,2 t1,2 t2 s2

  19. Good matching Level v1v2 t0 t1 t2

  20. Good matching Levels v2v3 and v2v0 t0 t1 t2

  21. haplotype phasing ## ## 11 10 11 00 01 00 ## v0 v1 v2 00 00 00

  22. 3 Polyploid chain v0v1v1v2v2v0 t0 s0 t2 t2 t1 s1 t2 t2 t2 s2 t2

  23. No valid matching Level v0v1 t0 t1 t2

  24. HapCompass Algorithm (polyploid) input: Sequence reads, variant calls, number of distinct haplotypes k output: k haplotypes GC← maximum spanning tree cycle basis CC ← set of conflicting cycles in respect to GC forcinCCdo: resolve(c); GC← rebuild() Gg← general chain graph CN← set of non conflicting cycles in respect to GC forcinCNdo: addEvidence(c) output: haplotype assembly according to maximum weight spanning tree of Gg

  25. An example compass graph

  26. compass graph chain graph general chain graph

  27. compass graph chain graph general chain graph

  28. compass graph chain graph general chain graph

More Related