1 / 25

HapCompass : A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data

HapCompass : A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data. By Derek Aguiar and Sorin Istrail ( Brown University) Journal of Computational Biology, June 2012 Presented by KWOK Tsz Piu (Bill) 19/12/2013. Introduction.

Download Presentation

HapCompass : A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. HapCompass: A Fast Cycle Basis Algorithm for Accurate Haplotype Assembly of Sequence Data By Derek Aguiar and SorinIstrail (Brown University) Journal of Computational Biology, June 2012 Presented by KWOK TszPiu (Bill) 19/12/2013

  2. Introduction • Genetic variation is present in the form of single nucleotide polymorphisms(SNPs), insertions/deletions, inversions, translocations, copy number variations, etc. • The abundance of SNPs in human genome and the development of high-throughput genotyping technologies • SNPs become the marker of choice for understanding human genetic variation.

  3. Introduction • Human genome contains a pair of DNA sequences : one from each parent called haploid sequences or haplotypes • Haplotypes differ in SNP/insertion/deletion… • SNPs are single bpmutations (~0.1%; non-uniform) • SNP positions contain one of two possible alleles … ataggtccCtatttcgcgcCgtatacacgggActata … … ataggtccGtatttcgcgcTgtatacacgggTctata… … ataggtccCtatttcgcgcCgtatacacgggTctata …

  4. Haplotypes and Genotypes • Haplotype: description of SNP alleles on a chromosome • 0 for major allele, 1 for minor • Diploids: two homologous copies of each autosomal chromosome • One inherited from mother and one from father • Genotype: description of alleles on both chromosomes • 0 - both chromosomes contain the major allele; • 1 - both chromosomes contain the minorallele; • 2 - the chromosomes contain different alleles 021200210 011000110 001100010 genotype + two haplotypes per individual

  5. Goal of Haplotype assembly • Reconstruct the two haplotypesby the aligned sequence fragments

  6. Goal of Haplotype assembly • Sequence reads are sampled from haploid fragments

  7. Gene‐Disease Association Studies • Haplotypes increase power of association

  8. Haplotype assembly problem • In the absence of error in sequenced read, the correct haplotype assembly is unique. • In the real case, the problem become finding the haplotype assembly that optimizes a certain objective function • E.g., minimize the number of conflicts with the sequenced reads. (MEC)

  9. Input

  10. Input

  11. Compass Graph • Weight = Number of phasings – number of phasings • Positive => suggest phasings • Negative => suggest phasings • Zero (small absolute value) => both phasings are ok.

  12. Compass Graph

  13. Properties of compass graph • There is a unique phasing between two SNPs si and sj if and only if for any two simple edge-disjoint paths p and q in GC between si and sj, the number of negative edges of p plus the number of negative edges of q is even, and p and q include no 0-weight edges. • S1->S2->S4 • S1->S3->S4

  14. Definitions • Conflicting cycle is: • Simiple cycle contains odd number of negative edges • Or has at least one 0-weight edges • GC(Compass graph) with no conflicting cycle is happy • Happy graph can be uniquely phased • We can observe that • Every spanning tree of a compass graph is a happy graph

  15. Problem formulations • Target: • Remove conflicting cycles with Minimum weighted edge removal (MWER)

  16. Problem formulations • Target: • Remove conflicting cycles with Minimum weighted edge removal (MWER)

  17. Algorithm 1 • Remove all 0-weight edges from GC.  • Construct a maximum spanning tree T. • Mark all conflicting cycles. • Repeat 4.1 & 4.2 until Gcis happy: • Randomly select a conflicting cycle, remove the edge e with weight closest to 0 on the cycle. • Re-mark the conflicting cycles • Output the phasing corresponding to any spanning tree of GC m = |Ec|, n = |Vc| Time complexity: O(m(m-n+1)2)+(m-n+1)(m log n))

  18. Algorithm 1

  19. Algorithm 1

  20. Improvement • Idea: • Want to remove edges that are in multiple conflicting cycles • Formulate the problem to set cover problem: • Set: edges • Elements: conflicting cycles • Target: Find the set of edges(sets) of minimum weight s.t. they cover all of the conflicting simple cycles (elements) Universe = {1, 2, 3, 4, 5} (5 elements) Set = {{1, 2, 3}, {2, 4}, {3, 4}, {4, 5}} Best = {{1, 2, 3}, {4, 5}}

  21. Results • Real Data: 1000 genome data, chr 22 of NA12878 • FMPR: Number of mismatch of each fragment to haplotypes • BFM: Number of fragments that are not perfectly match the haplotypes • Block size = number of SNPs

  22. Results • Simulated data: • Chr 22, NA12878 • 10M simulated reads, error rate = 0.05, read length = 100bp

  23. Results

  24. Conclusion • Haplotype assembly is becoming increasingly important • Cost of sequencing decreases • More genome-wide and whole-exome studies are conducted • A new haplotype assembly algorithm • New formulation of the graph • Some useful observations to make the algorithm works • Quality of SNP calls and sequence base call scores will be included in the future.

  25. Thank you!

More Related