haplotype assembly n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Haplotype assembly PowerPoint Presentation
Download Presentation
Haplotype assembly

Loading in 2 Seconds...

play fullscreen
1 / 66

Haplotype assembly - PowerPoint PPT Presentation


  • 352 Views
  • Uploaded on

Haplotype assembly. CSCI2820: Medical Bioinformatics Derek Aguiar. Outline. Haplotype assembly workflow Data generation and preparation Error modeling Problem optimizations Algorithms HapCUT Levy et al. 2007 HapCompass Polyploidy and cancer. Haplotype reconstruction. solution space S

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

Haplotype assembly


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
    Presentation Transcript
    1. Haplotype assembly CSCI2820: Medical Bioinformatics Derek Aguiar

    2. Outline • Haplotype assembly workflow • Data generation and preparation • Error modeling • Problem optimizations • Algorithms • HapCUT • Levy et al. 2007 • HapCompass • Polyploidy and cancer

    3. Haplotype reconstruction solution space S k-ploidy n 2’s Sequence reads or genotypes Haplotype assembly Haplotype phasing Haplotypes

    4. The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

    5. The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

    6. DNA ...A A C G T C A G T C A G C C T... 5.4 thousand bases 150 billion bases 3.2 billion bases

    7. Variation structural variation Single nucleotide variant (SNV)or polymorphism (SNP) del/no del alleles C/T alleles http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism https://sites.google.com/site/lifesciencesinmaine/5-cell-division-reproduction-and-dna

    8. Genotypes and haplotypes Genotype Haplotype ACGCCA ACGCCA CCTTAA {A,C} {C} {G,T} {C,T} {C,A} {A} TGCGGT CCTTAA GGAATT 000001 101111 2 0 2 2 2 1

    9. High-throughput Sequencing • Also termed next-generation sequencing • Illumina • 454 • SOLiD • DNA is fractured, amplified, fixated onto an array, bases are added • Single molecule or 3rd generation technologies Source of bias Error signature Short reads: ~50-200bp (454 can get up to 1kb) Generally more error prone than Sanger Extremely fast and parallel

    10. Wetterstrand KA. DNA Sequencing Costs: Data from the NHGRI Genome Sequencing Program (GSP) Available at: www.genome.gov/sequencingcosts. Accessed April 2013.

    11. http://www.ncbi.nlm.nih.gov/Traces/sra/

    12. Illumina FASTQ format • Four lines per sequence • identifier line • sequence line • + line • base quality line

    13. Illumina File Snippet • @D4ZHLFP1:10:C03D7ACXX:2:2305:7378:90088 2:N:0:ATCACG • CGTGACAATATTTAGCTTTAGATGGAGTTTACCACCCATTTTGGGCTGCATTCCCAAACAACCCGACTCGTCGAAAGCACATCGTGAACGACCGAGGTCGC • + • BBBDDFFFFHGHH@D@GHIGIFIJJJIEEHHHIHIHGHIJJIJFIHII;BFFFBDGIGIGIJJGIAEEDFDBD>BBBA:<AC@@BDCBCD@BB@>BD5399

    14. ID line in detail • @D4ZHLFP1:10:C03D7ACXX:2:2305:7378:90088 2:N:0:ATCACG

    15. Illumina Base QualityEncoding http://en.wikipedia.org/wiki/FASTQ_format

    16. The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

    17. sequence read alignment • Report all read alignments between sequence reads ri with lengths li for 1≤i≤n and a genome g of length L

    18. sequence read alignment • The real problem is li<<L for 1≤i≤n • What can we do to overcome this problem?

    19. BLAT • Algorithm • Construct an index of non-overlapping k-mersfor the reference • Use index to quickly find regions of the genome similar to the query • Compute an alignment between read and genomic region

    20. SAM/BAM file format • Used to store read alignments • Tab delimited • samtools program used to manipulate SAM/BAM files • Picard java port

    21. VCF file format • Used to store variant calls • Developed as part of the 1000 Genomes Project

    22. Motivation • Sequence reads are sampled from haploid fragments (fi) gap between sequences covered SNPs f1 C T f2 A C ···········A······C······A······T······G·········· ···········C······G······A······A······T·········· ···········C······G······G······A······T·········· C G T f3

    23. The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

    24. errors • Two fundamental sources of errors • Sequencing • A base was miscalled • An insertion or deletion was miscalled • Read alignment • A read was aligned to the wrong location or the alignment itself is wrong

    25. errors • We can explicitly model sequencing errors based on the technology used to generate the reads • Also, we can use the phred quality scores

    26. relationship between quality score and base call

    27. fragments • For now, we assume the organism we are sequencing has ploidy=2, or 2 sets of chromosomes • Thus, each SNP can only have two alleles • Our computational model of an aligned read is termed a fragment • Given n SNPs, the ith fragment fiis an n-dimensional vector with alphabet {0,1,-} where 0 and 1 represent the major and minor alleles and – represents a gap in coverage

    28. formal definition of input • The input is a matrix M with m rows corresponding to m fragments and n columns corresponding to the n SNPs. f1 A C ⋅⋅⋅A⋅⋅⋅⋅C⋅⋅⋅⋅A⋅⋅⋅⋅T⋅⋅⋅ ⋅⋅⋅C⋅⋅⋅⋅G⋅⋅⋅⋅G⋅⋅⋅⋅A⋅⋅⋅ G A f2 SNP-fragment matrix s1 s2 s3 s4 f1 f2

    29. The haplotype assembly workflow DNA sequencing read alignment; retain SNPs computational modeling of reads and sequencing errors problem optimization haplotypes

    30. The haplotype assembly problem (diploid) • Given: a set aligned sequence reads and variants • Compute: two haplotypes for each chromosome/scaffold

    31. imagine a world without errors How would you assemble the fragments?

    32. fragment conflict • Let fik be the allele or ‘-’ at position k of fragment i, then two fragments conflict if • Informally, two fragments fi and fj are in fragment conflict if they cover a common SNP and have different alleles at that site So, how would you assemble the fragments?

    33. fragment conflict graph Lancia et al., 2001 1 − 0 • The fragment conflict graph is a representation of the fragments which share a SNP but differ in alleles. • The fragment conflict graph GF(VF,EF) has a vertex for every fragment and an edge between fragments that conflict (where d(fi,fj) is the number of fragment conflicts between fi and fj): 1 0 − − 0 0 0 1 − 0 − 1 − 1 1

    34. fragment conflict graph • Find two sets such that there are no fragment conflicts internal to each set (bipartition of GF) • The shores of a bipartition of GF give the haplotype Haplotype A: 1 0 0 1 − 0 1 0 − − 0 0 0 1 − 0 − 1 − 1 1 Haplotype B: 0 1 1

    35. minimum fragment removal • Optimization: Compute a subset of vertices VS in VF such that G(VF-VS) induces a bipartite graph. • Sometimes referred to as the bipartization problem and is NP-hard Lancia et al., 2001 But, exist in real data, sequencing errors do. Hmmmmm. disney.wikia.com

    36. minimum edge removal • Optimization: Compute a subset of edges ES in EF such that G(EF-ES) induces a bipartite graph. • Also referred to as bipartization and NP-hard • But what exactly does an edge removal in GF mean? • More on this later… Lancia et al., 2001

    37. SNP conflict graph • GS(VS,ES) has a vertex for every SNP and an edge between SNPs that exhibit 3 or more haplotypes: Schwartz 2010

    38. minimum error correction • The most used optimization in the literature is the Minimum Error Correction (MEC) • FastHare4,HAPCUT1, HASH2, He et al.3 • Compute a minimum set of error corrections in the fragments (rows) of M such that GF is bipartite. Lancia et al., 2001 1 Bansal, V. and Bafna, V. (2008). Hapcut: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics, 24(16), 153–159. 2 Bansal, V., Halpern, A. L., Axelrod, N., and Bafna, V. (2008). An mcmc algorithm for haplotype assembly from whole-genome sequence data. Genome Research, 18(8), 1336–1346. 3 He, D., Choi, A., Pipatsrisawat, K., Darwiche, A., and Eskin, E. (2010). Optimal algorithms for haplotype assembly from whole-genome sequence data. Bioinformatics, 26(12), i183–i190. 4 Panconesi, A. and Sozio, M. (2004). Fast hare: A fast heuristic for single individualsnp haplotype reconstruction. In Proceedings of the 4th International Workshop on Algorithms in Bioinformatics WABI04, volume 3240, pages 266 – 277.

    39. minimum error correction • A single A->T change creates a bipartite GF Lancia et al., 2001 Schwartz 2010

    40. Input v0 v1 CCTA read 1 AGTA read 2 AGGA consensus CGGA read 3

    41. Fragment conflicts • Two fragments conflict if they share a variant, differ in allele • Haplotype assembly = two sets with no internal conflicts (bipartite) 10 01 f1 f3 v0 v1 11 read 1 f2 read 2 read 3

    42. Optimizations • Remove minimum number of • Fragments • Variants • Correct minimum number of errors f1=01 f2=10 f3=11 f3=10 f1 f3 v0 v1 0- 01 01 1- 10 10 f2

    43. Prior Work • Minimum Error Correction (MEC) • Levy et al. 2007, FastHare4,HAPCUT1, HASH2, He et al.3 • Minimum SNP Removal • Minimum Fragment Removal • HapCompass5 1Bansal, V. and Bafna, V. (2008) 2 Bansal, V., Halpern, A. L., Axelrod, N., and Bafna, V. (2008) 3 He, D., Choi, A., Pipatsrisawat, K., Darwiche, A., and Eskin, E. (2010) 4 Panconesi, A. and Sozio, M. (2004) 5 Derek Aguiar and SorinIstrail, (2012)

    44. Levy et al. 2007 • Venter diploid genome • First diploid genome assembly • Greedy haplotype assembly algorithm

    45. Levy et al. 2007 • Initialize set S containing all reads • while S has a read in which all variants are not currently assigned • choose a read containing the most uncalled variants in the set of haplotypes • use read to seed haplotype; assign alleles to haplotype and the compliment alleles to the compliment haplotype • iterate: • extend haplotypes using read with strongest signal (|number of variant agreements for haplotype A - number of variant agreements for haplotype B|)

    46. Levy et al. 2007 (continued) • Iterate: • For each variant, assign allele by majority rule of the reads used to determine the alleles • For each read, assign the read to the haplotype with minimum distance

    47. Levy et al. 2007 example v0 v1 v2 v3 v4 read 0 read 1 read 2 read 3 read 4 hap A hap A’