1 / 46

Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment

Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment. Ruibin Xi Peking University School of Mathematical Sciences. High-throughput sequencing. HTS platforms Roche 454 platforms Illumina / Solexa platforms (most widely used) Applied Biosystem (ABI) SOLiD

deiter
Download Presentation

Biostatistics-Lecture 15 High-throughput sequencing and sequence alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biostatistics-Lecture 15High-throughput sequencing and sequence alignment Ruibin Xi Peking University School of Mathematical Sciences

  2. High-throughput sequencing • HTS platforms • Roche 454 platforms • Illumina/Solexa platforms (most widely used) • Applied Biosystem (ABI) SOLiD • HelicosHeliSopeTM sequencer(single molecular sequencing) • Life Technologies platforms • The throughput is increasing and the price is dropping • Short read but high throughput

  3. High-throughput sequencing • HTS platforms • Roche 454 platforms • Illumina/Solexa platforms (most widely used) • Applied Biosystem (ABI) SOLiD • HelicosHeliSopeTM sequencer(single molecular sequencing) • Life Technologies platforms • The throughput is increasing and the price is dropping • Short read but high throughput

  4. What sequencing data can do • Detection of genomic variations (Whole genome sequencing or targeted sequencing) • Single nucleotide polymorphisms (SNP) • Copy number variations (CNV) • Structural variations (SV) • Analyze protein interactions with DNA (ChIP-seq) • Whole Transcriptome study (RNA-seq) • DNA methylation study • and many more …..

  5. Sequencing (Illunima) Mardis NatureReivew Genetics (2010)

  6. What the data look like? Fastq Format detail: 1st line: the name of a short read 2nd line: the read itself (a short sequence of A,C,G,T) 3rd line: the name of the short read or plus (+) sign 4th line: the quality score

  7. General strategy for analyzing HTS data • Alignment-based • Assembly-based

  8. Comparing two DNA sequences

  9. How can we evaluate an alignment

  10. How can we evaluate an alignment • Scores: • Mutation (mismatch): 0 • Match: 1 • Gap: -1

  11. Find a best alignment • Exhaustively search all possible alignments • Computationally too expensive!!! • Observation: for a pair (i,j)

  12. Find a best alignment

  13. Find a best alignment—Solution I

  14. Find a best alignment—Solution II

  15. Dynamical programming

  16. Dynamical programming

  17. Dynamical programming

  18. Dynamical programming

  19. Dynamical programming

  20. Dynamical programming

  21. Dynamical programming

  22. Dynamical programming

  23. Semiglobal alignment

  24. Semiglobal alignment

  25. Semiglobal alignment

  26. Semiglobal alignment

  27. Local Alignment

  28. Local Alignment

  29. Local Alignment

  30. Local Alignment

  31. BLAST and BLAT • BLAST: Basic Local Alignment Search Tool • BLAT: BLAST Like Alignment Tool

  32. BLAST • BLAST: • A search algorithm for finding local alignments of two sequences S and T • An associated theory for evaluating the statistical significant • Terminology and notation • S(a,b): scoring system • High-scoring Segment Pair (HSP): • Cannot be extended or shortened without dropping the score

  33. BLAST • The number of HSPs with a score ≥ S approximately follows a Poisson Distribution (under the null hypothesis) with parameter • Assumptions Some probability to take positive score • Based on extreme value theory • E-value • Bit score

  34. BLAST • Algorithm: seed and extend • Build an index for k-mers of the query sequence • Find the hits of the k-mers in the database sequence in the query sequence • Extend the seeds with a score ≥ a threshold and find the HSPs with a score ≥ S • Evaluate the statistical significance

  35. BLAT • Strategy: seed and extend • In the seed stage, detects regions of two sequences that are likely to be homologous • In the extend stage, those regions are examined in detail and alignments are produced. • Index the non-overlapping K-mers of the database sequences instead of the query sequence

  36. BLAT • Strategy: seed and extend • In the seed stage, detects regions of two sequences that are likely to be homologous • In the extend stage, those regions are examined in detail and alignments are produced. • Index the non-overlapping K-mers of the database sequences instead of the query sequence

  37. BLAT • Seeding strategy • Single perfect K-mer matches • Single near perfect K-mer matches • Multiple perfect K-mer matches

  38. BLAT • Some definitions

  39. BLAT • Single perfect match • the probability that a specific K-merin a homologous region of the database matches perfectly the corresponding K-mer in the query • Sensitivity: the probability of a hit (at least one non-overlapping K-mers in the database matches perfectly with the corresponding K-mer in the query) • Specificity: the expected number of non-overlapping K-mers that matches, assuming all letters are equally likely

  40. BLAT • Single perfect match • the probability that a specific K-merin a homologous region of the database matches perfectly the corresponding K-mer in the query • Sensitivity: the probability of a hit (at least one non-overlapping K-mers in the database matches perfectly with the corresponding K-mer in the query) • Specificity: the expected number of non-overlapping K-mers that matches, assuming all letters are equally likely

  41. BLAT • Single imperfect match • The probability • The sensitivity • The specificity

  42. BLAT • Single imperfect match • The probability • The sensitivity • The specificity

  43. BLAT • Multiple perfect Matches • Probability • Sensitivity • Specificity

  44. BLAT • Multiple perfect Matches • Probability • Sensitivity • Specificity

  45. BLAT • Clumping the hits • The hit list L is sorted by database coordinate. • The list L is split into buckets of size 64 kb each, based on the database coordinate. • Each bucket is sorted along the diagonal, i.e. hits are sorted by the value of database position minus query position. • Hits that are within the gap limit are grouped together into proto-clumps. • Hits within proto-clumps are then sorted by their database coordinate and put into real clumps • Clumps within 300 bp or 100 amino acids of each other in the database are merged

  46. BLAT • Nucleotide alignment • A hit list is generated between the query sequence q and the homologous region h in the database, looking for smaller, perfect K-mers. • If a K-mer w in q matches multiple K-mers in h, then w is repeatedly extended by one until the match is unique or exceeds a certain size. • The hits are extended as far as possible, without mismatches • Overlapping hits are merged. • Then extensions using indels followed by matches are considered.

More Related