1 / 40

Fast Sequence Search Multiple Sequence Alignment

Fast Sequence Search Multiple Sequence Alignment. Xiaole Shirley Liu STAT115/STAT215, 2010. Outline. Fast sequence search BLAST , statistical significance BLAST programs BLAT Global MSA ClustalW ClustalW features ClustalW example. Fast Sequence Similarity Search. Query. Sequence DB.

stacie
Download Presentation

Fast Sequence Search Multiple Sequence Alignment

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Fast Sequence SearchMultiple Sequence Alignment Xiaole Shirley Liu STAT115/STAT215, 2010

  2. Outline • Fast sequence search • BLAST, statistical significance • BLAST programs • BLAT • Global MSA • ClustalW • ClustalW features • ClustalW example STAT115

  3. Fast Sequence Similarity Search Query Sequence DB • Uses: • Map a sequence to sequenced genome • Infer unknown sequence function • Find family of proteins in an organism • Find homolog/ortholog in other organisms • Find sequence mutations or variations (SNP) STAT115

  4. Fast Similar Sequence Search • Can we run Smith-Waterman between query and every DB sequence? • Yes, but too slow! • General approach • Break query and DB sequence to match subsequences • Extend the matched subsequences, filter hopeless sequences • Use dynamic programming to get optimal alignment STAT115

  5. BLAST • Basic Local Alignment Search Tool • Altschul et al.J Mol Biol. 1990 • One of the most widely used bioinformatics applications • Alignment quality not as good as Smith-Waterman • But much faster, supported at NCBI with big computer cluster • For tutorials or information: http://www.ncbi.nlm.nih.gov/Education/BLASTinfo/information3.html STAT115

  6. BLAST Algorithm Steps • Query and DB sequences are optionally filtered to remove low-complexity regions • E.g. ACACACACA, TTTTTTTTT STAT115

  7. BLAST Algorithm Steps • Query and DB sequences are optionally filtered to remove low-complexity regions • Break DB sequences into k-mer words and hash their locations to speed later searches • k is usually 11 for DNA/RNA and 3 for protein LPPQGLL LPP PPQ PQG QGL GLL STAT115

  8. BLAST Algorithm Steps • Query and DB sequences are optionally filtered to remove low-complexity regions • Break DB sequences into k-mer words and hash their locations to speed later searches • Each k-mer in query find possible k-mers that matches well with it • “well” is evaluated by substitution matrices STAT115

  9. Scoring K-mer Matches P E G P Q G 7 + 2 + 6 = 15 BLOSUM62 STAT115

  10. BLAST Algorithm Steps • Only words with  T cutoff score is kept • T is usually 11-13, ~ 50 words make T cutoff • Note: this is 50 words at every query position • For each DB sequence with a high scoring word, try to extend it in both ends Query: LP PQG LL DB seq: MP PEG LL HSP score 9 + 15 + 8 = 32 • Form HSP (High-scoring Segment Pairs) • Use BLOSUM to score the extended alignment • No gaps allowed STAT115

  11. BLAST Algorithm Steps • Keep only statistically significant HSPs • Based on the scores of aligning 2 random seqs • Use Smith-Waterman algorithm to join the HSPs and get optimal alignment • Gaps are allowed default (-11, -1) STAT115

  12. Statistical Significance Probability that a random alignment gets score like this or better  pvalue • Local similarity scores follow extreme value distribution (Altschul et al, Nat. Genet. 1994) STAT115

  13. Digression: hypothesis testing • Null hypothesis H0 (“nothing special”) • Like a “defendant” presumed innocent • Alternative hypothesis HA • Proven guilty if overwhelming evidences are present STAT115

  14. Two Sample t-test • Statistical significance in the two sample problem Group 1: X1, X2, … Xn1 Group 2: Y1, Y2, … Yn2 • If Xi ~ Normal (μ1, σ12), Yi ~ Normal (μ2, σ22) • Null hypothesis of μ1= μ2 • Use Welch-t statistic • Check T table for p-val • A gene with small p-val (very big or small t) • Reject null • Significant difference between normal and MM Tongji 2009

  15. Permutation Test • Non-parametric method for p-val calculation • Do not assume normal expression distribution • Do not assume the two groups have equal variance • Randomly permute sample label, calculate t to form the empirical null t distribution • For MM-study, (14 choose 5) = 2002 different t values from permutation • If the observed t extremely high/low  differential expression with statistical significance Tongji 2009

  16. Permutation Technique Compute T0 Compute T1 Compute T2 Compute T3 Compare T0 to T* set Tongji 2009

  17. Statistical Significance Probability that a random alignment gets score like this or better  pvalue • Local similarity scores follow extreme value distribution (Altschul et al, Nat. Genet. 1994) STAT115

  18. Statistical Significance • Actual alignment score S can be normalized • m, n are query and DB length • K,  are constants • Depends on substitution matrix and sequence composition • For typical amino acid and PAM250 matrix K = 0.09,  = 0.229 STAT115

  19. Statistical Significance • Normalized score s can be used to get p-value • When x > 2, probability can be approximated • Another quick check, raw score S/3 STAT115

  20. BLAST Reporting • Report DB sequences above a threshold • E value: Number (instead of probability  pvalue) of matches expected merely by chance • Usually [0.05, 10] threshold • Smaller E, more stringent • User selected (just for display): e.g. top 10, 50, 100 STAT115

  21. Different BLAST Programs • If query is DNA, but known to be coding (e.g. cDNA) • Translate cDNA into protein • Zero gap-extension penalty STAT115

  22. Seq2 Seq3 Seq1 Seq4 Psi-BLAST • Position Specific Iterative BLAST • Align high scoring hits in initial BLAST to construct a profile for the hits • Use profile for next iteration BLAST • Find remote homologs or protein families • FP sequences can degrade search quickly Query STAT115

  23. Reciprocal Blast • Search for orthologous sequences between two species • Orthologs: genes related by vertical descent from a common ancestor and encode proteins with the same function in different species • Paralogs: homologous genes evolved by duplication and code for protein with similar but not identical functions • Finding the correct orthologous sequence is very important in comparative genomics STAT115

  24. Reciprocal Blast • Search for orthologous sequences between two species • Orthologs: genes related by vertical descent from a common ancestor and encode proteins with the same function in different species • GeneA in Species1 BLAST Species2 GeneB • GeneB in Species2 BLAST Species1  GeneA • GeneAGeneB • Also called bi-directional best hit orthologous STAT115

  25. BLAT • BLAST-Like Alignment Tool • Compare to BLAST, BLAT can align much longer regions (MB) really fast with little resources • E.g. can map a sequence to the genome in seconds on one Linux computer • Allow big gaps (mRNA to genome) • Need higher similarity (> 95% for DNA and 80% for proteins) for aligned sequences • Basic approach • Break long sequence into blocks • Index k-mers, typically 8-13 • Stitch blocks together for final alignment STAT115

  26. BLAT: Indexing Genome:cacaattatcacgaccgc 3-mers: cac aat tat cac gac cgc Index: aat 3 gac 12 cac 0,9 tat 6 cgc 15 cDNA (mRNA -> DNA): aattctcac 3-mers: aat att ttc tct ctc tca cac 0 1 2 3 4 5 6 hits: aat 0,3 -3 cac 6,0 6 cac 6,9 -3 clump: cacAATtatCACgaccgc STAT115

  27. BLAT Example • Get result instantly!! STAT115

  28. Multiple Sequence Alignment • MSA Uses: • Establish evolutionary relationships (global) • Find conserved nucleotides and amino acids (global) • Characterize signature protein patterns or motifs (local) • Find acceptable substitutions (local) • Protein MSA gold standard: structural alignment STAT115

  29. Progressive MSA Method • Progressive: • Heuristic algorithm: approximation strategy, do not aim at perfect • Build alignment with most related sequences, progressively add less-related to the alignment • Often manual examination can improve alignments • ClustalW, NAR 1994 • W stands for weighting: more distant seqs weigh more • Reflect evolutionary distance STAT115

  30. ClustalW Steps • Global pairwise alignment for all pairs • Calculate pairwise sequence distances • Approximate evolutionary distance STAT115

  31. C A 3 1 2 1 1 D B ClustalW Steps • Construct a tree based on sequence distances • E.g. solve the following matrix A B C D A 4 6 2 B 4 4 C 6 D STAT115

  32. C A 3 1 2 E F 1 1 D B ClustalW Steps • Progressively add sequences/alignments by the tree order • Starting from the smallest distance • Add seq to seq, seq to align, align to align • AD form new node E, calc AE, DE distance • Calc E consensus, weighted by AE DE distance • Calc B, C, E pairwise distance • BE form new node F… STAT115

  33. ClustalW Features: Consensus • Consensus is used to represent the aligned sequences; if different, find AA to maximize score • Final score weighted based on branch length • Weight for A = a = 0.2 + 0.3 / 2 + 0.3/3 = 0.45 • Weight for B = b = 0.1 + 0.3 / 2 + 0.3/3 = 0.35 • Weight for C = c = 0.5 + 0.3/3 = 0.6 0.2 A B C 0.3 0.1 0.3 0.5 STAT115

  34. Scoring an Alignment Sequence A (weight a) …K… Sequence B (weight b) …I… Merge and align to Sequence C (weight c) …L… Sequence D (weight d) …V… Score for aligning the column [a  c  Score(K,L) + a  d  Score(K,V) + b  c  Score(I,L) + b  d  Score(I,V)] / 4 STAT115

  35. ClustalW Features: Gaps • Sequence specific gap penalties • Penalize gaps more in segments that are less likely to have gaps STAT115

  36. Progressive Alignment Limitations • Gaps can proliferate, if not careful Align1: ABCD-E ABC-D-E Align2: ABC-DE ABC-D-E • Need many heuristic parameters • Does not guarantee global optimum • Errors in initial alignments are propagated • Manual improvements: • Shift residues from one side of gap to the other • Reduce gaps STAT115

  37. ClustalW Alignment * - identical : - conserved . - semi-conserved STAT115

  38. ClustalW Tree Branch length ~ distance 0.02318 0.41596 0.10523 0.01824 0.12694 0.02011 0.01147 STAT115

  39. Summary • Fast sequence similarity search • Break seq, hash DB sub-seq, match sub-seq and extend, use DP for optimal alignment • *BLAST, most widely used, many applications with sound statistical foundations • *BLAT, align sequence to genome, fast yet need higher similarity • Protein global MSA • Progressive heuristic alignment • ClustalW: pairwise, tree, merge alignments • Merge with minimum edit, sequence weighting, sequence/position specific gaps STAT115

  40. Acknowledgment • David Mount • Aoife McLysaght • Ir. Brecht Claerhout STAT115

More Related