1 / 37

Profiles and multiple Sequence alignments

Profiles and multiple Sequence alignments. Understanding Bioinformatics 9 th KIAS winter school Lee, Juyong. Contents. Defining profile PSSM by PSI-BLAST Profile HMM Aligning profiles PSSM & Profile HMM Generate multiple sequence alignment Progressive Other methods. What is Profile?.

omer
Download Presentation

Profiles and multiple Sequence alignments

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profiles and multiple Sequence alignments Understanding Bioinformatics 9th KIAS winter school Lee, Juyong

  2. Contents • Defining profile • PSSM by PSI-BLAST • Profile HMM • Aligning profiles • PSSM & Profile HMM • Generate multiple sequence alignment • Progressive • Other methods

  3. What is Profile? • Represent general properties of the set of sequences • A set of sequences contains more information than a single sequence • Environment is being considered • Two types • Position Specific Scoring Matrix • Profile Hidden Markov Model

  4. Example PSSM

  5. Position specific scoring matrix $> blastpgp -b 0 -j 3 -h 0.001 -d myDB –I mySEQ.fasta –Q myPSSM.mtx –o myMSA.bla

  6. A set of sequences has more information Are K, I and S are meaningful? Are A & T are meaningless? K, I and S are highly conserved! T at the sixth column is also conserved 2nd and 4th columns do not show preference K-IAS-- KAI-ST- K-I-ST- KRISS-- K-I-STI K-IAS- KAI-ST

  7. Generating PSSM Log-odds score of amino acid a at position u Multiple sequence alignment Lack of information should be treated! Not Good ! If a is not observed, m  -∞

  8. Generating PSSM (2)Pseudo-counts : fraction of amino acid a at position u : amino acid a distribution α & β are scaling parameters

  9. Generating PSSM (3)More realistic pseudocounts Use substitution matrix information rather than random alignment! Pseudo count of amino acid a F : frequency of amino acid b at u Formula used in PSI-BLAST

  10. Example of Pseudocount

  11. PSI-BLAST is sequence DB searching program • Goal : Find sequence homologs! • First, perform regular BLAST local search • Build PSSM based on the first round result • Align sequences against PSSM • Update sequence alignment! • Do these iteratively!

  12. Sequence Logo

  13. Profile HMM • Represent general property of a set of sequences based on Hidden Markov Model 0.4 0.1 0.1 0.6 0.5 0.7 0.4 0.2 0.7 0.3 0.6 Emit Amino acid

  14. Profile HMM (2) KIA-S- K-AIST KI--ST KIA-S- K-AIST D1 D2 D3 D4 Start M1 M2 M3 M4 END I0 I1 I2 I3 I4 A S K T I

  15. Profile HMM (3) KIA-S- K-AIST KI--ST KIAS- KI-ST D1 D2 D3 D4 Start M1 M2 M3 M4 END I0 I1 I2 I3 I4 I S T K

  16. Estimate probabilities Transition probability between states Amino acid emission probability

  17. Profile HMM requires a lot of data • Many parameters to be trained • Transition probabilities ~ Nseq * 9 • Amino acid emission probabilities ~ Nseq * 20 • For 100 residue seq, • ~3000 parameters to be tuned • Generally at least 20~30 related sequences are required to build accurate profile HMM

  18. Many possible paths! We need to score them…… QUERY : KRISS D1 D1 D2 D2 D3 D3 D4 D4  Start M1 M2 M3 M4 END Start M1 M2 M3 M4 END I0 I1 I2 I3 I4 I0 I1 I2 I3 I4 S S R I I S S K K R 

  19. How to score a sequence to profile HMM • Two ways of evaluating fitness of a sequence to profile HMM • Through the Most probable path • Viterbi algorithm • Faster, less accurate • Consider all possible paths ! • Forward ( Backward ) algorithm Slower, more accurate

  20. Viterbi algorithm Equivalent to the dynamic programming of pairwise alignment

  21. Forward algorithm • Consider all possible path ! Probability of emitting xi at state Su

  22. Summary • Profile  General property of a query sequence derived from a set of related sequences • Position specific Scoring Matrix • Profile Hidden Markov Model • Can find remote sequence homolog • Those can not be detected by pairwise alignment of sequences

  23. Aligning Profiles • Comparing PSSM • LAMA : no gaps allowed, use Pearson correlation of scores • Prof_sim : gaps allowed, use amino acid distribution at each column • COMPASS : gaps allowed, psuedocounts are used as similar to PSI-BLAST

  24. Aligning profile HMMs COACH, HHsearch are available Can find very remote homologs Position dependent gap scoring is possible

  25. Multiple Sequence Alignment- MSA

  26. Why MSA is difficult? • DP of Pairwise is easy and applicable • Only three cases • If three sequences…… • Seven cases…… • For six sequences…… • 60TB memory required • DP is Impossible  A A A - A A A V V - V - - V L - - L

  27. Methods to align sequences • Progressive method • Add a sequence at a time • ClustalW, T-COFFEE, etc. • Iterative method • Deletion, realigning steps are introduced • Prrp, DIALIGN, MUSCLE and etc.

  28. Order is important! Case 1 Let’s align the followings --D-G-D D-G-D  --G-G-- G-G D-G-G D-G-G-- Case 2 D-G-G G-G D-G-D 

  29. Determine order ! Build phylogenic tree based on all pairwise distance matrix

  30. Which MSA is better?-Scoring scheme Usually Sum of Pairs are used

  31. Scores • ClustalW • Similar to schemes for pairwise alignment • Employ residue-specific gap opening

  32. Scores (2) • T-COFFEE • Score if aligned column is present in the Library • Diverse alignment • Local & Global

  33. Library Extension of T-COFFEE Different Weights for individual columns

  34. Other methods - DIALIGN • Construct whole alignment from ungapped local alignments • Find all ungapped alignments and weight them ! • Key Idea : pairwise alignment can miss biologically important region

  35. Other methods - SAGA • Genetic Algorithm • Alignment  generation • Evolve through mutation & Crossover

  36. Other methods - MSACSA

  37. Thank you!

More Related