1 / 22

Profiles for Sequences

Profiles for Sequences. Sequence Profiles. Often, sequences are characterized by similarities that are not well captured through matching algorithms.

Download Presentation

Profiles for Sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.


Presentation Transcript

  1. Profiles for Sequences

  2. Sequence Profiles • Often, sequences are characterized by similarities that are not well captured through matching algorithms. • For example, identification of genes in the presence of exons/introns, gene features (CpG islands, etc.), domain profiles in proteins, among others. • For such sequences, Markov chains provide useful abstractions.

  3. Markov Chains Rain Sunny Cloudy State transition matrix : The probability of the weather given the previous day's weather. States : Three states - sunny, cloudy, rainy. Initial Distribution : Defining the probability of the system being in each of the states at time 0.

  4. Hidden Markov Models Hidden states : the (TRUE) states of a system that may be described by a Markov process (e.g., the weather). Observable states : the states of the process that are `visible’.

  5. Hidden Markov Models Initial Distribution : Initial state probability vector. State transition Matrix Emission Probabilities: containing the probability of observing a particular observable state given that the hidden model is in a particular hidden state.

  6. Hidden Markov Models Transition Prob. Output Prob. Observed sequences can be scored if their state transitions are known. The probability of ACCY along this path is: .4 * .3 * .46 * .6 * .97 * .5 * .015 * .73 *.01 * 1 = 1.76x10-6.

  7. Methods for Hidden Markov Models Scoring problem: Given an existing HMM and observed sequence , what is the probability that the HMM can generate the sequence

  8. Methods, contd. • Alignment Problem • Given a sequence, what is the optimal state sequence that the HMM would use to generate it

  9. Methods, contd. Training Problem How do we estimate the structure and parameters of a HMM from data.

  10. HMMs– Some Applications • Gene finding and prediction • Protein-Profile Analysis • Secondary Structure prediction • Copy Number Variation • Characterizing SNPs

  11. Gene Template (Left) (Removed)

  12. HMMs: Applications • Classification: Classifying observations within a sequence • Order: A DNA sequence is a set of ordered observations • Structure : can be intuitively defined: • Measure of success: # of complete exons correctly labeled • Training data: Available from various genome annotation projects

  13. HMMs for Gene Finding • Training- Expectation Maximization (EM) • Parsing – Viterbi algorithm An HMM for unspliced genes. x : non-coding DNA c : coding state

  14. Genefinders: a Comparison Sn = Sensitivity Sp = Specificity Ac = Approximate Correlation ME = Missing Exons WE = Wrong Exons GENSCAN Performance Data, http://genes.mit.edu/Accuracy.html

  15. Protein Profile HMMs • Motivation • Given a single amino acid target sequence of unknown structure, we want to infer the structure of the resulting protein. Use Profile Similarity • What is a Profile? • Proteins families of related sequences and structures • Same function • Clear evolutionary relationship • Patterns of conservation, some positions are more conserved than the others

  16. HMMs From Alignment ACA - - - ATG TCA ACT ATC ACA C - - AGC AGA - - - ATC ACC G - - ATC insertion Transition probabilities Output Probabilities A HMM model for a DNA motif alignments, The transitions are shown with arrows whose thickness indicate their probability. In each state, the histogram shows the probabilities of the four bases.

  17. HMMs from Alignments Deletion states Matching states Insertion states No of matching states = average sequence length in the family PFAM Database - of Protein families (http://pfam.wustl.edu)

  18. Database Searching • Given HMM, M, for a sequence family, find all members of the family in data base. • LL – score LL(x) = log P(x|M) • (LL score is length dependent – must normalize or use Z-score)

  19. Querying a Sequence Suppose I have a query protein sequence, and I am interested in which family it belongs to? There can be many paths leading to the generation of this sequence. Need to find all these paths and sum the probabilities. Consensus sequence: P (ACACATC) = 0.8x1 x 0.8x1 x 0.8x0.6 x 0.4x0.6 x 1x1 x 0.8x1 x 0.8 = 4.7 x 10 -2 ACAC - - ATC

  20. Multiple Alignments • Try every possible path through the model that would produce the target sequences • Keep the best one and its probability. • Output : Sequence of match, insert and delete states • Viterbi alg. Dynamic Programming

  21. HMMs from Unaligned Sequences • Baum-Welch Expectation-maximization method • Start with a model whose length matches the average length of the sequences and with random output and transition probabilities. • Align all the sequences to the model. • Use the alignment to alter the output and transition probabilities • Repeat. Continue until the model stops changing • By-product: a multiple alignment

  22. PHMM Example An alignment of 30 short amino acid sequences chopped out of a alignment of the SH3 domain. The shaded area are themost conserved and were represented by the main states in the HMM. The unshaded area was represented by an insert state.

More Related