1 / 58

sequence analysis - overview

Sequence Analysis - Overview. Multiple Sequence Alignments. Global Multiple Sequence AlignmentDeals with the entire length of the homologous sequencesAbstraction and Representation of Multiple Sequence AlignmentsCharacter basedNumeric Local Multiple Sequence Alignment (generally called pattern identification)Deals with a segment (most often without gaps) from the sequencesSequences need not be homologous over their entire length.

Pat_Xavi
Download Presentation

sequence analysis - overview

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


    1. 1

    3. Multiple Sequence Alignments Global Multiple Sequence Alignment Deals with the entire length of the homologous sequences Abstraction and Representation of Multiple Sequence Alignments Character based Numeric Local Multiple Sequence Alignment (generally called pattern identification) Deals with a segment (most often without gaps) from the sequences Sequences need not be homologous over their entire length

    4. Focus: What to Remember How to abstract an alignment or motif What sequence or structural elements are likely to be found as a good diagnostic pattern or motif Discovering motifs without previously aligning the sequences How do you recognize a good pattern How do you search for a good pattern

    5. What are motifs? Why look for them? Motifs are well conserved regions of sequence generally organized around one or two very highly conserved residues. The high residue conservation results from a high likelihood of a defective protein from a mutation within the motif. Thus motifs are likely to be important for structure, function, or both. Useful for finding sequences that are distant family members Useful for editing multiple sequence alignments

    6. Heterotrimeric g-proteins: alpha, beta and gamma subunits

    7. Cellular signaling and g-proteins

    8. G-proteins structure

    9. Multiple Sequence Alignment: Fungal G-alpha subunits

    10. Types of Residues Highly Conserved probably structure or function Highly Mutable filler Non-Randomly Mutated recognition sites, substrate specificity

    11. Abstracting and RepresentingMultiple Sequence Alignments Consensus Sequence Residue most common at each position of the alignment Composite Sequence (set representation) All residues present at each position in the alignment Composition Matrix Table showing how many of each residue are present at each position

    13. Representing and AbstractingMultiple Sequence Alignments

    14. Consensus sequences from motifs

    15. Composition Matrix

    16. Representations Position-Specific Scoring Matrix Based on log-odds scores Uses dynamic programming (usually Smith-Waterman) Gap model probably stills needs work PSSM can be developed from any number of sequences Hidden Markov Model Fully probabilistic Uses maximum likelihood method Gap model is integral to entire model HMM usually requires a minimum of 50-100 sequences to get a good model

    17. Position-Specific Scoring Matrix Idea: Perform a large, high quality multiple sequence alignment Compute a log-odds score for each position Problem: assuming normal amino acid distribution, this would require a minimum of 200 sequences, and probably more on the order of 500-1000. Implementation problems when qij = 0 !!! Solution: Develop a method that uses the known data and a belief about the missing observations (based on the mutational data at the core of the PAM or Blossum series of Matrices)

    18. Log-Odds Scoring Matrix

    19. Position Specific Scoring Matrices Weight average Similarity Matrix Bayesian approach Mutational Frequencies Dirichlet mixtures Evolutionary Approach

    20. Weighted average similarity matrix

    21. Gribskov Profile Gap Penalty Have a preset maximum open/extend gap penalty Open = -10, Extend = -1 For each position in the profile, define a multiplier to reduce the gap penalties. Multiplier is 100 for positions in which there are no insertions Based on the maximum length of the gap (LGap) across all sequences in the multiple sequence analysis Equation: multiplier = Gmax/(1.0+GincLGap) where: Gmax = maximum multiplier (Default = 33.3) Ginc = rate the multiplier changes (Default = 0.1)

    22. Bayesian Mutational Frequencies utilizes predicted (pseudo-counts) based on observed replacement frequencies to create weighted average of the probability of finding residue i in the position

    23. Henikoff Algorithm

    24. Bayesian Dirichlet mixtures K. Sjolander, K. Karplus, M. Brown, R. Hughley, A. Krogh, S. Mian, D. Haussler, 1996. Dirichlet mixtures: a method for improved detection of weak but significant homology. CABIOS 12:327-345. http://www.cse.ucsc.edu/research/compbio/dirichlets/index.html Assumes that a selection of residues should be treated as a group Rigorous statistical treatment of the pseudo-count problem

    25. Evolutionary Based Method M. Gribskov, S. Veretnik. 1996. Identification of Sequence Patterns with Profile Analysis. Methods in Enzymology 266:198-212. Evolution-based method Determines the probability of which residue was the true ancestor Mixes a selection of 20 different matrices based on the above probabilities Directly computes scores based on the mixture

    26. MakePSSM For both DNA and Proteins Currently implemented Gribskovs Average and Bayesian Approaches (both mutational and dirichlet frequencies) A large number of Matrices and Frequencies available Blossum and PAM Numerous Gap Models Gribskov Gap Model Linear weighing depending on number of Gaps Based on extreme values of Log-Odds scores Future Additions: Gribskovs evolutionary approach

    27. Ways to define: Use the mutational frequencies that underlie the BLOSUM or PAM matrices (Hennikoff approach). Use Dirchilet mixtures (nine component). MakePSSM

    28. PSSM for fungal g-alpha motif 1

    29. Searching with a PSSM Most approaches use the Dynamic Programming Algorithm usually the Smith-Waterman variant Excellent method for finding distantly related sequences Gap model is AFFINE with the Open and Extend Gap Penalties a function of which position they are in the alignment. Gribskov has a complicated form Hennikoffs did not have a gap model Can be used to locate a motif in an alignment and then edit the alignment

    30. Hidden Markov Model (HMM) HMMs are a fully probabilistic model of a family of homologous sequences HMMs are not a specific method of creating a multiple sequence alignment After the HMM is created an alignment can be generated from the HMM and any sequence from the same homologous family There are several algorithms for creating the HMM from a set of homologous sequences - each will yield a different HMM and hence different alignments An HMM can be calculated from a good alignment created by other means Creating HMMs requires many sequences (>100)

    31. HMM: Description HMM has 3 different kinds of states - a state is a probability model that specifies how frequently different types of sequence residues are found at a specific position of a family of sequences Main States: probability of a specific sequence residue Deletion State: probability of no sequence residue Insertion State: probability of adding extra sequence residues The states are connected by transition probabilities that determine how frequently you go to the 3 states corresponding to the location in the description of the sequence family from the previous state.

    32. Alignment ? HMM

    33. HMM: Diagram

    35. Several methods are in use for training an HMM Some training algorithms are similar to the use of dynamic programming in the progressive pairwise method So far, all are like the progressive pairwise alignment method, prone to being trapped in local minima different from the correct alignment Requires some experience and expertise to use these programs effectively Best current practice may be to take a carefully crafted alignment and use it to create an HMM for use in database searches and other statistical applications HMM: Building an HMM

    36. Classification Libraries/Patterns ProSite composite sequence Prints PSSM Blocks Henikoff style PSSM Pfam Hidden Markov Models

    37. Local Multiple Sequence Alignment Modern programs combine two theoretical methods derived from statistics EM (Expectation-Maximization) to deal with missing data We know the sequences but dont know where the patterns or motifs are within them Stochastic Sampling to reduce the volume of alignment space that must be searched Number of possible pattern starting points ??? Sequence Lengths

    38. Expectation Maximization (EM) Used to identify conserved domains Uses sequences that have a common sequence pattern not easily recognized by eye Iterates two steps: Calculates the probability of finding the site at any position in the sequences New counts estimated in step 1 are used to update the previous set

    39. A good motif or pattern is easy to recognize It has a high information content (entropy) p i fraction of residue i in the sequences q i,j fraction of residue i at position j of the pattern i is the sequence residue type index j is the index of the position within the pattern Expectation-Maximization

    40. Stochastic Sampling Also known as the Gibbs sampler Too many possible motifs to calculate the information for all of them Exploit the memory of empirical position specific scoring matrix motif representations (profiles) A sequence segment that is part of the pattern used to calculate an empirical log-odds position specific matrix representation of the pattern will generally have a higher than average or expected score when scored using the matrix

    41. Stochastic Sampling: Example

    42. Stochastic Sampling: Example

    43. Stochastic Sampling: Example

    44. Score every segment of the left out sequence Use the scores for each segment to randomly select one of the segments Choose a new sequence to leave out of the data and repeat the process with the already defined sequence segments Stochastic Sampling: Example

    45. Refine the Pattern: Picking the next word

    46. Stochastic Sampling: Example

    47. MEME Multiple EM for Motif Elicitation (MEME) Will locate one or more ungapped patterns (motifs) in a set of sequences A search is conducted for a range of possible motif widths and the EM algorithm is used to find the best estimate for the width of the motif OOPS (one occurrence per sequence) ZOOPS (zero or one occurrence per sequence) TCM (any number of occurrences per sequence) Can use prior knowledge about possible motifs Will produce a PSSM

    48. Which Representation Should I Use Gribskov Profile is the simplest (has the fewest parameters to fit to the available data) and requires the least data to adequately model these parameters. HMMs are the most complex and require the most data to adequately model the parameters. PSSMs are intermediate. HMMs and PSSMs tend to look like the background when you dont have enough data to adequately model the parameters.

    49. Sequence Logos Use information content of PSSM to make a graphical representation http://weblogo.berkeley.edu Crooks GE, Hon G, Chandonia JM, Brenner SE WebLogo: A sequence logo generator, Genome Research, 14:1188-1190, (2004) Schneider TD, Stephens RM. 1990. Sequence Logos: A New Way to Display Consensus Sequences. Nucleic Acids Res. 18:6097-6100

    50. MEME motif patterns for fungal g-alphas

    51. Fungal G-alpha motifs - structure view

    52. MAST Motif Alignment and Search Tool Searches through databases to identify motifs in other proteins Helps in finding distantly related sequences Helps in finding additional information about the motifs

    53. MetaMEME Uses the EM and HMM methods to find motifs Starts with the EM method of MEME to find most of the motifs A simplified HMM is produced using the MEME results as prior information MAST is used to determine the most probable order and spacing of the patterns The above information and a set of modified Dirichlet mixtures are used to train the HMM The HMM can then be used for database searches

    54. PSI-BLAST Searches database with BLAST using query to get a group of sequences that will be used to create a PSSM Iterates the search using the PSSM User selects the sequences to be included for iteration Helps find distantly related sequences User may input own PSSM instead of a query sequence Must be careful about sequence selection for PSSM creation or for iterations Inclusion of wrong sequences may quickly create artifacts and end up with an incorrect set of sequences

    55. Glutathione S-Transferase Detoxifies organic chemicals containing halogen or double bonds by addition of Glutathione. Subsequent processing pathway leads to excretion. The catalytic residue (thiol) is from Glutathione. Only the cytoplasmic form is presented here. Classified into six groups, initially based on Swiss-Prot database annotation. Exact number of groups is still subject to debate. Found in bacteria and all kinds of eukaryotes. 126 Sequences from the Swiss-Protein Database.

    56. MEME ZOOPS Motifs for GST

    57. MEME ZOOPS Motifs -- Rat Mu-1

    58. Specialized for Nucleic Acids AlignACE Roth, F.P., Hughes, P.W., Estep, J.D., and Church, G.M. 1998. Finding DNA regulatory motifs within unaligned noncoding sequences clustered by whole-genome mRNA quantitation. Nat. Biotech. 16:939-945. BioProspector Liu X., Brutlag, D.L., Liu, J.S. 2001. BioProspector: discovering conserved DNA motifs in upstream regulatory regions of co-expressed genes. Pac. Symp. Biocomput. 2001;:127-38. Up to tetranucleotide (3rd order Markov Model) background Gibbs Recursive Sampler Thompson, W., Rouchka, E.C., and Lawrence, C.E. 2003. Gibbs Recursive Sampler: finding transcription factor binding sites. Nucleic Acids Res.31:35803585. Allows palindromes and spacers in the model.

More Related