1 / 72

What is a motif?

CZ5226: Advanced Bioinformatics Lecture 4: Motifs and methods for generating motifs Prof. Chen Yu Zong Tel: 6874-6877 Email: csccyz@nus.edu.sg http://xin.cz3.nus.edu.sg Room 07-24, level 7, SOC1, National University of Singapore. What is a motif?.

forest
Download Presentation

What is a motif?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CZ5226: Advanced BioinformaticsLecture 4: Motifs and methods for generating motifsProf. Chen Yu ZongTel: 6874-6877Email: csccyz@nus.edu.sghttp://xin.cz3.nus.edu.sgRoom 07-24, level 7, SOC1, National University of Singapore

  2. What is a motif? • A motif is a sequence pattern that occurs repeatedly in a group of related DNA or RNA or protein or peptide sequences.

  3. Types of motifs and what they mean • Motifs in protein sequences • Structure, function, evolution • Motifs in DNA and RNA sequences • Promoters, transcription factor binding sites, splicing signals • Motifs in MHC-binding peptides • Anchor residue positions, TCR recognition residues

  4. Motifs in Protein Sequences • The leucine zipper may explain how some eukaryotic gene regulatory proteins work. • L-x(6)-L-x(6)-L-x(6)-L • The leucine side chains extending from one alpha-helix interact with those from a similar alpha helix of a second polypeptide, facilitating dimerization

  5. Motifs in DNA Sequences

  6. Motifs in DNA Sequences • Promoter regions, e.g. TATA box • Transcription factor binding sites, e.g. Eve in Drosophila: G-G-T-C-C-T-G-G • Cis-Regulatory regions

  7. Motifs in RNA sequences

  8. Motifs in Protein Structures • Protein structure patterns can encode information about protein function. • Structure motifs can be used to improve multiple alignments of protein sequences.

  9. Active site recognition EXAMPLE:CATHEPSIN A PEPTIDASE FAMILY S10 EC # 3.4.16.5 3-D representation 3D profile (PROCAT)

  10. 1ac5 438LTFVSVYNASHMVPFDKS455 1ivy 419IAFLTIKGAGHMVPTDKP436

  11. Motifs in MHC-Binding Peptide

  12. Motifs in MHC Binding Peptides

  13. Motifs in MHC Binding Peptides

  14. What is the goal and method of motif detection? • Perform local multiple sequence alignment to find consensus sequences and common sequence patterns (motifs)

  15. Macromolecular motif recognition 1-D representation: Primary amino acid sequence MIRAAPPPLFLLLLLLLLLVSWASRGEAAPDQDEIQRLPGLAKQPSFRQYSGYLKSSGSKHLHYWFVESQKDPENSPVVLWLNGGPGCSSLDGLLTEHGPFLVQPDGVTLEYNPYSWNLIANVLYLESPAGVGFSYSDDKFYATNDTEVAQSNFEALQDFFRLFPEYKNNKL... Query secondary databases over the Internet Computational sequence analysis http://www.ebi.ac.uk/interpro/

  16. The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language • This defines what kind of patterns you can find. • An objective function • This defines what makes a pattern “interesting”. • An algorithm • This defines how to search among the possible patterns to find the “interesting” ones.

  17. The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language • This defines what kind of patterns you can find. • An objective function • This defines what makes a pattern “interesting”. • An algorithm • This defines how to search among the possible patterns to find the “interesting” ones.

  18. Pattern Description Languages • Regular expressions • Profiles • Hidden Markov Models (HMMs) • Motif HMMs • Motif-based HMMs

  19. Macromolecular motif recognition single motif exact regular expression (PROSITE) full domain alignment profile (PROSITE) Hidden Markov Model (Pfam, PROSITE) residue frequency matrices (PRINTS) multiple motifs

  20. Regular Expressions • Regular expressions can be used to describe sequence motifs. • They use a simple syntax to describe patterns. • An example protein pattern: [DENG]-x-[DEN]-x(0,2)-[DENQK]-[LIVFY]

  21. Regular expressions contd. • Basic rules for regular expressions • • Each position is separated by a hyphen “-” • • A symbol X is a regular expression matching itself • • x means ‘any residue’ • • [ ] surround ambiguities - a string [XYZ] matches any of the enclosed symbols • • A string [R]* matches any number of strings that match • • { } surround forbidden residues • • ( ) surround repeat counts • Model formation • Restricted to key conserved features in order to reduce the “noise” level • Built by hand in a stepwise fashion from multiple alignments

  22. Motif modelling methods Prosite: Regular expressions CARBOXYPEPT_SER_HIS [LIVF]-x(2)-[LIVSTA]-x-[IVPST]-x-[GSDNQL]-[SAGV]-[SG]-H-x-[IVAQ]-P-x(3)-[PSA] Regular expressions represent features by logical combinations of characters. A regular expression defines a sequence pattern to be matched.

  23. Regular expressions contd. Regular expressions, such as PROSITE patterns, are matched to primary amino acid sequences using finite state automata. “all-or-none”

  24. G G G G A T Y C C C A 0 0 0 2 16 0 0 0 0 0 C 0 0 0 0 1 0 7 16 18 17 G 18 18 18 16 0 3 1 0 0 1 T 0 0 0 0 1 15 10 2 0 0 Profiles • Profiles give weights for each letter. • Example from TRANSFAC: NF-kappab1

  25. Profiles • Profiles are usually created by aligning multiple instances of the motif. • Example: nuclear hormone receptor transcription factor binding site.

  26. Motif modelling methods Prints: Residue frequency matrices Motif 1 NPESWTNFANMLW NPYSWVNLTNVLW REYSWHQNHHMIY NEGSWISKGDLLF NPYSWTNLTNVVY NEYSWNKMASVVY NDFGWDQESNLIY NENSWNNYANMIY NEYGWDQVSNLLY NPYAWSKVSTMIY NPYSWNGNASIIY NEYAWNKFANVLF NPYSWNRVSNILY NPYSWNLIANVLY NEYRWNKVANVLF Motif 2 LDQPFGTGYSQ VDNPVGAGFSY VDQPVGTGFSL VDQPGGTGFSS IDNPVGTGFSF IDQPTGTGFSV VDQPLGTGYSY IDQPAGTGFSP LESPIGVGFSY LDQPVGSGFSY LDQPVGSGFSY LDQPINTGFSN LDQPIGAGFSY LDAPAGVGFSY LDQPVGAGFSY Motif 3 FFQHFPEYQTNDFHIAGESYAGHYIP FFNKFPEYQNRPFYITGESYGGIYVP WVERFPEYKGRDFYIVGESYAGNGLM FLSKFPEYKGRDFWITGESYAGVYIP WFQLYPEFLSNPFYIAGESYAGVYVP FFEAFPHLRSNDFHIAGESYAGHYIP FFRLFPEYKDNKLFLTGESYAGIYIP FLTRFPQFIGRETYLAGESYGGVYVP FFNEFPQYKGNDFYVTGESYGGIYVP WMSRFPQYQYRDFYIVGESYAGHYVP FFRLFPEYKNNKLFLTGESYAGIYIP FFRLFPEYKNNKLFLTGESYAGIYIP WLERFPEYKGREFYITGESYAGHYVP WMSRFPQYRYRDFYIVGESYAGHYVP WFEKFPEHKGNEFYIAGESYAGIYVP Motif 4 LAFTLSNSVGHMAP LQFWWILRAGHMVA LMWAETFQSGHMQP LTYVRVYNSSHMVP LQEVLIRNAGHMVP LTFVSVYNASHMVP LTFARIVEASHMVP LTFSSVYLSGHEIP IDVVTVKGSGHFVP MTFATIKGSGHTAE MTFATIKGGGHTAE FGYLRLYEAGHMVP MTFATVKGSGHTAE ITLISIKGGGHFPA MTFATVKGSGHTAE • a collection of protein “fingerprints” that exploit groups of motifs to build characteristic family signatures • motifs are encoded in ungapped ”raw” sequence format • different scoring methods may be superimposed onto the data, e. .g. BLAST • improved diagnostic reliability • mutual context provided by motif neighbours

  27. Motif modelling methods Prosite: Profiles Feature is represented as a matrix with a score for every possible character. Matrix is derived from a sequence alignment, e.g.: F K L L S H C L L V F K A F G Q T M F Q Y P I V G Q E L L G F P V V K E A I L K F K V L A A V I A D L E F I S E C I I Q

  28. Profiles contd. Derived matrix: A -18 -10 -1 -8 8 -3 3 -10 -2 -8 C -22 -33 -18 -18 -22 -26 22 -24 -19 -7 D -35 0 -32 -33 -7 6 -17 -34 -31 0 E -27 15 -25 -26 -9 23 -9 -24 -23 -1 F 60 -30 12 14 -26 -29 -15 4 12 -29 G -30 -20 -28 -32 28 -14 -23 -33 -27 -5 H -13 -12 -25 -25 -16 14 -22 -22 -23 -10 I 3 -27 21 25 -29 -23 -8 33 19 -23 K -26 25 -25 -27 -6 4 -15 -27 -26 0 L 14 -28 19 27 -27 -20 -9 33 26 -21 M 3 -15 10 14 -17 -10 -9 25 12 -11 N -22 -6 -24 -27 1 8 -15 -24 -24 -4 P -30 24 -26 -28 -14 -10 -22 -24 -26 -18 Q -32 5 -25 -26 -9 24 -16 -17 -23 7 R -18 9 -22 -22 -10 0 -18 -23 -22 -4 S -22 -8 -16 -21 11 2 -1 -24 -19 -4 T -10 -10 -6 -7 -5 -8 2 -10 -7 -11 V 0 -25 22 25 -19 -26 6 19 16 -16 W 9 -25 -18 -19 -25 -27 -34 -20 -17 -28 Y 34 -18 -1 1 -23 -12 -19 0 0 -18 Alignment positions

  29. Profiles contd. • Inclusion of all possible information to maximise overall signal of protein/domain • i. e., a full representation of features in the aligned sequences • Able to detect distant relationships with only few well conserved residues • Position-dependent weights/penalties for all 20 amino acids, gaps, insertions • Dynamic programming algorithms for scoring hits

  30. Hidden Markov Models (HMM) • HMMs generalize the idea of a profile. • They can model insertions and deletions in the sequence as well as the letters at conserved positions. • Profiles can be seen as simple HMMs.

  31. Macromolecular motif recognition • Pfam and Prosite: Hidden Markov Models(HMMs) • Feature is represented by a probabilistic model of interconnecting match, delete or insert states • contains statistical information on observed and expected positional variation - “platonic ideal of protein family” Di Ii B Mi E

  32. HMM example A possible HMM for the sequence “ACCY” which is represented as a sequence of probabilities. The probability of ACCY is shown as a highlighted path through the model. P that an amino acid occurs in a particular state P of transition state

  33. Motif HMM M1 M2 M3 M4 M5 Motif-Based HMMs Motif-based HMMs are sequence models made by combining one or more motif models. Motif HMM: Motifs are modeled as profile HMMs without delete or insert states.

  34. Sequence HMM Start Left Flank M1 M2 M3 M4 M5 Right Flank End A Simple Motif-Based HMM • Adding emitting states with self-loops, plus start and end states, turns a motif HMM into a sequence model. • The HMM below models sequences with one occurrence of the motif.

  35. Motif-Based HMM for ModelingCis-regulatory Regions With two or more motif models we can make more complicated motif-based HMMs. This sequence model captures motifs on the + and – strand of DNA. It does not capture the order of the motifs.

  36. The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language • This defines what kind of patterns you can find. • An objective function • This defines what makes a pattern “interesting”. • An algorithm • This defines how to search among the possible patterns to find the “interesting” ones.

  37. Objective functions for Regular Expression Patterns • Possible objective functions are: • Perfect matches only (no mismatches) • Allow a given number of mismatches • Allow a given density of mismatches (or wildcards). • To be interesting, the motif must occur a certain minimum number of times in the data.

  38. Objective functions for profiles and HMMs • Profile- and HMM-based motifs are usually ranked by statistical or information-theoretical measures: • Likelihood ratio (eg, forward-backward) • Information content (relative entropy) • Maximum a posteriori probability

  39. Example for profiles: the likelihood ratio • Use the profile to compute the likelihood of the data: Pr(data | profile) • Use the background model to compute the likelihood of the data under the background model: Pr(data | bkgrnd) • The likelihood is: Pr(data | profile) / Pr(data | bkgrnd)

  40. Objective functions for protein structure patterns • Structure motifs are usually evaluated based on the RMS distance • between the pattern and each instance, or, • among all the instances of the pattern.

  41. The Three Elements of Pattern Discovery Pattern discovery requires: • A pattern language • This defines what kind of patterns you can find. • An objective function • This defines what makes a pattern “interesting”. • An algorithm • This defines how to search among the possible patterns to find the “interesting” ones.

  42. Algorithms for discovering sequence motifs • Regular expression searches enumerate or use seeds. • Profile/HMM algorithms use Gibbs sampling or Expectation Maximization (EM). Forward-Backward is a form of EM.

  43. Regular Expression Discovery: a simple algorithm • Look for DNA 16-mers where (up to) one wild card is allowed in the pattern: • E.g., “T-A-C-X-G-T-A-G-G-C-C-T-A-G-T-T” • There are possible patterns—a big number. • Idea: Instead of enumerating the possible patterns and counting, just update the counts of appropriate patterns for each 16-mer that actually occurs in the data.

  44. Regular Expression Discovery: a simple algorithm (cont’d) • Run a window of width 16 along the data and, for each 16-mer in the data, e.g. “AGGGTAAAAGCCCCCT”, update the counts of the exact match pattern and each pattern with one wildcard: A-G-G-G-T-A-A-A-A-G-C-C-C-C-C-T, X-G-G-G-T-A-A-A-A-G-C-C-C-C-C-T, A-X-G-G-T-A-A-A-A-G-C-C-C-C-C-T, etc.

  45. Profile discovery algorithms • Profile discovery algorithms for finding sequence motifs mostly use either EM (Expectation Maximization) or Gibbs sampling.

  46. What is Gibbs sampling? • Stochastic optimization method • Works well with local multiple alignment without gaps (motif searching) • Searches for the statistically most probable motifs by sampling random positions instead of going through entire search space

  47. What is the program going to do? • Ask user for : • file containing multiple DNA or protein sequences • motif width • how many motifs wanted • Calculate the background frequencies of A,C,G,T from all the sequences. [0.34951456310679613, 0.17799352750809061, 0.21035598705501618, 0.23300970873786409]

  48. What is the program going to do? • Generate random start positions for the motif in each sequence. Example: 10 sequences, 30 bp in length, motif width of 7 start = [2, 6, 9, 14, 5, 7, 20, 20, 6, 22] >> random.uniform(0,ceiling) where ceiling=len(sequence)-width

  49. What is the program going to do? 4. Construct position specific score matrix from all sequences except one.

  50. What is the program going to do? 5. Score the left-out sequence according to the position specific score matrix:

More Related