1 / 56

Profile Searches

Profile Searches. Revised 07/11/06. Overview. Introduction Motif representation Motif screening Motif Databases Exercise. Introduction. Multiple sequence alignment. Features characteristic for the whole family. How to represent the characteristic features?.

jara
Download Presentation

Profile Searches

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Profile Searches Revised 07/11/06

  2. Overview • Introduction • Motif representation • Motif screening • Motif Databases • Exercise

  3. Introduction Multiple sequence alignment Features characteristic for the whole family How to represent the characteristic features? • Motif model: captures the family characteristic features • regular expression, weight matrix, HMM profile

  4. Multiple sequence alignment Scan new sequence with the model Construct model Unaligned sequences Introduction • model: captures the family characteristic features • used to detect remote homologs of a family

  5. Overview • Introduction • Motif representation • String based representation • Consensus • Regular expression • Probabilistic representation • PSSM • HMM • Profile • Motif screening • Motif Databases • Exercise

  6. HMM I. II. Multiple sequence alignment Scan new sequence with the model Construct model III. Unaligned sequences

  7. String Based Representation Consensus sequence: • Reductionistic representation of a motif • Most frequent instance is used as a representative • Loss of information Regular expression: • More complex representation allowing motif degeneracy

  8. String Based Representation Consensus CTTAATATTAACTTAAT Regular expression CTTAAKRTTMAYTTAAT

  9. signal cell chromosome signal motif Gene 1 Gene 2 Gene 3 Gene 4 gene transcription ? mRNA translation protein String Based Representation DNA motifs

  10. String Based Representation Sequences involved in enzymatic reactions (PROSITE)

  11. Overview • Introduction • Motif representation • String based representation • Consensus • Regular expression • Probabilistic representation • PSSM • HMM • Profile • Motif screening • Motif Databases • Exercise

  12. PSSM Probabilistic G A A T T C A T G T C A C T T C A T T G Frequency matrix Alignment Pseudo Counts Frequency matrix

  13. PSSM G A A T T C A T G T C A C T T C A T T G Probabilistic Alignment Convert into PSSM Motif logo PSSM p(A)=p(C)=p(G)=p(T)=0.25

  14. msa Regular expression Weight matrix Motif logo PSSM

  15. Motif Representation Consensus CTTAATATTAACTTAAT Regular expression CTTAAKRTTMAYTTAAT PSSM (motif logo)

  16. Dj Ij begin Mj end HMM Definition HMM • State sequence path p: • Probability of a state depends only on the previous state State k State l ek(b) akl • emission probability: probability that symbol b is seen when in state k • Transition probability from state l to state k • A HIDDEN Markov model: it is not possible to tell what state the system is in by looking at the corresponding symbol • Finding the possible paths = decoding

  17. HMM Probabilistic model that represents the alignment of the family • Gapped multiple alignment • Distinct states separated by transition probabilities (i.e. the probability of moving from one state to the next) • The current stateis only dependent on the previous state (first order Markov process) • The sequence of states followed in the model is called the path p • Each state has the probability of emitting a certain symbol of the alphabet (A,C,T,G for DNA) or one of the 20 amino acids for proteins: emission probability

  18. HMM • HMM can model any possible sequence • It defines a probability distribution over the whole space of sequences • Training a HMM: search for the parametrisation that makes this distribution peak around members of the family • Parametrisation • Determine model structure • Length of alignment • Number of insert states • Determine the probability parameters

  19. HMM Training a HMM • Determine structure of the model • Determine emission and transition probabilities E.g. the first column: e1(A) = 4/5; e1(T) = 1/5; e1(C) = 0; e1(G) = 0; E.g. the second column: e2(A) = 0; e2(T) = 0; e2(C) = 4/5; e2(G) = 1/5; E.g. the third column: e3(A) = 4/5; e3(T) = 0; e3(C) = 1/5; e3(G) = 0; 0.4 A 0.2 C 0.4 G 0.2 T 0.2 ACA---ATG TCAACTATC ACAC--AGC AGA---ATC ACCG--ATC 0.6 0.6 A 0.8 C PC G PC T 0.2 A PC C 0.8 G 0.2 T PC A 0.8 C G T 0.2 A 1 C G T 0.4 1 1

  20. Profile representation b I A … I S T T V A V A I L T V T V I A I V • Suppose I (amino acid b) is the ancestor • What is the probability of observing a T (amino acid a) in the first column (position p) of the alignment • This probability is reflected by the score M M(p,a)= W(p,b) X Y(a,b) M(1,I)= W(1,T) X Y(I,T) M is dependent on • The observed frequency of T in the first position of the alignment (W) • The probability of mutating I => T (according to PAM) (Y)

  21. Profile representation gaps

  22. Overview • Introduction • Motif representation • Motif screening • Motif Databases • Exercise

  23. HMM I. II. Multiple sequence alignment Scan new sequence with the profile Construct profile HMM III. Unaligned sequences

  24. Screening

  25. Screening • The multiple alignment of the family is known (Clustal W) • The motif to be detected is known but the multiple alignment does not yet exist • Motifs already described in literature • Construct the multiple alignment, derive the model • Neither the motif nor the multiple alignment exist • Probabilistic motif detection

  26. Screening Genome wide screening • Obtained Motif Model used for genome wide screening (Motif Scanner) • Identification of putative additional targets • Use sliding window • Attribute to each sequence within the sliding window a score • Rank the hits based on their score and select the most promising candidates

  27. Screening • Distinct methods differ in the motif representation and the scoring system used • Consensus Sequence or Regular expression (pattern match) • Very conservative • Do not allow mismatches • PSSM / HMM: more complicated scoring schemes • based on information content • Log likelihood • Less conservative • Difficult choice of threshold score • Tradeoff between sensitivity and selectivity

  28. Screening • FDR (1-Precision) FP/(TP+FP) • Precision TP/(TP+FP) • Specificity (related to the false positive fraction= 1-spec) TN/(TN+FP) • Sensitivity (true positive fraction = recall) TP/(TP+FN)

  29. Screening • E- value: corresponds to the probability of finding a score equal or better than the one observed, by chance alone.

  30. Screening with Regular Expression Simple perl scripts

  31. Screening with PSSM (0.6*0.9*0.8*0.97*0.6*0.7)/(0.25^6) 9.4 = log2(720) Background frequency of each of the four nucleotides: • Slide a window of length W over a sequence • Calculate for each subsequence within the window a log odds-score • The highest scoring positions correspond to the most likely locations of the motif

  32. Screening with HMM • Belongs a sequence to a family of proteins? • Scoring a sequence with a HMM • aligning the sequence to the HMM • finding the hidden path that generates the sequence • A sequence can be generated by different paths • Enumerate all paths and calculate for each path the probability that is generates the sequence • Viterbi Path: most likely path • Total probability that sequence is generated by HMM = sum of probabilities of all possible paths

  33. Screening with HMM Example for 1 path ATCAGT

  34. ATT and TTC A T A - - T Screening with HMM • Calculate the probability of the sequence being generated by the HMM profile of a protein family versus a random model = align the unknown sequence with the HMM • The sequence can be generated by different paths • Impossible to enumerate all possibilities • What is the most probable path? (Viterbi, backtracking) • What is the total probability? (Forward) Bits score

  35. Screening with HMM • Hidden Markov model because if we observe a sequence, the path of states that was followed by the Markov model to generate the observed sequence is unknown or hidden. • This hidden path contains the information on how the observed sequence should be aligned with the profile. • Usually a sequence can be generated in multiple ways by the Markov model and more hidden paths (corresponding to distinct alignments) are possible. Usually not all possible paths have an equal probability. Indeed some transitions are not very likely (low transition probability). Usually the path with the highest probability (highest score = most likely path) corresponds to the best alignment.

  36. Dj Ij begin Mj end Screening with HMM • Detecting the underlying sequence of states allows to uncover the most probable path of transitions (decoding) • VITERBI Algorithm: most probable path (backtracking) • Start at first position (state k) • Move to next most probable state l • Vk(i) is the probability of the most probable path ending in state k • Calculate probability • Viterbi algorithm allows to detect the most probable path and the probability of this most probable path

  37. HMM -ACA---ATG -TCAACTATC -ACAC--AGC -AGA---ATC -ACCG--ATC ACAAG Calculate Score state 1: S(1)= a(BM) +e(A) S(2)= a(BI) + e(A) S(3)= a(BD) A AC - Dj Maximal score state M: S(1)= a(BM) +e(A) S(1)= a(BI) + e(A) + a(IM)+e(C) Ij begin Mj end

  38. Conclusion • Distinct methods differ in the motif representation and the scoring system used • Consensus Sequence or Regular expression (pattern match) • Very conservative • Do not allow mismatches • PSSM / HMM: more complicated scoring schemes • based on information content • Log likelihood • Less conservative • Difficult choice of threshold score • Tradeoff between sensitivity and selectivity

  39. Overview • Introduction • Motif representation • Motif screening • Motif Databases • Prosite • Blocks • pFAM • Exercise

  40. Pfam • Pfam starts from a set of automatically generated domain alignments (generated by PsiBlast). • From these alignments a HMM is calculated • Subsequently all sequences in the SwissProt database of proteins are classified in protein families • By scoring them with the representative HMMs • Ranking sequences according to their score • separate class members from the other sequences in the database based on a suitable threshold • Pfam 7.0 is such a database that contains a total of 3360 families. Pfam contains multiple protein alignments and profile-HMMs of these families.

  41. Pfam

  42. Pfam • Full: alignment on which the Pfam HMM was based • HMMs for global and fragment search

  43. Pfam Screening an new sequence against Pfam HMMs to classify the novel sequence

  44. Pfam • Scores in Pfam • Raw score: bitscore The probability that the sequence was generated by the HMM and the probability that the sequence was generated by a null model E-value is the number of hits that would be expected to have a score equal or better than this by chance alone • Each Pfam family: "trusted cutoff" and a "noise cutoff“ • TC1 is the lowest score for sequences included in the family • NC1 is the highest score for sequences not included in the Full alignment

  45. Pfam

  46. PROSITE Patterns (regular expressions) (ScanProsite) • Shorter than Pfam • Enzyme catalytic sites • Prosthetic group attachment sites (heme, pyridoxal-phosphate, biotin, etc) • Amino acids involved in binding a metal ion • Cysteines involved in disulfide bonds • Regions involved in binding a molecule (ADP/ATP, GDP/GTP, calcium, DNA, etc.) or another protein

  47. PROSITE • Profiles (Profile representation)

  48. PROSITE Aminael renew

  49. BLOCKS • Database of ungapped alignments • Motif models represented as PSSMs

More Related