1 / 53

Protein Sequence Analysis: Possible Domain Queries

Explore different solutions for identifying functional domains in protein sequences and determine membership in specific protein families.

willier
Download Presentation

Protein Sequence Analysis: Possible Domain Queries

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. CSE182-L6 Protein sequence analysis CSE182

  2. Possible domain queries • Case 1: • You have a collection of sequences that belong to a family (contain a functional domain). • Given an ‘orphan’ sequence, does it belong to the family? • There are different solutions depending upon the representation of the domain (patterns/alignments/HMM/profiles) • Case 2: • You have an orphan sequence from an uncharacterized family. Can you identify other members of the family, and create a representation of them (Harder problem). CSE182

  3. EX: Innexins • The Macagno lab is studying Gap junction proteins, Innexins (invertebrate analogs of connexins) in Hirudo • Innexins have been found in C. elegans, and Drosophila. • In C. elegans, 25 members of this family have been found, and partially categorized. CSE182

  4. Innexins in Hirudo • When certain Innexins are knocked out, they cause serious defects in cells in the ganglia. • The EST database (partial gene sequences) contains a number of putative Innexins, discovered via BLAST. • Project: • Q: Can you confirm that these are Innexins. Can you find more members? (this lecture) • Q: Can you characterize them w.r.t known innexins in C. elegans, and Drosophila? • Q: Use your method for other families of interest. Netrins, and their receptors. CSE182

  5. Not all features(residues) are important Skin patterns Facial Features CSE182

  6. Protein sequence motifs • Premise: • The sequence of a protein sequence gives clues about its structure and function. • Not all residues are equally important in determining function. • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. • The key residues can be identified if we had structural information, or through conserved residues in an alignment of the family. CSE182

  7. Representation of domains/families. • We will consider a number of representations that describe key residues, characteristic of a family • Patterns (regular expressions) • Alignments • Profiles • HMMs • Start with the following: • A collection of sequences with the same function. • Region/residues known to be significant for maintaining structure and function. • Develop a pattern of conserved residues around the residues of interest • Iterate for appropriate sensitivity and specificity CSE182

  8. From alignment to patterns * ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE] • Search a database with the resulting pattern • Refine pattern to eliminate false positives • Iterate CSE182

  9. Regular Expression Patterns • Zinc Finger motif • C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H • 2 conserved C, and 2 conserved H • How can we search a database using these motifs? • The motif is described using a regular expression. What is a regular expression? CSE182

  10. Regular Expressions • Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if CSE182

  11. Regular Expression • Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • Is *(A+C) regular? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? CSE182

  12. Regular Expression & Automata • Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or  • Suppose R is described by automaton A • S  R if and only if there is a path from start to end in A, labeled with s. CSE182

  13. Examples: Regular Expression & Automata • (A+C)*EEC* A C E E start end C • Is CEEC in R? • AEC? • ACEE? • ACE? CSE182

  14.    Constructing automata from R.E  • R = {} • R = {},    • R = R1 + R2 • R = R1 · R2 • R = R1*      CSE182

  15. Regular Expression Matching • Given a database D, and a regular expression R, is a substring of D in R? • Is there a string D[l..c] that is accepted by the automaton of R? • Simpler Q: Is D[1..c] accepted by the automaton of R? CSE182

  16. Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA  D[1] D[2] D[c] CSE182

  17. Alg. For matching R.E. • If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA • There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END u D[1] .. D[c-1] D[c] CSE182

  18. D.P. to match regular expression u  v • Define: • A[u,] = Automaton node reached from u after reading  • Eps(u): set of all nodes reachable from node u using epsilon transitions. • N[c] = subset of nodes reachable from START node after reading D[1..c] • Q: when is v  N[c]  u Eps(u) CSE182

  19. D.P. to match regular expression • Q: when is v  N[c]? • A: If for some u  N[c-1], w = A[u,D[c]], • v  {w}+ Eps(w) CSE182

  20. Algorithm CSE182

  21. The final step • We have answered the question: • Is D[1..c] accepted by R? • Yes, if END  N[c] • We need to answer • Is D[l..c] (for some l, and some c) accepted by R CSE182

  22. Representation 2: Profiles • Profiles versus regular expressions • Regular expressions are intolerant to an occasional mis-match. • The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members. • Profiles capture some of these ideas. CSE182

  23. Profiles • Start with an alignment of strings of length m, over an alphabet A, • Build an |A| X m matrix F=(fki) • Each entry fki represents the frequency of symbol k in position i 0.71 0.14 0.28 0.14 CSE182

  24. Profiles • Start with an alignment of strings of length m, over an alphabet A, • Build an |A| X m matrix F=(fki) • Each entry fki represents the frequency of symbol k in position i 0.71 0.14 0.28 0.14 CSE182

  25. Scoring matrices i • Given a sequence s, does it belong to the family described by a profile? • We align the sequence to the profile, and score it • Let S(i,j) be the score of aligning position i of the profile to residue sj • The score of an alignment is the sum of column scores. s sj CSE182

  26. Scoring Profiles Scoring Matrix i k fki s CSE182

  27. Domain analysis via profiles • Given a database of profiles of known domains/families, we can query our sequence against each of them, and choose the high scoring ones to functionally characterize our sequences. • What if the sequence matches some other sequences weakly (using BLAST), but does not match any Profile? CSE182

  28. Psi-BLAST idea • Iterate: • Find homologs using Blast on query • Discard very similar homologs • Align, make a profile, search with profile. • Why is this more sensitive? Seq Db CSE182

  29. Pigeonhole principle again: • If profile of length m must score >= T • Then, a sub-profile of length l must score >= lT|/m • Generate all l-mers that score at least lT|/M • Search using an automaton • Multiple alignment: • Use ungapped multiple alignments only Psi-BLAST speed • Two time consuming steps. • Multiple alignment of homologs • Searching with Profiles. • Does the keyword search idea work? CSE182

  30. Representation 3: HMMs • Question: • your ‘friend’ likes to gamble. • He tosses a coin: HEADS, he gives you a dollar. TAILS, you give him a dollar. • Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. • Can you say what fraction of the times he loads the coin? CSE182

  31. Representation 3: HMMs • Building good profiles relies upon good alignments. • Difficult if there are gaps in the alignment. • Psi-BLAST/BLOCKS etc. work with gapless alignments. • An HMM representation of Profiles helps put the alignment construction/membership query in a uniform framework. V CSE182

  32. The generative model • Think of each column in the alignment as generating a distribution. • For each column, build a node that outputs a residue with the appropriate distribution 0.71 Pr[F]=0.71 Pr[Y]=0.14 0.14 CSE182

  33. A simple Profile HMM • Connect nodes for each column into a chain. Thie chain generates random sequences. • What is the probability of generating FKVVGQVILD? • In this representation • Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] • What is the difference with Profiles? CSE182

  34. Profile HMMs can handle gaps • The match states are the same as on the previous page. • Insertion and deletion states help introduce gaps. • A sequence may be generated using different paths. CSE182

  35. Example A L - L A I V L A I - L • Probability [ALIL] is part of the family? • Note that multiple paths can generate this sequence. • M1I1M2M3 • M1M2I2M3 • In order to compute the probabilities, we must assign probabilities of transition between states CSE182

  36. Profile HMMs • Directed Automaton M with nodes and edges. • Nodes emit symbols according to ‘emission probabilities’ • Transition from node to node is guided by ‘transition probabilities’ • Joint probability of seeing a sequence S, and path P • Pr[S,P|M] = Pr[S|P,M] Pr[P|M] • Pr[ALIL AND M1I1M2M3] = Pr[ALIL| M1I1M2M3,M] Pr[M1I1M2M3|M] • Pr[ALIL | M] = ? CSE182

  37. Protein structure basics CSE182

  38. Side chains determine amino-acid type • The residues may have different properties. • Aspartic acid (D), and Glutamic Acid (E) are acidic residues CSE182

  39. Bond angles form structural constraints CSE182

  40. Various constraints determine 3d structure • Constraints • Structural constraints due to physiochemical properties • Constraints due to bond angles • H-bond formation • Surprisingly, a few conformations are seen over and over again. CSE182

  41. Alpha-helix • 3.6 residues per turn • H-bonds between 1st and 4th residue stabilize the structure. • First discovered by Linus Pauling CSE182

  42. Beta-sheet • Each strand by itself has 2 residues per turn, and is not stable. • Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel. • Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local interactions. CSE182

  43. Domains • The basic structures (helix, strand, loop) combine to form complex 3D structures. • Certain combinations are popular. Many sequences, but only a few folds CSE182

  44. 3D structure • Predicting tertiary structure is an important problem in Bioinformatics. • Premise: Clues to structure can be found in the sequence. • While de novo tertiary structure prediction is hard, there are many intermediate, and tractable goals. • The PDB database is a compendium of structures PDB CSE182

  45. Searching structure databases • Threading, and other 3d Alignments can be used to align structures. • Database filtering is possible through geometric hashing. CSE182

  46. Trivia Quiz • What research won the Nobel prize in Chemistry in 2004? • In 2002? CSE182

  47. How are Proteins Sequenced? Mass Spec 101: CSE182

  48. Nobel Citation 2002 CSE182

  49. Nobel Citation, 2002 CSE182

  50. Mass Spectrometry CSE182

More Related