0 Views

Download Presentation
##### CSE182-L6

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -

**CSE182-L6**Protein sequence analysis CSE182**Possible domain queries**• Case 1: • You have a collection of sequences that belong to a family (contain a functional domain). • Given an ‘orphan’ sequence, does it belong to the family? • There are different solutions depending upon the representation of the domain (patterns/alignments/HMM/profiles) • Case 2: • You have an orphan sequence from an uncharacterized family. Can you identify other members of the family, and create a representation of them (Harder problem). CSE182**EX: Innexins**• The Macagno lab is studying Gap junction proteins, Innexins (invertebrate analogs of connexins) in Hirudo • Innexins have been found in C. elegans, and Drosophila. • In C. elegans, 25 members of this family have been found, and partially categorized. CSE182**Innexins in Hirudo**• When certain Innexins are knocked out, they cause serious defects in cells in the ganglia. • The EST database (partial gene sequences) contains a number of putative Innexins, discovered via BLAST. • Project: • Q: Can you confirm that these are Innexins. Can you find more members? (this lecture) • Q: Can you characterize them w.r.t known innexins in C. elegans, and Drosophila? • Q: Use your method for other families of interest. Netrins, and their receptors. CSE182**Not all features(residues) are important**Skin patterns Facial Features CSE182**Protein sequence motifs**• Premise: • The sequence of a protein sequence gives clues about its structure and function. • Not all residues are equally important in determining function. • Suppose we knew the key residues of a family. If our query matches in those residues, it is a member. Otherwise, it is not. • The key residues can be identified if we had structural information, or through conserved residues in an alignment of the family. CSE182**Representation of domains/families.**• We will consider a number of representations that describe key residues, characteristic of a family • Patterns (regular expressions) • Alignments • Profiles • HMMs • Start with the following: • A collection of sequences with the same function. • Region/residues known to be significant for maintaining structure and function. • Develop a pattern of conserved residues around the residues of interest • Iterate for appropriate sensitivity and specificity CSE182**From alignment to patterns*** ALRDFATHDDF SMTAEATHDSI ECDQAATHEAS ATH-[DE] • Search a database with the resulting pattern • Refine pattern to eliminate false positives • Iterate CSE182**Regular Expression Patterns**• Zinc Finger motif • C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H • 2 conserved C, and 2 conserved H • How can we search a database using these motifs? • The motif is described using a regular expression. What is a regular expression? CSE182**Regular Expressions**• Concise representation of a set of strings over alphabet . • Described by a string over • R is a r.e. if and only if CSE182**Regular Expression**• Q: Let ={A,C,E} • Is (A+C)*EEC* a regular expression? • Is *(A+C) regular? • Q: When is a string s in a regular expression? • R =(A+C)*EEC* • Is CEEC in R? • AEC? • ACEE? CSE182**Regular Expression & Automata**• Every R.E can be expressed by an automaton (a directed graph) with the following properties: • The automaton has a start and end node • Each edge is labeled with a symbol from , or • Suppose R is described by automaton A • S R if and only if there is a path from start to end in A, labeled with s. CSE182**Examples: Regular Expression & Automata**• (A+C)*EEC* A C E E start end C • Is CEEC in R? • AEC? • ACEE? • ACE? CSE182**** Constructing automata from R.E • R = {} • R = {}, • R = R1 + R2 • R = R1 · R2 • R = R1* CSE182**Regular Expression Matching**• Given a database D, and a regular expression R, is a substring of D in R? • Is there a string D[l..c] that is accepted by the automaton of R? • Simpler Q: Is D[1..c] accepted by the automaton of R? CSE182**Alg. For matching R.E.**• If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA D[1] D[2] D[c] CSE182**Alg. For matching R.E.**• If D[1..c] is accepted by the automaton RA • There is a path labeled D[1]…D[c] that goes from START to END in RA • There is a path labeled D[1]..D[c-1] from START to node u, and a path labeled D[c] from u to the END u D[1] .. D[c-1] D[c] CSE182**D.P. to match regular expression**u v • Define: • A[u,] = Automaton node reached from u after reading • Eps(u): set of all nodes reachable from node u using epsilon transitions. • N[c] = subset of nodes reachable from START node after reading D[1..c] • Q: when is v N[c] u Eps(u) CSE182**D.P. to match regular expression**• Q: when is v N[c]? • A: If for some u N[c-1], w = A[u,D[c]], • v {w}+ Eps(w) CSE182**Algorithm**CSE182**The final step**• We have answered the question: • Is D[1..c] accepted by R? • Yes, if END N[c] • We need to answer • Is D[l..c] (for some l, and some c) accepted by R CSE182**Representation 2: Profiles**• Profiles versus regular expressions • Regular expressions are intolerant to an occasional mis-match. • The Union operation (I+V+L) does not quantify the relative importance of I,V,L. It could be that V occurs in 80% of the family members. • Profiles capture some of these ideas. CSE182**Profiles**• Start with an alignment of strings of length m, over an alphabet A, • Build an |A| X m matrix F=(fki) • Each entry fki represents the frequency of symbol k in position i 0.71 0.14 0.28 0.14 CSE182**Profiles**• Start with an alignment of strings of length m, over an alphabet A, • Build an |A| X m matrix F=(fki) • Each entry fki represents the frequency of symbol k in position i 0.71 0.14 0.28 0.14 CSE182**Scoring matrices**i • Given a sequence s, does it belong to the family described by a profile? • We align the sequence to the profile, and score it • Let S(i,j) be the score of aligning position i of the profile to residue sj • The score of an alignment is the sum of column scores. s sj CSE182**Scoring Profiles**Scoring Matrix i k fki s CSE182**Domain analysis via profiles**• Given a database of profiles of known domains/families, we can query our sequence against each of them, and choose the high scoring ones to functionally characterize our sequences. • What if the sequence matches some other sequences weakly (using BLAST), but does not match any Profile? CSE182**Psi-BLAST idea**• Iterate: • Find homologs using Blast on query • Discard very similar homologs • Align, make a profile, search with profile. • Why is this more sensitive? Seq Db CSE182**Pigeonhole principle again:**• If profile of length m must score >= T • Then, a sub-profile of length l must score >= lT|/m • Generate all l-mers that score at least lT|/M • Search using an automaton • Multiple alignment: • Use ungapped multiple alignments only Psi-BLAST speed • Two time consuming steps. • Multiple alignment of homologs • Searching with Profiles. • Does the keyword search idea work? CSE182**Representation 3: HMMs**• Question: • your ‘friend’ likes to gamble. • He tosses a coin: HEADS, he gives you a dollar. TAILS, you give him a dollar. • Usually, he uses a fair coin, but ‘once in a while’, he uses a loaded coin. • Can you say what fraction of the times he loads the coin? CSE182**Representation 3: HMMs**• Building good profiles relies upon good alignments. • Difficult if there are gaps in the alignment. • Psi-BLAST/BLOCKS etc. work with gapless alignments. • An HMM representation of Profiles helps put the alignment construction/membership query in a uniform framework. V CSE182**The generative model**• Think of each column in the alignment as generating a distribution. • For each column, build a node that outputs a residue with the appropriate distribution 0.71 Pr[F]=0.71 Pr[Y]=0.14 0.14 CSE182**A simple Profile HMM**• Connect nodes for each column into a chain. Thie chain generates random sequences. • What is the probability of generating FKVVGQVILD? • In this representation • Prob [New sequence S belongs to a family]= Prob[HMM generates sequence S] • What is the difference with Profiles? CSE182**Profile HMMs can handle gaps**• The match states are the same as on the previous page. • Insertion and deletion states help introduce gaps. • A sequence may be generated using different paths. CSE182**Example**A L - L A I V L A I - L • Probability [ALIL] is part of the family? • Note that multiple paths can generate this sequence. • M1I1M2M3 • M1M2I2M3 • In order to compute the probabilities, we must assign probabilities of transition between states CSE182**Profile HMMs**• Directed Automaton M with nodes and edges. • Nodes emit symbols according to ‘emission probabilities’ • Transition from node to node is guided by ‘transition probabilities’ • Joint probability of seeing a sequence S, and path P • Pr[S,P|M] = Pr[S|P,M] Pr[P|M] • Pr[ALIL AND M1I1M2M3] = Pr[ALIL| M1I1M2M3,M] Pr[M1I1M2M3|M] • Pr[ALIL | M] = ? CSE182**Protein structure basics**CSE182**Side chains determine amino-acid type**• The residues may have different properties. • Aspartic acid (D), and Glutamic Acid (E) are acidic residues CSE182**Various constraints determine 3d structure**• Constraints • Structural constraints due to physiochemical properties • Constraints due to bond angles • H-bond formation • Surprisingly, a few conformations are seen over and over again. CSE182**Alpha-helix**• 3.6 residues per turn • H-bonds between 1st and 4th residue stabilize the structure. • First discovered by Linus Pauling CSE182**Beta-sheet**• Each strand by itself has 2 residues per turn, and is not stable. • Adjacent strands hydrogen-bond to form stable beta-sheets, parallel or anti-parallel. • Beta sheets have long range interactions that stabilize the structure, while alpha-helices have local interactions. CSE182**Domains**• The basic structures (helix, strand, loop) combine to form complex 3D structures. • Certain combinations are popular. Many sequences, but only a few folds CSE182**3D structure**• Predicting tertiary structure is an important problem in Bioinformatics. • Premise: Clues to structure can be found in the sequence. • While de novo tertiary structure prediction is hard, there are many intermediate, and tractable goals. • The PDB database is a compendium of structures PDB CSE182**Searching structure databases**• Threading, and other 3d Alignments can be used to align structures. • Database filtering is possible through geometric hashing. CSE182**Trivia Quiz**• What research won the Nobel prize in Chemistry in 2004? • In 2002? CSE182**Nobel Citation 2002**CSE182**Nobel Citation, 2002**CSE182**Mass Spectrometry**CSE182