1 / 42

A Similar Fragments Merging Approach to Learn Automata on Proteins

A Similar Fragments Merging Approach to Learn Automata on Proteins. Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes. Outline of the talk. Protein families signatures Similar Fragment Merging Approach (Protomata-L) Characterization Similar Fragment Pairs (SFPs) Ordering the SFPs

maxine
Download Presentation

A Similar Fragments Merging Approach to Learn Automata on Proteins

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Similar Fragments Merging Approach to Learn Automata on Proteins Goulven KERBELLEC & François COSTE IRISA / INRIA Rennes

  2. Outline of the talk • Protein families signatures • Similar Fragment Merging Approach (Protomata-L) • Characterization • Similar Fragment Pairs (SFPs) • Ordering the SFPs • Generalization • Merging of SFP in an automaton • Gap generalization • Identification of Physico-chemical properties • Experiments

  3. Protein families • Amino acid alphabet : • Protein sequence : • Protein data set : {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V} >AQP1_BOVIN MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI… >AQP1_BOVIN MASEFKKKLFWRAVVAEFLAMILFIFISIGSALGFHYPIKSNQTTGAVQDNVKVSLAFGLSI… >AQP2_RAT MWELRSIAFSRAVLAEFLATLLFVFFGLGSALQWASSPPSVLQIAVAFGLGIGILVQALGH… >AQP3_MOUSE MGRQKELMNRCGEMLHIRYRLLRQALAECLGTLILVMFGCGSVAQVVLSRGTHGGFLT… Common function & Common topology (3D structure)

  4. Characterization of a protein family • ZBT11 ...Csi..CgrtLpklyslriHmlk..H... • ZBT10 ...Cdi..CgklFtrrehvkrHslv..H... • ZBT34 ...Ckf..CgkkYtrkdqleyHirg..H... Zinc Finger Pattern C-x(2,4)-C-x(3)-[LIVMFYWC]-x(8)-H-x(3,5)-H x x x x x x x x x x x x CH x \ / x x Zn x x / \ x CH x x x x

  5. Expressivity classes of patterns PRATTTEIRESIAS PROSITE PROTOMATA-L

  6. Characterization

  7. Dialign: Similar Fragment Pairs • Significantly similar fragment pairs (SFPs) • Natural selection • Important area characterization Data set D:

  8. Ordering the SFPs • Problem : • Solution : ordering the SFPs by scoring each SFP • S(f1,f2)= ? • 3 different scoring functions : • dialign Sd • support Ss • implication Si

  9. Dialign Score • Sd ( f1 , f2 ) = - log P ( L , Sim ) • L = |f1| = |f2| • Sim = Sum of the individual similarity values • P = Probability that a random SFP of the same L has the same S Blossum62similarity

  10. Support Score • Taking into account the representativeness of SFP • Ss (f1,f2,D) = Number of sequences supporting <f1,f2> f1 f2 f <f1,f2> is supported by f with respectthe triangular inequality :Sd(f,f1) + Sd(f,f2)  Sd(f1,f2)

  11. Implication Score • Taking into account a counter-example set N • Discriminative fragments • Lerman index:Si(f1,f2,D,N) = • avec P(X) = -P( Ss(f1,f2,N) ) + P( Ss(f1,f2,D) ) x P(N) P( Ss(f1 ,f2 ,D) ) x |N| |X| |D| + |N|

  12. Generalization

  13. From protein data sets to automata MASEIKLFW F W A L S M I K E

  14. From protein data sets to automata MASEIKLFW MGYEVKYRV F W A L S M I K E R V G Y Y M V K E

  15. Merging SFPs MASEIKLFW MGYEVKYRV F W A L S M I K E R V G Y Y M V K E

  16. F W A M S L [I,V] K E Y Y R V G M Merging SFPs MASEIKLFW MGYEVKYRV

  17. Merging SFPs MASEIKLFW MGYEVKYRV F W A M S L [I,V] K E Y Y R V G M MASEVKLFM MGYEIKYRV MASEIKYRV MGYEVKLFW MASEVKYRV MGYEIKLFW

  18. Protein Sequence Data Set List of SFPs Ordered List of SFPs MCA MERGING Automaton / Regular Expression

  19. Gap Generalization • Merging on themself non-representative transitions • Treat them as "gaps"

  20. Identification of Physico-chemical properties • Similar Fragments ~ potential function area • Amino acids share out the same position • Physicochemical property at play • => Generalization from a group (of amino acids) to a Taylor group I,V C I,Q,W,P aliphatic no information C I,L,V x [I,V] [I,L,V] C C[I,Q,W,P] X {A,R,N,D,C,Q,E,G,H,I,L,K,M,F,P,S,T,W,Y,V}

  21. Likelihood ratio test • To decide if the multi-set A has been generated according to a physico-chemical group G or not by a likelihood ratio test: • Given a threshold , we test the expansion of A to G and reject it when LRG/A < 

  22. Experiments

  23. MIP : the Major Intrinsic Protein Family Family MIP Subfamilies AQP, Glpf, Gla

  24. Set « C» (49 seq) Blast(1<e<100) not MIP Data sets Water-specific Set « W+» (24 seq) Set « W-» (16 seq) Set « M» (44 seq) identity<90% Set « E» (79 seq) Set « T » (159 seq) Set « U » (911 seq) MIP in SWISS-PROT UNIPROT

  25. Experiments • First Common Fragment on a Family • MIP family • Positive set • Comparison with pattern discovery tools • Teiresias • Pratt • Protomata-L (short pattern) • Water-specific Characterization • MIP sub-families • Positive and negative sets • Leave-one-out cross-validation • Protomata-L (short to long pattern)

  26. Set « T » (159 seq) Target set Learning set Learning Set Set « M» (44 seq) First Common Fragment Automaton • Results of 4 patterns scannedon Swiss-Prot protein Database

  27. From short automata to long automata • Previous experiment • only the first SFPs of the ordered list of SFPs • short automaton • first common fragment automaton • Next experiment • larger cut-offs in the list of SFPs • Protomat-L is able to create longer automata with more common subparts • Long patterns are closed of the topoly (3D-structure) of the family

  28. Water-specific characterization • Leave-one-out cross-validation • Learning set • W+ \ Si : Positive learning set • W- \ Sj : Negative learning set • Test set • { Si U Sj } • Control set • Set T • Implication score Set « W+» (24 seq) Set « W-» (16 seq) Set « C» (49 seq)

  29. Leave-one-out cross-validation

  30. Error Correcting Cost • The error correcting cost of a sequence S represents the distance (blossum similarity) between S and the closest sequence given by the automaton A. • Distibution of sequences with long automata (size Approx. 100)

  31. Leave-one-out cross-validationWith Error Correcting Cost

  32. Leave-one-out cross-validation

  33. Conclusion & Perspective • Good characterization of protein family using automata (-> hmm structure) • No need of a multiple alignment • greedy data-driven algorithm • Important subparts localization • Physico-chemical identification and generalization • Counter example sets • Bringing of knowledge is possible in automata (-> 2D structure)

  34. Questions ? ? ? ? ? ? ? ? ? ? ? ? ? ? ? ?

  35. Demo

  36. Protomata-L ’s Approach • First Common Fragment

  37. Protomata-L ’s Approach • To get a more precise automaton

  38. Data set (Protein sequences) Pairs of fragments EXTRACTION Initial Automaton(MCA) SORT MERGING IDENTIFICATION OF PHYSICOCHEMICAL GROUPS IDENTIFICATION OF « GAPS »

  39. Structural discrimination

  40. Generalization of an Aquaporins automaton Aromatique Hydrophobe Non Informatif

  41. Physico-chemical properties identification Ratio likelihood test Aliphatic x Small

More Related