1 / 25

Geometric Crossovers for Supervised Motif Discovery

Geometric Crossovers for Supervised Motif Discovery . Rolv Seehuus NTNU. Motivation and Scope. Try out the applicability of the geometric framework, on a supervised motif discovery problem Compare its merits to a previously used operator. In practice, we test on a very easy problem

druce
Download Presentation

Geometric Crossovers for Supervised Motif Discovery

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Geometric Crossovers for Supervised Motif Discovery Rolv Seehuus NTNU

  2. Motivation and Scope • Try out the applicability of the geometric framework, on a supervised motif discovery problem • Compare its merits to a previously used operator. • In practice, we test on a very easy problem • that existing software can solve easily • Value as test case • Building block for more complex motif discovery problems, that current algorithms can not solve satisfactory

  3. Motif Discovery • Has become a standard problem in bioinformatics • Given a set of sequences, figure out what is special with it • …by eliciting motifs in the dataset • Differing by… • Motif model • Learning algorithms • Scoring functions

  4. The Standard Approach • Do analysis of the positive set of sequences • …background distribution… • …information content… • …statistical significance… • Report motif

  5. Motif discovery as a classification problem • Always at least two datasets: The positive, and “the rest” • Choose a negative dataset • Report motifs best suited to discriminate • No need to learn a background model • The statistical significance of the motif can be given • Discriminative motif discovery has received increased attention lately

  6. Classification problem • Protein sequences, from the SwissProt database • Classified according to protein family (as specified in the Prosite database) • Selected six families, that previously have been shown to be hard to classify under similar circumstances. • Some of the families can be said to have an overrepresented motif as the ones we can train on

  7. The Potential Negative Data Set • Huge, compared to the negative • Quite common in bioinformatics, and an interesting problem to cope with in its own right • In field: • randomly generated sequences • one set of randomly selected sequences • random rearrangement of the positive sequences (data not shown) • The “best practice” was to select the samples randomly from the negative set each generation, so that their size matches the positive set.

  8. Motif Model • Twenty amino acids • Wildcard C...C.C..C DMEGACGGSCACSTCHVIVDP Motif match, positive sequence

  9. Operators on Motifs • Unit edit move as mutation Mut(A) = {Insert, Delete or Replace a token} • Substring Swapping Crossover (for comparison) • Two-point Geometric Crossover

  10. Geometric Crossover • Search space have a metric • Mutation is a move in search space • Crossover yield children found on the shortest path between the parents in search space • Successfully applied to other problems

  11. Geometric Crossover for Motifs • View motifs as sequences • Basic assumption: The edit distance is a good way to move around in motif space • A crossover based on the edit distance, should yield a good crossover for motif discovery • We (arbitrarily) choose unit costs for insertions, deletions and substitutions

  12. Sequence Alignment • Alignment: put spaces (-) in both sequences such as they become of the same length Seq1’= agcacac-a Seq2’=a-cacacta • Score: 2 • An Optimal alignment is an alignment with minimal score • The score of the optimal alignment of two sequences equals their edit distance • There often are multiple optimal alignments

  13. Homologous Crossover • Pick an optimal alignment for two parent sequences • Generate a crossover mask as long as the alignment • Recombine as traditional crossover • Remove dashes from offspring Child1’= BANANAS Child2’= ANANA Mask = 1101100 Seq1’= BANANA- Seq2’= -ANANAS SeqA’= BANANAS SeqB’= -ANANA-

  14. Experiments • Two crossovers with same parameters, and mutation only • Ten fold cross validation: • Partitioned datasets in ten pieces • Trained on 9/10ths • Tested the best motif on the remaining test set • Trained on randomly selected subset of SwissProt • Tested on entire SwissProt • Fitness: Scaled Pearson correlation of confusion matrix

  15. Dynamic behavior during evolution

  16. Maximum Values

  17. Max

  18. Cytochrome • Include the following fragment of a highly conserved motif: C…CH • Which geometric crossover find • While substring swapping finds: CH • Conservation of length keeps us in the correct ballpark • CH representa local maximum for substring swap

  19. Ferredoxin • Contains the following motif: C..C..C...C[PH] • Which Substring Swap finds • While Geometric Crossover don’t • Conservation of length keeps us from finding the correct motif

  20. Population Means

  21. Means

  22. Classification Performance Medians, of 10 experiments, for each family

  23. Classification Performance - II • Similar for all operators • Maybe a slight advantage, for the geometric crossover if we have • A highly conserved motif exist • A “ballpark guess” on motif length • Surprisingly, mutation frequently outperforms the other operators

  24. Concluding remarks • The geometric operator is promising - need work • It is more length preserving than substring swap • The geometric operator need a good guess on motif length • Edit move might not be optimal for motif discovery? • even though, it for some problems shows merit. • Our initial assumption imply an insertion/deletion equally often as replacement in sequence data • we are WAY off on that parameter

  25. Future Work • Synthetic data with known parameters • Include character classes and within motif gaps in representation • Modules (composite motifs) • Expand to position weight matrixes

More Related