1 / 32

A Randomized Exhaustive Propositionalization Approach for Molecule Classification

A Randomized Exhaustive Propositionalization Approach for Molecule Classification. Michele Samorani Manuel Laguna Kirk DeLisle Daniel Weaver. Drug discovery. Drug Discovery. The process of developing new drugs

zeheb
Download Presentation

A Randomized Exhaustive Propositionalization Approach for Molecule Classification

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Randomized Exhaustive Propositionalization Approach for Molecule Classification Michele Samorani Manuel Laguna Kirk DeLisle Daniel Weaver

  2. Drug discovery

  3. Drug Discovery • The process of developing new drugs • The cost of developing a drug typically varies from 500 million $ to 2 billion $ • Molecule classification is used along the entire process to discriminate between: • Active and Non Active compounds • Toxic and Non Toxic compounds • During the development of a new drug: • Use the experiments done so far to train a classifier • Use the classifier to find the promising compounds to test next • An ideal Classification Algorithm: • Speeds up the design of new drugs • Gives insights about chemical properties

  4. Data Mining in Drug Discovery The chemist designs a compound Non-Active! Classifier Non-Active (0) Active (1) Attribute representation

  5. Molecule classification – Binary Fingerprints • One of the main attribute representations is the so called Binary Fingerprints: • Every attribute represents the absence/presence (0/1) of a characteristic or a substructure • The attributes are pre-defined characteristics • The classification process does not find new knowledge, but which attributes are most important

  6. Molecule classification – Binary Fingerprints • The focus of this work is NOT on improving the classification procedure • It is on how to generate a good attribute representation that generates new knowledge

  7. propositionalization

  8. Propositionalization • The starting point is a database • By navigating through the database, new features are generated, which represent the result of SQL queries • These features are added to the mining table

  9. It contains the compounds n 1 n n n 1 n

  10. Generating a new attribute • Two steps: • Find a path that starts from the target table • Roll-up one simple attribute, through aggregations and refinements, from the last table to the target table

  11. STEP 1: Find a path n 1 n n n 1 n

  12. STEP 1: Find a path n 1 n n n This path will find attributes of depth 2 1 n Depth = measure of how complex the attribute is

  13. STEP 2: Roll-up n 1 n n n Aggregate to each Atom: count distinct bonds (CDB) 1 n

  14. STEP 2: Roll-up CDB 1 n 2 1 n n n Aggregate to each Atom: count distinct bonds (CDB) 1 n

  15. STEP 2: Roll-up CDB 1 n 2 max(Atom.CDB) Where Atom.ele = ‘C’ 1 n X n 2 3 n Attach to the target table: The maximum number of bonds to which an atom of carbon participates 1 n

  16. Propositionalization – graphically Depth 1 Depth 4 Depth 2 Depth 3

  17. Our contribution over traditional propositionalization Our Randomized Exhaustive approach produces: More expressive attributes (Exhaustive) “Deeper” attributes (Randomized)

  18. More expressive attributes – Example • Traditional propositionalization algorithms can generate the following attribute: • Count the number of double bonds to which each atom participates • Compute the maximum • But not the following attribute: • Count the number of double bonds to which each atom participates • Compute the maximumamong the oxygen atoms

  19. Attributes Traditional vs Exhaustive Activity Mutagenicity

  20. EXPERIMENTS

  21. Design of the experiments • Given an attribute generation strategy: • Perform a 10-fold cross validation using 10 different classifiers (from Weka): • MultilayerPerceptron, BayesNet, Bagging, J48, ADTree, REPTree, RandomForest, PART, Nnge, Ridor • The average accuracy across the folds and across the classifiers is the measure of the performance of the strategy used

  22. Up to a predefined depth In general, the deeper we go the higher accuracy we obtain Let’s generate attributes at depth > 4

  23. Up to depth 4 + 1,000 in [5,7] Generate all attributes up to depth 4 and add 1,000 attributes randomly sampled from depth 5 to 7 77.93% 77.72% 76.30% 74.95%

  24. Summary of the results • Exhaustive is significantly better than Traditional • Sampling deep attributes at the end of the attribute generation procedure is significantly better than continuing generating non-deep attributes • (in terms of proportion of classifiers that perform better with one strategy than with the other)

  25. Comparison to fingerprints The difference is not significant

  26. Comparison to fingerprints 2 hours of computing time Years of research effort in order to identify this attribute representation The difference is not significant

  27. Additional Attribute Generation Strategies • Let’s not sample deep attributes randomly • Strategy 1: find the best mix of depths (scatter search) • Strategy 2: use a Bayesian Network to retrieve attributes with high information gain

  28. New knowledge • Although our best method does not improve upon fingerprints, it has the potential of generating new knowledge • The attributes used by the classifiers represent important characteristics • Number of bromine atoms • The average number of double bonds among the atoms different from S These attributes identify structures that have characteristics that may prevent mutagenesis

  29. New knowledge • But, sometimes, deep attributes are hard to interpret. • On Estrogen: • Label each atom A in the following way. 1) Consider the atoms connected to it and count the bonds to which they participate (excluding the bond connecting A to each of them). 2) Compute the sum of these labels and obtain the label for A. Label the molecule with the minimum of these labels across all atoms of oxygen. • Specifically, a high value would represent an oxygen atom that is connected to other atoms participating in a large number of additional bonds - presumably an oxygen atom that is somewhat buried and interacting with highly branched atoms.

  30. Conclusions • The current attribute representations (Fingerprints) used for molecule classification does not provide insights on the chemical properties of the compounds • Traditional propositionalization approaches do not obtain satisfying accuracy • Our method extends the traditional propositionalization approach and: • Obtains an accuracy comparable to Fingerprints • Has the potential of finding new knowledge • Note that our method is applicable to any domain (marketing, medical, etc…)

  31. Future Work • Accuracy improvement: • Scan&Sample => Scan & Smartly Sample • Improve the feature representation • Query-like is ok for computer scientists, but chemists would prefer a graphical representation

  32. Thank you for your attention Michael.Samorani@Colorado.edu

More Related