Randomized Exhaustive Propositionalization for Molecule Classification

A Randomized Exhaustive Propositionalization Approach for Molecule Classification Michele Samorani Manuel Laguna Kirk DeLisle Daniel Weaver

Drug discovery

Drug Discovery • The process of developing new drugs • The cost of developing a drug typically varies from 500 million $ to 2 billion $ • Molecule classification is used along the entire process to discriminate between: • Active and Non Active compounds • Toxic and Non Toxic compounds • During the development of a new drug: • Use the experiments done so far to train a classifier • Use the classifier to find the promising compounds to test next • An ideal Classification Algorithm: • Speeds up the design of new drugs • Gives insights about chemical properties

Data Mining in Drug Discovery The chemist designs a compound Non-Active! Classifier Non-Active (0) Active (1) Attribute representation

Molecule classification – Binary Fingerprints • One of the main attribute representations is the so called Binary Fingerprints: • Every attribute represents the absence/presence (0/1) of a characteristic or a substructure • The attributes are pre-defined characteristics • The classification process does not find new knowledge, but which attributes are most important

Molecule classification – Binary Fingerprints • The focus of this work is NOT on improving the classification procedure • It is on how to generate a good attribute representation that generates new knowledge

propositionalization

Propositionalization • The starting point is a database • By navigating through the database, new features are generated, which represent the result of SQL queries • These features are added to the mining table

It contains the compounds n 1 n n n 1 n

Generating a new attribute • Two steps: • Find a path that starts from the target table • Roll-up one simple attribute, through aggregations and refinements, from the last table to the target table

STEP 1: Find a path n 1 n n n 1 n

STEP 1: Find a path n 1 n n n This path will find attributes of depth 2 1 n Depth = measure of how complex the attribute is

STEP 2: Roll-up n 1 n n n Aggregate to each Atom: count distinct bonds (CDB) 1 n

STEP 2: Roll-up CDB 1 n 2 1 n n n Aggregate to each Atom: count distinct bonds (CDB) 1 n

STEP 2: Roll-up CDB 1 n 2 max(Atom.CDB) Where Atom.ele = ‘C’ 1 n X n 2 3 n Attach to the target table: The maximum number of bonds to which an atom of carbon participates 1 n

Propositionalization – graphically Depth 1 Depth 4 Depth 2 Depth 3

Our contribution over traditional propositionalization Our Randomized Exhaustive approach produces: More expressive attributes (Exhaustive) “Deeper” attributes (Randomized)

More expressive attributes – Example • Traditional propositionalization algorithms can generate the following attribute: • Count the number of double bonds to which each atom participates • Compute the maximum • But not the following attribute: • Count the number of double bonds to which each atom participates • Compute the maximumamong the oxygen atoms

Attributes Traditional vs Exhaustive Activity Mutagenicity

EXPERIMENTS

Design of the experiments • Given an attribute generation strategy: • Perform a 10-fold cross validation using 10 different classifiers (from Weka): • MultilayerPerceptron, BayesNet, Bagging, J48, ADTree, REPTree, RandomForest, PART, Nnge, Ridor • The average accuracy across the folds and across the classifiers is the measure of the performance of the strategy used

Up to a predefined depth In general, the deeper we go the higher accuracy we obtain Let’s generate attributes at depth > 4

Up to depth 4 + 1,000 in [5,7] Generate all attributes up to depth 4 and add 1,000 attributes randomly sampled from depth 5 to 7 77.93% 77.72% 76.30% 74.95%

Summary of the results • Exhaustive is significantly better than Traditional • Sampling deep attributes at the end of the attribute generation procedure is significantly better than continuing generating non-deep attributes • (in terms of proportion of classifiers that perform better with one strategy than with the other)

Comparison to fingerprints The difference is not significant

Comparison to fingerprints 2 hours of computing time Years of research effort in order to identify this attribute representation The difference is not significant

Additional Attribute Generation Strategies • Let’s not sample deep attributes randomly • Strategy 1: find the best mix of depths (scatter search) • Strategy 2: use a Bayesian Network to retrieve attributes with high information gain

New knowledge • Although our best method does not improve upon fingerprints, it has the potential of generating new knowledge • The attributes used by the classifiers represent important characteristics • Number of bromine atoms • The average number of double bonds among the atoms different from S These attributes identify structures that have characteristics that may prevent mutagenesis

New knowledge • But, sometimes, deep attributes are hard to interpret. • On Estrogen: • Label each atom A in the following way. 1) Consider the atoms connected to it and count the bonds to which they participate (excluding the bond connecting A to each of them). 2) Compute the sum of these labels and obtain the label for A. Label the molecule with the minimum of these labels across all atoms of oxygen. • Specifically, a high value would represent an oxygen atom that is connected to other atoms participating in a large number of additional bonds - presumably an oxygen atom that is somewhat buried and interacting with highly branched atoms.

Conclusions • The current attribute representations (Fingerprints) used for molecule classification does not provide insights on the chemical properties of the compounds • Traditional propositionalization approaches do not obtain satisfying accuracy • Our method extends the traditional propositionalization approach and: • Obtains an accuracy comparable to Fingerprints • Has the potential of finding new knowledge • Note that our method is applicable to any domain (marketing, medical, etc…)

Future Work • Accuracy improvement: • Scan&Sample => Scan & Smartly Sample • Improve the feature representation • Query-like is ok for computer scientists, but chemists would prefer a graphical representation

Thank you for your attention Michael.Samorani@Colorado.edu

Randomized Exhaustive Propositionalization for Molecule Classification

Randomized Exhaustive Propositionalization for Molecule Classification

Presentation Transcript

A New Approach for Classification :

Adopt-A-Molecule

Water : A Molecule Essential for Life

A Lazy Approach to Associative Classification

Statistical Approach to Classification

A semantic approach for question classification using WordNet and Wikipedia

EXTENDED NEAREST NEIGHBOR CLASSIFICATION METHODS FOR PREDICTING SMALL MOLECULE ACTIVITY

A Balanced Ensemble Approach to Weighting Classifiers for Text Classification

An evolutionary approach for song genre classification

Music Emotion Classification: A Fuzzy Approach

A Randomized Algorithm for Minimum Cuts

A String Matching Approach for Visual Retrieval and Classification

Single-Molecule Magnets: A Molecular Approach to Nanomagnetism

A Person and Context Specific Approach for Skin Colour Classification

Exhaustive Search and a first approach to avoiding it

A Classification Approach for Movie Recommender System

A New Subspace Approach for Supervised Hyperspectral Image Classification

Single-Molecule Magnets: A Molecular Approach to Nanomagnetism

A Combinatoric Approach to the Study of Mineral- Molecule Interactions

A Person and Context Specific Approach for Skin Colour Classification

A Multi-Relational Approach to Spatial Classification

A Balanced Ensemble Approach to Weighting Classifiers for Text Classification