Kernel Functions for Chemical ClassificationAaron Smalter, Jun Huan, Gerald Lushington{asmalter,jhuan,glushington}@ku.eduChemical and Graph Classification

Support Vector Machine

- Chemicals are structured as graphs.
- Vertices and edges correspond to atoms and bonds.
- Labeled, undirected.

- Graph classification is critical for drug development and screening.
- Sifting through large databases of compounds requires efficiency.
- Costs of chemical manufacture and assay experiments necessitate accuracy.

- Traditional chemical classifiers use vector representations of chemicals, neglecting the rich structure of graph models.

- SVM is a fast, accurate classifier designed for vector data.
- Crucially, SVM internally represents data points as inner products between pair of input vectors.
- SVM can then linearly classify non-linear data distributions by applying the kernel trick, and replacing the inner product with some similarity measurement function, K(x,y)
- The key is that this kernel function K can be defined on non-vector data, allowing direct operation on structured data such as graphs.

Figure 1. Using graphs to model chemicals.

Figure 2. A kernel function maps nonlinear data (left) into a linearly separable space (right).

Graph Kernel Functions

Our Work

- Problem of chemical graph classification changes:
- fromfinding vector representations of graphs, to defining high-quality kernel functions to compare graphs.

- Previous kernel functions -
- Decompose graphs into substructures such as paths, cycles, and trees.
- Optimally assign vertices based on neighborhood similarity.
- Respective limitations are:
- Dependency on particular decompositions; pattern enumeration time.
- Inefficient recursive comparison and a flaw rendering them not true kernel functions.

- We can improve graph kernels with several ideas:
- Embed frequent patterns by using their occurrences as features in the graph. [1]
- Use wavelet functions to compress neighborhood information.[2]
- Avoid finding an optimal assignment by using setmatching and summing the kernels between all vertex pairs.

Figure 4. Frequent patterns annotate graph vertices.

Figure 3. Finding an optimal assignment using a bipartite graph.

Figure 5. A wavelet function overlays a chemical graph.

Fig 6. Comparing graph kernels, our GPM method performs best overall.

This work supported by K-INBRE (NIH/NCRR award #P20 RR016475), the KU CMLD (NIH/NIGM award #P50 GM069663), and NIH grant #R01 GM868665.

[1] A. Smalter, J. Huan, G. Lushington. Chemical Compound Classification with Automatically Mined Structure Patterns. Proc. of the 6th Asia Pacific Bioinformatics Conference (APBC). 2008.

[2] A. Smalter, J. Huan, G. Lushington. Graph Wavelet Alignment Kernels for Drug Virtual Screening. Proc. of the 7th Annual Int. Conf. On Computational Systems Bioinformatics. 2008.