1 / 23

Amit Satsangi amit@cs.ualberta

Amit Satsangi amit@cs.ualberta.ca. Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc E, Cherkasov A., and Sahinalp S.C. Background and Focus.

whetsel
Download Presentation

Amit Satsangi amit@cs.ualberta

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Amit Satsangi amit@cs.ualberta.ca Novel Approaches for Small Bio-molecule Classification and Structural Similarity SearchKarakoc E, Cherkasov A., and Sahinalp S.C. CMPUT 605

  2. CMPUT 605 Background and Focus • Identification of molecules that play an active role in regulation of biological processes or disease states (Aspirin) • Structural similarity  Similar biological and/or physico-chemical properties (Maggiora et al.) • Classification of probe compound (unknown bioactivity) • Similarity search amongst compounds with known bioactivity

  3. CMPUT 605 Background and Focus • Determining similarity distance measures (SDM) • Using SDM for classification of compounds—k-NN classification • Efficient data structures for fast similarity search—DMVP trees (an improvement over SCVP trees used previously)

  4. CMPUT 605 Outline • Similarity measures • Classification techniques • k-NN classifier • DMVP tree • Results, Observations and Conclusion

  5. CMPUT 605 Similarity between Molecules • Structural Similarity—doubly bonded C pair, existence of aromatic atom etc. (Used in structural similarity search engines) • Similarity of chemical descriptors—atomic wt., hydrophobicity, charge, density etc. (Used in QSAR* tools) *Quantitative Structure-Activity Relationship

  6. CMPUT 605 Similarity Measures • Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y: • X & Y: n-dimensional bit-vectors (representation used by PubChem & some other databases) • Range of Tanimoto coefficient: [0, 1]

  7. CMPUT 605 Similarity measures • Tanimoto Dist. Measure: DT(X,Y) = 1 –T(X,Y) • Minkowski distance (LP): • Real valued data possible

  8. CMPUT 605 Classification Techniques • Multiple Linear Regression (MLR) • Linear Discriminant Analysis (LDA) • Artifical Neural Networks (ANN) • Support Vector Machines (SVM) • k-nearest Neighbor (k-NN) classification not used previously.

  9. CMPUT 605 Distance-based Classification • Compounds—s & r • S & R respective descriptor arrays • If D(S,R) is small then bioactivity levels of s & r are similar • Notion of distance  classification of new compounds • Distance measure == metric (conditions) e.g. Hamming Distance, Tanimoto distance etc.

  10. CMPUT 605 k-nn Classification • Given  Bioactivity • To Find  Distance measure that separates active and inactive compounds for the training set N-dimensional plane • Problem  Easy

  11. CMPUT 605 k-nn Classification • Given  Bioactivity • To Find  Distance measure that separates active and inactive compounds for the training set N-dimensional plane • Problem  NP-hard • Solution  Use Genetic Algorithms, heuristic linear search to find the plane

  12. QSAR approach • Uses a linear combination of descriptors • Assigns a weight to each dimension , W [0,1] • Weighted Minkowski distance of order 1 • Only binary classification considered (A/I) • Methods are general CMPUT 605

  13. CMPUT 605 Parameter Optimization

  14. CMPUT 605 k-NN Classifier • Set of data elements: {X1, … Xn} • Query element: Y • Range query  Find Xi such that D(Y,Xi) < R1 (user defined) • k-nn query  Find k items such that their distance to Y is as small as possible

  15. CMPUT 605 Data structures: VP-Trees • Vantage Point (VP) tree • Choose an arbitrary data point (called Vantage Point) • Binary tree—recursively partitions the dataset into two equal sized subsets • Zero in on the nearest neighbor

  16. CMPUT 605 Efficient data structures: SCVP Trees • Space Covering Vantage Point tree • Multiple vantage points chosen at each level • No more a binary tree—multiple branches at each internal node • Multiple inner partitions—hope is that each data point lies in atleast one inner partition

  17. CMPUT 605 DMVP Tree • Memory requirements of SCVP tree can be large—redundancy of data elements • Deterministic selection of Vantage points • VP minimization—NP-Hard • Minimization == Weighted set cover problem • Use of greedy Algorithm: O(log l); l<n • Approximates the min number of VP’s

  18. CMPUT 605 Experiments • Five types of bioactivities viz. being antibiotic (520), bacterial metabolite (562), human metabolite(1104), drug(958), drug-like(1202) • 62 dimensional descriptor array (30 QSAR & 32 physico-chemical properties) • k=1 i.e. one NN • Comparison with LDA, MLR, ANN • 70% data used for training • wL1 distance is calculated in all cases

  19. CMPUT 605 Experimental Results • Table 1 shows that in almost all cases in terms of accuracy, and T_P, T_N, F_P etc. k-NN does better than LDA and MLR • ANN beats k-NN on almost all counts • Pruning—more than 80% in each kind of bioactivity (over brute-force search) • Key point – k-NN classifier is faster • More than 100 times faster than ANN

  20. CMPUT 605 Experimental Results • Can calculate the level of bioactivity instead of a YES/NO • The value of the weights provides insights into the importance of descriptors for each bioactivity

  21. CMPUT 605 Observations & Conclusion • Bacterial metabolites & antimicrobial drugs overlap (confirmation) • Human metabolites display distinctive properties • QSAR models for drugs + human metabolites dominated by few descriptors • These descriptors favored by drug developers and natural evolution

  22. CMPUT 605 Observations & Conclusion • Classification results from k-NN can help rationalize the design and discovery of drugs • DMVP tree improves the space utilization of the program • Provides a means for fast similarity search • Data structure can be applied to any metric distance like wLp and Tanimoto distance

  23. Thank You For Your Attention! CMPUT 605

More Related