1 / 23

230 likes | 368 Views

Amit Satsangi amit@cs.ualberta.ca. Novel Approaches for Small Bio-molecule Classification and Structural Similarity Search Karakoc E, Cherkasov A., and Sahinalp S.C. Background and Focus.

Download Presentation
## Amit Satsangi amit@cs.ualberta

**An Image/Link below is provided (as is) to download presentation**
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.
Content is provided to you AS IS for your information and personal use only.
Download presentation by click this link.
While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

**Amit Satsangi**amit@cs.ualberta.ca Novel Approaches for Small Bio-molecule Classification and Structural Similarity SearchKarakoc E, Cherkasov A., and Sahinalp S.C. CMPUT 605**CMPUT 605**Background and Focus • Identification of molecules that play an active role in regulation of biological processes or disease states (Aspirin) • Structural similarity Similar biological and/or physico-chemical properties (Maggiora et al.) • Classification of probe compound (unknown bioactivity) • Similarity search amongst compounds with known bioactivity**CMPUT 605**Background and Focus • Determining similarity distance measures (SDM) • Using SDM for classification of compounds—k-NN classification • Efficient data structures for fast similarity search—DMVP trees (an improvement over SCVP trees used previously)**CMPUT 605**Outline • Similarity measures • Classification techniques • k-NN classifier • DMVP tree • Results, Observations and Conclusion**CMPUT 605**Similarity between Molecules • Structural Similarity—doubly bonded C pair, existence of aromatic atom etc. (Used in structural similarity search engines) • Similarity of chemical descriptors—atomic wt., hydrophobicity, charge, density etc. (Used in QSAR* tools) *Quantitative Structure-Activity Relationship**CMPUT 605**Similarity Measures • Tanimoto coefficient T(X,Y)—Given two descriptor sets X & Y: • X & Y: n-dimensional bit-vectors (representation used by PubChem & some other databases) • Range of Tanimoto coefficient: [0, 1]**CMPUT 605**Similarity measures • Tanimoto Dist. Measure: DT(X,Y) = 1 –T(X,Y) • Minkowski distance (LP): • Real valued data possible**CMPUT 605**Classification Techniques • Multiple Linear Regression (MLR) • Linear Discriminant Analysis (LDA) • Artifical Neural Networks (ANN) • Support Vector Machines (SVM) • k-nearest Neighbor (k-NN) classification not used previously.**CMPUT 605**Distance-based Classification • Compounds—s & r • S & R respective descriptor arrays • If D(S,R) is small then bioactivity levels of s & r are similar • Notion of distance classification of new compounds • Distance measure == metric (conditions) e.g. Hamming Distance, Tanimoto distance etc.**CMPUT 605**k-nn Classification • Given Bioactivity • To Find Distance measure that separates active and inactive compounds for the training set N-dimensional plane • Problem Easy**CMPUT 605**k-nn Classification • Given Bioactivity • To Find Distance measure that separates active and inactive compounds for the training set N-dimensional plane • Problem NP-hard • Solution Use Genetic Algorithms, heuristic linear search to find the plane**QSAR approach**• Uses a linear combination of descriptors • Assigns a weight to each dimension , W [0,1] • Weighted Minkowski distance of order 1 • Only binary classification considered (A/I) • Methods are general CMPUT 605**CMPUT 605**Parameter Optimization**CMPUT 605**k-NN Classifier • Set of data elements: {X1, … Xn} • Query element: Y • Range query Find Xi such that D(Y,Xi) < R1 (user defined) • k-nn query Find k items such that their distance to Y is as small as possible**CMPUT 605**Data structures: VP-Trees • Vantage Point (VP) tree • Choose an arbitrary data point (called Vantage Point) • Binary tree—recursively partitions the dataset into two equal sized subsets • Zero in on the nearest neighbor**CMPUT 605**Efficient data structures: SCVP Trees • Space Covering Vantage Point tree • Multiple vantage points chosen at each level • No more a binary tree—multiple branches at each internal node • Multiple inner partitions—hope is that each data point lies in atleast one inner partition**CMPUT 605**DMVP Tree • Memory requirements of SCVP tree can be large—redundancy of data elements • Deterministic selection of Vantage points • VP minimization—NP-Hard • Minimization == Weighted set cover problem • Use of greedy Algorithm: O(log l); l<n • Approximates the min number of VP’s**CMPUT 605**Experiments • Five types of bioactivities viz. being antibiotic (520), bacterial metabolite (562), human metabolite(1104), drug(958), drug-like(1202) • 62 dimensional descriptor array (30 QSAR & 32 physico-chemical properties) • k=1 i.e. one NN • Comparison with LDA, MLR, ANN • 70% data used for training • wL1 distance is calculated in all cases**CMPUT 605**Experimental Results • Table 1 shows that in almost all cases in terms of accuracy, and T_P, T_N, F_P etc. k-NN does better than LDA and MLR • ANN beats k-NN on almost all counts • Pruning—more than 80% in each kind of bioactivity (over brute-force search) • Key point – k-NN classifier is faster • More than 100 times faster than ANN**CMPUT 605**Experimental Results • Can calculate the level of bioactivity instead of a YES/NO • The value of the weights provides insights into the importance of descriptors for each bioactivity**CMPUT 605**Observations & Conclusion • Bacterial metabolites & antimicrobial drugs overlap (confirmation) • Human metabolites display distinctive properties • QSAR models for drugs + human metabolites dominated by few descriptors • These descriptors favored by drug developers and natural evolution**CMPUT 605**Observations & Conclusion • Classification results from k-NN can help rationalize the design and discovery of drugs • DMVP tree improves the space utilization of the program • Provides a means for fast similarity search • Data structure can be applied to any metric distance like wLp and Tanimoto distance**Thank You For Your Attention!**CMPUT 605

More Related