C2D Cheminformatics : Methods,Tools and Results. By OSDD-Cheminformatics team. The burden of TB. About 9 million people were infected with TB in year 2009, and 1.7 million died India is the world Tb capital with estimated 1.9 million cases reported every year.
Virtual Screening Data
Building computational models for drug discovery process.
To screen molecules interacting with the Potential TB targets using classifiers.
Select the selected molecules and dock with Targets to further screen the molecules for leads.
Use cheminformatics techniques such as QSAR ,3D QSAR, ADMET to look for potential leads and design Drugs using the leads – by building combinatorial libraries.
Use a previously derived mathematical model that predicts the biological activity of each structure
Run substructure queries to eliminate molecules with undesirable functionality
Use a docking program to identify structures predicted to bind strongly to the active site of a protein (if target structure is known)
Filters remove structures not wanted in a succession of screening methods
Polar surface Area
SPC : Structure Property Correlation
Molecular descriptors are numerical values that
characterize properties of molecules.
The descriptors fall into Four classes
d) Hybrid or 3D Descriptors
According to David Hand et al., of MIT press (2001)
“ Data mining is the analysis of (often large) observational data sets to find unsuspected relationships and to summarize the data in novel ways that are both understandable and useful to the data owner”.
Data mining …. But why?
Data Information Knowledge
All compounds sdf file
Upload the sdf file
Generate descriptor file
Open the CSV file in Excel
Append the bioassay result corresponding to the compounds
Bioassay result (all)
Select the actives and inactive compounds
Remove the useless attributes
TP %, FP<20%, Accuracy >70%
Apply classifier algorithms
Selection of best classifier model
To get trained in using different classifiers in weka and analyzing the results
The P450s are mono-oxygenase enzymes,
Generally interact with flavoprotein and/or iron–sulphur centre redox partners for catalysis
The Mtb genome sequence—a plethora of P450s .
‘‘P450 dense’’ by comparison with eukaryotic genomes
Thus, analysis of Mtb CYP51 revealed P420 is an irreversibly inactivated and structurally disrupted species.
Organism P450s Genome size Ratio
Humans 57 3.3 billion bp 1:5.8 million bp
D. melanogaster 84 123 million bp 1: 1.5 million bp
A. thaliana has 249 115 million bp 1: 462,000 bp
M. tuberculosis 20 4.4 million bp 1: 220,000 bp
Mutations were largely located not in the active site area itself, but instead in regions that are conformationally mobile, where entry and exit of substrate to the active site is facilitated
Thus, acquired resistance could be mediated by mutations and it enhances flexibility and conformational rearrangements to increased activity
To develop model from AID 899 HTS to study the compound/drug interaction with Human CYP450.
a) Drug metabolism b) affecting CYP450
2) It should work against CYP450 of M.tuberculosis
Select active/inactive compounds against human
CYP450 from Pubchem HTS data
Generate model for lead compound screening
Screen the compounds via model
Select the inactives
Go for testing against mycobacterium CYP450 (model)
Select active lead compound
Go for insilico drug designing
Invitro studies and invivo studies
To be worked
Base Classifier and Cost Sensitive Classifier (CSC)
CSC setting cost factor False Negative
TP, FP rate increases
So FN is important than FP
Communication – need alternative to SKYPE
Institutional limitations – Ban of media stream,
social network, chatting, etc.
Tried two approaches for processing the AID to obtain train and test data set.Method 1: We downloaded sdf file containing all tested compounds. We downloaded bioassay data files for the same. Then we matched it in MS excel. It contained active, inactive, inconclusive and discrepancy We further selected only active and inactive and ran in PowerMV to get csv Then after converting to arff we processed test and train from it. Loaded the two files in Weka and used different algorithms to build best model. Method 2:We download active and inactive SDF files separately from the same pubchem page. After processing in PowerMV both files were combined to form one. Then similar steps were followed as in Method 1.
Problem: The number of final active and inactive compounds differ between the methods.
AID 899 - not curated “Problem reported to pubchem“. Director will be looking at it.
From the preliminary investigation it is clear that AID 899 is not a properly curated dataset
In method I many classifiers were applied and the results are represented below
In method II still many classifiers can be run and results generated.