260 likes | 421 Views
With the vast number of available molecules in both public and proprietary databases, there is a strong need for chemoinformatics tools that can efficiently rank compounds based on their likelihood of biological activity. This involves a range of methods, including structure-based approaches when X-ray structures are available and ligand-based methods such as similarity searching and 3D pharmacophore searching when only limited data is accessible. Advanced techniques, including machine learning and data fusion from multiple similarity measures, enhance the accuracy of identifying potential drug leads.
E N D
Virtual screening • The huge numbers of molecules available in public and in-house databases means that there is a requirement for tools to rank compounds in order of decreasing probability of activity • Range of methods available, varying in the sophistication and the amount of information that is available • Use of structure-based methods when an X-ray structure for the biological target is available • If this is not the case then must make use of information about (potential) ligands
Ligand-Based Methods • Similarity searching • Use when just a single bioactive reference structure is available • 3D pharmacophore searching • Use when it has been possible to carry out a pharmacophore mapping exercise • Machine learning • Use when a fair number of both actives and inactives have been identified
Similarity Searching: I • Use of a similarity measure to quantify the resemblance between an active target, or reference, structure and each database structure • The similar property principle means that high-ranked structures are likely to have similar activities to that of the target structure • Similarity searching hence provides an obvious way of following-up on an initial active
Similarity searching: II • Many ways in which the similarity between two molecules can be computed • A similarity measure has two components • A structure representation • A similarity coefficient to compare two representations • Most operational systems use similarity measures based on 2D fingerprints and the Tanimoto coefficient
Fragment bit-strings (fingerprints) • Originally developed for 2D substructure search • Similarity is based on the fragments common to two molecules • Widely used in both in-house and commercial chemoinformatics systems
Similarity coefficients • Tanimoto coefficient for binary bit strings • C bits set in common between Target and Database Structure • T bits set in Target • D bits set in Database structure • Values between zero (no bits in common) and unity (identical fingerprints) • Many other, related similarity coefficients exist: • Tversky, cosine, Euclidean distance …..
Combination of search techniques using data fusion: I • Tanimoto/fingerprint measures most common but many other types, e.g., • Computed physicochemical properties • 3D grid describing the molecular electrostatic potential • These reflect different molecular characteristics, so may enhance search performance by using more than one similarity measure • Data fusion or consensus scoring
Combination of search techniques using data fusion: II • Combination of different rankings of the same sets of molecules • Two basic approaches • Generate rankings from the same molecule using different similarity measures (similarity fusion) • Generate rankings from different molecules using the same similarity measure but different molecules (group fusion)
Reference 2 Reference 3 Groupfusion Reference 1
After truncation to required rank Reference 2 Reference 1 Reference 3
Fused Group Fusion Final truncated r = 1000 r = 2000 New Active Active found in earlier list
Group fusion rules • Useful performance increases, even with just 10 actives, as better coverage of structural space with multiple starting points • Improvement most obvious when searching for heterogeneous sets of active molecules • Best results obtained by • Fusing similarity coefficient values, rather than ranks • Re-ranking using the maximum of the similarity values associated with each molecule • Using the Tanimoto coefficient
Turbo similarity searching: I • Similar property principle: nearest neighbours are likely to exhibit the same activity as the reference structure • Group fusion improves the identification of active compounds • Potential for further enhancements by group fusion of rankings from the reference structure and from its assumed active nearest neighbours
Turbo similarity searching: II REFERENCE STRUCTURE RANKED LIST NEAREST NEIGHBOURS
Experimental details • MDL Drug Data report (MDDR) dataset of 11 activity classes and 102K structures • In all, 8294 actives in the 11 classes, with (turbo) similarity searches being carried out using each of these as the reference structure • ECFP_4 fingerprints/Tanimoto coefficient • MAX group fusion on similarity scores • Increasing numbers of nearest neighbours
Rationale for upper bound results • The true actives in the set of assumed actives yield significant enhancements in performance • The true inactives in the set of assumed actives have little effect on performance • Taken together, the two groups of compounds yield the observed net enhancement
Use of machine-learning methods for similarity searching: I • Turbo similarity searching uses group fusion to enhance conventional similarity searching • Machine learning is a more powerful virtual screening tool than similarity searching • But requires a training-set containing known actives and inactives • Given an active reference structure, a training-set can be generated from • Using the k nearest neighbours of the reference structure as the actives • Using k randomly chosen, low-similarity compounds as the inactives
Use of machine-learning methods for similarity searching: II
Results: I • Experiments with the MDDR dataset show that group fusion better than machine-learning methods when averaged over all of the classes • However, group fusion inferior for the most diverse datasets (as measured by the mean pair-wise similarities) • Additional searches using 10 MDDR activity classes that are as structurally diverse as possible
Conclusions: I • Fingerprint-based similarity searching using a known reference structure is long-established in chemoinformatics • When small numbers of actives are available, group fusion will enhance performance when the sought actives are structurally heterogeneous
Conclusions: II • Can also enhance conventional similarity search, even if there is just a single active, by assuming that the nearest neighbours are also active • Can be effected in two ways • Use of group fusion to combine similarity rankings (overall best approach) • Use of substructural analysis to compute fragment weights (best with highly heterogeneous sets of actives)
Soaluntukdipelajari • Tunjukkanperankhemoinformatikdalam QSAR • Data dananalisisdarikhemoinformatik yang banyakdigunakandalam docking molekul • Indekskemiripan (similarity index) banyakdigunakanuntukmendapatkaninformasitentangsenyawabaru yang memilikiaktivitasbiologistinggi. Jelaskansecarasingkatsistemkerjanya • Dalampenemuanobatbaru yang lebihpotensialdari yang sudahdikenal, banyakmemanfaatkankhemoinformatiks. Jelaskandenganbeberapacontoh. • ApaperbedaanpenggunaanKhemoinformatiksdalam QSAR, molecular docking dan similarity searching?