1 / 30

Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach

Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach. Laurent Jacob Jean-Philippe Vert. Introduction. Predicting interactions between small molecules and proteins Vital to the drug discovery process Key to understanding biological processes. 3 classes of drug targets

elwyn
Download Presentation

Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Protein-Ligand Interaction Prediction: An Improved Chemogenomics Approach Laurent Jacob Jean-Philippe Vert

  2. Introduction • Predicting interactions between small molecules and proteins • Vital to the drug discovery process • Key to understanding biological processes • 3 classes of drug targets • G-protein-coupled receptors (GPCRs) • Enzymes • Ion channels

  3. Classical Methods • Consider each target independently from other proteins • Ligand-based approach • Compare to known ligands of the target • Requires knowledge about other ligands of a given target • Structure-based or docking approaches • Uses 3D structure of the target to determine how well a ligand can bind • Requires 3D structure of the target • Very time consuming • Cannot apply if no ligand or 3D structure is known for a given target

  4. Chemogenomics • Chemical space: • set of all small molecules • Biological space: • set of all proteins or protein families • Mine the entire chemical space for interactions with the biological space • Knowledge of some ligands for a target can help to predict ligands for similar targets

  5. Chemogenomic Approaches • Ligand-based chemogenomics • Look at families or subfamilies of proteins • Model ligands at the level of a family • Target-based chemogenomics • Cluster receptors based on ligand binding site similarity • Use known ligands for each cluster to infer shared ligands • Target-ligand approach • Use binding information for targets to predict ligands for another target in a single step

  6. Previous Experiments • Bock and Gough (2005) • Describe ligand-receptor complexes by merging ligand and target descriptors • Use machine learning methods to predict if a ligand-receptor pair forms a complex • Erhan et al. (2006) • Merge a set of ligand descriptors with a set of receptor descriptors in a framework of neural networks and support vector machines • Offers a large flexibility in the choice of descriptors

  7. Proposed Method • Investigates different types of descriptors • Builds upon recent developments in kernel methods • In bio- and cheminformatics • Tests different methods for prediction of ligands • For 3 major classes of targets • Shows that the choice of representation greatly effects accuracy • New kernel based on hierarchies of receptors outperforms all other descriptors • Performs especially well for targets with few or no known ligands

  8. Learning Problem • Given n target/molecule pairs (t1,c1), …, (tn, cn) known to form complexes or not • Each pair is represented by a vector (t,c) • Estimate a linear function • f(t,c)=w┬(t,c) • Whose sign is used to predict if a chemical c can bind to a target t • The vector w is estimated from the training set

  9. Vector Representation • Represent a molecule c by a vector lig(c)Rdc • Encode physiochemical and structural properties • Model interactions between small molecules and a single target • Represent a protein t by a vector tar(t)Rdt • Capture properties of the proteins sequence or structure • Infer models that predict the structural or functional class of a protein • Need to represent a pair (c,t) in a single vector • Capture interactions between features of the molecule and protein that can be useful predictors • Multiply a descriptor of c with a descriptor of t

  10. Tensor Product • (c,t) = lig(c)  tar(t) • Represent the set of all possible products of features of c and t • dc x dtvector • The (i,j)-th entry is the product of the i-th entry of lig(c) by the j-th entry of tar(t) • Size may be prohibitively large • Use kernel methods

  11. Kernel Trick • Can process large- or infinite-dimensional patters if the inner product between any two patterns can be computed • Can factorize the inner product between two tensor product vectors • (lig(c)  tar(t))┬ (lig(c’)  tar(t’)) • = lig(c)┬ lig(c’) x tar(t)┬ tar(t’) • Obtain the inner product between two tensor products • K((c,c’),(t,t’))= Kligand(c,c’) x Ktarget(t,t’) • Kligand(c,c’)= lig(c)┬lig(c’) • Ktarget(t,t’)= tar(t) ┬tar(t’)

  12. Kernels For Ligands • Have been impressive advances in use of SVM in chemoinformatics • Kernels have been designed using: • Physiochemical properties of molecules • 2D or 3D fingerprints • Comparison of 2D and 3D structures of molecules • Detection of common substructures in 2D graphs • Encoding various properties of 3D structures • Used in single-target virtual screening and prediction of pharmacokinetics and toxicity

  13. Tanimoto Kernel • Classical choice • State-of-the-art performance • Kligand(c,c’) = lig(c)┬ lig(c’) / [lig(c)┬ lig(c) + lig(c’)┬ lig(c’) - lig(c)┬ lig(c’)] • lig(c)┬ is a binary vector • Bits indicate if the 2D structure of c contains all linear paths of length l or less as a subgraph • Choose l=8 • Used ChemCPP software to compute

  14. Kernels For Targets • SVM and kernel methods are widely used in bioinformatics • Various Kernels have been proposed based on: • Amino-acid sequence of proteins • 3D structures of proteins • Pattern of occurrences of proteins in multiple sequenced genomes • Used for various tasks related to structural or functional classification of proteins

  15. Dirac Kernel • KDirac(t,t’) • = 1 if t = t’ • = 0 otherwise • Represents different targets as orthonormal vectors • Orthogonality between two proteins t and t’ implies orthogonality between all pairs (c,t) and (c’,t’) for any two molecules c and c’ • Learning is performed independently for each target protein • Does not share any information of known ligands between different targets

  16. Multitask Kernel • Kmultitask(t,t’) = 1 + Kdirac(t,t’) • Removes the orthogonality • Combines target-specific properties of the ligands and general properties across all targets • Allows sharing of information during learning • Preserves the specificities of the ligands for each target • Does not weigh much how known interactions should contribute

  17. Mismatch and Local Alignment Kernels • Empirical observations suggest that molecules that bind to t are only likely to bind to t’ if they are similar in terms of structure or evolutionary history • Can be detected by comparing protein sequences • Mismatch kernel: • compares short sequences of amino acids up to some number of mismatches • Choose 3mers with a maximum of one mismatch • Local alignment kernel: • uses the alignment score between the primary sequences of proteins to measure their similarity

  18. Hierarchy Kernel • Khierarchy(t,t’)=(h(t), h(t’)) • h(t) has a feature for each node in the hierarchy • Is set to 1 if the node is part of t’s hierarchy • Is set to 0 otherwise • Plus one feature is constantly set to 1 • Use data from the target and data from other targets, giving it smaller weight • Performed the best in the experiments

  19. Enzyme Hierarchy • Enzyme Commission numbers • International Union of Biochemistry and Molecular Biology (1992) • Classifies by the chemical reaction they catalyze • Four-level hierarchy • For example, • EC 1 includes oxidoreductases • EC 1.2 includes oxidoreductases that act on the aldehyde or oxo group of donors • EC 1.2.2 has NAD+ or NADP+ as an acceptor • EC 1.2.2.1 caltalyze the oxidation of formate to bicarbonate • Enzymes that are close in the hierarchy should have similar ligands

  20. GPCR Hierarchy • GPCRs are grouped into four classes • Group A: rhodopsin family • Group B: secretin family • Group C: metabotropic family • Group D: regroups more divers receptors • KEGG database subdivides rhodopsin family into three subgroups • Amine receptors • Peptide receptors • Other receptors • And adds a second level of classification based on the type of ligands or known subdivisions

  21. Ion Channel Hierarchy • The KEGG database divides ion channels into 8 classes • Cys-loop superfamily • Glutamate-gated cation channels • Epithelial and related Na+ channels • Voltage-gated cation channels • Related to voltage-gated cation channels • Related to inward rectifier K+ channels • Chloride channels • Related to ATPase-linked transporters • Each class is further subdivided • By, for example, the type of ligands or type of ion passing through the channel

  22. Data Extraction • Extracted compound interaction data from KEGG BRITE database • Known compounds for each target • Type of interaction • Enzymes: inhibitor, cofactor, effector • GPCR: antagonist, full/partial agonist • Ion Channels: pore blocker, positive/negative allosteric modulator, agonist, antagonist • Did not take into account • Orthologs of targets • Enzymes with same EC number • Compounds with no molecular descriptor • Primarily peptides • Targets with no known compounds

  23. Data Points • Generated as many negative ligand-target pairs as known ligand-target pairs • Randomly chose ligands • Produced false negatives • Need experimentally confirmed negative pairs • 2436 data points for enzymes • 675 enzymes, 524 compounds • 798 data points for GPCRs • 100 receptors, 219 compounds • 2230 data points for ion channels • 114 channels, 462 compounds

  24. Known Ligands Distribution of the number of known ligands per target for enzymes, GPCR, and ion channel datasets • Each bar indicates the proportion of targets for which a given number of training points are available • Few compounds are known for most targets Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409

  25. Experiments • Experiment 1 • Trained an SVM classifier on • all points involving other targets of the family • plus a fraction of points involving t • Tested on the remaining data points for t • Assesses the accuracy for a given target when using ligands for other targets for training • Experiment 2 • Trained an SVM classifier using only interactions that did not involve t • Tested on data points that did involve t • Simulated making predictions for targets with no known ligands • Measured performance using the area under the ROC curve (AUC)

  26. Results: Experiment 1 Mean AUC on each dataset with various target kernels • Hierarchy kernel shows significant improvements • Sharing information for known ligands of different targets • Incorporating prior information into the kernels

  27. Gram Matrices Target kernel Gram matrices (Ktar) for ion channels with multitask, hierarchy, and local alignment kernels • Hierarchy kernel adds structure information • Local alignment kernel retains some substructures • For GPCR and enzymes, almost no structure is found by the sequence kernels Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409

  28. Relative Improvement Relative improvement of the hierarchy kernel against the Dirac kernel as a function of the number of known ligands for enzymes, GPCR, and ion channel datasets • Strong improvement when few ligands are known • Decreases when enough training points become available • After a certain point, performance is impaired Jacob, L. et al. Bioinformatics 2008 24:2149-2156; doi:10.1093/bioinformatics/btn409

  29. Results: Experiment 2 Mean AUC on each dataset with various target kernels • Dirac kernel showed random behavior • Learning with no training data • Hierarchy kernel still gives reasonable results • 1.7%, 5.1%, 7.2% loss for enzymes, GPCR, and ion channels compared to the first experiment

  30. References • Rognan D: Chemogenomic approaches to rational drug design. Br J Pharmacol 2007, 152:38-52. • Kanehisa M, Goto S, Kawashima S, Nakaya A: {The KEGG databases at GenomeNet}. Nucl. Acids Res. 2002, 30:42-46. • Jacob L, Vert J: Protein-ligand interaction prediction: an improved chemogenomics approach. Bioinformatics 2008, 24:2149-2156. • Erhan D, L'Heureux P, Yue SY, Bengio Y: Collaborative Filtering on a Family of Biological Targets. Journal of Chemical Information and Modeling 2006, 46:626-635. • Bock JR, Gough DA: Virtual Screen for Ligands of Orphan G Protein-Coupled Receptors. Journal of Chemical Information and Modeling 2005, 45:1402-1414.

More Related