1 / 44

RECCR Organization and Capabilities Curt Breneman July 17, 2006

RECCR Organization and Capabilities Curt Breneman July 17, 2006. Predictive Cheminformatics: Models and Statistical Methods. “ If your experiment needs statistics, you ought to have done a better experiment ” - Ernest Rutherford. “ But what if you haven’t done the experiment yet? ”.

lona
Download Presentation

RECCR Organization and Capabilities Curt Breneman July 17, 2006

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. RECCR Organization and Capabilities Curt Breneman July 17, 2006

  2. Predictive Cheminformatics: Models and Statistical Methods “If your experiment needs statistics, you ought to have done a better experiment” - Ernest Rutherford “But what if you haven’t done the experiment yet?”

  3. Simulation-based Protein Affinity Descriptors Non-linear Model Building and Validation Methods Creation of Generic Data Mining Tools Protein-DNA Binding and Gene Regulation Bioinformatics Cheminformatics Alignment-free Molecular Property Descriptors Protein Kinetic Stability Prediction Protein Chromatography Modeling Drug Design and QSAR RECCR CenterVision

  4. RECCR Organization

  5. RPI RECCR – Goals • RECCR is committed to testing ways of integrating data generators, method developers and data mining groups to provide new tools and workflows • RECCR will develop and disseminate new tools for characterizing the interface between chemistry and biology • RECCR will create and study protein similarity and regional similarity methods for quantifying protein binding interactions • RECCR will create a new type of simulation-based protein solvation descriptors • RECCR will validate traditional and novel modeling approaches to determine their applicability for specific kinds of datasets and descriptors • RECCR will integrate its software tools into work flow modules using Pipeline Pilot. • RECCR will dissemination its descriptor technology, modeling methods and results to the world at large through web-based utilities • RECCR will offer large-scale cheminformatics computing capability onsite

  6. RPI RECCR – Goals • RECCR is committed to testing ways of integrating data generators, method developers and data mining groups to provide new tools and workflows • RECCR will develop and disseminate new tools for characterizing the interface between chemistry and biology • RECCR will create and study protein similarity and regional similarity methods for quantifying protein binding interactions • RECCR will create a new type of simulation-based protein solvation descriptors • RECCR will validate traditional and novel modeling approaches to determine their applicability for specific kinds of datasets and descriptors • RECCR will integrate its software tools into work flow modules using Pipeline Pilot. • RECCR will dissemination its descriptor technology, modeling methods and results to the world at large through web-based utilities • RECCR will offer large-scale cheminformatics computing capability onsite

  7. RPI RECCR – Goals • RECCR is committed to testing ways of integrating data generators, method developers and data mining groups to provide new tools and workflows • RECCR will develop and disseminate new tools for characterizing the interface between chemistry and biology • RECCR will create and study protein similarity and regional similarity methods for quantifying protein binding interactions • RECCR will create a new type of simulation-based protein solvation descriptors • RECCR will validate traditional and novel modeling approaches to determine their applicability for specific kinds of datasets and descriptors • RECCR will integrate its software tools into work flow modules using Pipeline Pilot. • RECCR will dissemination its descriptor technology, modeling methods and results to the world at large through web-based utilities • RECCR will offer large-scale cheminformatics computing capability onsite

  8. RPI RECCR – Goals • RECCR is committed to testing ways of integrating data generators, method developers and data mining groups to provide new tools and workflows • RECCR will develop and disseminate new tools for characterizing the interface between chemistry and biology • RECCR will create and study protein similarity and regional similarity methods for quantifying protein binding interactions • RECCR will create a new type of simulation-based protein solvation descriptors • RECCR will validate traditional and novel modeling approaches to determine their applicability for specific kinds of datasets and descriptors • RECCR will integrate its software tools into work flow modules using Pipeline Pilot. • RECCR will dissemination its descriptor technology, modeling methods and results to the world at large through web-based utilities • RECCR will offer large-scale cheminformatics computing capability onsite

  9. RPI RECCR – Goals • RECCR is committed to testing ways of integrating data generators, method developers and data mining groups to provide new tools and workflows • RECCR will develop and disseminate new tools for characterizing the interface between chemistry and biology • RECCR will create and study protein similarity and regional similarity methods for quantifying protein binding interactions • RECCR will create a new type of simulation-based protein solvation descriptors • RECCR will validate traditional and novel modeling approaches to determine their applicability for specific kinds of datasets and descriptors • RECCR will integrate its software tools into work flow modules using Pipeline Pilot. • RECCR will dissemination its descriptor technology, modeling methods and results to the world at large through web-based utilities • RECCR will offer large-scale cheminformatics computing capability onsite

  10. RPI RECCR – Goals • RECCR is committed to testing ways of integrating data generators, method developers and data mining groups to provide new tools and workflows • RECCR will develop and disseminate new tools for characterizing the interface between chemistry and biology • RECCR will create and study protein similarity and regional similarity methods for quantifying protein binding interactions • RECCR will create a new type of simulation-based protein solvation descriptors • RECCR will validate traditional and novel modeling approaches to determine their applicability for specific kinds of datasets and descriptors • RECCR will integrate its software tools into work flow modules using Pipeline Pilot • RECCR will dissemination its descriptor technology, modeling methods and results to the world at large through web-based utilities • RECCR will offer large-scale cheminformatics computing capability onsite

  11. RPI RECCR – Goals • RECCR is committed to testing ways of integrating data generators, method developers and data mining groups to provide new tools and workflows • RECCR will develop and disseminate new tools for characterizing the interface between chemistry and biology • RECCR will create and study protein similarity and regional similarity methods for quantifying protein binding interactions • RECCR will create a new type of simulation-based protein solvation descriptors • RECCR will validate traditional and novel modeling approaches to determine their applicability for specific kinds of datasets and descriptors • RECCR will integrate its software tools into work flow modules using Pipeline Pilot • RECCR will dissemination its descriptor technology, modeling methods and results to the world at large through web-based utilities • RECCR will offer large-scale cheminformatics computing capability onsite

  12. RPI RECCR – Goals • RECCR is committed to testing ways of integrating data generators, method developers and data mining groups to provide new tools and workflows • RECCR will develop and disseminate new tools for characterizing the interface between chemistry and biology • RECCR will create and study protein similarity and regional similarity methods for quantifying protein binding interactions • RECCR will create a new type of simulation-based protein solvation descriptors • RECCR will validate traditional and novel modeling approaches to determine their applicability for specific kinds of datasets and descriptors • RECCR will integrate its software tools into work flow modules using Pipeline Pilot • RECCR will dissemination its descriptor technology, modeling methods and results to the world at large through web-based utilities • RECCR will offer large-scale cheminformatics computing capability onsite

  13. Evolution of Synergies

  14. Evolution of Synergies

  15. Evolution of Synergies

  16. Evolution of Synergies

  17. Evolution of Synergies

  18. Evolution of Synergies

  19. Evolution of Synergies

  20. Evolution of Synergies

  21. Evolution of Synergies

  22. RECCR Emphasis #1: Descriptors Structural Descriptors Physiochemical Descriptors Topological Descriptors Geometrical Descriptors AAACCTCATAGGAAGCATACCAGGAATTACATCA… Molecular Structures Descriptors Model Activity

  23. Surface Property Distribution RECON/TAE Descriptors Surface histograms represent electronic property distributions to provide data for descriptors PIP (Local Ionization Potential) surface property for a member of the Lombardo blood-brain barrier dataset.

  24. Wavelet Coefficient Descriptors (WCD) Wavelet Decomposition: • Creates a set of coefficients that represent a waveform. • Small coefficients may be omitted to compress data. Wavelet Surface Property Reconstruction: 16 coefficients from S7 and D7 portions of the WCD vector represent surface property densities with >95% accuracy. 1024 raw wavelet coefficients capture PIP distribution on molecular surface.

  25. d RAD: Recon Autocorrelation Descriptors • Patterned after Whole Surface Autocorrelation Descriptors [1] • Uses Integrated TAE Surfaces and the Gasteiger autocorrelation formula [2]: • Function binned by distance between atoms x and y. • 1. Breneman, C.M., et al., New developments in PEST shape/property hybrid descriptors. J. Comput. aided Mol. Design, 17, 231-240, 2003. • 2. Wagener, M., J. Sadowski, and J. Gasteiger, J. Am. Chem. Soc., 1995. 117: p. 7769-7775.

  26. PEST Hybrid Shape/Property Descriptors • Surface properties and shape information are encoded into alignment-free descriptors PIP vs Segment Length

  27. PPEST Protein Shape/Property Descriptors

  28. PROLICSS: ligand/protein interaction surfaces INPUT: protein-ligand complex geometry • Generate interaction surfaces • Interface surface • Encoding • Extraction • Analysis 6CPA protein surface ligand surface ligand surface eclipsed by protein surface

  29. “Dixel” DNA descriptors Central base pair is encoded and stored as a RECON “Dixel” object. • A “basis set” of all possible nucleotide base pairs with all possible neighbors results in a set of base pair “triplets”. • Ab Initio properties of base pair and two flanking base pairs (end capped) are computed.

  30. DATASET Training set Test set Y-scrambling model validation! Bootstrap sample k Predictive Model Training Validation Learning Model Tuning / Prediction Prediction RECCR Emphasis #2: Model Validation

  31. RECCR Emphasis #3: Outreach • Public access to online and downloadable descriptor generation and modeling tools • Provide access to new RPI CCNI/IBM (70 Teraflop) supercluster for near real-time descriptor and predictive modeling calculations • Pre-compute 2D and 3D descriptor sets for a steadily increasing number of PubChem datasets • Participate in modeling competitions (CoEPrA) to demonstrate performance of new methods • New IOS Press Journal “Cheminformatics” • Reviews of current topics • EAB input

  32. New “Cheminformatics” Review Journal • Current “Cheminformatics” Advisory Board • Dan Ortwine – Pfizer Global Research, Ann Arbor, MI • Pat Walters – Vertex Pharmaceuticals, Cambridge, MA • Prof. Dr. Jack A.M. Leunissen - Wageningen University, NL • Frank Leusen – University of Bradford, UK • Mark Embrechts – RPI, Troy, NY • Barry Lavine – Oklahoma State University, OK • Dimitris Agrafiotis – Johnson & Johnson R&D, Exton, PA • Terry Stouch – Lexicon Pharmaceuticals • Yvonne Martin – Abbot Laboratories, Abbot Park, IL • Leah Frye – Schrodinger, Portland, OR • Bob Clark – Tripos Associates, Saint Louis, MO • David Spellmeyer – IBM Almaden Research Center, San Jose, CA • Peter Jurs – Penn State University

  33. RECCR Modules • Data Generators • Model Builders / Method Developers • Model Validators / Utilizers • Targeted Task Models for Cheminformatics Process Development (Bennett) • Mining Complex Patterns (Zaki) • Causal Chemometrics Modeling with Kernel Partial Least Squares and Domain Knowledge Filters (Embrechts) • Elucidation of the Structural Basis of Protein Kinetic Stability (Colon) • Theoretical Characterization of kinetically stable proteins (Garcia) • Chemoselective Displacer Synthesis (Moore) • Cyclazocine QSAR and Synthesis. (Wentland) • Protein Bioseparations (Cramer) • Beyond ATCG: “Dixel” representations of DNA-protein interactions (Breneman) • Protein Dissimilarity Analysis using Shape/Property Descriptors (Breneman) • Molecular Simulation-Based Descriptors (Garde) • Potential of Mean Force Approach for describing Biomolecular Hydration (Garcia)

  34. Current RECCR Software • RECON 5.8 + Analyze w/Outlier detection • RAD • Fast KPLS test set mode with low memory footprint • RECON for MOE • Drop-in interactive or batch RECON 5.8 for MOE 2005+ • RECON 2001 for protein characterization • Property moment descriptors (w/Cramer) • Binding site/ligand scoring using Universal Descriptor Space (w/Tropsha) • CoLIBRi Method (w/Tropsha) • TAE/DIXEL • DNA Characterization and bioinformatics (w/Lawrence) • PEST 4.0 (Compatible with Gaussian98/03 or Jaguar 5.x) • PAD (PEST Autocorrelation Descriptors) • WSAD (Whole-surface Autocorrelation Descriptors) • WCD Wavelet Coefficient Descriptors • PPEST • PROLICSS

  35. Dissemination: RECON/TAE Descriptors in MOE (for CoLiBRi)

  36. Protein Structure “Cleaning” and Data Preparation Tool Select PDB structure from protein databank with the highest resolution and fewest number of missing amino acids. 1POC All PDBs have missing atoms, and many have missing amino acids 1POC - Bee-venom Phospholipase A2 Compute solvent accessible surface Remove water and heteroatoms Use clean PDB structure and protein sequence to form a homology model of the complete protein structure Encode with pH sensitive properties Soon available on the RECCR website:http://reccr.chem.rpi.edu MOE descriptors pH independent Surface + charge descriptors pH dependent properties

  37. Dissemination:Protein RECONtool

  38. Capabilities Summary • Novel molecular descriptor tools • KPLS, SVM and specialized kernel-based tools • Model evaluation and visualization • Protein binding site solvation descriptor methods • Prolicss ligand scoring tool • DNA / RNA informatics tools • Local data generation, synthesis and screening capabilities • Collaborations with CECCR and Cleveland Clinic • “Cheminformatics” journal outreach

  39. ACKNOWLEDGMENTS • Current and Former members of the DDASSL group • Breneman Research Group (RPI Chemistry) • N. Sukumar • M. Sundling • Min Li • Long Han • Jed Zaretski • Theresa Hepburn • Mike Krein • Steve Mulick • Shiina Akasaka • Hongmei Zhang • C. Whitehead (Pfizer Global Research) • L. Shen (BNPI) • L. Lockwood (Syracuse Research Corporation) • M. Song (Synta Pharmaceuticals) • D. Zhuang (Simulations Plus) • W. Katt (Yale University chemistry graduate program) • Q. Luo (J & J) • Embrechts Research Group (RPI DSES) • Tropsha Research Group (UNC Chapel Hill) • Bennett Research Group (RPI Mathematics) • Collaborators: • Tropsha Group (UNC Chapel Hill - CECCR) • Cramer Research Group (RPI Chemical Engineering) • Funding • NIH (GM047372-07) • NIH (1P20HG003899-01) • NSF (BES-0214183, BES-0079436, IIS-9979860) • GE Corporate R&D Center • Millennium Pharmaceuticals • Concurrent Pharmaceuticals • Pfizer Pharmaceuticals • ICAGEN Pharmaceuticals • Eastman Kodak Company • Chemical Computing Group (CCG)

  40. Reserve Slides

  41. Specific AIMS • Aim 1) To form a critical mass of researchers with complementary areas of expertise in chemistry, data mining, bioinformatics, computer science, machine learning, descriptor generation, model building and model validation for the purpose of building a collaborative organization to seed the development of new interdisciplinary methods and hybrid applications. • Aim 2) To identify existing limitations within current data mining and predictive property modeling methods for a wide variety of contemporary cheminformatics and QSPR problems, and to identify and follow promising leads for assessing and/or extending the applicability of those methods. • Aim 3) To create a generic toolkit for evaluating the applicability of a particular chemical property prediction methodology for a given class of problem, and to apply these tools to the molecular design and bioinformatics problems illustrated in the Application Modules presented in this proposal. • Aim 4) Use workshops and Center retreats to identify key interdisciplinary approaches for pilot studies, and to direct resources to advance those project modules. • Aim 5) Disseminate results and algorithms to the chemical community through traditional means, and also by setting up web-based server access to ECCR Center computer resources and software to make it available for use on real-world datasets. • Aim 6) Gather preliminary Cheminformatics results, and develop an agile, effective organizational structure for the ECCR that will support the preparation of a competitive P50 proposal for a Cheminformatics Research Center within two years.

  42. Cheat Sheet Main directional arrows Color Palette Blocks Sample Copy Sample Copy

  43. Slides with Copy Only: Sample • As the Molecular Libraries initiative was developed by the NIH, there was a strong sense that there were insufficiencies in readily available computational chemistry tools. • There was also concern about data mining methods to extract value out of PubChem • Extramural grants were awarded to develop important centers for cheminfomatics research • P20 programs - exploratory centers with limited funds for 2 years with the intent that awardees would develop plans and collaborative teams, and early results, to apply for the larger NIH P50 solicitation (next stage of the program). • P50s will be 5 year awards. Application to the P50 program is open to everyone as long as they have some preliminary data and a strong collaborative team.

More Related