Overview of chemical informatics and cyberinfrastructure collaboratory
1 / 24

Overview of Chemical Informatics and Cyberinfrastructure Collaboratory - PowerPoint PPT Presentation

  • Uploaded on

Overview of Chemical Informatics and Cyberinfrastructure Collaboratory. March 15 2007 Geoffrey Fox Computer Science, Informatics, Physics Pervasive Technology Laboratories Indiana University Bloomington IN 47401 [email protected] http://www.infomall.org http://www.chembiogrid.org.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Overview of Chemical Informatics and Cyberinfrastructure Collaboratory' - mali

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Overview of chemical informatics and cyberinfrastructure collaboratory l.jpg

Overview of Chemical Informatics and Cyberinfrastructure Collaboratory

March 15 2007

Geoffrey Fox

Computer Science, Informatics, Physics

Pervasive Technology Laboratories

Indiana University Bloomington IN 47401

[email protected]



Indiana university summary l.jpg
Indiana University Summary Collaboratory

Indiana University is focusing on two major areas:

  • Creating a comprehensive, easily accessible infrastructure for chemoinformatics toolsand data sources, linked with PubChem and made available as web services, and partnering with screening centers and other users to demonstrate how this infrastructure can be usefully applied

    • Infrastructure can include any tools, not just ours (commercial/open source, chemoinformatics, bioinformatics, and so on)

    • New, custom applications can be built quickly using existing services in a similar way to Google Maps and other “web 2.0” resources

  • Being a central hub of chemoinformatics education, including offering distance courses on chemoinformatics theory and techniques, practical workshops on using chemoinformatics resources, and freely available web-based educational resources

    • We currently offer a Ph.D, M.S. and graduate certificate (distance) in chemical informatics

    • Distance education program allows you to “pick and choose” courses to meet educational needs: certificate is awarded on completion of four courses

Cicc senior personnel l.jpg
CICC Senior Personnel Collaboratory

  • Peter T. Cherbas

  • Mehmet M. Dalkilic

  • Charles H. Davis

  • A. Keith Dunker

  • Kelsey M. Forsythe

  • John C. Huffman

  • Malika Mahoui

  • Daniel J. Mindiola

  • Santiago D. Schnell

  • William Scott

  • Craig A. Stewart

  • David R. Williams

  • Geoffrey C. Fox

  • Mu-Hyun (Mookie) Baik

  • Dennis B. Gannon

  • Kevin E. Gilbert

  • Rajarshi Guha

  • Marlon Pierce

  • Beth A. Plale

  • Gary D. Wiggins

  • David J. Wild

  • Yuqing (Melanie) Wu

From Biology, Chemistry, Computer Science, Informatics

at IU Bloomington and IUPUI (Indianapolis)

Slide4 l.jpg

Chemical Informatics and Cyberinfrastucture Collaboratory Collaboratory

Funded by the National Institutes of Health




CICC Combines Grid Computing with Chemical Informatics

Large Scale Computing Challenges

Science and Cyberinfrastructure

CICC is an NIH funded project to support chemical informatics needs of High Throughput Cancer Screening Centers. The NIH is creating a data deluge of publicly available data on potential new drugs.

Chemical Informatics is non-traditional area of high performance computing, but many new, challenging problems may be investigated.













Initial 3D



OSCAR-mined molecular signatures can be clustered, filtered for toxicity, and docked onto larger proteins. These are classic “pleasingly parallel” tasks. Top-ranking docked molecules can be further examined for drug potential.

Chemical informatics text analysis programs can process 100,000’s of abstracts of online journal

articles to extract chemical signatures of potential drugs.




Big Red (and the TeraGrid) will also enable us to perform time consuming, multi-stepped Quantum Chemistry calculations on all of PubMed. Results go back to public databases that are freely accessible by the scientific community.

  • CICC supports the NIH mission by combining state of the art chemical informatics techniques with

    • World class high performance computing

    • National-scale computing resources (TeraGrid)

    • Internet-standard web services

    • International activities for service orchestration

    • Open distributed computing infrastructure for scientists world wide













Indiana University Department of Chemistry, School of Informatics, and Pervasive Technology Laboratories

Cicc web service infrastructure l.jpg

OSCAR Document Analysis Collaboratory

InChI Generation/Search

Computational Chemistry (Gamess, Jaguar etc.)


Quantum Chemistry

Grid Services

Service Registry

Job Submission and Management

Local Clusters

IU Big Red

TeraGrid, Open Science Grid

Portal Services

RSS Feeds

User Profiles

Collaboration as in Sakai

CICC Web Service Infrastructure

Slide6 l.jpg

Web Service Locations Collaboratory

Web Service Locations

Cambridge University

  • InChi generation / search


  • OpenBabel

Cambridge University

  • InChi generation / search


  • OpenBabel

Indiana University

  • Clustering

  • VOTables

  • OSCAR3

  • Toxicity classification

  • Database services

Indiana University

  • Clustering

  • VOTables

  • Toxicity classification

  • Database services

  • Statistics services

VCC Laboratory

  • ALogPS


  • CSLS

University of Cologne

  • NMRShiftDB

Slide7 l.jpg

Where Does The Functionality Come From? Collaboratory

University of Michigan

  • PkCell

Cambridge University

  • InChi generation / search



  • BCI fingerprints

  • DivKMeans

gNova Consulting


  • PubChem

  • PubMed


  • Cheminformatics

European Chemicals Bureau

  • ToxTree toxicity predictions


  • Docking

R Foundation

  • R package

Indiana University

  • VOTables

  • NCI DTP predictions

  • Database services

Cicc infrastructure vision l.jpg
CICC Infrastructure Vision Collaboratory

  • Drug Discovery and other academic chemistry and pharmacologyresearch will be aided by powerful modern information technology ChemBioGrid set up as distributed cyberinfrastructure in eScience model

  • ChemBioGrid will provide portals (user interfaces) to distributed databases, results of high throughput screening instruments, results of computational chemical simulations and other analyses

  • ChemBioGrid will provide services to manipulate this data and combine in workflows; it will have convenient ways to submit and manage multiple jobs

  • ChemBioGrid will include access to PubChem, PubMed, PubMed Central, the Internet and its derivatives like Microsoft Academic Live and Google Scholar

  • The services include open-source software like CDK, commercial code from vendors from BCI, OpenEye, Gaussian and Google, and any user contributed programs

  • ChemBioGrid will define open interfaces to use for a particular type of service allowing plug and play choice between different implementations

Cheminformatics education at iu l.jpg
Cheminformatics Education at IU Collaboratory

  • Linked to bioinformatics in Indiana University’s School of Informatics

    • School of Informatics degree programs BS, MS, PhD

  • Programs offered at both the Indianapolis (IUPUI) and Bloomington (IUB) campuses

    • Bioinformatics MS and track on PhD

    • Chemical InformaticsMS and track on PhD

    • Informatics Undergraduates can choose a chemistry cognate (change to Life Sciences )

  • PhD in Informatics started in August 2005 and offers tracks in

    • bioinformatics; chemical informatics; health informatics; human-computer interaction design; social and organizational informatics; more to come!

  • Good employer interest but modest student understanding of value of Cheminformatics degree

  • 3 core courses in Cheminformatics plus seminar/independent studies

  • Significant interest in distance education version of introductory Cheminformatics course (enrollment promising in Distance Graduate Certificate in Chemical Informatics)

Example spreading chemoinformatics education with cic courseshare l.jpg
Example: Spreading chemoinformatics education with CIC courseshare

  • We have partnered with the University of Michigan to offer our introductory chemoinformatics (I571) course concurrently at Indiana University and the University of Michigan as a CIC courseshare, so UM pharmacy, chemistry and engineering students can be trained in chemoinformatics techniques for course credit at UM

  • In addition, individual students in academia, government, and small and large life science companies have taken the class remotely from all over the country for credit towards the graduate certificate

  • Uses mixture of web conferencing (Breeze), videoconferencing, and online resources for maximum flexibility

    • Minimally all that is required is a telephone and internet-connected PC

    • Students can replay any of the classes using just a regular PC

    • Most recent course wiki is available at http://cheminfo.informatics.indiana.edu/djwild/I571_2006_wiki

Giving a class remotely to UM students with video and web conferencing

Mlscn post hts biology decision support l.jpg
MLSCN Post-HTS Biology Decision Support courseshare

Percent Inhibition or IC50 data is retrieved from HTS

Grids can link data analysis ( e.g image processing developed in existing Grids), traditional Chem-informatics tools, as well as annotation tools (Semantic Web, del.icio.us) and enhance lead ID and SAR analysis

A Grid of Grids linking collections of services atPubChem

ECCR centers

MLSCN centers

Workflows encoding plate & control well statistics, distribution analysis, etc

Question: Was this screen successful?

Workflows encoding distribution analysis of screening results

Question: What should the active/inactive cutoffs be?

Question: What can we learn about the target protein or cell line from this screen?

Workflows encoding statistical comparison of results to similar screens, docking of compounds into proteins to correlate binding, with activity, literature search of active compounds, etc

Compounds submitted to PubChem




Example hts workflow finding cell protein relationships l.jpg
Example HTS workflow: finding cell-protein relationships courseshare

A protein implicated in tumor growth with known ligand is selected (in this case HSP90 taken from the PDB 1Y4 complex)

The screening data from a cellular HTS assay is similarity searched for compounds with similar 2D structures to the ligand.

Docking results and activity patterns fed into R services for building of activity models and correlations





Similar structures are filtered for drugability, are converted to 3D, and are automatically passed to the OpenEye FRED docking program for docking into the target protein.

Once docking is complete, the user visualizes the high-scoring docked structures in a portlet using the JMOL applet.

Similar structures to the ligand can be browsed using client portlets.

Example pubdock l.jpg
Example: PubDock courseshare

  • Database of approximately 1 million PubChem structures (the most drug-like) docked into proteins taken from the PDB

  • Available as a web service, so structures can be accessed in your own programs, or using workflow tools like Pipeline Polit

  • Several interfaces developed, including one based on Chimera (below) which integrates the database with the PDB to allow browsing of compounds in different targets, or different compounds in the same target

  • Can be used as a tool to help understand molecular basis of activity in cellular or image based assays

Example r statistics applied to pubchem data l.jpg
Example: R Statistics applied to PubChem data courseshare

  • By exposing the R statistical package, and the Chemistry Development Kit (CDK) toolkit as web services and integrating them with PubChem, we can quickly and easily perform statistical analysis and virtual screening of PubChem assay data

  • Predictive models for particular screens are exposed as web services, and can be used either as simple web tools or integrated into other applications

  • Example below uses DTP Tumor Cell Line screens - a predictive model using Random Forests in R makes predictions of probability of activity across multiple cell lines (available at http://rguha.ath.cx/~rguha/ncidtp/dtp)

Varuna environment for molecular modeling baik iu l.jpg
Varuna courseshareenvironment for molecular modeling (Baik, IU)








Simulation ServiceFORTRAN Code,


DB ServiceQueries, Clustering,Curation, etc.





PubChem, PDB,NCI, etc.




Methods development at the cicc l.jpg
Methods Development at the CICC courseshare

  • Tagging methods for web-based annotation exploiting del.icio.us and Connotea

  • Development of QSAR model interpretability and applicability methods

  • RNN-Profiles for exploration of chemical spaces

  • VisualiSAR - SAR through visual analysis

    • See http://www.daylight.com/meetings/mug99/Wild/Mug99.html

  • Visual Similarity Matrices for High Volume Datasets

    • See http://www.osl.iu.edu/~chemuell/new/bioinformatics.php

  • Fast, accurate clustering using parallel Divisive K-means

  • Mapping of Natural Language queries to use cases and workflows

  • Advanced data mining models for drug discovery information

  • Physics-based Scoring Algorithms

What do you get in a web service l.jpg
What Do You Get in a Web Service? courseshare

  • WSDL for all services available

    • Collected on a web page

    • Available in a UDDI repository

  • Javadocs or plain text descriptions

  • Source code and associated unit tests

  • Various client examples

    • Web pages (via PHP)

    • Python

    • Chimera

Web service vision l.jpg
Web Service Vision courseshare

  • Web services provide a neutral approach to exposing functionality

  • You can utilize them in

    • Workflow tools – Pipeline Pilot, Taverna, XBaya

    • Desktop clients – Chimera, custom

    • Web pages

  • They can be located anywhere

    • On your desktop

    • Intranet

    • Internet

Web service vision19 l.jpg
Web Service Vision courseshare

  • Literally anything can be made into a web service

    • Libraries

    • Standalone programs

    • Commerical code

    • Open-source code

Rss feeds l.jpg
RSS Feeds courseshare

  • Provide access to DB's via RSS feeds

  • Feeds include 2D/3D structures in CML

  • Viewable in Bioclipse, Jmol as well as Sage etc.

  • Two feeds currently available

    • SynSearch – get structures based on full or partial chemical names

    • DockSearch – get best N structures for a target

R cdk pubchem l.jpg
R, CDK & PubChem courseshare

  • Goals

    • Access cheminformatics from within R

    • Access PubChem data from within R

  • rcdk package allows to do cheminformatics within R using CDK functionality

  • rpubchem provides access to PubChem compound data and bioassay data

    • Searchable via assay ID, keywords

  • J. Stat. Soft, 2007, 18(6)

Databases l.jpg
Databases courseshare

  • Most of our databases aim to add value to PubChem or link into PubChem

    • We maintain a local mirror for testing, data mining

  • 3D structures (MMFF94)

    • Searchable by CID, SMARTS, 3D similarity

  • Docked ligands (FRED)

    • 906K drug-like compounds into 7 ligands

    • Will eventually cover ~2000 targets

Cheminformatics algorithm development l.jpg
(Cheminformatics) Algorithm Development courseshare

  • Goals

    • Focus on interpretability and applicability

    • Devise novel approaches to clustering problems

    • Investigate the utility of low dimensional representations for a variety of problems

  • Examples

    • Ensemble feature selection (JCIM, in press)

    • Cluster counting with R-NN curves (in revision)

Chemical data mining l.jpg
Chemical Data Mining courseshare

  • Working on screening data with Scripps, FL

    • Random forests (modeling & feature selection)

    • Naïve Bayes (modeling)

    • Identifying features indicative of toxicity

    • Domain applicability

  • NCI DTP Cell line activity predictions

    • Random forest models for 60 cell lines

  • All available as

    • downloadable R models

    • web services (supply SMILES, get prediction) with web page clients