Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry In...
1 / 15

Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics - PowerPoint PPT Presentation

  • Uploaded on

Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics. CHI BioIT World, Boston, 2007 Christopher Southan, Global Compound Sciences, AstraZeneca R&D Mölndal, Sweden.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics' - RexAlvis

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Slide1 l.jpg
Complementarity between Public and Commercial Databases: New Opportunities in Medicinal Chemistry Informatics

CHI BioIT World, Boston, 2007

Christopher Southan, Global Compound Sciences, AstraZeneca R&D Mölndal, Sweden

Revolutions in public cheminformatics increasingly compliment commercial databases l.jpg
Revolutions in Public Cheminformatics Increasingly Compliment Commercial Databases

  • Formal representation of the ”missing entity” of chemical structure within the global Web of bioinformatic relationships

  • Ability to search links between biological effects, protein names, sequence data, and chemical information

  • Deposition of HTS results and other types of screening or bioactivity data, directly linked to chemical structure information

  • Proliferation of open cheminformatic tools and downloadable data sets

  • 53 Entrez-selectable compound data sources now in PubChem

Explicit compound to sequence links l.jpg
Explicit Compound-to-sequence Links Compliment Commercial Databases

Increasing commercial and public availability of annotated relationships

…..document (or database entry) “W “ includes assay data “X” that defines compound “Y” as an activity modulator of protein “Z”…….

provide crucial value in medicinal chemistry informatics

We selected the following to include in our comparative analysis:

~ 130,000 cpds, ~1,300 sequences, ~7,000 papers

~ 1.5 million cpds ~ 2,000 sequences ~ 20,000 patents and papers

~ 4,000 cpds, 502 sequences

83 protein targets with bioassay data, and ~6,000 cpds in PDB structures

A linking example l.jpg
A Linking Example Compliment Commercial Databases

Project objectives methods l.jpg

Produce standardised comparisons between public and non-proprietary commercial sources

Include databases or subsets with explicit chemistry-target or other types of bioactive links

Review similarities and differences in content

Project Objectives Methods

  • Normalise downloaded sources by removing fragments

  • Derive canonical tautomer

  • Generate, compare and retain unique molecular hashcodes

  • Prepare “all-against-all” content overlap matrix

  • Perform selected merges and Venn-type complete overlaps

Post filtration compound counts l.jpg

GVKBio 1,488,288 non-proprietary commercial sources

GVKBio Journals 542,858

GVKBio Patents 1,034,548

GVKBio Drug 1,933

WOMBAT 128,120

PubChem 7,268,193

PubChem Prous 3,318

PubChem PDB 5,626

PubChem actives 35,671

PubChem pharmacol 6,070

Bioprint 2,437

ZINC FDA 1,200

DrugBank 3,723

DrugBank small mol 1,018

DrugBank exp drugs 2,737

Dict. Nat.Prod. 132,831

MDDR 159,867

MDDR launched 1,118

CMC 8,189

Post-filtration Compound Counts

All vs all result matrix l.jpg
All-vs-all Result Matrix non-proprietary commercial sources

Result overview l.jpg
Result Overview non-proprietary commercial sources

  • Our post-filtration unique compound numbers were typically 5% to 20% lower than those given by the databases

  • This facilitated standardised comparisons

  • Self-comparisons and subset numbers were consistent

  • On a pair wise basis in the 19 X 19 matrix

    • no single set was entirely covered by any other

    • no cells were null i.e. all shared some content

  • Larger databases showed significant non-overlap suggestive of unique content

  • None of the “known drug” sets overlapped exactly

  • For small differences it was not possible to discriminate between technical errors in structure files or genuine unique content

  • Relative coverage result should not be taken as an implicit criticism or endorsement of any particular database

Gvkbio l.jpg
GVKBIO non-proprietary commercial sources

  • At just under 1.5 million GVKBIO is divided between journals and patents at approximately 1:2 ratio, with an overlap of 89,000

  • GVKBIO covers 93% of WOMBAT and is ~10x larger

  • WOMBAT has captured over 7000 compounds not found in GVKBIO

  • 29% of GVKBIO is represented in PubChem, split evenly between journals and patents

  • Includes 25% of cpds reported as active in any of the screening data sets in PubChem and 70% with a pharmacology link in PubChem via MeSH.

Pubchem l.jpg
PubChem non-proprietary commercial sources

  • 48% overlap with ChemNavigator (not in matrix)

  • Only 3% screened within the system so far (11% active)

  • Largest coverage of every other database, except WOMBAT, of which PubChem covers some 3,000 less compounds than GVKBIO

  • 46% of DNP, 42% of MDDR, 92% of DrugBank, 93% of CMC and 95% of BioPrint and MDDR launched

  • Covers 0.43 mill of GVKBIO

  • GVKB patent overlap shows that the number of PubChem compounds with potential claims is 238,000

A key test set l.jpg
a Key Test Set non-proprietary commercial sources

  • Prous “Drugs of the Future” is a review journal for new compounds in development

  • 3318 cpds in PubChem with document outlinks (but no inlinks)

  • 1374 in PubChem MeSH pharmacology

  • Selected overlaps

    • 2,628 in GVKBIO (with document-cpd-sequence links)

    • 733 in GVKBIO Drugs (“ “ “)

    • 994 in WOMBAT ( “ “ “)

    • 1,875 in MDDR, 734 in MDDR launched

    • 543 in DrugBank

  • Numbers allow inferences on triage through different sources

Venn type overlaps highlight unique content l.jpg

PubChem non-proprietary commercial sources










Venn-type Overlaps Highlight Unique Content

1.49 mill

7.27 mill


Merge of all bioactives l.jpg
Merge of all Bioactives non-proprietary commercial sources

  • Bioprint, CMC, DNP, DrugBank, GVKBIO, GVKBIO DD, MDDR, PubChem Prous, PubChem PDB, PubChem active, PubChem Pharmacol, ZINK FDA, WOMBAT (not entire PubChem)

  • Gives 1,976,273

  • Filtered to unique content reduces to 1,741, 392

  • Relatively small redundancy collapse 234, 881 (11%)

  • Indicates substantial unique content and possibilities for further analysis

Conclusions l.jpg
Conclusions non-proprietary commercial sources

  • Our filtration and comparison methods clarify compound content

  • Public sources have essential value and complementarity to commercial sources

  • Bioactive coverage is expanding in PubChem sub-sets

  • DrugBank provides exemplary linkage between bioinformatics and cheminformatics

  • Public sources now offer data mining and linking functionality with no commercial eqivalent

  • Journal and patent “backfilling” of cpd<->sequence relationships by expert curation is largely only covered by commercial databases

  • GVKBIO has highest coverage of cpd <-> sequence links and patent content

  • Some commercial dbs could define their content and extraction methods more explicitly

  • On-line-only data access models are becoming less attractive

Reference and acknowledgments l.jpg
Reference and Acknowledgments non-proprietary commercial sources


“Complementarity between public and commercial databases: new opportunities in medicinal chemistry informatics” Chris Southan, Péter Várkonyi and Sorel Muresan, Current Topics In Medicinal Chemistry, 2007, in press

Many thanks to:

Prof. Tudor Oprea, Sunset Molecular, for WOMBAT data