Slide1 l.jpg
This presentation is the property of its rightful owner.
Sponsored Links
1 / 34

William Hayes, PhD Phoebe Roberts, PhD March 19, 2007 PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Biogen Idec Literature Informatics for Drug Discovery. William Hayes, PhD Phoebe Roberts, PhD March 19, 2007. Mission. Provide access to literature and text resources tools to access and manage literature and text resources expert analyses of literature and text resources

Download Presentation

William Hayes, PhD Phoebe Roberts, PhD March 19, 2007

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Slide1 l.jpg

Biogen Idec Literature Informatics for Drug Discovery

William Hayes, PhD

Phoebe Roberts, PhD

March 19, 2007

Mission l.jpg


  • Provide

    • access to literature and text resources

    • tools to access and manage literature and text resources

    • expert analyses of literature and text resources

    • the most advanced tools and analyses available

Agenda l.jpg


  • Value Proposition

  • Literature Informatics Overview

  • Projects

  • Summary

Value proposition l.jpg

Value Proposition

  • A recent trend in the industry is to cut the library to a bare operational staff - to manage E-journals and document delivery

  • To do so eliminates our ability to make knowledgeable decisions for drug development

Slide5 l.jpg

  • The Scope of the Literature Problem–You cannot keep up!

  • The annual worldwide production of information in publications is estimated

  • as 8 TB in books, 25 TB in newspapers, 20 TB in magazines, and 2 TB in

  • journals

    Every minute scientific knowledge

    increases by 2,000 pages

    It takes five years to read the new

    scientific material produced every

    24 hours

    80% of information is stored as

    unstructured text

    The number of papers associated

    with a pharma target:

    in 1990 = 100

    in 2001 = 8

Library literature informatics l.jpg

Library -> Literature Informatics

  • Deliver information

  • Requires variety of skill sets (Library science, operations, technical, informatics, domain expertise)

What is literature informatics l.jpg

What is Literature Informatics ?

  • Applying data mgmt and analytical technologies to extract and store knowledge from scientific/business literature

  • Analytical technologies:

    • Information retrieval

    • Text mining

    • Semantic reasoning and inference

  • Analytical objectives:

    • What protein interactions can be found in the corpus?

    • Which gene expressed in a particular pathway with respect to a special disease for a special genetic group

    • Which compounds inhibit a protein?

    • Which documents found are toxicology-related?

    • Show me all co-occurring genes and diseases

Literature informatics benefits l.jpg

Literature Informatics Benefits

  • Much more efficient overview of research areas

    • Save significant time for individual researchers/the company

  • Ability to effectively extract information from hundreds to millions of documents

  • Greater than 10X improvements in speed of analysis and recall

  • More value captured from $Millions spent on literature content and research

External vs internal research dollars l.jpg

External vs Internal Research Dollars

  • US Total: $94.3B (2003) (JAMA. 2005;294:1333-1342)

    • Public 43%- NIH(28%), Other Federal (7%), State/local gov (5%), Charity (3%)

    • Private 57% - Pharma (29%), Biotech ~1500 companies (19%), Device (9%)

  • Pfizer R&D (2004)

    • $8B(3.5X of Pfizer spend from one funding agency!)

  • Biogen Idec - 3rd largest biotech

    • $684M (2004) R&D (0.7% of US Total)

Number of papers published from pubmed l.jpg

Number of Papers Published (from Pubmed)

Text analytics financial analysis l.jpg

Text Analytics Financial Analysis

  • Given 1000 researchers

  • 22% time searching and analyzing literature (Outsell survey 2002)

  • 220 person-years per year analyzing literature

    •  $22M / year

  • Significant percentage of that time is retrievable using advanced text analytics and expert analysts

Front loading safety concerns l.jpg

Front-loading Safety Concerns

  • Lead optimization (LO) costs ~$126M (Tufts survey)

  • LO projects take between 2-4 years

  • ~50% LO projects undergo attrition due to safety concerns (Tufts survey results)

  • ~50% of safety issues had literature indicators at beginning of project (anecdotal evidence)

  • $25M per 4 LO projects can be recovered IF comprehensive literature analyses can speed up Safety analyses by 20%

Text analytics impact l.jpg

Text Analytics Impact

  • Case 1: start with an unknown protein, determine interaction network. No standard procedure without NLP tools – estimated 2-3 weeks of manual mining. With an NLP tool that extracts connectivity information w/ graph visualization from full-text journal articles – 1 hour

  • Case 2: determine toxicity patterns for a compound, or determine toxicity side-effects of inhibiting a target. With manual OVID search – library scientists have already put in 3 months, a total of a year estimated. With NLP+ontologies (OBIIE) – 2-3 weeks.

  • Case 3: An unknown protein is somehow linked to a known disease. There is a lot of disease literature, but only 4 papers on the protein. Establish a plausible connection of mechanism of action with this disease. Without NLP – indefinite. With OBIIE – 2-3 weeks.

The analyst s role l.jpg

The Analyst’s Role

  • Understand questions asked, problems encountered

    • Too much information

    • Not enough information

    • Relevant information is buried

  • Match resources to needs

    • Protein-centric versus pipeline?

    • Better clinical or chemistry coverage?

  • Know search logic and available tools

  • Pre-screen end-user tools

The analyst s role15 l.jpg

The Analyst’s Role

  • Link disparate resources for improved coverage

  • Repackage results to match question, user preferences

  • Never lose sight of user experience

    • Alleviate tedium

    • Minimize error

    • Increase relevance

    • Make them look good

  • Raise awareness of previously unanswerable questions

Drug discovery due diligence information requirements l.jpg

Drug Discovery & Due Diligence Information Requirements

  • Set up alerts/RSS feeds on company, compound, clinical trial info, etc

  • What’s in clinic for indication, trial info/protocols and stage of trial

  • Safety issues

  • Potential alternative indications

  • Biomarkers

  • Toxicities of compounds for indication

  • Potential consultants, collaboration map

  • More comprehensive searches for research, development, pharmacodynamics, clinical trials, adverse events, etc.

Typical text mining workflow l.jpg

Typical Text Mining Workflow

Using workflow technologies to build text mining applications using finer grain components/services

Data Mining

Classification, Clustering, Association,

Statistical Analysis,

Visual Analysis,

etc …

Feature Extraction


Word Counts, Pattern Extraction & Counts, etc


Gene Name counts, etc


Phrase counts, etc

Text Processing


Stop-word filters,

Pattern filters,

Lexicon matching,


NLP parsing

etc, ..

Retrieval/ Storage


Access Drivers



Feature Vectors

Text docs

Text docs

Text documents

Features are summarized into vector forms which are suitable for data mining

Results can be document characterization or hidden relationship extraction

Pre-process documents to enhance the ease of feature extraction

Retrieve and organize relevant documents

Overview l.jpg


  • Collect

    • Quosa

    • Medline

  • Explore

    • Biovista

  • Extract

    • Linguamatics I2E

  • Infrastructure

    • KDE

Quosa l.jpg


  • Federated search/alerts

  • Localize full-text papers

  • Find information not found in abstracts (kinetic parameters, experimental protocols, etc)

  • Manage literature

  • Collaborate

  • Analyze literature sets

  • Develop corpora for other applications to analyze

Biovista interactive co occurrence analysis l.jpg

Biovista Interactive Co-occurrence Analysis

  • Basic Research

    • Target expansion and off-target effects

    • Experimental design

    • Going fishing

    • Finding connections between known facts

    • Comprehensive summary of a research area

    • Collaboration

  • Clinical Development

    • Drug-Drug interactions

    • Timeline studies

    • Side effects to worry about

  • Intellectual Property

    • Analyze issued patents

  • Competitive Intelligence

Linguamatics i2e l.jpg

Linguamatics I2E

  • Fact search engine

  • Uses semantic entity types coupled with syntactic search criteria for relationship extraction

  • Agile NLP application

Inforsense kde l.jpg

Inforsense KDE

  • Text Mining Infrastructure

  • Text/Data workflow environment

Use case 1 where are early licensing opportunities in academia l.jpg

Use case 1: Where are early licensing opportunities in academia?

Goal: identify areas of research that could yield potential therapeutics


  • some efficacy is established in the form of testing in animal models

  • Pre-IND filing

    Approach: Survey the literature for papers that describe in vivo testing of reagents that affect a particular biology (eg immunity, neurology or tumor growth)

Paint a picture of the desired target l.jpg

Paint a picture of the desired target

  • Use internal projects to develop search criteria

  • Four early-stage projects each have 5-10 papers describing neutralizing antibodies

  • The papers mention an indication only half the time

  • The papers always mention tissues and cell types

  • Antibodies are described in a limited number of ways

  • The target of the antibody is almost always in the same sentence as the antibody term

  • The ability of an antibody to block function is described in a limited number of ways

Use the desired features to construct a search l.jpg

Use the desired features to construct a search

Antibody and protein terms in the same sentence

Block/neutralize and variations somewhere in the abstract

Nervous tissues somewhere in the abstract

“a neutralizingmonoclonal antibody against IL-1 beta was infused into the wound immediately following the injury”

“a neutralizingmonoclonal antibody directed against MMP-9 was administered intravenously”

“anti-rat neutralizingIL-1 betaantibody (anti-IL-1 beta) or control immunoglobulin G antibody (IgG) was microinjected”

“potent blocking of p75 binding occurs only with MAb 909”

“an antibody that blockserbB2/neu-mediated signaling inhibited vestibular ganglion neuron viability”

Search results l.jpg

Search Results

Use case 2 the gene list l.jpg

Use Case 2: the Gene List

  • Generated by biomarker studies, toxicity studies, central to translational medicine

  • Often hundreds of genes

  • Official names are obscure

  • Finding all the names, the most common name is hard

  • On average, one a week

A literature analytics workflow l.jpg

A Literature Analytics Workflow

Gene Expression Analysis

Find Relevant Genes from Online Databases

Find Associations between Frequent Terms

Visualizing search results and information within yields new insights l.jpg

Visualizing search results and information within yields new insights

  • Paging through abstracts one by one doesn’t show the big picture:

    • Who’s collaborating with whom?

    • Who’s patenting their work?

    • When did the field develop and mature?

    • Who are the opinion leaders?

Slide31 l.jpg

1934 Author/Affiliations

8893 relations

Blue = Aurora Kinases

Green = Cancer lit

Red = Patents lit

Where do we need to be l.jpg

Where do we need to be?

  • Spend less time acquiring, more time assimilating

  • Provide domain experts with powerful literature analytics

  • Mix/match best of breed applications for combining text/data mining

  • Need knowledge discovery/exploitation environment that supports rapid construction of integrated text/data results for researchers

Acknowledgements l.jpg


  • Connie Matsui

  • June Ivey

  • Pam Gollis

  • Harry Bochner

  • Adrean Andreas

  • Cindy Shamel

  • Steve French

  • Research Informatics

  • Login