The research assistant for biological text mining
This presentation is the property of its rightful owner.
Sponsored Links
1 / 31

The Research Assistant for Biological Text Mining PowerPoint PPT Presentation


  • 76 Views
  • Uploaded on
  • Presentation posted in: General

Software for Biotech and Pharma Research. The Research Assistant for Biological Text Mining. Luc Dehaspe Other Members of the BioMinT Consortium. Text Mining in the biological domain . Emerging field of research and development 40+ articles in “Bioinformatics 2004”

Download Presentation

The Research Assistant for Biological Text Mining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


The research assistant for biological text mining

Software for Biotech and Pharma Research

The Research Assistant for Biological Text Mining

Luc Dehaspe

Other Members of the BioMinT Consortium


Text mining in the biological domain

Text Mining in the biological domain

  • Emerging field of research and development

    • 40+ articles in “Bioinformatics 2004”

    • Dedicated workshops, competitions and interest groups

  • Information retrieval and extraction to deal with information overflow

    • 12 million citations in Medline from 4600 journals

    • Many more resources on the web

  • Essential link in the semantic integration of the numerous biological resources.


Use of text mining for database annotation

Use of text mining for database annotation

  • curated protein sequence database

  • high level of annotation of proteins

  • high level of integration with other databases

Swiss-Prot Entry Creation Flowchart


Use of database annotations for text mining

Use of database annotations for text mining

  • Tools for information retrieval, filtering, classification, extraction rely on

    • Corpora of examples used by machine learning methods;

    • Linguistic analysis and controlled vocabularies, (ontologies, thesauri, biological dictionaries).

  • Databases provide semi-structured information that could be used

    • for corpus elaboration

    • as specific vocabulary resources


The research assistant for biological text mining

University of Antwerp (BE)

Artificial Intelligence

Austrian Research Institute for AI

Biological Sciences

University of Manchester (UK)

Coordinator

PharmaDM (BE)

Swiss Institute of Bioinformatics

University of Geneva (CH)

  • 3 year FP5 European Project, started in January 2003

  • Official web site: www.biomint.org

  • Interdisciplinary consortium:


The goals of biomint

The goals of BioMinT

  • To develop a generic text mining tool that:

    • interprets different types of queries

    • retrieves relevant documents from the biological literature

    • extracts the required information

    • outputs the result as a database slot filler or as a structured report

  • The tool thus provides two essential research supportservices:

    • Curator's Assistant:accelerate, by partially automating, the annotation and update of databases;

    • Researcher's Assistant: generate readable reports in response to queries from biological researchers.


Curator s assistant for swiss prot annotation

Comments

Definition

Gene name

Reference

content

Reference

comments

Keywords

Sequence

features

Curator’s Assistant forSwiss-Prot Annotation


Curator s assistant for prints annotation

Family

Super-family

Domain-family

High level function

High level structure

Disease associations

Subcellular location

Tissue distribution

etc…

Low level function

Super-family structure

Disease associations

Number of subtypes

etc…

Domain structure

Domain function

Curator’s Assistant for PRINTS annotation

  • PRINTS deals with groups of proteins

  • Annotation of 3 types of protein fingerprints

Extracted Information


The biological research assistant

Swiss-Prot Entry Creation Flowchart

Biological Researcher’s Literature Screening Flowchart

The Biological Research Assistant

  • Overlap with Curator’s Assistant

    • All biologists occasionally in the curator’s seat

    • Keep ahead of Swiss-Prot in research area of interest

    • Include private (confidential) document collections


Information retrieval and extraction modules

G

U

I

IR

Query expansion

PubMed search

Document filtering/ranking

Document organisation

IE

Sentence extractor

NLP

tools

Case frame generator

Information retrieval and extraction modules


Information retrieval and extraction modules1

Information retrieval and extraction modules

G

U

I

IR

Query expansion

PubMed search

Document filtering/ranking

Document organisation

IE

Sentence extractor

NLP

tools

Case frame generator


Information retrieval

Information Retrieval

  • A meta-query engine built round PubMed

    • Expansion of the initial query with synonyms using a gene/protein synonym database (GPSDB)

      • the goal being to retrieve an exhaustive set of documents containing information on a protein.

    • Filtration and ranking of the retrieved documents

    • Pre-classification according to information topics.


Gpsdb

GPSDB

  • Database for synonym expansion of gene and protein names

  • Populated by the main resources on model organisms

  • Contains 559’294 synonyms referring to 292’472 proteins


Gpsdb1

LocusLink

TWIST1

H-twist

LocusLink

BPES2

SCS

ACS3

HUGO

HUGO

ACSL3

BPES3

TWIST

ACS3

twist

PRO2194

ACSL3

TWIST1

FACL3

FACL3

H-twist

ACS3

BPES2

SCS

PRO2194

ACS3

BPES3

TWIST

Swiss-Prot

OMIM

ACSL3 FACL3

TWIST1 TWIST

OMIM

Swiss-Prot

TWISTTWIST1

ACS3

LACS3

FACL3

GPSDB

  • Cross-reference links are used to connect database entries that refer to a same gene/protein entity, thus pointing out the problem of homonymy when it occurs


Gpsdb screenshot

GPSDB screenshot

lap2 is a synonym of three separate protein entities

Erbin

HSP 86

Thymopoietin


Gpsdb screenshot1

GPSDB screenshot


Gpsdb used for query expansion

GPSDB used for query expansion

lap2

Original user query:

Query expansion based on GPSDB


Document filtering and ranking

Document filtering and ranking

  • Interactive modules which permit a flexible selection of relevant documents for the IE process.

  • Algorithmic approaches

    • Query dependent:

      • Lucene Ranker: java-based indexing engine giving a ranked output of queried documents

    • Query independent:

      • Naive Bayes Ranker: using pre-trained classification of relevant documents on specific topics


Document filtering and ranking1

Document filtering and ranking

Output of query dependent ranking


Document filtering and ranking2

Document filtering and ranking

Output of query independent ranking with respect to topic “Disease”


Information retrieval and extraction modules2

Information retrieval and extraction modules

G

U

I

IR

Query expansion

PubMed search

Document filtering/ranking

Document organisation

IE

Sentence extractor

NLP

tools

Case frame generator


Sentence extractor

Sentence extractor

  • Goal: extract sentences with information relevant for protein annotation

  • Method: machine learning from corpora with manually labeled sentences

  • Data representation: bag-of-words approach

  • Best results with Support Vector Machines (linear/Radial Basis Function)


Sentence extractor sample output

Sentence extractorSample output

  • set of sentences extracted from the top 5 ranked papers

  • query-terms are highlighted

  • sentences classified according to topics (function, structure, disease)

  • sentences linked to the PubMed abstract they originate from


Case frame generator

Case frame generator

A protein containing the N-terminal domain with the first transmembrane segment of MAN1 is retained in the inner nuclear membrane.

TARGETED_TO {X: MAN1} {Y: inner nuclear membrane}


Case frame generator1

Case frame generator

  • Goal: Automatic identification of selected types of entities, relations, or events in free text

  • Methods:

    • Given a set of pre-labeled sentences, learn IE templates with Inductive Logic Programming (ILP)

    • Background knowledge:

      • Syntactic & semantic information from shallow-parser

      • Ontologies providing entities in a given domain

  • Text analysis tools

    • Shallow Parser (MBSP) based on Machine Learning (TiMBL)

    • Shallow parser adapted to biomedical field using Genia corpus


Case frame generator sample output shallow parser

subject

object

object

Case frame generatorSample output shallow parser

The mouse lymphoma assay (MLA) utilizing the Tk gene is widely used to identify chemical mutagens.

Cell-line

The mouse lymphoma assay

MLA

DNA part

to identify

utilizing

chemical mutagens

the TK gene


Case frame generator sample output

Case frame generatorSample output

  • Information extracted by the Case Frame Generator, which applied machine learned IE rules to output of the Shallow Parser


Summary

Summary

  • The BioMinT prototype is a workingunified system for Biological Text Mining

    • Information Retrieval:

      • query expansion

      • doc filtering/ranking

    • Information extraction

      • Extraction of sentences on user-specified topics

      • Extraction of relationships between entities (Case frames)

  • Based on variety of resources/technologies/expertises

    • Biological sciences: corpus annotation, database annotation, fingerprints, ontologies, …

    • Artificial intelligence: IR, machine learning (SVM, ILP, …), Natural Language Processing (Shallow Parser), Case Frames, …

    • Software development: databases, web-server, GUI, …


Future biomint developments

Future BioMinT developments

  • Integration of BioMinT prototype in the future annotation environment of Swiss-Prot & PRINTS

  • Release Q4-2005

    • Free web-based version, with restrictions on

      • Simultaneous users

      • Resources per user (computing & storage)

    • Customization services provided by PharmaDM

      • Integration into researcher’s IT environment (E-mail alerts …)

      • Mining in-house document collections

      • Combination with DMax data analysis software

      • Incorporation of highly specialized background knowledge (ontologies, thesauri, biological dictionaries, etc…)

      • Custom reports and GUI, etc…


The research assistant for biological text mining

WWW

  • BioMinT home page: http://www.biomint.org

  • GPSDB synonyms database: http://biomint.oefai.at

  • BioMinT prototype Quick Tour:

    http://biomint-server.pharmadm.com:8080/xwiki/bin/view/BioMinT/ProtopQuickTour


Acknowledgements

Melanie Hilario

Jee-Hyub Kim

Walter Daelemans

Jo Meyhi

Frederik Durant

Terri Attwood

Alex Mitchell

Paul Bradley

Kurt De Grave

Fred Lefever

Walter Luyten

Kristof Van Belleghem

Andre Vandecandelaere

Johann Petrak

Alexander Seewald

Anne-Lise Veuthey

Marc Zehnder

Violaine Pillet

Swiss-Prot Curators

Acknowledgements

Artificial Intelligence

Biological Sciences

Interested? Demo?

Leave your card at POSTER 49


  • Login