Go tag assigning gene ontology labels to medline abstracts
This presentation is the property of its rightful owner.
Sponsored Links
1 / 64

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts PowerPoint PPT Presentation


  • 80 Views
  • Uploaded on
  • Presentation posted in: General

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts. Robert Gaizauskas. Natural Language Processing Group Department of Computer Science. M. Ghanem, Tom Barnwell , Y. Guo Department of Computing. GO Tag: Assigning Gene Ontology Labels to Medline Abstracts. Robert Gaizauskas.

Download Presentation

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Go tag assigning gene ontology labels to medline abstracts

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts

Robert Gaizauskas

Natural Language Processing Group

Department of Computer Science


Go tag assigning gene ontology labels to medline abstracts1

M. Ghanem, Tom Barnwell, Y. Guo

Department of Computing

GO Tag: Assigning Gene Ontology Labels to Medline Abstracts

Robert Gaizauskas

N. Davis, Y.K. Guo, H. Harkema

Natural Language Processing Group

Department of Computer Science

J. Ratcliffe


Outline

Outline

  • Context

    • Project Background

    • The Gene Ontology

    • Go Annotation in Model Organism Databases

    • Medline

  • Go Tagging Tasks

    • User types/scenarios

    • Possible tasks

    • Related Work

  • Data sets/Gold Standards

  • Approaches and Results to Date

    • Lexical lookup

    • Vector Space Similarity

    • Machine Learning

  • Exploiting the Results in Search Tools

NaCTeM Seminar


Project background

Project Background

  • Work is funded by the EPSRC as a Best Practice Project for collaboration between DiscoveryNet and myGrid -- E-Science Pilot Projects (2001-5)

  • Both projects

    • have developed text mining and data analysis components -- complementary approaches NLP vs. datamining/statistical analysis

    • workflow models for co-ordinating distributed services

    • working on life science applications

  • Aim: to develop a unified real-time e-Science text-mining infrastructure that builds upon and extends the technologies and methods developed by both Discovery Net and myGrid

    • Software engineering challenge: integrate complementary service-based text mining capabilities with different metadata models into a single framework

    • Application challenge: annotate biomedical abstracts with semantic categories from the Gene Ontology

NaCTeM Seminar


The gene ontology

The Gene Ontology

  • “The Gene Ontology project provides a controlled vocabulary to describe gene and gene product attributes in any organism”http://www.geneontology.org/

  • Consists of three structured, controlled vocabularies (ontologies) that describe gene products in terms of their associated:

    • biological processes

    • cellular components

    • molecular functions

      in a species-independent manner

  • E.g. gene product cytochrome c can be described by

    • the molecular function term electron transporter activity

    • the biological process terms oxidative phosphorylation and induction of cell death

    • the cellular component terms mitochondrial matrix and mitochondrial inner membrane

NaCTeM Seminar


Gene ontology cont

Gene Ontology (cont)

From: Gene Ontology: tool for

the unification of biology.

The Gene Ontology Consortium

(2000) Nature Genet. 25: 25-29.

NaCTeM Seminar


The gene ontology cont

The Gene Ontology (cont)

  • Started as a joint effort between three model organism databases (FlyBase (Drosophila), the Saccharomyces Genome Database (SGD) and the Mouse Genome Database (MGD))

  • GO now (08/11/05) contains 19022 terms

  • GO Slim(s) are reduced versions of GO ontologies containing a subset of GO terms

    • Aim to give a broad overview of ontology content

    • GO Slim Generic currrently contains 127 terms

  • A typical GO Term

Term name: isotropic cell growth

Accession: GO:0051210

Ontology: biological_process

Synonyms: related: uniform cell growth

Definition: “The process by which a cell irreversibly increases in size uniformly in all directions. In general, a rounded cell morphology reflects isotropic cell growth.”

NaCTeM Seminar


Go annotation in model organism db s

Evidence code

Gene

Go Annotation

Reference

TAS : Traceable

Author Statement

structural constituent

of cytoskeleton

Botstein D, et al. (1997)

The yeast cytoskeleton.

Pruyne D and Bretscher A (2000)

Polarization of cell growth in yeast.

Pruyne D and Bretscher A (2000)

Polarization of cell growth in yeast.

I. Establishment and maintenance

Botstein D, et al. (1997)

The yeast cytoskeleton.

TAS : Traceable

Author Statement

ACT1

exocytosis

Galarneau L, et al. (2000)

Multiple links between the NuA4

histone acetyltransferase complex

and epigenetic control of transcription

IDA : Inferred from

Direct Assay

histone

acetyltransferase

complex

GO Annotation in Model Organism DB’s

  • Model organism db’s typically record for each entry (gene) one or more GO codes + links to the literature supporting the assignment of the GO code

  • E.g. from the Saccharomyces Genome Database (SGD)

IC: Inferred by Curator

IDA: Inferred from Direct Assay

IEA: Inferred from Electronic Annotation

IEP: Inferred from Expression Pattern

IGI: Inferred from Genetic Interaction

IMP: Inferred from Mutant Phenotype

IPI: Inferred from Physical Interaction

ISS: Inferred from Sequence or Structural Similarity

NAS: Non-traceable Author Statement

ND: No biological Data available

RCA: inferred from Reviewed Computational Analysis

TAS: Traceable Author Statement

NR: Not Recorded

NaCTeM Seminar


Pubmed

PubMed

  • PubMed

    • on-line bibliographic database designed to provide access to citations from biomedical literature

    • developed by the US NCBI at the NLM

    • Contains Medline, OldMedline, various other sources

  • Medline

    • Over 12 million citations dating back to 1960’s

    • Author abstracts and citations from > 4800 biomedical journals

NaCTeM Seminar


Pubmed1

PubMed

  • Entrez is NCBI’s integrated, text-based search and retrieval system for the major databases it maintains

NaCTeM Seminar


Outline1

Outline

  • Context

    • Project Background

    • The Gene Ontology

    • Go Annotation in Model Organism Databases

    • Medline

  • Go Tagging Tasks

    • User types/scenarios

    • Possible tasks

    • Related Work

  • Data sets/Gold Standards

  • Approaches and Results to Date

    • Lexical lookup

    • Vector Space Similarity

    • Machine Learning

  • Exploiting the Results in Search Tools

NaCTeM Seminar


User types

User Types

  • Research Geneticists

    • Narrow information interest

      • Particular gene

      • Particular activity/functionality

  • Model Organism Genome DB Curators

    • Broader information interest

    • Typically track a number of publications, seeking to enhance information stored in the model organism genome DB at the locus level

NaCTeM Seminar


User scenarios research geneticist

User Scenarios: Research Geneticist

  • Possible scenarios using GO tagging to support a research geneticist include:

    • Search result presentation:

      • Tag abstracts returned from a PubMed search with GO codes

      • Use GO codes to cluster/structure search results to support more effective information access

    • Structuring of related literature as workflow side-effect

      • Many typical researcher workflows involve BLAST searches yielding BLAST/Swissprot reports

      • Workflow can automatically assemble a set of “related” papers by extracting PMIDs of homologous genes/proteins from reports and collecting these abstracts plus, optionally, others closely related by text similarity

      • Resulting abstract set can be clustered/structured by GO terms and presented to researcher

        (Integrating Text Mining Services into Distributed Bioinformatics Workflows: A Web Services Implementation. Gaizauskas, Davis, Demetriou, Guo and Roberts, In Proceedings of the IEEE International Conference on Services Computing (SCC 2004), 2004.)

NaCTeM Seminar


Search result presentation motivating example

Search Result Presentation: Motivating Example

  • One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1)

  • Putting LIM Kinase into Entrez gives 146 possible papers of interest.

NaCTeM Seminar


Search result presentation motivating example1

Search Result Presentation: Motivating Example

  • One of the genes involved in the cognitive/social elements of Williams Beuren syndrome is LIM Kinase 1 (LIMK1/LIMK-1)

  • Putting LIM Kinase into Entrez gives 146 possible papers of interest.

  • However search in the model organism corpus for LIM Kinase yields only 5 papers but a high number of associated GO codes (and this is from only partially annotated papers):

  • Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set.

GO:0006468 : biological_process : protein amino acid phosphorylation

GO:0004674 : molecular_function : protein serine/threonine kinase activity

GO:0004672 : molecular_function : protein kinase activity

GO:0007283 : biological_process : spermatogenesis

GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization

GO:0005515 : molecular_function : protein binding

GO:0005634 : cellular_component : nucleus

GO:0005925 : cellular_component : focal adhesion

GO:0005515 : molecular_function : protein binding

NaCTeM Seminar


Search result presentation motivating example2

Search Result Presentation: Motivating Example

  • However search in the model organism corpus for LIM Kinase yields only 5 papers but a high number of associated GO codes (and this is from only partially annotated papers):

  • Suggests even a single gene may be involved in numerous roles and that clustering according to GO codes may give a more focused method of searching rather than simply supplying more and more keywords which may remove useful and important papers from the result set.

GO:0006468 : biological_process : protein amino acid phosphorylation

GO:0004674 : molecular_function : protein serine/threonine kinase activity

GO:0004672 : molecular_function : protein kinase activity

GO:0007283 : biological_process : spermatogenesis

GO:0008064 : biological_process : regulation of actin polymerization and/or depolymerization

GO:0005515 : molecular_function : protein binding

GO:0005634 : cellular_component : nucleus

GO:0005925 : cellular_component : focal adhesion

GO:0005515 : molecular_function : protein binding

NaCTeM Seminar


User scenarios model organism db curator

User Scenarios: Model Organism DB Curator

  • Possible scenarios using GO tagging/text mining to support DB curators include:

    • Help assemble texts that may support GO code assignment

      • GO tag texts in curator’s watching brief

      • Automated tagging could act as prompt for/check on curator’s judgement

    • Help to determine gene-GO term pairs for annotation

      • Perform GO tagging/ gene name identification at text level and suggest all pairs as candidates

      • Perform GO tagging/gene name identification at sentence level and suggest candidates

      • Attempt to assign GO evidence codes

        • To text segments providing evidence for GO code assignment without identifying GO code/gene pair to which the evidenced pertains

        • To text segments providing evidence plus the GO code/gene pair to which the evidenced pertains

NaCTeM Seminar


Possible tasks 1

Possible Tasks (1)

Assigning GO codes to abstracts/full papers

  • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology

  • Task: assign 0 or more GO codes to a text iff the text is “about” the function/process/component identified by the code (assume most specific code only assigned)

  • Note in this task there is no association of GO code with any specific gene/gene product

NaCTeM Seminar


Possible tasks 2

Possible Tasks (2)

Assigning GO codes to genes/gene products in abstracts/full papers

  • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology

  • Task: If the text supports the assignment of one or more GO codes to a gene/gene product, identify gene/gene product-GO code pairs and the text supporting the assignments

  • This capability would support additional tasks

    • Given a particular gene/gene product and a text collection, find all GO codes for the gene/gene product across the collection

    • Given a GO code and a text collection, find all genes/gene products tagged with the code across the collection

NaCTeM Seminar


Possible tasks 3

Possible Tasks (3)

Assigning evidence codes to genes/gene products-GO code pairings in abstracts/full papers

  • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology

  • Task: As in Task 2. but additionally supply the evidence codes

  • A weaker variant of this is just to suggest evidence text that may assist in the assignment of GO code

NaCTeM Seminar


Related work

Related Work

  • Raychaudri, Chang, Sutphin & Altman (2002)

    • Task: associate GO codes with genes by

      • Associating GO codes with papers

      • Associating a specific GO code with a gene if sufficient number of papers mentioning the gene have the GO associated with them

    • Method: Treat 1. as a document classification task and evaluate maximum entropy, Naïve Bayes and Nearest Neighbours approaches

    • Evaluation: corpus of 20,000 Medline abstracts assigned one or more of 21 GO terms/categories

    • Results: maximum entropy best -- 72.8% classification accuracy over 21 categories

NaCTeM Seminar


Related work1

Related Work

  • Go-KDS (Smith & Cleary, 2003)

    • Product of Reel Two

    • Task: assign arbitrary GO terms to PubMed articles

    • Method:

      • Proprietary Weighted Confidence learner (similar to Naïve Bayes), using only words as features

      • trained on gene/protein DB’s which use GO codes AND have links to Medline

    • Evaluated on approx. same data/task as Raychaudri et al. -- 70.5 % accuracy

NaCTeM Seminar


Related work2

Related Work

  • GoPubMed

    • On-going work at Dresden University (www.gopubmed.org)

    • Task: Annotate PubMed abstracts with GO terms

    • Method: Use a local sequence alignment algorithm with weighted term matching (to overcome limits of strict matching) between GO terms and strings in texts

    • Evaluation: None reported

  • Kiritchenko et al. (U. of Ottawa)

    • Task: assign arbitrary GO terms to biomedical texs

    • Method:

      • Treat task as hierarchical text classification

      • use AdaBoost.MH

    • Evaluation:

      • introduce hierarchical evaluation measure

      • Results unclear

NaCTeM Seminar


Related work cont

Related Work (cont)

  • Biocreative challenge -- task 2 contained three related subtasks

    • Given an article, a protein and a GO code, where the article justifies the assignment of the GO code to the protein, find evidence text in the article supporting the assignment

    • Given protein-article pairs plus the number of GO code assignments supported by the article, find the GO code(s) that should be assigned to the protein based on the article

    • Given a set of proteins, retrieve a set of papers relevant to assigning codes to the proteins plus the GO code annotations and the supporting passages (not evaluated)

  • Results indicated no systems ready for practical use

  • Issues: lack of training data; complexity of tasks

NaCTeM Seminar


Related work cont1

Related Work (cont)

  • TREC Genomics Track 2004 -- three tasks related to GO code assignment

    • Triage -- given a set of articles find those that contain some evidence for the assignment of a GO code, i.e. warrant being curated

    • Given an article and names of genes occurring in the article assign one or more of the top three GO ontologies from which human curators had assigned codes

    • Task 2 plus provide evidence code supporting each gene-GO hierarchy label association

  • Results for all three tasks poor

NaCTeM Seminar


Outline2

Outline

  • Context

    • Project Background

    • The Gene Ontology

    • Go Annotation in Model Organism Databases

    • Medline

  • Go Tagging Tasks

    • User types/scenarios

    • Possible tasks

    • Related Work

  • Data sets/Gold Standards

  • Approaches and Results to Date

    • Lexical lookup

    • Vector Space Similarity

    • Machine Learning

  • Exploiting the Results in Search Tools

NaCTeM Seminar


Data sets and evaluation

Data Sets and Evaluation

  • In order to assess performance of GO tag assignment, a gold standard manually annotated/verified corpus is needed

  • However, no such corpus exists …

NaCTeM Seminar


Data sets and evaluation1

Data Sets and Evaluation

  • Solution 1: SGD Gold Standard

    • Derive a corpus from SGD model organism database (yeast)

    • Assemble all Medline abstracts cited as evidence supporting assignment of GO terms

    • Associate with each abstract the GO term whose assignment it is cited as supporting

    • I.e. given the annotated genes in SGD, assign a GO term T to a paper P if the paper P is referenced in support of a Gene-GO term association involving T

    • SGD Gold Standard

      • 4922 PMIDS

      • 2455 GO terms

      • 10485 PMID-GO term pairs

NaCTeM Seminar


Data sets and evaluation sgd gold standard

Data Sets and Evaluation: SGD “Gold” Standard

  • Advantages:

    • Data already exists -- no extra annotation work required

    • Can assemble similar corpora for each model organism DB

  • Disadvantages:

    • Each abstract has associated with it GO terms whose assignment to specific genes it supports, but may be missing other GO terms which can also be legitimately attached to it

    • Not every paper supporting a GO term assignment will be cited

  • Consequence:

    • SGD gold standard is “GO term incomplete”

      • Weak measure of recall

      • Precision figures difficult to interpret

NaCTeM Seminar


Data sets and evaluation sgd gold standard1

Data Sets and Evaluation: SGD “Gold” Standard

  • Further issue:

    • SGD Gene-GO term assignments are based on full papers, whereas system only has access to abstracts

  • Consequence:

    • Limit on maximum Recall obtainable by system

NaCTeM Seminar


Data sets and evaluation cont

Data Sets and Evaluation (cont)

  • Solution 2: IC Gold Standard

    • Manually extend the GO annotation of abstracts derivable from the SGD

      • Goal: GO term complete gold standard

    • Selected a subset (~800) for which support for all the assigned GO codes is found in the abstract (rather than the full paper)

    • Manually added additional GO annotations using a combination of fuzzy maching against GO and some manual addition of synonyms during checking

      • For included terms, include lowest within each ontology

        ‘cell wall biosynthesis’ => ‘cell wall biosynthesis’‘cell wall’

    • Also applied same methodology to evidence paragraphs -- brief summaries written by curators deliberately using GO vocabulary

  • IC Gold Standard

    • 785 PMIDS

    • 1006 GO terms

    • 5170 PMID-GO term pairs

NaCTeM Seminar


Data sets and evaluation cont1

Data Sets and Evaluation (cont)

  • Advantages:

    • Much closer to a GO-term complete gold standard

  • Disadvantages

    • Still not GO-term complete

      • Method of creation suggests there may still be many unannotated GO terms that ought to be marked (direct mentions of GO terms vs. semantically entailed GO terms)

    • Gold Standard creation method favors lexical look-up approach to GO-tagging

    • Dataset is small

NaCTeM Seminar


Outline3

Outline

  • Context

    • Project Background

    • The Gene Ontology

    • Go Annotation in Model Organism Databases

    • Medline

  • Go Tagging Tasks

    • User types/scenarios

    • Possible tasks

    • Related Work

  • Data sets/Gold Standards

  • Approaches and Results to Date

    • Lexical lookup

    • Vector Space Similarity

    • Machine Learning

  • Exploiting the Results in Search Tools

NaCTeM Seminar


The go tagging task addressed

The Go Tagging Task Addressed

  • The approaches we investigated all considered Task 1, as defined earlier:

    • Given: a set of texts (PubMed abstracts/full papers) and the GO/GO Slim ontology

    • Task: assign 0 or more GO codes to a text iff the text is “about” the function/process/component identified by the code (assume most specific code only assigned)

NaCTeM Seminar


Approach 1 lexical lookup using termino

Approach 1: Lexical Lookup Using Termino

  • Termino: a large-scale terminological resource to support term processing for information extraction, retrieval, and navigation

  • Termino contains a database holding large numbers of terms imported from various existing terminological resources, e.g., UMLS, GO

  • Efficient recognition of terms in text is achieved through use of finite state recognizers compiled from contents of database

  • The results of lexical look-up in Termino can feed into further term processing components, e.g., term parser

  • Available as a Web Service (see http://nlp.shef.ac.uk)

NaCTeM Seminar


Termino terminology engine

Existing Terminological Resources

Raw Texts

Neurofibromin

GO annotations:

- 0008181: tumor supressor

- 0005737: cytoplasma

- …

Medline

Abstracts

UMLS

GO

Mastectomy

UMLS data:

- CUI: C0024881

- semantic type: therapeutic

or preventive procedure

- synonyms: mammectomy

- …

Source-Specific Loaders

TermDB

Term

Induction

Peptidyl-prolyl isomerase

- type: protein term

- source: induced from Medline

- …

Finite State

Look-Up

Term

Parser

Termino

Text out

Text in

Termino Terminology Engine

NaCTeM Seminar


Lexical look up for go tag

Lexical Look-Up for GO Tag

  • Termino

    • Imported names of all terms in GO, plus their GO ids and namespace attributes (18270 names in total)

    • Go term synonyms

    • SGD yeast gene names

  • Recognition of terms in text

    • Case-insensitive

    • “Simple” morphological variants are recognized

      • Cells mapped onto cell

      • Mitochondrial, mitochondria not mapped onto mitochondrion

NaCTeM Seminar


Lexical look up for go tag cont

Lexical Look-Up for GO Tag (cont)

  • GO code assignment

    • GO term T is assigned to text iff name of T is recognized in text

    • Extensions:

      • GO term T is assigned to a paper if synonym ofterm T occurs in the abstract of the paper

      • GO term T is assigned to a paper if yeast gene nameassociated with term T occurs in the abstract of the paper

NaCTeM Seminar


Lexical lookup results for go slim

Lexical Lookup Results for GO Slim

NaCTeM Seminar


Lexical lookup results for go full

Lexical Lookup Results for GO Full

NaCTeM Seminar


Lexical lookup approach discussion

Lexical Lookup Approach: Discussion

  • Recall

    • Effect of curators using full text (SGD) vs. abstracts only (IC)

    • Inherent drawbacks of lexical look-up: term variation, literal mentions

    • Effects of Gold Standard creation method (IC)

  • Precision

    • Effects of Gold Standard creation method (IC)

  • GO vs. GO Slim

    • Recognizing GO Slim terms is easier than recognizing GO terms

  • Effects of extensions (synonyms/gene names) on performance

    • Adding synonyms: variable decrease in Precision, substantial increase in Recall

    • Adding yeast terms: substantial decrease in Precision, substantial increase in Recall

NaCTeM Seminar


Error analysis

Error Analysis

  • False negatives for abstracts:

    • Abbreviation: mismatch repair(GO name) vs. MMR (in text)

    • Permutation, derivation: regulation of translation vs. regulated translation, sporulationvs.sporulate

    • Truncation: galactokinase activity vs. galactokinase

    • Alternative descriptions: protein catabolism vs. proteins for degradation, autophagic vacuolevs. autophagosomal

NaCTeM Seminar


Approach 2 ir based vector space similarity

Approach 2: IR-based Vector Space Similarity

  • Document Collection

    • Build a collection of “GO documents” where each GO document consists of GO term, its synonyms and its definition sentence

  • Query

    • Treat each abstract to which GO codes are to be assigned as a query against the GO document collection

  • Retrieval

    • Given a query (i.e abstract) retrieve relevant “GO documents” (i.e. GO terms)

    • assign top 1, 5, 10 … GO terms to an abstract which are most “similar” as measured by Vector Space Model(VSM)

NaCTeM Seminar


Ir based approach

IR-based Approach

  • indexed the GO documents using Lucene search engine

  • Standard IR preprocessing: tokenization, stop word removal, case normalization, stemming

  • 4 Indices were built according varying as to whether they used

    • Standard GO or GO Slim

    • A GO document consisting of the GO term text (name + definition) or itself plus its ancestor GO terms;

  • Used standard weighting scheme included in Lucene

  • Postprocessing:

    • Re-weighting: give credit to duplicated GO documents (found on more than one path back to root)

    • Threshold: the number of relevant GOIDs to return

NaCTeM Seminar


Ir based results

IR-Based Results

  • Better performance on IC abstracts than on SGD abstracts

  • Hierarchical documents do slightly worse than flat documents

  • Discriminatory effect of specific GO terms may be reducedby occurrence of general terms such as cell and protein

NaCTeM Seminar


Approach 3 machine learning

Approach 3: Machine Learning

  • Variety of text classification algorithms: Naïve Bayes, Decision Tree, SVM classifier, …

    • Naïve Bayes predicts only one GO term per abstract

    • SGD GS: 2.1 GO terms/abstract; IC GS: 6.6 GO terms/abstract

  • Features: words, frequent phrases

    • Preprocessing steps: tokenization, removal ofstop words, stemming

  • Training on 66% of annotated data, evaluation on remainder of data

  • GO term assignments vis-à-vis generic GO Slim to mitigate data sparsity problems

NaCTeM Seminar


Machine learning results

Machine Learning Results

  • One GO term vs. multiple GO terms per abstract makes a difference

  • Higher precision scores than lexical look-up (SGD): GO terms directly mentioned in text not be assigned if GO terms not present in training set

  • Oracle Text Decision Tree (IC): classifier learns systematic, strong correlation between words in text and words in GO terms

NaCTeM Seminar


Comparison of approaches

Comparison of Approaches

  • Best F scores for GO Slim

    • SGD Gold Standard

    • IC Gold Standard

NaCTeM Seminar


Outline4

Outline

  • Context

    • Project Background

    • The Gene Ontology

    • Go Annotation in Model Organism Databases

    • Medline

  • Go Tagging Tasks

    • User types/scenarios

    • Possible tasks

    • Related Work

  • Data sets/Gold Standards

  • Approaches and Results to Date

    • Lexical lookup

    • Vector Space Similarity

    • Machine Learning

  • Exploiting the Results in Search Tools

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

Input keywords here

Upload a file containing a list of Medline abstracts

Type/paste free texts to get results

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

Click for the abstract details

Click for the GO definition

Search for the abstracts with similar Go annotations

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

Click for the abstract details

Click for the GO definition

Search for the abstracts with similar Go annotations

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

NaCTeM Seminar


Exploiting the results in search tools

Abstract

Titles

GO

Hierarchy

Abstract

Bodies

Go Labels/

Gene Names

Exploiting the Results in Search Tools

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

Input keywords here

Upload a file containing a list of Medline abstracts

Type/paste free texts to get results

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

NaCTeM Seminar


Conclusions

Conclusions

  • GO tagging is an “interesting” task that offers significant potential benefits to research biologists and bioinformaticians

    • Several increasingly complex/valuable variants of the task can be identified

  • Simple techniques

    • Direct term matching

    • IR-type text macthing

    • Machine Learning text classification methods

      have been assessed for their level of performance on the simplest task -- assigning GO terms to texts at the whole text level

  • Evaluation methods/resources are critical issues

  • Effectively utilising imperfect text mining results in end user applications is challenging

NaCTeM Seminar


Future work

Future Work

  • Enhancements to each of the 3 simple approaches

  • Combining 3 simple approaches into a hybrid system

  • Look other tasks:

    • Extracting GO term-gene/gene product pairs

    • Assigning evidence codes

  • Improving resources and methodology for evaluating the technology

  • End-user evaluation of search tools employing this technology

NaCTeM Seminar


Go tag assigning gene ontology labels to medline abstracts

END

Reference:

Davis, N., Harkema, H., Gaizauskas, R.,Guo Y.K., Ghanem, M., Barnwell, T., Guo, Y. and Ratcliffe, J. (2006) Three Approaches to GO-Tagging Biomedical Abstracts.

In Proceedings of the Second International Symposium on Semantic Mining in Biomedicine (SMBM06), Jena, April 2006.

Available from http://www.dcs.shef.ac.uk/~robertg/publications/

NaCTeM Seminar


  • Login