Benchmarking infrastructure for mutation text m ining
This presentation is the property of its rightful owner.
Sponsored Links
1 / 25

Benchmarking Infrastructure for Mutation Text M ining PowerPoint PPT Presentation

  • Uploaded on
  • Presentation posted in: General

Benchmarking Infrastructure for Mutation Text M ining. Artjom Klein*, Alexandre Riazanov , Christopher J . O . Baker Computer Science and Applied Statistics, University of New Brunswick, Saint John, Canada. Matthew Hindle Synthetic and Systems Biology, Edinburgh University, 

Download Presentation

Benchmarking Infrastructure for Mutation Text M ining

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

Benchmarking Infrastructure for Mutation Text Mining

ArtjomKlein*, AlexandreRiazanov, Christopher J. O. BakerComputer Science and Applied Statistics,

University of New Brunswick, Saint John, Canada.

Matthew HindleSynthetic and Systems Biology,

Edinburgh University, 

Edinburgh, UK.

AIMM 2012

September 9th

Basel, Switzerland.

Mutation Text Mining

Mutation text mining facilitates a wide range of activities in multiple scenarios in bioinformatics and systems biology, including:

Modeling of cell signalling pathways

Protein structure annotation

Expansion of disease-mutation database annotations

Development of tools predicting the impacts of mutations

Useful text mining tasks:

Simple identification of mutation mentions

Linking (”grounding”) identified mutations to the corresponding genes and proteins

Identifying mutation impacts and/or related phenotypes.

SampleMutation Text Mining Task: Grounding

“ Haloalkanedehalogenase (DhlA) from XanthobacterautotrophicusGJI0 hydrolyses terminally chlorinated and brominated n-alkanes to the corresponding alcohols.”

“The A149Tmutant showed only a slight reduction of dehalogenase activity(Vmax) and while D260N resulted in a larger increase of Km with 1,2-dibromoethane.”

Mutation / Protein / Gene / Organism / Direction / Protein Property / Chemical

Protein ID


SampleMutation Text Mining Tasks: Relations

“Haloalkanedehalogenase (DhlA) from XanthobacterautotrophicusGJI0 hydrolyses

terminally chlorinated and brominated n-alkanes to the corresponding alcohols.”

“The A147Fmutant showed only a slight reduction of the enzyme activity(Vmax)

and while D157P resulted in a larger increase of Km with 1,2-dibromoethane.”

Mutation / Protein / Gene / Organism / Direction / Protein Property / Chemical

Event: Mutation Impacts (Protein Property + Impact Direction)

Impact Direction: Positive / Negative

Linking / Grounding Impact mentions to Mutation mentions in text

Linking / Grounding Mutation Mentions to Protein mentions in text

Linking / Grounding Protein Property mentions to GO Molecular Function terms

Normalizing Mutation mentions (e.g. to HGVS Nomenclature)

Performance of Mutation Grounding Systems

Benchmarking Resources

The Developer needs:

  • Annotated Corpora (training and dev.)

  • Robust tools to evaluate the performance of

    • different systems

    • different runs of an evolving prototype system against a gold standard corpus.

  • Tools for migrating / uploading system outputs

  • Semantic models for integrating data

  • Appropriate Metrics

Typical Benchmarking Challenges

Benchmarks (annotated corpora) often not open / published

Must be build from scratch

Annotation Set (Semantic Types, Ontology classes, …)

Different formats, different annotation schemas

Choice of Representation Format (TXT, XML, TAB, RDF)

Different metrics used for evaluation

What and How to evaluate it

Complex submission procedures

Cumbersome / Slow


We leverage the semantic web standards: OWL, RDF and SPARQL

Our Representation Format: RDF

Why not XML? - XML is a widely used standard format for corpora annotation and is supported by a large number of tools.

The processing of complex annotations in XML – parsing, storing, querying, evaluation – is usually virtually impossible with off-the-shelf XML tools.

Developers need to develop schema-specific parsers and processing scripts and change them each time when the schema is changed or extended.

RDF: Extensibility

Different mutation text mining tasks and all requirements can not be foreseen

Same data may be used for different tasks

=> We need extensible representations.

OWL/RDF ontologies are highly extensible data schemasproviding:

easy integration of new corpora with annotation schemas that need not be identical, as long as they are compatible. 

easy merging of data defined modulo one ontology with data modulo another ontology. 

additional alignments between the ontologies can be provided by the annotation providers – corpus curators or text mining system developers.

RDF: Tool Availability

OWL reasoners for data integrity checking

RDF and OWL APIs for multiple programming languages facilitate easy programmatic generation and manipulation of annotations or RDF data representing text mining results.

SPARQL query language can be directly used for calculating system performance metrics as well as for various searches in the gold standard corpora. 

Multiple implementations of RDF databases (triplestores) are available that facilitate efficient storing and querying of large volumes of annotations.

Semantic Model: Mutation Impact Extraction Ontology (MIEO)

Riazanov A, Laurila JB and Baker C, Deploying mutation impact text-mining software with the SADI Semantic Web Services framework, BMC Bioinformatics 2011, 12(Suppl 4):S6 and Nona Naderi and René Witte, Automated extraction and semantic analysis of mutation impacts from the biomedical literature, BMC Genomics2012, 13(Suppl 4):S10 

Benchmarking with SPARQL

SPARQL is query language for RDF data.

Create a new SPARQL query or change an existing one is usually easier than create or rewrite some scripts.

We use named graphs (identified subsets of RDF statements) to separate results coming from different systems or different experiments and gold-standard data: results from different experiments, and even gold standard data from different corpora.

Basically, metrics are calculated by comparing 2 graphs and finding overlaps between them.

Corpus Development: KinMutBase

A subset of 201 documents annotated with singular amino acid substitutions grounded to proteins

Curation – using MutationFinder (high recall) and comparing the results with the annotations in the database. Based on this comparison, we discarded about 70 documents that appear annotated with protein-level mutations not explicitly mentioned in the documents

The final size of the corpus is 128 documents. In total, we have 271 mutations linked to 26 different UniProt identifiers.

Primarily for Mutation Grounding Evaluations

Stenberg KA, Riikonen PT, Vihinen M., KinMutBase, a database of human disease-causing protein kinase mutations , Nucleic Acids Res. 1999 Jan 1;27(1):362-4.

Corpus Development: EnzyMiner

Full text documents (38) randomly selected from EnzyMiner*abstracts

Documents with proteins from 49 UniProt Ids and 24 different species.

Coverage: 488 statements (occurrences of impact information in text), 61 molecular functions and 29 combined mutations.

Annotated Information:

Studied protein-level mutations, in the form of singular amino acid substitutions. For situations when effects of several simultaneous amino acid substitutions are studied, we annotate them as combined mutations.

Proteinsto which the mutations are related are identified with UniProt IDs. Host organisms / sets of specific protein sequences can be identified via UniProtIDs.

Protein properties specified as Gene Ontology Molecular function classes.

Mutation impacts qualified as Positive, Negative or Neutral.

Text fragmentsfrom where information was extracted from. Typical fragments contain mentions of protein properties, impact directionality words, such as “increased” or “worse”, mutation mentions, protein and organism names, etc.

Documentsidentified with PubMed IDs.

* YeniterziS, Sezerman U., EnzyMiner: automatic identification of protein level mutations and their impact on target enzymes from PubMed abstracts.

BMC Bioinformatics. 2009 Aug 27;10 Suppl 8:S2.

Corpora Statistics


As a part of our infrastructure, we created a small set of simple utilities, which facilitate data access:

The evaluator utilitycalculates standard performance metrics by executing some user-provided SPARQL queries, counting the results and making necessary calculations. The user can supply the queries in a simple configuration file.

The Sesame loaderand query client are simple command line applications that allow loading RDF graphs into a Sesame triplestore and executing queries from files.

Mutation Grounding Metrics

Precision- the fraction of correctly grounded mutations (true positives) over all grounded mutations (true positives + false positives)

Recall- the fraction of correctly grounded mutations over all mutations in the gold standard (true positives + false negatives). 

A mutation is considered correctly grounded if it is mapped to a sequence corresponding to the UniProt ID specified by the corresponding gold standard corpus. 

  • Witte R, Baker CJO (2007) "Towards a systematic evaluation of protein mutation extraction " proposes over 15 different metrics to evaluate protein mutation extraction systems.

Evaluation Example

  • There are 3 SPARQL queries to calculate:

    • The number of all correctly grounded mutations (query1),

    • All grounded mutations (query2),

    • All mutations in gold-standard (query 3).

  • Queries defined in a configuration file of evaluator tool.

  • Evaluator tool executes 3 queries and combines results in recall and precision formulas which are also defined in the configuration file as mathematical expressions.


    • precision = query1/query2,

    • recall = query1/query3,

SPARQL Query 1

  • Singular mutation mention recognition. (True positives)

    SELECT ?doc ?singl_mut1


    WHERE {

    GRAPH <> {

    ?doc sio: 'refers to' ?singl_mut1 .

    ?singl_mut1 a mieo:AminoAcidSubstitution .

    ?singl_mut1 mieo:mutationHasWildtypeResidue ?wt_residue .

    ?singl_mut1 mieo:mutationHasMutantResidue ?mut_residue .

    ?singl_mut1 mieo:mutationHasPosition ?pos1 .

    ?pos1 sio:’has_value’ ?pos_value .

    } .

    GRAPH goldst:v0.0 {

    ?doc sio: 'refers to' ?singl_mut2 .

    ?singl_mut2 a mieo:AminoAcidSubstitution .

    ?singl_mut2 mieo:mutationHasWildtypeResidue ?wt_residue .

    ?singl_mut2 mieo:mutationHasMutantResidue ?mut_residue .

    ?singl_mut2 mieo:mutationHasPosition ?pos2 .

    ?pos2 sio:’has_value’ ?pos_value .



We select single mutations from system output, which match single mutations from gold standard (overlap: wild type residue, position and mutant residue)

Testing the Infrastructure

For concept validation the infrastructure was used for testing and iterative performance evaluation during a project dedicated to the development of a robust mutation grounding system.

EnzyMiner was used as development corpus.

originally created for mutation impact extraction

it only contains information about mutations whose impact is studied

there may be other mutations associated with specific proteins but not with impacts

we only compute our performance metrics on the subsets of mutations mentioned in the annotations

All other corpora were used as test corpora.

Evaluation Results

On all corpora the new system outperforms the original prototype ….


Future work

Further stress-test infrastructure with text mining tasks other than mutation grounding and mutation impact extraction, and a third-party mutation text mining system.

Extend the ontology based on the new requirements identified through community involvement and our own research.

Extend the infrastructure to include protein properties other than molecular functions, such as enzyme kinetics, and DNA-level mutations.

Modeling sentence level provenance to provide more precise pointers to text fragments supporting annotations.

Want participants: Open Mutation Text Mining Competition


This research was funded in part by the New Brunswick Innovation Foundation, New Brunswick, Canada; the NSERC, Discovery Grant Program, Canada and the Quebec - New Brunswick University Co-operation in Advanced Education – Research Program, Government of New Brunswick, Canada.

  • Login