ProMiner at MGI

7thFraunhofer Symposium on Text MiningOctober 6, 2009 ProMiner at MGI Implementing Dictionary-Based NER Solutions for Mining Biomedical Literature Karen Dowell, Monica McAndrews-Hill, David Hill, Harold Drabkin, Judith Blake

From Algorithms to Applications • ProMiner at Mouse Genome Informatics (MGI) • Background on MGI and our biocuration process • Applying Named Entity Recognition (NER) applications to improve MGI curator efficiency and minimize bottlenecks • Our implementation and results to date using ProMiner to annotate full-text scientific journal articles in HTML and PDF format

MGI Model Organism Database • A comprehensive, integrated public information resource for mouse genetics, genomics and biology • Facilitates use of the laboratory mouse as a model for human biology • Provides extensively curated mouse data

www.informatics.jax.org The MGI website presents information on mouse biology in a publically accessible, content rich, continually updated online database

Mouse Genome Informatics MGI content spans from DNA sequence to disease phenotype

The Mouse Information Resource MGI integrates information on mouse genes and experimental data through a combination of manual curation, computational curation, and collaboration with other online resources.

MGI Biocuration Workflow • For literature curation we • Review more than 160 scientific journals each month • Screen more than 12,000 articles a year

MGI Biocuration Workflow • Curators pick papers based on • Expression • Mapping • Homology • New Genes • Gene Ontology (GO) • Alleles & Phenotypes • Sequences • Inbred Strain • Tumor • Nomenclature • General Interest • Screen for references to mouse, mice, murine

MGI Biocuration Workflow • Selected articles are assigned reference numbers and entered into a master bibliography In 2009…10,097 articles added~1122 per month(as of September 29, 2009)

MGI Biocuration Workflow Indexing is our internal process of associating article reference numbers to at least one entity within the MGI database. For gene indexing that entity is a gene.

MGI Biocuration Workflow • Curators read each paper and enter information into MGI database using controlled vocabularies • Articles annotated based on • Expression • Mapping • Homology • New Genes • Sequences • Inbred Strains • Tumors • Alleles & Phenotypes

Literature Acquisition at MGI

Text Mining and MGI Biocuration • Many areas could benefit from text mining(as tools, not replacements for human curators) • Selected gene indexing as a prototype project to • Minimize a bottleneck within our curation workflow More than 2000 articlesin gene indexing pipeline

Our Ideal System • A dictionary-based named entity recognition (NER) system that • Complements our existing biocuration processes and workflow • Processes full-text PDF files in batch • Uses MGI or comparable dictionaries of mouse symbols, synonyms, and human orthologs • Produces meaningful reports that aid curators • Provides visualization tools • Achieves high F-scores in published evaluations

at MGI • Of all the dictionary-based NER tools we evaluated, ProMiner most closely fit our needs • Rule-based protein and entity recognition using pre-processed dictionaries (Entrez Gene, SwissProt, ATTC, and ECACC) • Batch processing of PDF Files (beta release) • Standard and custom reports • Customizable annotation projects and dictionaries/term lists • Initiated collaborative pilot project between SCAI and MGI

ProMiner Technical Specifications • System requirements • Runs on Linux systems, Sun-Ultra, and other UNIX-based systems • Requires minimum 1 GB RAM, 500 MB disk spaceJava (v1.5 or higher) and Perl (v5.8 or higher) • Uses GeneDB to retrieve data (requires 1 GB to store index files). Includes an HTML-based (CGI) viewer • One processor can update ~1000 articles per project • On a cluster of 16 processors, ProMiner can search the entire MEDLINE literature base with 1 dictionary in ~2 hours

ProMiner Implementation at MGI • MGI Operating Environment • Dedicated Sun Fire X4100 Server with two dual core AMD Opteron processors, 2.8 Ghz, 64 bit • Solaris 10 V. 508 operating system , Java5 built-in • Adobe Acrobat Pro Version 9.1 • SCAI delivered… • Installation scripts, ProMiner scripts and dictionaries • Documentation and demos • MGI project definition files for annotation using human and mouse dictionaries

Testing, Testing, Testing • HTML Version 6.4 implemented in March • PDF Version 7.1 delivered in August

Reports to Scan for Gene References

This paper was indexed to mouse genes Tlr4 and Ly96

Annotation Dictionary Layers

Preliminary Results • 1 part-time curator working 5.5 hours a day processing batches of 10 articles at a time • 8 of 10 PDFs processed correctly, without errors • Some PDF format (PDF/A) and color labeling errors • We provide feedback to SCAI to enhance dictionaries and PDF formatting

Processing time = 0.2333 (No. Articles )+ 0.5751 R² = 0.9905 ProMiner 7.1 annotates 75 full-text articles in PDF format in less than 20 minutes on our server

Next Steps • Complete performance testing and evaluate status of pilot project with SCAI • Consider extending pilot to continue testing ProMiner 7.1 • Explore future collaborations • Gene Ontology terms • Protein-protein interactions • Other curation functions at MGI

Acknowledgments • Fraunhofer SCAI • Juliane Fluck • Heinz-Theodor Mevissen • Symposium Organizers • MITRE Corporation • Lynette Hirschman • Journal of Immunology • MGI • Judith Blake • Nancy Butler • Harold Drabkin • Alex Diehl • David Hill • Monica McAndrews-Hill • Sue McClatchy • David Shaw • Dmitry Sitnikov MGI System Administration • Matt Baya • Mike McCrossin • Iry Witham

ProMiner at MGI

ProMiner at MGI

Presentation Transcript

The Saguaro from MGI, USA

The Saguaro from MGI, USA

ProMiner at MGI

What is MGI?

Browsing the GO at MGI

MGI Updates

Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI

Welcome delegates - MGI Conference 2009

MGI and Phenotyping Projects

mgi network services, inc.

MGI: Lessons Learned

MGI: Lessons Learned

The Saguaro from MGI, USA

Mgi Sanskar Residency Bhiwadi - Madhyam

MGI MAPLE Best Apartments In Ghaziabad

MGI Sanskar Residency Bhiwadi

MGI GREGORIOU & CO

mgi zip x1

ProMiner at MGI

ProMiner at MGI

Presentation Transcript

The Saguaro from MGI, USA

The Saguaro from MGI, USA

ProMiner at MGI

What is MGI?

Browsing the GO at MGI

MGI Updates

Managing Big Scientific Data Capturing, Integrating and Presenting Mouse Data at MGI

Welcome delegates - MGI Conference 2009

MGI and Phenotyping Projects

mgi network services, inc.

MGI: Lessons Learned

MGI: Lessons Learned

The Saguaro from MGI, USA

Mgi Sanskar Residency Bhiwadi - Madhyam

MGI MAPLE Best Apartments In Ghaziabad

MGI Sanskar Residency Bhiwadi

MGI GREGORIOU &amp; CO

mgi zip x1

MGI GREGORIOU & CO