Data Mining Tools for Curation of the Human Metabolome Database

Building the MetaboCards The HMDB contains more than 1400 metabolite entries, each consisting of over 80 data fields. The data pertaining to each metabolite is accessible as a “MetaboCard”. The MetaboCard serves as a curator-friendly summary of the current metabolite annotations stored in the HMDB (Fig 1). The initial set of MetaboCards is assembled using a data mining program called MetaboBuilder, which searches a variety of databases using sequence and keyword queries. The results of each search are evaluated to determine whether they are relevant for the metabolite in question, or if they should be discarded. MetaboBuilder also coordinates the updating of fields that are calculated from the contents of other fields, such as protein molecular weight, and protein isoelectric point. The content that is gathered and generated by MetaboBuilder is stored in a relational database and in a flat file database to facilitate curator review. Data Mining Tools for Curation of the Human Metabolome Database Savita Shrivastava, Craig Knox, Paul Stothard, Russ Greiner, David Wishart, University of Alberta, Edmonton, Canada Abstract:The Human Metabolome Database (HMDB) contains more than 1400 metabolite entries, each consisting of more that 80 data fields. Obtaining and evaluating the contents of these data fields has required the development of several custom software tools. These data mining programs extract information from several publicly accessible databases (KEGG, PubChem, PubMed, MetaCyc, ChEBI, PDB, Swiss-Prot, GenBank), and generate a series of web-based reports. These reports, by combining the results obtained from several independent sources, provide a useful means for evaluating the reliability of the metabolite information that is added to the HMDB. The HMDB is regularly updated as additional data becomes available and as source databases and data mining methods improve. Introduction Evaluating Metabolizing Enzymes The extensive information stored in the HMDB has been assembled by a team of curators using a collection of custom data mining programs developed specifically for building and updating the HMDB. These software tools use sequence and text comparison algorithms to obtain up-to-date metabolite information from the some of the most reliable and complete resources. Two of the HMDB data mining tools, MetaboBuilder and MetabolizingInfo, are discussed below. Each of the automatically generated MetaboCards is reviewed by curators who look for missing or incorrect information. To assist the curators the HMDB development team has prepared several tools that obtain information from additional resources, using data mining approaches that differ from those used to build the MetaboCards. One of the programs, called MetabolizingInfo, is used to evaluate the content of the MetaboCards relating to metabolizing enzymes. Currently more than 3,000 protein (and DNA) sequences are linked to the metabolite entries. The MetabolizingInfo program uses the name of each metabolite and its known synonyms to obtain publications from PubMed, metabolizing enzymes from Swiss-Prot, and metabolite and metabolizing enzyme information from KEGG. The searches are conducted using a combination of WWW agents and public database APIs. All of the retrieved information is ranked using a scoring system and presented to the curator as an HTML document (Fig 2). Each of the entries in the document is hyperlinked to a complete database record (Fig 3). Fig. 1 Data stored in the HMDB is available to users and curators in the form of MetaboCards. The cards are generated by a data mining program that retrieves information from several external and internal databases and scripts. Whenever possible the contents of the MetaboCards are hyperlinked to additional information to aid in the curation process. Updating the HMDB The HMDB will never be a “finished” database, since new research is always providing additional data. Furthermore, the HMDB data mining tools and curators constantly scrutinize and update existing content. The HMDB is available at http://www.hmdb.ca. We encourage users to provide us with their feedback. Fig. 3 The HMDB data mining tools, such as the MetabolizingInfo program, provide web-based reports for human curators. These reports contain hyperlinks to records in a variety of external databases, including Swiss-Prot, PubMed, and KEGG. Shown above is a Swiss-Prot record, PubMed abstract, KEGG compound record, and KEGG enzyme record obtained for corticosterone. By using a combination of automated data mining and manual curation, the HMDB aims to be a comprehensive and reliable database of human metabolites. Fig. 2 The MetabolizingInfo program uses text-based searches to retrieve information from Swiss-Prot, PubMed, and KEGG. Records that pass a scoring cut-off are presented in a colour-coded HTML table. The table for corticosterone is shown above. Each external record ID is hyperlinked to its corresponding record for curator review. Some of these records are shown in Fig 3.

Data Mining Tools for Curation of the Human Metabolome Database

Data Mining Tools for Curation of the Human Metabolome Database

Presentation Transcript

Data Mining Tools

Database and Data Mining Security

The Human Metabolome Project

Data Mining Tools

The Human Metabolome Project

Text Mining Applications for Literature Curation

Database and Data Mining Tools for Electronic Teamwork Assessment Tool e-TAT

Data Mining Tools

Bioinformatic Treatment of Human Metabolome Profile for Diagnostics

Data Mining, Database Tuning

Database Management Systems: Data Mining

The Planet Pipeline (and Citizen Science) Data curation and mining of

Database Support to Data Mining

Ontology-based Tools to Enhance Data Curation

Curation Tools

Curation Tools

Curation Tools

Database Management Systems: Data Mining

Database Management Systems: Data Mining

Top Data Mining Tools

Top Content Curation Tools

Content Curation Tools