140 likes | 226 Views
GeneKeyDB is a gene-centered relational database designed to optimize biological data mining. It simplifies complex queries and enhances data validation. Future plans include updating with APIs and integrating with other databases.
E N D
Creation and Maintenance of GeneKeyDB Research being conducted by Kevin Kastner Under the direction of Dr. Erich Baker
The Problem • There exists thousands of biomedical data sources. • In 2006, there were ~557 relevant public resources in molecular biology. • This is growing rapidly. • 203 sources in 1999 • 226 sources in 2000 • 277 sources in 2001.
The Problem • Traditional database approaches are too structured. • Scientific objects change identification over time. • Gene names change over time. • The Human Genome Nomenclature Database (HUGO) contains 13,594 active symbols, 9635 literature aliases, and 2739 withdrawn symbols. • SIR2L1 (w/drawn) is a synonym for SIRT1 and sir2-like 1.
The Solution • GeneKeyDB • A gene-centered relational database developed to enhance data mining in biological data sets. • GeneKeyDB relies primarily on existing database identifiers derived from community databases (NCBI, GO, Ensembl, et al.) as well as the known relationships among those identifiers. • Version 1 is already out! • http://www.biomedcentral.com/1471-2105/6/72
Weaknesses of Version 1 • Can no longer be updated • Complex queries must be made to the database in order to obtain desired information
Complex Queries SELECT ll_xp_cdd.cdd_name, ll_np_cdd.cdd_name, organism FROM ll_xp_cdd, ll_np_cdd, ll_locus WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score AND ll_id IN (SELECT ll_id FROM ll_refseq_xm WHERE ll_refseq_xm_id IN (SELECT ll_refseq_xm_id FROM ll_xp_cdd, ll_np_cdd WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score)) AND ll_id IN (SELECT ll_id FROM ll_refseq_nm WHERE ll_refseq_nm_id IN (SELECT ll_refseq_nm_id FROM ll_xp_cdd, ll_np_cdd WHERE ll_xp_cdd.cdd_score = ll_np_cdd.cdd_score));
Current Research • Creation of APIs to validate data in the database and to enable querying to become much easier for the user. • One-step updating of the database and the information it contains.
API Alternative // fxn(search_params, desired_info), returns ll_id curated.cdd(score[ ],null) curated_score[ ] score[ ] locus_id1[ ] gaa.cdd((name[ ],score[ ]), score[ ]) gaa_name[ ] name[ ] gaa_score[ ] score[ ] locus_id2[ ] curated.cdd(name[ ],score[ ]) curated_name[ ] name[ ] locus_id[ ] intersect(locus_id1[ ],locus_id2[ ]) locus(organism[ ], locus_id[ ]) print(gaa_name[ ], curated_name[ ], organism[ ])
External Implementations • Some databases have APIs as well. • Ensembl • APIs are done in Perl. • APIs for GeneKeyDB will be done in Java. • More structured language. • Easier to read.
The Future of GeneKeyDB • GeneKeyDB will join even more external and widely used databases together. • Code for updating GeneKeyDB will tie into database information that will change in expected ways. • Lowers the required number of code rewrites. • GeneKeyDB will be dynamically updated.
The Future of GeneKeyDB • APIs made that will be written in Perl. • Perl is used often, almost exclusively, by biologists. • Can have Perl APIs tie into Java APIs, rather than creating all new ones.
Comments? Questions? • http://genereg.ornl.gov/gkdb/