1 / 16

A Database Platform for Bioinformatics

A Database Platform for Bioinformatics. Sandeepan Banerjee Oracle Corporation. Background. Need massive storage for more and more genomic and proteomic data generated in database

ehlers
Download Presentation

A Database Platform for Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Database Platform for Bioinformatics Sandeepan Banerjee Oracle Corporation

  2. Background • Need massive storage for more and more genomic and proteomic data generated in database • Need high-performance computing platform to search data, identify similarities and patterns within genomic data and unify the slices of distributed developed knowledge

  3. Steps to genomic projects • Divide the chromosomes into smaller fragments that can be isolated • Order these fragments to correspond to their respective locations on the chromosomes. • Determine the sequence of bases A,T,C & G in each fragment. • Annotate the regions of sequenced chromosomes with their function • Catalogue the differences in sequences

  4. Computing for cataloguing • Any two individuals differ in about 1/1000 of their genetic material, i.e. about 3 million base pairs. • The global population is now about 6 billion. • So a full cataloguing of all sequence differences will run to 18*1015 entries.

  5. Traditional Database • Few databases have had a native ability to deal with complex data • Hard to handle high-dimensional data Ex. Query on structural similarity: Given a particular sequence, what other sequences resembling this sequence exist in the database?

  6. BLAST • A set of similarity search programs • Hard to handle due to too complex, too large and far too custom-built • Degrade performing when interactions with database increase • Query optimizations not easy • Hard to manage with database as a whole system.

  7. Four technologies needed • Extensibility database architecture • Data mining and Data Warehousing • Data integration technologies • Internet portal technologies

  8. Extending Databases • User-defined Types • User-defined operators • Domain-specific indexing • Optimizer extensibility

  9. User-defined Types • Oracle Type System • Object types – structure is fully known to the database • Opaque types – not known to the database

  10. User-defined operators • Define domain-specific operatorsresembles() • Can be invoked anywhere built-in operators can be used. Like in Select command:SELECT ID FROM DNATABLE WHERE Contains(fragment, `GCCATA`);

  11. Extensible Indexing • Cooperative indexingUser-supplied implementations and the Oracle server cooperate to build and maintain indexes for complex types such as genetic, text or spatial data. • User implemented Indextype.

  12. Extensible Optimizer • Gives developers control over the three main inputs used by the optimizer:statistics, selectivity, and cost.

  13. Mining Sequence Data • Oracle Darwin for bioinformatics

  14. Integrating Heterogeneous Data • Sequence data will be distributed all over the institutes. • Annotations to this data will make it change and grow all the time. • Oracle use ODBC and OLEDB to connect non-Oracle database system to do query, search, insert, delete.

  15. Portal Technologies • ‘Soft Goods’ Sales • Visualization • Security & Access Control

  16. Questions? • What’s the benefit for Oracle compared with BLAST? • Are there any other technologies required for the Bioinformatics database platform? • Is there anything Darwin can’t do for data mining in the bioinformatics database?

More Related