1 / 35

Bioinformatics Databases: Fundamental Concepts of Database Technology & Data Organization

Bioinformatics Databases: Fundamental Concepts of Database Technology & Data Organization. Kristen Anton Director of BioInformatics Dartmouth Medical School. Bio Informatics @ Dartmouth Medical School. How can data be organized?. Paper (e.g. in notebooks) Flat files

selena
Download Presentation

Bioinformatics Databases: Fundamental Concepts of Database Technology & Data Organization

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Bioinformatics Databases:Fundamental Concepts of Database Technology & Data Organization Kristen Anton Director of BioInformatics Dartmouth Medical School BioInformatics @ Dartmouth Medical School

  2. How can data be organized? • Paper (e.g. in notebooks) • Flat files • Collection of data records • Minimal structure, no metadata • Application program must contain relationship information • Database • Hierarchical • Network • Relational BioInformatics @ Dartmouth Medical School

  3. BioInformatics @ Dartmouth Medical School

  4. How can data be organized? • Paper (i.e. in notebooks) • Flat files • Collection of data records • Minimal structure, no metadata • Application program must contain relationship information • Database • Hierarchical • Network • Relational BioInformatics @ Dartmouth Medical School

  5. What is a relational database? A database composed of relations and conforming to a set of principles governing how such relations are supposed to behave (“Codd’s 12 Rules”). There are many database systems that use tables but don’t conform to all of the principles. These are often called “semirelational” systems. from Understanding SQL, Martin Gruber BioInformatics @ Dartmouth Medical School

  6. Practically speaking... • A database is a body of information stored in two dimensions (rows and columns) • Rows are records • Columns are attributes of those record entities (usually!) • The groups of rows and columns, or tables, are largely independent of each other • The power of the database lies in the relationships that you construct among the tables • A database is self-describing: it contains metadata, which is a description of its own structure BioInformatics @ Dartmouth Medical School

  7. What is a Database Management System (DBMS)? • A set of programs which define, administer and process databases and their associated applications • A scalable DBMS can run on multiple platforms (varying sizes) • A DBMS that supports interoperability uses industry-standard language and standard ways of exchanging data -> open source Examples: Oracle, Sybase, MySQL, MS Access, PostgreSQL, 4D, Filemaker … BioInformatics @ Dartmouth Medical School

  8. Features of a Relational Database • Rows (records) are in no particular order • Columns (fields) are ordered, numbered and named; names should indicate content of the field • Primary key uniquely identifies each row - ensures that no row is empty, and that every row is different from every other row • Two-step commit process BioInformatics @ Dartmouth Medical School

  9. Features of a Relational Database • A view is a subset of the database that an application (or user) can process • The database schema is the structure of the entire database • A constraint is a condition you apply to an attribute of a table BioInformatics @ Dartmouth Medical School

  10. Relationships between tables • One-to-One, Many-to-One, Many-to-Many • A “join” is an operation that combines data from multiple tables into a singe result table • E-R (entity-relationship) diagram is the basic graphic to describe the structure of a database SELECT Sequence.sname, KnownGenes.gname, KnownGenes.length FROM Sequence, KnownGenes WHERE KnownGenes.length = Sequence.length BioInformatics @ Dartmouth Medical School

  11. E-R Diagram BioInformatics @ Dartmouth Medical School

  12. The tool for communicating withrelational databases: SQL • Standard Query Language (SQL) • A query is a question you ask the database, and SQL retrieves the appropriate answer set • Interactive SQL (command line) vs. RAD tool/GUI • Standardization issue: ANSI (American National Standards Institute) BioInformatics @ Dartmouth Medical School

  13. Data Types • Types of data indicate functions that are possible between related fields • Each field is assigned one data type (imposes structure on data) • Examples: text (CHAR, VARCHAR), number (INT, DEC); date, time, money binary • Standardization issue: ANSI (American National Standards Institute) BioInformatics @ Dartmouth Medical School

  14. A word about database design: • Designing a database is not trivial • The value is not only in the data, but also in the structure • Design to facilitate the retrieval and interpretation of the data BioInformatics @ Dartmouth Medical School

  15. Design database for data extraction: think it through • Relationships ease extraction and/or reporting of data from the system • Redundancy • Concept of attributes in rows instead of columns BioInformatics @ Dartmouth Medical School

  16. Design database for data extraction: think it through BioInformatics @ Dartmouth Medical School

  17. Design database for data extraction: think it through BioInformatics @ Dartmouth Medical School

  18. Example: BioInformatics Core Technology • Reusable ‘core’ modules, with customizable components • Standard business logic framework controls transactions (middle layer) • Metadata-based back-end data storage (facilitates data sharing) BioInformatics @ Dartmouth Medical School

  19. BioInformatics Core Technology BioInformatics @ Dartmouth Medical School

  20. Data Security: High Priority HIPAA, FIPS 140-2, IRB requirements … BioInformatics @ Dartmouth Medical School

  21. Life science has become a field which generates an enormous amount of un-integrated data. How can methods for data organization help to solve this problem? BioInformatics @ Dartmouth Medical School

  22. What is Data Integration? • Creating a system which allows the extraction of a piece or set of information (query result) across multiple domains (possibly disparate data sources - flat files, databases, spreadsheets, URLs...) • or • Pooling data to create power for detection of small signals BioInformatics @ Dartmouth Medical School

  23. Sample integration problem:Cancer Biomarker Discovery • Clinical center collects blood samples from 1000 individuals with colon cancer • Expression analysis reveals that protein ‘x’ is over-expressed in these samples, relative to controls • Could protein ‘x’ be a colon cancer biomarker? BioInformatics @ Dartmouth Medical School

  24. Understanding transcription factors for protein ‘x’ production Show me all genes in the public literature that are putatively related to protein ‘x’, have more than 4-fold expression differential between affected and normal tissue and are homologous to known transcription factors. Q1: Find homologs Q2: Find genes with4-fold differential Q3: Show me genesin public literature SEQUENCE EXPRESSION LITERATURE (Q1Q2Q3) BioInformatics @ Dartmouth Medical School

  25. Key components to integration • Accessing without modifying original data sources • Handling redundant, conflicting, missing, changing (versions) data • Normalizing analytical data from different data sources • Conforming terminology to industry standards • Accessing the integrated data as a single repository • Including metadata in repository BioInformatics @ Dartmouth Medical School

  26. Approaches to Integrationwhere are the key issues addressed? • Federated database (poses constraints on original data sources; fragility in reliance on source systems) • Data warehousing (ETL layer, original data sources untouched, required understanding of domain, sophisticated update/archive processes) • Integrating data source profiles • Indexed Flat Files • Others…. BioInformatics @ Dartmouth Medical School

  27. Data Warehousing BioInformatics @ Dartmouth Medical School

  28. Metadataone key to success • Describes data types, relationships, histories, etc. • Back-end (supports developers), front-end (supports users and application) Data value: 55 BioInformatics @ Dartmouth Medical School

  29. Metadataone key to success • Describes data types, relationships, histories, etc. • Back-end (supports developers), front-end (supports users and application) Data value: 55Metadata values: Data element name: vehicle speed BioInformatics @ Dartmouth Medical School

  30. Metadataone key to success • Describes data types, relationships, histories, etc. • Back-end (supports developers), front-end (supports users and application) Data value: 55Metadata values: Data element name: vehicle speed Unit: miles per hour BioInformatics @ Dartmouth Medical School

  31. Metadataone key to success • Describes data types, relationships, histories, etc. • Back-end (supports developers), front-end (supports users and application) Data value: 55Metadata values: Data element name: vehicle speed Unit: miles per hour Description: the average velocity of a vehicle BioInformatics @ Dartmouth Medical School

  32. Standardsthe final frontier • Naming conventions • Standard coordinate systems • Unify interpretations of single object types • Unify software solutions to the same problem (also data formats) • Standards for metadata (incompatible or missing metadata) BioInformatics @ Dartmouth Medical School

  33. Developing Standardsfor Life Sciences Research • Discovery science does not lend well to constraints (especially system constraints) • Decentralized data management infrastructure, competition • Wildly varying skill levels for data and information management Several groups (Bio-Ontologies, HGNC, OMG, etc.) and national research initiatives (EDRN, caBIG, etc.) are taking the lead in the effort to create ‘workable’ standards. BioInformatics @ Dartmouth Medical School

  34. New approach to integration:Cancer Biomarker Discovery • Network of distributed data ‘silos’ (does not perturb data sources) • Centralized query and ‘business logic’ servers, accessed through web interface • CORBA framework ‘manages’ XML profile definitions across the web • A profile is a set of resource definitions implemented in XML for data sources residing in one or more distributed systems BioInformatics @ Dartmouth Medical School

More Related