1 / 49

Databases for Microarrays

Databases for Microarrays. Vidhya Jagannathan SIB, Lausanne. Overview. Microarray data in a nutshell Why databases? What data to represent? What is a database? Different data models E-R modelling Microarray Databases Standards being developed. Microarray Experiment.

bran
Download Presentation

Databases for Microarrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Databases for Microarrays Vidhya Jagannathan SIB, Lausanne

  2. Overview • Microarray data in a nutshell • Why databases? • What data to represent? • What is a database? • Different data models • E-R modelling • Microarray Databases • Standards being developed Vidhya Jagannathan, SIB, Lausanne

  3. Microarray Experiment Vidhya Jagannathan, SIB, Lausanne

  4. Microarray Data in a Nutshell • Lots of data to be managed before and after the experiment. • Data to be stored before the experiment . • Description of the array and the sample. • Direct access to all the cDNA and gene sequences, annotations, and physical DNA resources. • Data to be stored after the experiment • Raw Data - scanned images. • Gene Expression Matrix - Relative expression levels observed on various sites on the array. • Hence we can see that database software capable of dealing with larger volumes of numeric and image data is required. Vidhya Jagannathan, SIB, Lausanne

  5. Why Databases? • Tailored to datatype • Tailored to the Scientists • Intuitive ways to query the data • Diagrams, forms, point and click, text etc. • Support for efficient answering of queries. • Query optimisation, indexes, compact physical storage. Vidhya Jagannathan, SIB, Lausanne

  6. Data Representation • Goal: Represent data in an intuitive and convenient manner • Without unnecessary replication of information • Making it easy to write queries to find required information • Supporting efficient retrieval of required information Vidhya Jagannathan, SIB, Lausanne

  7. What is a Database? • A database is an organised collection of pieces of structured electronic information. • Example 1: Libraires use a database system to keep track of library inventory and loans. • Example 2: All airlines use database system to manage their flights and reservations. • The collection of records kept for a common purpose such as these is known as a database. • The records of the database normally reside on a hard disk and the records are retrieved into computer memory only when they are accessed. • So the reasons are obvious why we need to discuss about a Microarray database. Vidhya Jagannathan, SIB, Lausanne

  8. Data Models • Describes a container for data and methods to store and retrieve data from that container. • Abstract math algorithms and concepts. • Cannot touch a data model. • Very useful Vidhya Jagannathan, SIB, Lausanne

  9. Types of Data Models • Ad-hoc file formats (not really data models!) • Relational data model • Object-relational data model • Object-oriented data model • XML (Extensible Markup Language) Vidhya Jagannathan, SIB, Lausanne

  10. Ad-hoc File Formats • The various 'ad-hoc' file formats in use for microarray data are: • Flat file formats. • Spread sheet formats. • Not the least - Even MS-Word documents !!! • Very rudimentary method to store data . • Sometimes contains redundant information. • Extremely inefficient for retrieval of particular subsets of the results. Vidhya Jagannathan, SIB, Lausanne

  11. Relational Data Model • Most prevalent and used in many databases developed today. The collection of related information is represented as a set of tables. Data value is stored in the intersection of row and column Column values are of the same kind. A Simple data validation. Rows are unique. So no data redundancy and every row is meaningful and can be identified by the unique key. Utilises Structured Query Language (SQL) for data storage, retrival and manipulation. Vidhya Jagannathan, SIB, Lausanne

  12. GENE GENE_ID CONTIG_ID CONTIG_START CONTIG_END CONTIG_STRAND GB2VN NT_0106058.3 2354807 2360778 Complement GB2VN32 NT_0106051.3 2308745 2321072 Complement Example Table Row or Record Field or Column Vidhya Jagannathan, SIB, Lausanne

  13. GENE GENE_ID CONTIG_ID CONTIG_BEGIN CONTIG_END CONTIG_STRAND GB2VN NT_0106058.3 2354807 2360778 Complement GB2VN32 NT_0106051.3 2308745 2321072 Complement CLASSIFICATION CLASS_ID GENE_ID DESCRIPTION TYPE MSX2 GB2VN MSH(Drosophila) Gene GO:0003677 GB2VN32 DNA binding Molecular function Example Vidhya Jagannathan, SIB, Lausanne

  14. Advantages of Relational Model • Allows information to be broken up into logical units and stored in tables. • Allows combining data from different tables in different ways to derive useful information. • Great for queries involving information from multiple original sources. • Can easily gather related information. • e.g. information about a particular gene from multiple datasets/experiments Vidhya Jagannathan, SIB, Lausanne

  15. Object Oriented Model • Object Oriented Model allows real world data to be represented as objects. • Objects encapsulate the data and provide methods to access or manipulate it. • Objects with specific structure and set of methods are said to belong to the object class. • Allows new classes to be created by extending the description of the parent class. • Child classes inherit the data and methods of the parent class. Vidhya Jagannathan, SIB, Lausanne

  16. Biomolecule . String seq get_bio_seq() Inherits Inherits DNA Protein DNA Protein DNA Protein Example OODBMS Vidhya Jagannathan, SIB, Lausanne

  17. Example - ArrayExpress Vidhya Jagannathan, SIB, Lausanne

  18. Database Design Entity-Relationship Concept Relationship Entity A Entity B Examples Vidhya Jagannathan, SIB, Lausanne

  19. Gene gene_id sequence sequence gene_id Entities • are real world objects • ex: gene • contain attributes • ex: gene_id, sequence • are drawn as rectangle boxes that holds the name of the entity and attribute in two different notations as there is no standard! Gene notation 2 Vidhya Jagannathan, SIB, Lausanne notation 1

  20. Relationship • Relationships provide connections between two or more entities • ex: Which genes were used in which experiment • When two entities are involved in a relationship, it is known as binary relationship. • When three entities are invoved in a relationship, it is called as ternary relationship. • When more than three entities are involved in a relationship, it is usually broken in to one or more binary or ternary relationships. • are drawn as a line linking the involved entities as: used_in Gene Experiment Vidhya Jagannathan, SIB, Lausanne

  21. Experimenter Experimenter-Id Name E-mail Dept. Institution Gene gene-id sequence Expression-value Sample Sample-Id Organism Cell-type {Drug-Ids} value Experiment Experiment-Id Date Image Array Array-Id Manufacturer Type Batch 1 * Example E-R Diagram Expt-Exptr Expt-Sample ¥Multivalued attribute Notation Expt-Array Many-to-one Vidhya Jagannathan, SIB, Lausanne

  22. Improved relational model by adding some features from object data models. Information is represented as in relational models but column values not restricted to one mutliple values are allowed. Example (sample table in previous slides): Object relational data model Vidhya Jagannathan, SIB, Lausanne

  23. Queries, queries, queries!! • Given a collection of microarray generated gene expression data, what kind of questions the users wish to pose. • Constructing an extensive list of possible interesting queries and data mining problems that has to be supported by the database will facilitate the design process. Vidhya Jagannathan, SIB, Lausanne

  24. Queries, queries, queries!! • Query to the data • Which genes are linked ? • Which genes are expressed similarly to my gene XYZ? • Which genes have a changed the expression in a second condition ? • Which genes are co-expressed in differing conditions ? • classification (of tumors, diseased tissues etc.): which patterns are characteristic for a certain class of samples, which genes are involved? Vidhya Jagannathan, SIB, Lausanne

  25. More Queries !!! • Queries that add a link in additional knowledge • functional classification of genes: Are changes clustered in particular classes? • metabolic pathway information: Is a certain pathway/route in a pathway affected? • disease information & clinical follow up: correlation to expression patterns. • phenotype information for mutants: Are there correlations between particular phenotypes and expression patterns? Vidhya Jagannathan, SIB, Lausanne

  26. More Queries !!! • in what region is the interesting gene located in the genome? • is there synteny in this region with other species? • is there a known trait that maps to this region? Vidhya Jagannathan, SIB, Lausanne

  27. Query Language • Language in which user requests information from the database. • SQL • Data definition helps you implement your model and data manipulation helps you modify and retrive data • Advantages: • Can specify query declaratively and let database system figure out best way of finding answers • Supports queries of medium complexity • Specialized languages • SQL language statements are not abstract but very close to spoken language. Vidhya Jagannathan, SIB, Lausanne

  28. Basic SQL Queries • Find the image for experiment number 1345 • select image from experiment where experiment-id = 1345; • Find the experiment-id and image of all experiments involving e-coli • select experiment-id, image from experiment, sample where experiment.sample-id = sample.sample-id and sample.organism = `e.coli‘; • All combinations of rows from the relations in the from clause are considered, and those that satisfy the where conditions are output Vidhya Jagannathan, SIB, Lausanne

  29. Interfacing • SQL queries are carried out on terminal screen which is not very useful and user friendly for an end user, so applications are created to interface more friendly staments with the SQL statements • A web form is a typical example of interface for SQL • Applications for data loading. • More complex queries (e.g. data mining such as classification and clustering) are very imporatant part of the Microarray Analyis Protocol • It is very important to interface the various applications we use to analyse the retrieved data with database. Vidhya Jagannathan, SIB, Lausanne

  30. Vidhya Jagannathan, SIB, Lausanne

  31. Gene Expression Databases Require Integration • There are many different types of data presenting numerous relationships. • There are a number of Databases with lots of information. • Experiments need to be compared because the experiments are very difficult to perform and very expensive. • Solution: Make all the databases talk the same language. • XML was the choice of data interchange format. Vidhya Jagannathan, SIB, Lausanne

  32. Why XML? • Why XML ?: XML provides the method for defining the meaning or semantics of data. • Example : A XML file of the earlier table we defined <gene_features> <gene_id>GBVN32</gene_id> <contig_id>NT_010651</contig_id> <contig_start>2354807</contig_start> <contig_end>2360778</contig_end> <contig_strand>Complement</contig_strand> </gene_features> Vidhya Jagannathan, SIB, Lausanne

  33. Mapping XML to Relational Database • The Data Structure in XML is defined in Document Type Descrciptor as follows <!ELEMENT gene_id (#PCDATA)> <!ELEMENT contig_id (#PCDATA)> <!ELEMENT contig_start (#PCDATA)> <!ELEMENT contig_end (#PCDATA)> <!ELEMENT contig_sequence (#PCDATA)> • This kind of DTD also helps us to have control over the vocabulary used. • SQL: create table gene ( gene_id varchar(5) primary key, contig_id varchar(10) not null, contig_start integer not null, contig_end integer not null, contig_sequence text not null); • So the DTD can be directly mapped into a relational database. Vidhya Jagannathan, SIB, Lausanne

  34. MAGE-ML As Data Interchage Format Expression Data Converter (program) MAGE-ML Databases Vidhya Jagannathan, SIB, Lausanne

  35. Existing Microarray Databases • Several gene expression databases exist:Both commercial and non-commercial. • Most focus on either a particular technolgy or a particular organism or both. • Commercial databases: • Rosetta Inpharmatics and Genelogic, the specifics of their internal structure is not available for internal scrutiny due to their proprietary nature. • Some non-commercial efforts to design more general databases merit particular mention. • We will discuss few of the most promising ones • ArrayExpress - EBI • The Gene expression Omnibus (GEO) - NLM • The Standford microarray Database • ExpressDB - Harvard • Genex - NCGR Vidhya Jagannathan, SIB, Lausanne

  36. Vidhya Jagannathan, SIB, Lausanne

  37. ArrayExpress • Public repository of microarray based gene expression data. • Implemented in Oracle at EBI. • Contains: • several curated gene expression datasets • possible introduction of an image server to archive raw image data associated with the experiments. • Accepts submissions in MAGE-ML format via a web-based data annotation/submission tool called MIAMExpress. • A demo version of MIAMExpress is available at: http://industry.ebi.ac.uk/~parkinso/subtool/subtype.html • Provides a simple web-based query interface and is directly linked to the Expression Profiler data analysis tool which allows expression data clustering and other types of data exploration directly through the web. Vidhya Jagannathan, SIB, Lausanne

  38. Gene Express Omnibus • The Gene Expression Omnibus ia a gene expression database hosted at the National library of Medicine • It supports four basic data elements • Platform ( the physical reagents used to generate the data) • Sample (information about the mRNA being used) • Submitter ( the person and organisation submitting the data) • Series ( the relationship among the samples). • It allows download of entire datasets, it has not ability to query the relationships • Data are entered as tab delimited ASCII records,with a number of columns that depend on the kind of array selected. • Supports Serial Analysis of Gene Expression (SAGE) data. Vidhya Jagannathan, SIB, Lausanne

  39. Stanford Microarray Database • Contains the largest amount of data. • Uses relational database to answer queries. • Associated with numerious clustering and analysis features. • Users can access the data in SMD from the web interface of the package. • Disadvantage : • It supports only Cy3/Cy5 glass slide data • It is designed to exclusively use an oracle database • Has been recently released outside without anykind of support !! Vidhya Jagannathan, SIB, Lausanne

  40. MaxdSQL • Minor changes to the ArrayExpress object data model allowed it to be instantiated as a relational database, and MaxdSQL is the resulting implementation. • MaxdSQL supports both Spotted and Affymetrix data and not SAGE data. • MaxdSQL is associated with the maxdView, a java suite of analysis and visualisation tools.This tool also provides an environment for developing tools and intergrating existing software. • MaxdLoad is the data-loading application software. Vidhya Jagannathan, SIB, Lausanne

  41. Vidhya Jagannathan, SIB, Lausanne

  42. GeneX • Open source database and integrated tool set released by NCGR http://www.ncgr.org. • Open source - provides a basic infrastructure upon which others can build. • Stores numeric values for a spot measurement (primary or raw data), ratio and averaged data across array measurements. • Includes a web interface to the database that allow users to retrieve: • Entire datasets, subsets • Guided queries for processing by a particular analysis routine • Download data in both tab delimited form and GeneXML format ( more descriptions later) Vidhya Jagannathan, SIB, Lausanne

  43. Vidhya Jagannathan, SIB, Lausanne

  44. ExpressDB • ExpressDB is a relational database containing yeast and E.coli RNA expression data. • It has been conceived as an example on how to manage that kind of data. • It allows web-querying or SQL-querying. • It is linked to an integrated database for functional genomics called Biomolecule Interaction Growth and Expression Database (BIGED). • BIGED is intended to support and integrate RNA expression data with other kinds of functional genomics data Vidhya Jagannathan, SIB, Lausanne

  45. Survey of existing microarray systems This survey is based on the article published in BRIEFINGS IN BIOINFORMATICS, Vol 2, No 2, pp 143-158, May 2001: ®A comparison of microarray databases Vidhya Jagannathan, SIB, Lausanne

  46. The Microarray Gene Expression Database Group (MGED) History and Future: • Founded at a meeting in November, 1999 in Cambridge, UK. • In May 2000 and March 2001: development of recommendations for microarray data annotations (MAIME, MAML). • MGED 2nd meeting: • establishment of a steering committee consisting of representatives of many of the worlds leading microarray laboratories and companies • MGED 4th meeting in 2002: • MAIME 1.0 will be published • MAML/GEML and object models will be accepted by the OMG • concrete ontology and data normalization recommendations will be published. • information can be obtained from http://www.mged.org Vidhya Jagannathan, SIB, Lausanne

  47. The Microarray Gene Expression Database Group (MGED) Goals: Facilitate the adoption of standards for DNA-array experiment annotation and data representation. Introduce standard experimental controls and data normalization methods. Establish gene expression data repositories. Allow comparision of gene expression data from different sources. Vidhya Jagannathan, SIB, Lausanne

  48. MGED Working Groups Goals: MIAME: Experiment description and data representation standards - Alvis Brazma MAGE: Introduce standard experimental controls and data normalization methods - Paul Spellman. This group includes the MAGE-OM and MAGE-ML development. OWG: Microarray data standards, annotations, ontologies and databases - Chris Stoeckert NWG: Standards for normalization of microarray data and cross-platform comparison - Gavin Sherlock Vidhya Jagannathan, SIB, Lausanne

  49. References • URL: • Tutorial on Information Management for Genome Level Bioinformatics, Paton and Goble, at VLDB 2001: http://www.dia.uniroma3.it/~vldbproc/#tutEuropea • European Molecular Biology Network http://www.embnet.org/ • Univ. Manchester site (with relational version of Microarray data representation, and links to other sites) • http://www.bioinf.man.ac.uk • Database textbook with absolutely no bioinformatics coverage • For Microarray Data • http://linkage.rockefeller.edu/wli/microarray/ Vidhya Jagannathan, SIB, Lausanne

More Related