MIAME and ArrayExpress – a standard for microarray gene expression data and the public database at EBI

  1. MIAME and ArrayExpress – a standard for microarray gene expression data and the public database at EBI Microarray Informatics Team EMBL- EBI (European Bioinformatics Institute) Transcriptome Symposium, April 2002 CHU Pitié-Salpêtrière, Université Paris VI

  2. Why have a public database? • EMBL- EBI centre for research and services in bioinformatics that makes and maintains public db: • EMBL Nucleotide Sequence, SWISS-PROT, Ensembl, MSD, etc. • Practical reasons: • Easy data access • Resolves local storage issues • Common data exchange formats can be developed • Scientific reasons: • Curation can be applied • Annotation can be controlled • Additional info can be stored that is missing in publications • Improve data comparison ! • Public standard can be applied

  MIAME annotation challenge: MGED BioMaterial Ontology Uses of MIAME concepts: ArrayExpress: a public repository for gene expression data MIAMExpress submission and annotation tool

  5. Standard for microarray data - Why? Size of dataset Different platforms - nylon, glass Different technologies - oligos, spotted References to external db not stable! Array annotation Sample annotation Data sharing needs standardized wayto annotate and record the information!

  6. Microarray Gene Expression Data Group: EBI + world’s largest microarray labs and companies (Sanger, Stanford, TIGR, Universite D'Aix-Marseille II, Affymetrics, Agilent, NCBI, DDBJ, etc.) MGED Group aims to Facilitate adoption of standards for: Experiment annotation Data representation Introduce standard for: Experimental controls Data normalization methods Standard for microarray data - MGED Group

  7. General MIAME principles • Minimum information about a microarray experiment • NOT a formal specification BUT a set of guidelines • Sufficient information must be recorded to: • Correctly interpret and verify the results • Replicate the experiments • Structured information must be recorded to: • Query and correctly retrieve the data • Analyse the data • MIAME- Brazma et al., Nature Genetics, 2001

  8. Sample source Sample treatments Extraction protocol Labeling protocol Hybridization protocol Hybridisation Sample Array design information Location of each element Description of each element Array Image Scanning protocol Software specifications Quantification matrix Analysis protocol Software specifications MIAME MIAME 6 parts of a microarray experiment

  9. Experiment Hybridisation Hybridisation Hybridisation Hybridisation Sample Sample Sample Sample Strategy Algorithm Control array elements Array Array Array Array • 3 data processing levels • Lack of gene expression measurement units ! Final data Normalisation MIAME MIAME 6 parts of a microarray experiment

  10. MIAME – Annotation challenge • Annotation implementations are required ! • Avoid/reduce free text descriptions • Use of controlled terms • Definitions and sources for each term • Remove of synonyms, or use of synonym mappings • Data curation at source (LIMS) • Integration of controlled terms in query interfaces • Facilitate data queries-analysis…….

  11. Samples • Sample annotations: - Source - Treatment Genes and transcription units Gene expression matrix • Array description: - Gene annotations Gene expression levels A gene expression database from the data analyst’s point of view

  12. Annotation – implementations required! Need an ontology to describe the sample: Defining controlled vocabularies and…… ….Using existing external ontologies Integrate the ontology in LIMS and databases: Develop browser or interface for the ontology Develop internal editing tools for the ontology However some free text description is unavoidable

  14. What CV and ontology are? • Controlled Vocabulary (CV): • Set of restrictive terms used to describe something, in the simplest case it could be a list • Ontology is more then a CV: • Describes the relationship between the terms in a structured way, provides semantics and constraints • Capture knowledge and make it machine processable

  15. MIAME, MAGE-ML and MGED Ontology Define MIAME concepts and their relationships incorporating MAML. The goal is to generate a document that will provide a clear and common understanding of what should be reported and how. The tables are a draft to form the basis for such a document. Located at Ontology Working Group home page. Ontology MIAME MAGE-OM/ML

  16. Sample annotation – MGED BioMaterial Ontologyan example Sample source and treatment description, and its correct annotation using the MGED BioMaterial Ontology classes and correspondent external references: “Seven week old C57BL/6N mice were treated with fenofibrate. Liver was dissected out, RNA prepared………”

  17. External References Instances 7 weeks after birth Female Charles River, Japan 22  2C 55  5% 12 hours light/dark cycle Specified pathogen free conditions ad libitum MF, Oriental Yeast, Tokyo, Japan in vivo, oral gavage 100mg/kg body weight ©-BioMaterialDescription ©-Biosource Property ©-Organism ©-Age ©-DevelopmentStage ©-Sex ©-StrainOrLine ©-BiosourceProvider ©-OrganismPart ©-BioMaterialManipulation ©-EnvironmentalHistory ©-CultureCondition ©-Temperature ©-Humidity ©-Light ©-PathogenTests ©-Water ©-Nutrients ©-Treatment ©-CompoundBasedTreatment (Compound) (Treatment_application) (Measurement) NCBI Taxonomy Mus musculus musculusid: 39442 Mouse Anatomical Dictionary Stage 28 International Committee on Standardized Genetic Nomenclature for Mice Mouse Anatomical Dictionary C57BL/6 Liver ChemIDplus Fenofibrate, CAS 49562-28-9 MGED BioMaterial Ontology

  Sample annotation: MGED BioMaterial Ontology

  19. Uses of MIAME concepts • Specifies the content of the information: • Sufficient • Structured • Uses: • Creation of MIAME-compliant LIMS or databases e.g: ArrayExpress • Development of submission/annotation tool for generating MIAME-compliant information e.g.: MIAMExpress

  20. EBI Web server Browse-Query Submission Submission MIAMExpress LIMS ArrayExpress Data warehouse Central database MIAMExpress Curation database Update Loader MAGE-ML Output Image server ArrayExpress – data flow Users

  21. Resources and ….messages • Open sources resources: • ArrayExpress and MIAMExpress schema-access to code • MIAME document and glossary • MAGE-ML dtd and annotation examples • MGED Ontology and other resources……… • Be aware of MIAME ! • Nature, Lancet and have already expressed their interest • Founding agencies • Join MGED meetings, tutorials and mailing lists: • MGED-5 meeting in Japan (Sept. 2002) • Ontology for BioSample description, EBI (Nov. 2002)