560 likes | 770 Views
Functional Genomics Data and Expression look-up tools: ArrayExpress and Expression Atlas. Sarah Morgan, PhD sarahm@ebi.ac.uk Training Programme Manager, EMBL-EBI Girona Workskshop 1 st July 2014. In this session…. What do we mean by “functional genomics data”?
E N D
Functional Genomics Data and Expression look-up tools: ArrayExpress and Expression Atlas Sarah Morgan, PhD sarahm@ebi.ac.uk Training ProgrammeManager, EMBL-EBI GironaWorkskshop 1st July 2014
In this session… • What do we mean by “functional genomics data”? • Two databases: ArrayExpress and Expression Atlas Why not just one database? • What’s in each database? • How to search, interpret & download data? fastq txt CEL bam 2 ArrayExpress
ArrayExpress What is functional genomics (FG)? • The aim of FG is to understand the function of genes and other parts of the genome • FG experiments typically utilize genome-wide assays to measure and track many genes (or proteins) in parallel under different conditions • High-throughput technologies such as microarrays and high-throughput sequencing (HTS) are frequently used in this field to interrogate the transcriptome
ArrayExpress What biological questions is FG addressing? • When and where are genes expressed? • How do gene expression levels differ in various cell types and states? • What are the functional roles of different genes and in what cellular processes do they participate? • How are genes regulated? • How do genes and gene products interact? • How is gene expression changed in various diseases or following a treatment?
The two databases: ArrayExpress www.ebi.ac.uk/arrayexpress (daily release at 6am UK time) Expression Atlas www.ebi.ac.uk/gxa (monthly release) wwwdev.ebi.ac.uk/gxa (updated more frequently)
The two databases: how are they related? Direct submission Expression data sets Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to analysis software, e.g. Links to other databases, e.g. 6 ArrayExpress
The two databases: how do they compare? 7 ArrayExpress
Data volume in ArrayExpress ~50,511 experiments, ~1/5 direct submissions, the rest imported Microarray vs HTS RNA-, DNA-, ChIP-seq breakdown (Pie charts as of 19 February 2014) 8 ArrayExpress
Data content in ArrayExpresswww.ebi.ac.uk/arrayexpress • Curated data from direct submissions, available in a structured and standardised format – essential for easy data sharing • Submissions are curated to Functional Genomics Data Society (FGED)’s standards: • MIAME guidelines & MAGE-TAB format for microarray • MINSEQE guidelines & MAGE-TAB format for HTS data • Many experiments have supporting publications 9 ArrayExpress
Community standards for data requirement • MIAME = Minimal Information About a Microarray Experiment (http://www.mged.org/Workgroups/MIAME/miame_2.0.html) • MINSEQE = Minimal Information about a high-throughput Nucleotide SEQuencingExperiment (http://www.mged.org/minseqe) • The checklist: 10 ArrayExpress
Reporting standards - MAGE-TAB format A simple spreadsheet format that uses a number of tab-delimited text files • Array Design Format file • Probe names, sequence, genomic mapping location • Investigation Description Format file • Experiment title + description • Submitter’s details • All protocols ADF (microarray only) IDF • Raw and processed data files • Sample Data Relationship Format file * SDRF /Seq lib Normalized.txt * Data1.txt Data2.txt .CEL Hyb/seq assays A1.CEL 2.fq.gz 1.fq.gz MAGE-TAB in FGED: http://www.mged.org/mage-tab/index.html 11 ArrayExpress
Example IDF: expt. info and protocols Row headings from MAGE-TAB spec, often with controlled vocab Submitter-supplied information
ArrayExpress ArrayExpress Archive – when to use it? • Find FG experiments that might be relevant to your research • Download data and re-analyze it. Often data deposited in public repositories can be used to answer different biological questions from the one asked in the original experiments. • Submit microarray or HTS data that you want to publish. Major journals will require data to be submitted to a public repository like ArrayExpress as part of the peer-review process.
ArrayExpress Browsing ArrayExpresswww.ebi.ac.uk/arrayexpress
ArrayExpress “Experiments” (= GEO “Series”) Sortable headings Data for all samples
ArrayExpress Feature (1): Ontology-based search extension Term suggestions from Experimental Factor Ontology (EFO, www.ebi.ac.uk/efo)
Expt. factor: “intent” of the study • The main variable(s) studied, related to the hypothesis or intent of the experiment. E.g. “disease” (diabetes patients vs healthy individuals) • Values of a factor among samples should vary (e.g. “p53-/-”, “wild type”). 18 ArrayExpress
ArrayExpress Experimental factor ontology (EFO)http://www.ebi.ac.uk/efo • A way to systematically organise experimental factor terms. controlled vocabulary + hierarchy (relationship) • Used in EBI databases: and external projects (e.g. NHGRI GWAS Catalogue) • Combine terms from a subset of well-maintained and compatible ontologies, e.g. • Gene Ontology (cellular component + biological process terms) • NCBI Taxonomy Ontology in layman terms: http://jamesmaloneebi.blogspot.co.uk/2012/06/common-ontology-questions-1-what-is-it.html
EFO in ArrayExpress datahttp://www.ebi.ac.uk/efo • expand on search terms when querying ArrayExpress (and Expression Atlas – coming soon) • using synonyms (e.g. “cerebral cortex” = “adult brain cortex”) • using child terms (e.g. “bone” “rib” and “vertebra”) • promote consistency (e.g. F/female/, 1day/24hours) • avoid ambiguity (e.g. “m” = ) • facilitate automatic annotation and integration of external data (e.g. changing “gender” to “sex” automatically) ? or ? 20 ArrayExpress
EFO marked-up search results Exactmatch to search term Matched EFO synonyms to search term Matched EFO child term of search term
ArrayExpress More examples of EFO terms • Sample attributes and experimental factor / factor values: • “genetic modification” “kidney” “diabetes” • “keratinocyte” “arsenic oxide” “potassium bromate” • “RNA-seq of coding RNA” • ArrayExpress accession number, e.g. “E-MEXP-568” • Secondary accession number e.g. GEO series “GSE5389” • Experiment title, description, e.g. “TG-GATEs” • Submitter's email address • Publication title, authors and journal name, PubMed ID What other search terms can I use?
Feature (2): Advanced search (i.e. filters) Task: Find experiments with rat liver samples and look at the effect of compounds ????
Feature (2): Advanced search (i.e. filters) • Format of search term: field_name:search_term • Hints: • Some examples: • https://www.ebi.ac.uk/arrayexpress/help/how_to_search.html#AdvancedSearchExperiment
Feature (2): Advanced search (i.e. filters) sa:”liver” AND ef:”compound” AND organism:”Rattusnorvegicus”
Feature (3): Samples table with expt “factor” 28 ArrayExpress
Feature (3): Samples table with expt “factor” Sortable headings: very handy for these 8105 rows! Data download links for each sample/assay 29 ArrayExpress
Feature (4): programmatic access options http://www.ebi.ac.uk/arrayexpress/help/programmatic_access.html “I want to download data for 250 experiments in one go…” 1. REST / XML: http://www.ebi.ac.uk/arrayexpress/xml/v2/experiments?keywords=“breast cancer cell line” 2. JSON: http://www.ebi.ac.uk/arrayexpress/json/v2/experiments?keywords=“breast cancer cell line” 3. R/Bioconductor: “ArrayExpress” R package 4. FTP: ftp.ebi.ac.uk/pub/databases/microarray/data (5.) MAGE-TAB Parsers: Limpopo (Java, Sourceforge) and Bio::MAGETAB (Perl, CPAN) 30 ArrayExpress
ArrayExpress Questions about ArrayExpress?
The two databases Direct submission Expression data sets Curation Statistical analysis ArrayExpress Expression Atlas Import from external databases (mainly NCBI Gene Expr. Omnibus) Links to analysis software, e.g. Links to other databases, e.g. 32 ArrayExpress
Expression Atlaswww.ebi.ac.uk/gxa / wwwdev.ebi.ac.uk/gxa • All manually curated, high-quality data sets, standard analysis pipeline. 33 ArrayExpress
Experiment with a broad selection of tissues/cell lines/conditions covered preferred * Presence of good quality rawfastq files(QC) Reference genome build in GenBank/ENA/DDBJ for read alignment Biological replicates preferred Baseline atlas selection criteria * Long term: Pool samples from multiple studies, report summarised expression per gene per condition per species. ArrayExpress
Baseline Atlas construction RNA-seq data only! @read_name/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @read_name/2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 fastq fastq NNNACTNNN 1. Data quality control Low quality reads Contamination 2. Align with TopHat Reference genome from Ensembl 3. Cufflinks FPKMs bam Mapped reads ArrayExpress
Clear contrast(s). At least 3 replicates for each factor value. Maximum 4 factors Adequate sample annotation using EFO terms Adequate array (platform) design to map probes to genes and to external references (e.g. Ensembl gene ID, Uniprot ID) Good quality rawdata files: e.g. CEL (Affy), fastq(HTS) RNA-seqexpt: reference genome build in GenBank/ENA/DDBJ Differential atlas selection criteria ArrayExpress
Differential atlas: how many contrasts per expt? E-MTAB-800 (rat compound treatment experiment, TG-GATEs) Simple case “diabetes” vs “normal” • ~130 compounds • 4 doses: (none), low, medium, high • Time of sacrifice: 4, 8, 15, 29 days • 2 tissues: liver, kidney >1000 contrasts!! ArrayExpress
Differential Atlas construction (microarray) CEL Normalised expression values per probe set CEL 1. RMA Normalization norm. 2. Moderated t-test (limma) 3. False discovery rate adjustment for p-values (Benjamini & Hochberg, 1995) fold-change, p-values Manually curated “contrast” disease:”diabetes” vs “normal” ArrayExpress
Differential Atlas construction (HTS) @read_name/1 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 @read_name/2 GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT + !''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 fastq fastq “contrast” NNNACTNNN 1. Data quality control Low quality reads Contamination 2. Align with TopHat Reference genome from Ensembl 3. HTSeq 4. DESeq Fold-change and p-values bam Mapped reads ArrayExpress
Mapping microarray probes to genes • Every (~monthly) Atlas release takes the latest Ensembl gene – probe identifier mapping data. • From Ensembl genes, we also get: • Compara genes • External references (xrefs) to other databases E.g. UniProt protein IDs, NCBI RefSeq IDs, HGNC gene symbols, gene ontology terms, InterPro terms Probe identifiers Expression data per probe Ensembl genes 40 ArrayExpress
Baseline Atlas use case: KCC2 gene Scenario: You study the health impact of BisphenolA (BPA) BPA: common additive in household plastic items. Negative health effects have been linked to BPA, e.g. on foetal and neonatal brain development. potassium chloride cotransporter 2 (Kcc2) mRNA levels ↓ in mouse Epigenetic downregulation BPA + • PNAS paper (Yeo et al., 2013) BisphenolA delays the perinatal chloride shift in cortical neurons by epigenetic effects on the Kcc2 promoter. Your question: What is the general expression profile of KCC2 in human tissues? ArrayExpress
Baseline Atlas use case: KCC2 gene ArrayExpress
Baseline Atlas use case: KCC2 gene ArrayExpress
Baseline Atlas use case: KCC2 gene ArrayExpress
Human KCC2 gene in Baseline Atlas Analysis method, experiment design FPKM threshold slider Tool tips! ArrayExpress
Baseline Atlas: ENCODE cell lines Scenario: You study the role of the apoptosis pathway (Reactome accession: REACT_578) in hepatoma cell line HepG2. Your question: What genes in the apoptosis pathway are expressed in HepG2? ArrayExpress
Baseline Atlas: Apoptosis genes in HepG2 ArrayExpress
Baseline Atlas: Apoptosis genes in HepG2 Ensembl * * * ArrayExpress
Differential Atlas use case: human primary hepatocyte and drug Trovafloxacin Analytics, experiment design, data download FDR and fold-change cut-offs Curated experimental factor and contrast Colour gradient showing significance of differential expression See fold-changes MA plots ArrayExpress