1 / 46

Unlocking the potential of public available gene expression data for large-scale analysis

Unlocking the potential of public available gene expression data for large-scale analysis. Jonatan Taminau PhD defense, November 2012. Introduction. In this thesis: Focus on data to information step. Focus on microarrays technology. Data. Information. Knowledge. Introduction. Data.

triage
Download Presentation

Unlocking the potential of public available gene expression data for large-scale analysis

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Unlocking the potential of public available gene expression data for large-scale analysis JonatanTaminau PhD defense, November 2012

  2. Introduction • In this thesis: • Focus on data to information step. • Focus on microarrays technology. Data Information Knowledge

  3. Introduction Data Information Data Repositories: + Massive amounts + Examples: GEO, ArrayExpress + Publicly available! Analysis Software: + Commercial: CLC Bio, Spotfire, etc. + Free: Bioconductor, Genepattern, Galaxy, etc. + A lot of existing research

  4. Introduction ? “Although hundreds of thousands of samples are publicly available, and several powerful analysis software solutions exist, the research community is facing a chasmbetween these two resources.” (Coletta et al, 2012) “One of the challenges for the future is how to integrate all the DNA microarray data that have been generated and deposited in public databases.” (Larsson et al, 2006)

  5. Introduction • We identified two hurdles for large-scale microarray analysis: • Consistent retrieval of individual datasets. • Integrative analysis of multiple data sets.

  6. Outline Chapter 1 Chapter 2 Chapter 3 Chapter 5 Chapter 4 Chapter 6 Chapter 7 Chapter 8 Chapter 9

  7. Outline Retrievalof data IntegrativeAnalysis Problem Statement Problem Statement Application Meta-Analysis Merging inSilico DB

  8. Outline Retrievalof data IntegrativeAnalysis Problem Statement Problem Statement Application Meta-Analysis Merging inSilico DB

  9. Retrieval of genomic data • Data is online, freely available • But: difficult to consistently retrieve the data (Example: Baggerly & Combes, 2011) • What does it mean? • Data retrieval is reproducible and tractable • No manual intervention needed • All data is preprocessed the same

  10. Retrieval of genomic data • Typical microarray workflow: CEL file Scanner ImageAnalysis Prepro- cessing Gene expression matrix numerical(‘raw’) data Image DNA microarray

  11. Retrieval of genomic data CEL file Prepro- cessing Gene expression matrix numerical(‘raw’) data Complex + normalization/background correction + probe-to-gene mapping + versioning issues + etc. “only 48% of all data in GEO and ArrayExpress was submitted with raw data” (Larsson et al. 2006) Not Documented!

  12. Retrieval of genomic data + Instances + Patients, tissues, etc.+ range: 10-100 + Features + Genes or probes+ range: 20k-30k Gene Expression Value: + Expression of gene i in sample j + range between 2-14 + log2 scaled xij

  13. Retrieval of genomic data • What about phenotypical data or meta-data ? • Extra information about the samples (age, gender, disease, etc.) • No standard way of formatting this information • MIAME / Ontologies / Free text / etc. • Also still an open problem

  14. Retrieval of genomic data • Why is consistent retrieval from public repositories so important? • Reproducibility of results • Comparison of new results with existing studies • Combining different studies

  15. Outline Retrievalof data IntegrativeAnalysis Problem Statement Problem Statement Application Meta-Analysis Merging inSilico DB

  16. The inSilico Database • Result of InSilico project • Innoviris (2007-2012) • 8 persons from VUB & ULB • Provides consistently preprocessed and expert-curated genomic data • Being commercialized

  17. The inSilico Database • What makes the inSilico Database so valuable ? • Not the fact that all data is precomputed • But how it is precomputed • What is the underlying engine ? • Genomic Pipelines • Backbone

  18. The inSilico DB | Genomic Pipelines • For every data type there is a different pipeline • Microarray pipeline: • Jobs • Dependencies • Backbone

  19. The inSilico DB | Backbone • Automatic Workflow System • Barely manual intervention needed • Control of intermediate results • Pre-computation saves time (for the user) • Streamlined Error management • Automatic Monitoring

  20. The inSilico DB | Backbone • How does it works? • Java daemon (recently replaced by application server) • Configuration Files

  21. inSilicoDb package • One thing missing for large-scale analysis... • Programmatic access via scripting • Contains the basic functionality of InSilico DB • Makes automatic retrieval of data possible! • Seamlessly integrates with other bioconductor analysis tools • Published in Bioinformatics, download > 2000 times

  22. Outline Retrievalof data IntegrativeAnalysis Problem Statement Problem Statement Application Meta-Analysis Merging inSilico DB

  23. Integrative Analysis • “Combining the information of multiple, independent but related studies in order to extract more general and more reliable results” • Problem: • How to do it ? • Two approaches: • Meta-Analysis • Merging

  24. Integrative Analysis Merging Meta-Analysis

  25. Outline Retrievalof data IntegrativeAnalysis Problem Statement Problem Statement Application Meta-Analysis Merging inSilico DB

  26. Meta-Analysis + Consistent Retrieval is essential ! + inSilicoDb package + Depends on goal + Much focus on findingDEGs + Defines what the results look like + Combining p-values + Combining effect sizes + Combining Ranks + Vote Counting + etc.

  27. Meta-Analysis | Stable Genes • 365 studies were screened for stable genes • Motivation: • Interested in reference genes • Currently used genes (housekeeping genes) are not ideal • Need a compact and diverse list of genes that are stable under most conditions • In collaboration with Dr Bram de Craene (VIB-UGent)

  28. Meta-Analysis | Stable Genes (1) Retrieve Data + inSilicoDb package + All 365 datasets downloaded in less than 100 min (2) Calculate Stability Scores + For each gene: + Coefficient of Variation (CV)sd / mean + avoid lowly expressed genes (3) Combine Stability Scores + For each gene take median of CVs + Rank and take top 100 (4) Semantic Similarity Filtering + Exclude genes that are related + Uses gene annotation from GO + Innovative Step! + From 100 to 10 genes

  29. Meta-Analysis | Stable Genes • Status: • August 2012 | waiting for results… • September 2012 | first positive results! • November 2012 | second test case, positive feedback from NAR, manuscript in preparation…

  30. Outline Retrievalof data IntegrativeAnalysis Problem Statement Problem Statement Application Meta-Analysis Merging inSilico DB

  31. Merging + Batch effects + Methods to remove - Location-scale - Matrix Factorization - Discretization+ Makes data compatible+ Preprocessing not sufficient + Consistent Retrieval is essential ! + inSilicoDb package + Same as with single studies + Increased sample size !

  32. Merging | Batch Effects • Illustrative Example what batch effects can cause: • We merged 4 different studies with thyroid samples • All studies contained normal and tumor samples • In collaboration with Wilma Van Staveren(IRIBHM, ULB) • Samples are plotted in MDS space • We expect two clusters

  33. Merging | Batch Effects Merging without batch effect removal Merging with batch effect removal Legend: + symbol for study + color for normal/tumor

  34. inSilicoMerging package • R/Bioconductor package combining: • 6 different merging methods • 5 visual inspection tools • 6 quantitative measures • Only resource so far combining all this functionality ! • Seamlessly integrates with inSilicoDb package

  35. Outline Retrievalof data IntegrativeAnalysis Problem Statement Problem Statement Application Meta-Analysis Merging inSilico DB

  36. Identification of DEGs in Lung Cancer • Idea: compare meta-analysis and merging approaches for integrative analysis • We used lung cancer as case based on the content of inSilico DB. • Ignore subtypes: DEGs can be seen as playing a role in the basic mechanisms of lung cancer

  37. Identification of DEGs in Lung Cancer • What is our hypothesis ? • Due to the small sample sizes of individual studies there are a lot or False Negatives when using meta-analysis • Can we avoid this by using merging as an alternative approach?

  38. Identification of DEGs in Lung Cancer Constraints: + fRMA preprocessed + > 30 samples + both normal and tumor + GPL96 or GPL570 + inSilicoMergingpackage Methodology: + apply limma - p-value < 0.05 - FC > 2 + robustness test - 100 iterations with 90% of data - resampling + take intersection Merging Meta-Analysis

  39. Identification of DEGs in Lung Cancer • Meta-Analysis:

  40. Identification of DEGs in Lung Cancer • Merging:

  41. Identification of DEGs in Lung Cancer • Findings: • Resampling helps to remove false positives • Relatively low impact of batch effect removal methods • More DEGs identified through merging (102) than via meta-analysis (25) “Deriving separate statistics and then averaging is often less powerful than directly computing statistics from aggregated data.” (Xu et al, 2008) no False Positives? + checked literature + initial pathway analysis

  42. Outline Retrievalof data IntegrativeAnalysis Problem Statement Problem Statement Application + Contributions+ Conclusions Meta-Analysis Merging inSilico DB

  43. Contributions • Genomic pipelines / backbone (Ch 4) • Release of 2 publicly available R/Bioconductor packages (Ch 4 & 7) • Survey of batch effect removal methods (Ch 7) • Two applications • Identification of stable genes via meta-analysis (Ch 6) • Screening of potential biomarkers via integrative analysis (Ch 8)

  44. Conclusions • We identified two hurdles for large-scale microarray analysis: • Consistent retrieval of individual datasets. • Integration of multiple data sets for integrative analysis.

  45. Conclusions • Consistent retrieval of individual datasets.inSilicoDb package • Integration of multiple data sets for integrative analysis.inSilicoMerging package • Paving the road towards unlocking the potential of public available gene expression studies

  46. Thanks! + InSilico Team! + Jury! + Audience! + Yann-Michaël!

More Related