1 / 15

Data integration

Data integration. Brixen 2008 Wolfgang Huber EMBL-EBI. Overview. Along genomic coordinates By gene (by pairs of genes) (by sets of genes) Here, "gene" is used in loose sense, to be defined as appropriate for the application; the concept encompasses: Loci on the DNA

soyala
Download Presentation

Data integration

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Data integration Brixen 2008 Wolfgang Huber EMBL-EBI

  2. Overview • Along genomic coordinates • By gene • (by pairs of genes) • (by sets of genes) • Here, "gene" is used in loose sense, to be defined as appropriate for the application; the concept encompasses: • Loci on the DNA • Transcripts (RNA molecules) • Proteins

  3. Integration of data along genomic coordinates • An example: • We measured the frequency of recombination events (cross-overs, gene conversions not associated with crossover) throughout the genome of S cerevisiae. • Is this pattern ('hotspots') associated with: • GC content • promoters • across- or within species conservation CO NCO

  4. Testing for association • You can consider the different sets of features along the genome as continuous-valued, or binary, "time" series • X1(t), ..., Xn(t) • Consider, e.g., the case where Xi(t) and Xj(t) are {0,1} indicators.A simple (but as we will see, inadequate) approach would be to compute an overlap statistic such as • or • and estimate its null distribution through random permutation in t.

  5. Testing for association • Alternatively, one could also compute, for each feature in series i, the distance to the closest feature in j, and then take a summary of the distribution of that statistic (e.g. median).

  6. "Boring" association of features in inhomogenous time series nearest neighbour distance: uniform genome genome with blocks

  7. ## Flawed testing for association along the genome • library("geneplotter") • library("RColorBrewer") • n = 10000 • oneplot = function(weights, s=200) { • e1 = sample(n, s, prob=weights) • e2 = sample(n, s, prob=weights) • d = matchpt(e1, e2)$distance • plot(x=e1, y=rep(1, length(e1)), type="p", pch=16, col= "#A6CEE3", ylim=c(-0.1, 1.1), xlab="", ylab="") • points(x=e2, y=rep(0.9, length(e2)), pch=16, col= "#B2DF8A") • lines(weights/sum(weights)*0.3, col="grey") • return(d) • } • w1 = rep(1, n) • w2 = rep(rep(c(0, 1), each=n/8), 4) • par(mfrow=c(3,1)) • dists = list( • w1=oneplot(w1), • w2=oneplot(w2)) • multidensity(dists, xlab="Distances", xlim=c(0, 120)) • legend("topright", names(dists), lwd=2, lty=1, col=brewer.pal(9, "Set1"))

  8. Testing for association • "Everything is correlated with GC-content"; Etc. • Hence everything is correlated with everything else. That is not very interesting. • Are two sets of genomic features correlated more than expected? • To be interesting, this "expectation" is not just uniform random distribution along the genome, but includes some "background model". When setting up such a test, we need to define what an interesting background model (null hypothesis) is, then set up an appropriate randomization scheme to try to reject it. • For example, we could say that we know that there are long range structures in the genome, in which we are not interested, and we want to test whether two features that we mapped at fine scale show local correlation above the coarse-scale correlation.

  9. Data integration via "genes" • A common and intuitive method for data integration is to compare the data from different experiments (assays) by mapping them all to the same set of genes. • This sounds easier than it is: different assays investigate different aspects of a gene • transcript(s) level • protein product(s) level, localization, structure, ... • chromatin state • promoter • UTRs • antisense transcript • and our understanding of how these aspects are organised together in a gene may be subtle, controversial, and changeable over time.

  10. Data integration via "genes" • The reagents and target molecule identifiers used in different experiments may be different: • RefSeq ID • Entrez ID • Ensembl Gene ID • Ensembl Transcript ID • Uniprot ID • Gene coordinate on the chrosome • Microarray probe sequence • siRNA sequence • Peptide sequence identified in MS • Short Read Sequence • Bioconductor offers tools to map these to each other (annotation packages; biomaRt; Biostrings).

  11. Data integration via "genes" • Bioconductor offers tools to map these to each other, so • others can reproduce your mapping • you can redo the mapping as the biological databases get updated • you can try out different ways to do the mapping and see how they affect the subsequent data analysis • Think about this early: • - keep the primary reagent identifiers around • - use versioning (annotation packages!) • - make the mapping process part of your reproducible, documented, automated workflow

  12. Acknowledgement • Robert Gentleman • Richard Bourgon • Jörn Tödling • Greg Pau

  13. Report Generation hwriter package

  14. References • Visualizing Genomic Data, R. Gentleman, F. Hahne, W. Huber (2006), Bioconductor Project Working Papers, Paper 10 • Choosing Color Palettes for Statistical Graphics, A. Zeileis, K. Hornik (2006), Department of Statistics and Mathematics, Wirtschaftsuniversität Wien, Research Report Series, Report 41

More Related