Combining heterogeneous data to reverse engineer regulatory networks

Combining heterogeneous data to reverse engineer regulatory networks Allan Tucker School of Information Systems Computing and Mathematics, Brunel University, London. UB8 3PH

Intelligent Data Analysis • IDA attempts to deal with data explosion to discover patterns and knowledge from data • Typical analysis tasks: • Clustering • Classification • Feature Selection • Prediction and Forecasting • Structure identification

Bayesian Networks • An IDA method to model a domain using probabilities • Easily interpreted by non-statisticians • Can be used to combine existing knowledge with data • Essentially use independence assumptions to model the joint distribution of a domain

Informative Priors • To build BNs we can also use prior structures and probabilities • These are then updated with data • Usually uniform (equal probability) • Informative Priors used to incorporate existing knowledge into BNs

Microarray Data • Major source of data for gene expression activity • Technology takes measurements over 1000s of genes simultaneously • Gene Regulatory Networks (GRNs) model how genes interact • Eliciting reliable GRNs from data key to understanding biological mechanisms

But... • Reliability issues that surround microarray gene expression data • Mechanisms in different systems & species • Can we build GRN models that have enhanced performance, based on a richer and/or broader collection of data than a single microarray dataset?

The talk • Incorporating literature priors • Consensus networks • Models of Increasing Complexity • Interspecies analysis

Literature-based priors • Information about biomedical concepts such as genes summarized using concept profiling (Jelier et al., 2007; Schuemie et al., 2007a) • Combine information from several databases, including Entrez Gene, Uniprot, and the Saccharomyces Genome Database • Concept profile is a vector of concepts with weights • Weight represents uncertainty between occurrence of one concept and another (2009) Steele, E., Tucker, A., 't Hoen, P.A.C. and Schuemie, M.J., Literature-Based Priors for Gene Regulatory Networks, Bioinformatics 25 (14) : 1768-1774

Literature-based priors • Perform Pearson correlation on concpet profiles of genes to create a literature matrix • Translate correlations into probs using confidence scores. Represents prob that a particular correlation was not drawn from the distribution of random gene-pair correlations • Not equal to probability that edge exists – see Segal et al. (2002) and Efron (2007) • Incorporate as a prior into BIC score: • BIC = w log P(S) + log P(S|D) - 0.5 k log(n)

The Experiments • Test our approach on synthetic networks generated using differential equations, yeast studies and e coli studies with known regulatory structures • Report on ROC analysis: • True Positives: links that are correctly id • False positives: links that are incorrectly id • False Negatives: links that are missed • True Negatives: links that are correctly missed • Also predictive power using CV

Yeast and E-Coli Network Analysis • Issues with circularity when validating

Predictive accuracy

Literature Priors Conclusions • A literature prior weight of between 0.4 and 0.6 appears best choice to identify relevant regulatory edges on human data for mechanisms involving Muscular Dystrophy • Higher prior weights lead to inclusion of too many edges (literature associations not of regulatory nature) • A lower weight than the optimum prior weights found for yeast and E. coli • Perhaps because less literature on the human organism whereas yeast and E. coli are both well-studied.

Consensus Bayesian Networks • Different platforms involve different biases: • e.g. Oligonucleotide estimates of absolute value of expression whereas cDNA measures relative differences between genes. • Previous research established comparing datasets using standard normalisation is difficult and not straightforward • An attempt to combine multiple microarray data sources through post-learning aggregation Steele, E. Tucker A. “Consensus and Meta-analysis regulatory networks for combining multiple microarray gene expression datasets”, Journal of Biomedical Informatics 41(6), pp 914-926 , 2008

Consensus Bayes Networks

Consensus Bayesian Networks • Bootstrapping on each dataset to generate robust networks with confidence • Threshold the confidence and generate a PDAG (due to equivalence classes) • Consensus looks for edges with enough support in the input networks • Edge direction is based upon voting of inputs – or left undirected if there is no consensus or if cycles cannot be resolved

Consensus Bayes Networks

E Coli

Yeast

Weighting networks Steele, E. and Tucker, A., Selecting and Weighting Data for Building Consensus Gene Regulatory Networks, Advances in Intelligent Data Analysis VIII: 8th International Symposium on Intelligent Data Analysis (IDA 2009). Lecture Notes in Computer Science, volume 5772: 190-201, 2009

c) Models of Increasing Complexity Specification of three muscle differentiation datasets (2010) Anvar, S.Y., t' Hoen, P.A.C. and Tucker, A., The Identification of Informative Genes from Multiple Datasets with Increasing Complexity, BMC Bioinformatics 11 : 32

MIC • Select one dataset for training • Others become test sets • Score mean and variance of SSE using CV and indpt test sets • Use these to rank genes

MIC - Datasets • All concerned with the differentiation of cells into the muscle (Myogenic) lineage • In-vitro system mimics the formation of new muscle fibres in-vivo • Cao uses embryonic fibroblasts, others use tumor cell line that has the potential for differentiation into different lineages (mainly muscle and bone) • Cao use MyoD and MyoG to force cell differentiation (others use serum starvation) • Sartorelli includes different treatments that affect timing and efficiency

MIC Select genes using one dataset (black) at a time and compare average CV error rate of BN classifier learnt on same dataset and validated on the other two datasets independently (grey). Cao does well on CV but overfits Tomzczak does well on both

MIC • Select 100 informative (KS test), and 50 uninformative genes. • Train BN classifier on Tomczak and test on Sartorelli. • Rank genes according to average error rate. • Score average improvement or deterioration of Myogenesis-Related, Top 100 and 50 random selected genes in Sartorelli • Compare our method with • rankings generated by • concordance model.

MIC Conclusions • Highly predictive and consistent genes from pool of • differentially expressed genes, across independent • datasets are more likely to be fundamentally involved • in the biological process under study • Results imply that gene regulatory networks identified • in simpler systems can be used to model more complex • biological systems

MIC Conclusions • e.g. muscle differentiation: myogenesis-related network • is difficult to derive from in vivo experiments due to • presence of multiple cell types and higher biological variation • But may become evident after initial training of the • network on the cleaner in vitro experiments

Inter-species Mechanisms

Summary • Explored a number of novel techniques for buidling more • Reliable GRNS • Incorporating exogenous knowledge in the form of BN • Priors constructed from biological abstracts • Consensus algorithms for post-learning aggregation of • data / networks • Models of increasing complexity for identifying genes that • are more confidently associated with a biological process • Future work – extending MIC to inter-organism mechanisms

Thanks Dr Emma Steele, previously Brunel Mr Yahya Anvar & Dr Peter-Bram ‘t Hoen, Leiden University Medical School, Netherlands

Combining heterogeneous data to reverse engineer regulatory networks

Combining heterogeneous data to reverse engineer regulatory networks

Presentation Transcript

Reverse Engineer Tools

optimizing heterogeneous networks

Communities in Heterogeneous Networks

Heterogeneous networks

Transcriptional Regulatory Networks

Leveraging BGP Dynamics to Reverse-Engineer Routing Policies

Reconstruction of regulatory modules based on heterogeneous data sources

Reverse engineering gene and protein regulatory networks using Graphical Models.

Reverse engineering of regulatory networks

Heterogeneous Wireless Networks

reverse engineering and interrogation of regulatory networks in human malignancies

Organizing Heterogeneous Data

Genetic Regulatory Networks Applied to Neural Networks

Reverse Engineering of Regulatory Networks

Reverse engineering gene regulatory networks

Inferring Regulatory Networks from Gene Expression Data

COMBINING HETEROGENEOUS MODELS FOR MEASURING RELATIONAL SIMILARITY

Week #01 Introduction to Heterogeneous Networks ( HetNet )

Challenges to Reliable Data Transport Over Heterogeneous Wireless Networks

Heterogeneous Wireless Networks

Reliable Data Transport over Heterogeneous Wireless Networks