Jorge Viveros Summer 2006 Workshop June 29 th , 2006. Aracne. Contents. Overview (the problem, the alternatives, ARACNE’s arlgorithm central idea) Demo (reconstruction of gene regulatory networks for affymatrix gene expression data)
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Algorithm for the Reconstruction of Accurate Cellular Networks
“Reverse engineering” or “deconvolution” problem:
max entropy methods
Gene regulatory network
A.A. Margolin [1,2], I. Nemenman , K. Basso , C. Wiggings [2,4], G. Stolovitzky , R. Dalla-Favera , A. Califano [1,2]
Dept. Biomedical informatics, Joint Centers for Sys Biology, Institute for Cancer Genetics, Dept. of Appl. Physics and Appl. Math.
IBM T.J. Watson Research Center.
BMC Bioinformatics 2006, 7(Suppl 1):S7
Understandmammaliannormal cell physiology and complex pathologic
phenotypesthrough elucidating gene transcriptional regulatory networks.
Statistical associations between mRNA abundance levels helps to
uncovergene regulatory mechanisms.
ARACNE recovers specific transcriptional interactions but does not attempt to
recover all of them (too complex a problem).
Genome-wide clustering of gene expression profiles: cannot discern direct
(irreducible) from “cascade” transcriptional gene interactions.
edge = (direct) statistical dependency
= direct regulatory interaction
nodes = genes
Temporal gene expression data for higher eukaryotes, difficult to obtain.
Only steady-state statistical dependencies are studied.
Gene expression values samples from a joint probability distribution
Consider the multi-information = average log-deviation of the joint probability distribution (JPD) from the product of its marginals (also “Kullback-Leibler divergence” (KL-div)).
Use maximum entropy methods to approximate JPD by an element of its “m-way” marginal Frechet class (m-way maximum-entropy estimate m-MEE)
Use m-MEE to define mth-order connected information (m-cinfo) to account for m-way statistical dependencies (only!).
Multi-info = sum of all m-cinfo’s.
“nodes, “expressions” or “genes”
Integral if conts case; sum if discrete case
Entropy of P(x)
JPD not known, approximate it!
m-MEE , , has the same m-marginals as
m-MEE has the following form:
Have no analytical solution BUT
can be obtained via an iterative
Proportional fitting proc (IPFP)
mth-order connected information
Compensate for the lack of knowledge of JPD by using the (truncated!) multi-info
to establish and quantify statistical dependencies
M-way interaction contributes to multi-info, iff minimum of interaction multi-information (inter multi-info) over -specific Frechet class is positive.
Inter multi-info =
and are m-MEE sharing same m-way marginals except for, perhaps,
Positivity of minimal inter multi-info is an irreducible (direct) interaction
Thus draw edges coming from nodes and meeting at m-edge vertex.
Regulatory cascade (Markov chain)
Information processing inequalty
generically dependent (similarly, )
No triplet interactions (coregulation)
2 regulates 1 and 3 OR 1 and 3 regulate 2 jointly
does not factor
but pairwise marginals do
Most developed features: microarray data analysis, pathway analysis and reverse engineering, sequence analysis, transcription factor binding site analysis, pattern discovery.
N = 3 # genes
M = 2 # microarrays
Input file has N+1=4 lines
each lines has M+2 (2M+2) fields
AffyID HG_U95Av2 SudHL6.CHP ST486.CHP
G1 G1 16.477367 0.69939363 20.150969 0.5297595
G2 G2 7.6989274 0.55935365 26.04019 0.5445875
G3 G3 8.8098955 0.5445875 21.554955 0.31372303
Microarray chip names
ARACNE: algorithm for gene regulatory network computation given
aracne GeneExpressionFile [-a | -k | -s | -t | -e | -f]
aracne -adj GeneExpressioFile AdjacencyFile [-t | -e]
-a accurate | fast [default: accurate]
-k gaussian kernel width [accurate method only; default: 0.15]
-s Averaging Window step size [fast method only; default: 6]
-t Mutual Info. threshold [default: 0]
-e DPI tolerance (btw 0 and 1) [default: 1]
-f mean stdev [default: no filtering]
# lines = N = # genes
G1:0 8 0.064729
G2:1 2 0.0298643 7 0.0521425
G3:2 1 0.0298643
G4:3 8 0.0427217
G5:4 5 0.403516
G6:5 4 0.403516 6 0.582265
G7:6 5 0.582265 9 0.38039
G8:7 1 0.0521425 8 0.743262
G9:8 0 0.064729 3 0.0427217 7 0.743262 9 0.333104
G10:9 6 0.38039 8 0.333104
Associated gene ID#
Incorporate information-theoretic ideas (Markov networks) to model statistical dependencies (cf. )
= joint prob dist function of stationary expressions of all genes (i=1,…,N)
N = # genes, Z = partition fun (normalization factor), = Hamiltonian,
, , , … = interaction potentials (e.g., genes i,j,k do not interact in the
model iff = 0.
Aim: identify nonzero potentials.
First-order approximation: genes are independent
1st order potentials obtained from marginal probabilities (estimated experimentally).
ARACNE’s approximation: truncate joint prob dist fun to pairwise potentials
In this model non-interacting genes (includes statistically
independent genes and genes that do not interact directly,
i.e., but ).
Reduce number of potential pairwise interactions via realistic biological
Assume two-way interaction: pairwise potentials determine all statistical dependencies.
Mutual information (MI) = measure of relatedness
= 0 iff
G = bivariate standard Gaussian density
h = kernel width
Some details and technicalities:
Transform x, y so and their marginal distributions seem uniform
There is not a universal way of choosing h, however the ranking of the MI’s
depends only weakly on them.
Define thresholdIO to discard MI’s (lower-bound interaction)
Shuffle genes across microarray profiles & evaluate MIs for seemingly
independent genes, choose IO based on what fraction of MIs falls below the
Data processing inequality: if genes g1 and g2 interact thorugh g3 then
ARACNE starts with network so for every edge
look at gene triplets and remove edge with smallest MI
N = number of genes, M = number of samples
DPI analysis MI estimation (order
of pairwise interactions )
Thm 1:If MI’s are estimated with no errors and true underlying interaction network is a tree with only pairwise interactions then ARACNE will reconstruct it.
Thm 2:If Chow-Liu maximum MI info tree is subnetwork of ARACNE’s network then this is the true network.
Thm 3: “ARACNE will reconstruct tree-network topologies exactly.”
Reconstruction of class of synthetic transcriptional networks by Mendes et al
(cf. ) and human B lymphocyte genetic network from gene expressions
Performance of ARACNE compared against Bayesian Networks (use LibB
package) and Relevance networks (similar to ARACNE but has less accurate
MI estimation procedure and less-developed of assigning statistical
100 genes, 200 interactions organized in two types of networks
1. Erdos-Renyi: each vertex interaction is equally likely
2. Scale-free topology: distribution of vertex connections obeys a power law
Pairwise gene interaction is
“(True) positive” if their statistical regulatory interaction is directly linked.
“(True) negative” if their interaction is not direct.
Precision fraction of true interactions correctly inferred
(expected success rate in experimental validation of
Recall fraction of true interactions among all inferred ones
Performance to be assessed via Precision-Recall curves (PRCs)
ARACNE’s performance above 40% for both models
ARACNE recovers far more true connections and predicts far less false ones
Assembled expression profile data set of ~340 B lymphocytes from normal, tumor-related and experimentally manipulated populations.
Data set was deconvoluted by ARACNE to generate B-cell specific regulatory network of ~129,000 interactions.
Validation of the network’s quality was done by comparing inferred interactions
with those identified through biochemical methods.
See cf .
AMDeC Bionformatics Core Facility at the Columbia Genome Center
AMDeC (Academic Medicine Development Company)