Graphs and Networks with Bioconductor. Wolfgang Huber EMBL/EBI Bioconductor Conference 2005 Based on chapters from "Bioinformatics and Computational Biology Solutions using R and Bioconductor", Gentleman, Carey, Huber, Irizarry, Dudoit, Springer Verlag. Graphs. Set of nodes and set of edges.
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
Bioconductor Conference 2005
Based on chapters from "Bioinformatics and Computational Biology Solutions using R and Bioconductor", Gentleman, Carey, Huber, Irizarry, Dudoit, Springer Verlag.
Set of nodes and set of edges.
Nodes: objects of interest
Edges: relationships between them
A useful abstraction to talk about relationships and interactions (think of integer numbers, apples and fingers)
Edges may have weights, directions, types
As always, need to distinguish between the true, underlying property of nature that you want to measure, and the actual result of a measurement (experiment)
1. False positive edges
2. False negative edges (were tested, were not found, but are there in nature)
3. Untested edges (were not tested, are not in your data, but are there in nature)
Uncertainty is not usually considered in mainstream graph theory, but cannot be ignored in functional genomics.
Nice application of these concepts to protein interactions: Gentleman and Scholtens, SAGMB 2004
Adjacency matrix (straightforward)
Adjacency matrix (sparse)
They are equivalent, but may be hugely different in performance and convenience for different applications.
Can coerce between the representations
Bioconductor project emphasizes re-use and interfacing to existing, well-tested software implementations rather than reimplementing everything from scratch ourselves.
RBGL package: interface to Boost Graph Library; started by V. Carey, R. Gentleman, now driven by Li Long.
> s = acc(IMCAGraph, "SOS")
Ha-Ras Raf MEK
1 2 3
ERK MYLK MYO
4 5 6
F-actin cell proliferation
National Cancer Institute cMAP
A structed vocabulary to describe molecular function of gene products, biological processes, and cellular components.
A set of "is a", "is part of" relationships between these terms
Directed acyclic graph
Directed, undirected graphs
Walk: alternating sequence of nodes and incident edges
Distance between nodes, shortest walk
Trail: walk with no repeated edges
Path: trail with no repeated nodes (except possibly first/last)
Weakly connected directed graph (see next page)
Cut: remove edges to disconnect a graph
Cut-set: remove nodes - " -
Connectivity of a graph
AG adjacency matrix (n x m) of a bipartite graph G with node sets U, V
One mode graphs
AU = AGt AG
AV = AGAGt
Can have different types of edges
:= set of Nodes + set of hyperedges
A hyperedge is a set of nodes (can be more than 2)
A directed hyperedge: pair (tail and head) of sets of nodes
Useful for representing hierarchies, partial orderings (e.g. in time, from general to special, from cause to effect)
n nodes, m edges
p(i,j) = 1/m
with high probability:
m < n/2: many disconnected components
m > n/2: one giant connected component: size ~ n.
(next biggest: size ~ log(n)).
degrees of separation: log(n).
Erdös and Rényi 1960
Random edge graph: randomEGraph(V, p, edges)
either p: probability per edge
or edges: number of edges
Random graph with latent factor: randomGraph(V, M, p, weights=TRUE)
M: latent factor
For each node, generate a logical vector of length length(M), with P(TRUE)=p. Edges are between nodes that share >= 1 elements. Weights can be generated according to number of shared elements.
Random graph with predefined degree distribution:
nodeDegree: named integer vector
sum of all node degrees must be even
For statistical inference, one can consider null hypotheses based on aforementioned random graph models; and ones based on node permutation of data graphs.
The second is often more appropriate.
For data graphs, the concept of clique is usually too restrictive (false negative or untested edges)
n-clique: distance between all members is <=n. (Clique: n=1)
k-plex: maximal subgraph G in which each member is neighbour of at least |G|-k others. (Clique: k=1)
k-core: maximal subgraph G in which each member is neighbour of at least k others. (Clique: k=|G|-1)
After: Social Network Analysis, Wasserman and Faust (1994)
graph basic class definitions and functionality
RBGL interface to graph algorithms
Rgraphviz rendering functionality Different layout algorithms.
Node plotting, line type, color etc. can be controlled by the user.
> library("graph"); library(Rgraphviz)
> myNodes = c("s", "p", "q", "r")
> myEdges = list(
s = list(edges = c("p", "q")),
p = list(edges = c("p", "q")),
q = list(edges = c("p", "r")),
r = list(edges = c("s")))
> g = new("graphNEL", nodes = myNodes, edgeL = myEdges, edgemode = "directed")
 "s" "p" "q" "r"
 "p" "q"
 "p" "q"
 "p" "r"
s p q r
1 3 2 1
s p q r
2 2 2 1
> g1 <- addNode("e", g)
> g2 <- removeNode("d", g)
> ## addEdge(from, to, graph, weights)
> g3 <- addEdge("e", "a", g1, pi/2)
> ## removeEdge(from, to, graph)
> g4 <- removeEdge("e", "a", g3)
> identical(g4, g1)
> adj(g, c("b", "c"))
 "b" "c"
 "b" "d"
> acc(g, c("b", "c"))
a c d
3 1 2
a b d
2 1 1
[1,] 1 2
[2,] 2 3
[3,] 3 1
[4,] 4 4
1 2 3 4
1 0 1 0 0
2 0 0 1 0
3 1 0 0 0
4 0 0 0 1
<graph edgemode="directed" id="G">
<edge id="e1" from="A" to="C">
<edge id="e2" from="B" to="D">
GXL (www.gupro.de/GXL) is "an XML sublanguage designed to be a standard exchange format for graphs".
The graph package provides tools for im- and exporting graphs as GXL
cc = connComp(rg)
1 2 3 4 15 18
36 7 3 2 1 1
Choose the largest component
wh = which.max(listLen(cc))
sg = subGraph(cc[[wh]], rg)
Depth first search
dfsres = dfs(sg, node = "N14")
 "N14" "N94" "N40" "N69" "N02" "N67" "N45" "N53"  "N28" "N46" "N51" "N64" "N07" "N19" "N37" "N35"  "N48" "N09"
sc = strongComp(g2)
nattrs = makeNodeAttrs(g2,
for(i in 1:length(sc))
plot(g2, "dot", nodeAttrs=nattrs)
Different algorithms for different types of graphs
o all edge weights the same
o positive edge weights
o real numbers
…and different settings of the problem
o single pair
o single source
o single destination
o all pairs
rg2 = randomEGraph(nodeNames, edges = 100)
fromNode = "N43"
toNode = "N81"
sp = sp.between(rg2,
 "N43" "N08" "N88"
 "N73" "N50" "N89"
 "N64" "N93" "N32"
 "N12" "N81"
mst = mstree.kruskal(gr)minimal spanning tree
Consider graph g with single connected component.
Edge connectivity of g: minimum number of edges in g that can be cut to produce a graph with two components.
Minimum disconnecting set: the set of edges in this cut.
 "D" "E"
 "D" "H"
dot: directed graphs. Works best on DAGs and other graphs that can be drawn as hierarchies.
neato: undirected graphs using ’spring’ models
twopi: radial layout. One node (‘root’) chosen as the center. Remaining nodes on a sequence of concentric circles about the origin, with radial distance proportional to graph distance. Root can be specified or chosen heuristically.
lg = agopen(g, …)
tags= list(HREF = href,
TITLE = title,
TARGET = rep("frame2", length(AgNode(nag)))),
imgname=fpng, width=imw, height=imh)
Show drosophila interaction network example
Nodes: all yeast genes
Graph 1: co-expression clusters from yeast cell cycle microarray time course
Graph 2: protein interactions reported in the literature
Graph 3: protein interactions found in a yeast-two-hybrid experiment
Do the graphs overlap more than random?
Is there anything special about overlapping edges?
nPdist: number of common edges as computed by a node label per-mutation model.
Number observed in data: 42
• Which expression clusters have intersections with which of the literature clusters?
• Are known cell-cycle regulated protein complexes indeed clustered together in both graphs?
• Are there expression clusters that have a number of literature cluster edges going between them suggesting that expression clustering was too fine, or that literature clusters are not cell-cycle regulated.
• Is the expression behavior of genes that are involved in multiple protein complexes different from that of genes that are involved in only one complex?
Nothing in the preceding treatment was specific to physical protein interactions or microarray clustering. Can you similar reasoning for many other graphs! - e.g. genomic vicinity, domain composition similarity
Packages: Gostats, Rgraphviz
actor size: number of papers that a gene appears in
event size: number of genes that appear in a paper
Example: R. Strausberg et al. Generation and initial analysis of more than 15,000 full-length human and mouse cDNA sequences. PNAS 99:16899–903, 2002
cites 15,000 genes
Note, usually one count (w.l.o.g. n22) is much larger than everybody else. Test statistics that do not depend on n22:
Boundary of gene list L: set of all genes that have co-citation (above threshold weight) with genes in L.
From: B. Gunawan et al., Cancer Res. 63: 6200-6205 (2003)
oncotree package by Anja von Heydebreck
Graphs are a natural way to represent relationships, just as numbers are a natural way to represent quantities.
Three main applications:
(1) to represent data (e.g. PPI)
(2) to represent knowledge (e.g. GO)
(3) to represent high-dimensional probability distributions
Bioconductor provides a rich set of tools mainly for (1) and (2). Various parts of R for (3), see also gR project.
There are still many challenges that call for methods to model uncertainty, make inference, and predictions.
Fine control of graph rendering