341 introduction to bioinformatics n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
341: Introduction to Bioinformatics PowerPoint Presentation
Download Presentation
341: Introduction to Bioinformatics

Loading in 2 Seconds...

play fullscreen
1 / 75

341: Introduction to Bioinformatics - PowerPoint PPT Presentation


  • 108 Views
  • Uploaded on

341: Introduction to Bioinformatics. Dr. Nataša Pržulj Department of Comput ing Imperial College London natasha@imperial.ac.uk Winter 2011. Topics. Introduction to biology (cell, DNA, RNA, genes, proteins) Sequencing and genomics (sequencing technology, sequence alignment algorithms)

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about '341: Introduction to Bioinformatics' - analiese


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
341 introduction to bioinformatics

341: Introduction to Bioinformatics

Dr. Nataša Pržulj

Department of Computing

Imperial College London

natasha@imperial.ac.uk

Winter 2011

slide2

Topics

  • Introduction to biology (cell, DNA, RNA, genes, proteins)
  • Sequencing and genomics (sequencing technology, sequence alignment algorithms)
  • Functional genomics and microarray analysis (array technology, statistics, clustering and classification)
  • Introduction to biological networks
  • Introduction to graph theory
  • Network properties
    • Global: network/node centralities
    • Local: network motifs and graphlets
  • Network models
  • Network/node clustering
  • Network comparison/alignment
  • Software tools for network analysis
  • Interplay between topology and biology

2

2

slide3

Topics

  • Introduction to biology (cell, DNA, RNA, genes, proteins)
  • Sequencing and genomics (sequencing technology, sequence alignment algorithms)
  • Functional genomics and microarray analysis (array technology, statistics, clustering and classification)
  • Introduction to biological networks
  • Introduction to graph theory
  • Network properties
    • Global: network/node centralities
    • Local: network motifs and graphlets
  • Network models
  • Network/node clustering
  • Network comparison/alignment
  • Software tools for network analysis
  • Interplay between topology and biology

3

3

slide4

Network properties: summary of last class

Network Comparisons:

  • Large network comparison is computationally hard due to NP-completeness of the underlying subgraph isomorphism problem:
    • Given 2 graphs G and H as input, determine whether G contains a subgraph that is isomorphic to H.
  • Thus, network comparisons rely on easily computable heuristics (approximate solutions), called “network properties”
  • Network properties can roughly & historically be divided in two categories:
    • Global network properties: give an overall view of the network, but might not be detailed enough to capture complex topological characteristics of large networks.
    • Local network properties: more detailed network descriptors which usually encompass larger number of constraints, thus reducing degrees of freedom in which the networks being compared can vary.

4

slide5

Network properties: summary of last class

1. Global Network Properties

  • Readings: Chapter 3 of “Analysis of biological networks” by Junker and Schreiber.
  • Global Network Properties:
    • Degree distribution
    • Average clustering coefficient
    • Clustering spectrum
    • Average Diameter
    • Spectrum of shortest path lengths
    • Centralities
slide6

Network properties: summary of last class

  • 2. Local Network Properties
  • Readings: Chapter 5 of “Analysis of Biological Networks” by Junker and Schreiber.
    • Network motifs
    • Graphlets

Two network comparison measures based on graphlets:

      • 2.1) Relative Graphlet Frequency Distance between two networks
      • 2.2) Graphlet Degree Distribution Agreement between two networks
slide7

1) Network motifs (Uri Alon’s group, ’02-’04)

http://www.weizmann.ac.il/mcb/UriAlon/

Also, see Pajek, MAVisto, and FANMOD

slide8

2) Graphlets

2.1) Reltive graphlet frequency distance between two networks

N. Przulj, D. G. Corneil, and I. Jurisica, “Modeling Interactome: Scale Free

or Geometric?,” Bioinformatics, vol. 20, num. 18, pg. 3508-3515, 2004.

slide9

2) Graphlets

2.1) Graphlet degree distribution agreement between two networks

N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” ECCB, Bioinformatics, vol. 23, pg. e177-e183, 2007.

slide10

2) Graphlets

2.1) Graphlet degree distribution agreement between two networks

Graphlet Degree (GD) vectors, or “node signatures”

T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008.

slide11

2) Graphlets

2.1) Graphlet degree distribution agreement between two networks

Signature Similarity Measure between nodes u and v

T. Milenkovic and N. Przulj, “Uncovering Biological Network Function via Graphlet Degree Signatures”, Cancer Informatics, vol. 4, pg. 257-273, 2008.

slide12

Software that implements many of these network

properties and compares networks with respect to them:

GraphCrunch

http://bio-nets.doc.ic.ac.uk/graphcrunch/

slide13

Software that implements many of these network

properties and compares networks with respect to them:

GraphCrunch

http://bio-nets.doc.ic.ac.uk/graphcrunch2/

slide14

Software that implements many of these network

properties and compares networks with respect to them:

GraphCrunch

http://bio-nets.doc.ic.ac.uk/graphcrunch2/

slide15

Another Software: Cytoscape

http://www.cytoscape.org/

slide16

Examples of signatures and signature similarities:

T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

slide17

Examples of signatures and signature similarities:

SMD1

YBR095C

40%

PMA1

T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

slide18

Examples of signatures and signature similarities:

T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

slide19

Examples of signatures and signature similarities:

90%*

SMD1

RPO26

SMB1

*Statistically significant threshold at ~85%

T. Milenković and N. Pržulj, “Uncovering Biological Network Function via Graphlet Degree Signatures,” Cancer Informatics, 2008:6 257-273, 2008 (Highly Visible).

slide20

Later we will see how to use this and other techniques

to link network structure with biological function

slide21

Generalize Degree Distribution of a network

  • The degree distribution measures:
    • the number of nodes “touching” k edges for each value of k

N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” Bioinformatics, vol. 23, pg. e177-e183, 2007.

slide22

N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” Bioinformatics, vol. 23, pg. e177-e183, 2007.

slide23

N. Przulj, “Biological Network Comparison Using Graphlet Degree Distribution,” Bioinformatics, vol. 23, pg. e177-e183, 2007.

slide24

/ sqrt(2) ( to make it between 0 and 1)

This is called Graphlet Degree Distribution (GDD) Agreement

between networks G and H.

slide25

Software that implements many of these network

properties and compares networks with respect to them:

GraphCrunch

http://bio-nets.doc.ic.ac.uk/graphcrunch/

slide26

Software that implements many of these network

properties and compares networks with respect to them:

GraphCrunch

http://bio-nets.doc.ic.ac.uk/graphcrunch2/

slide27

Topics

  • Introduction to biology (cell, DNA, RNA, genes, proteins)
  • Sequencing and genomics (sequencing technology, sequence alignment algorithms)
  • Functional genomics and microarray analysis (array technology, statistics, clustering and classification)
  • Introduction to biological networks
  • Introduction to graph theory
  • Network properties
    • Network/node centralities
    • Network motifs
  • Network models
  • Network/node clustering
  • Network comparison/alignment
  • Software tools for network analysis
  • Interplay between topology and biology

27

27

does the model network fit the data
Does the model network fit the data?
  • Use network properties:
    • Local
    • Global
  • Why?
  • “Hardness” of graph theoretic problems
    • E.g. NP-completeness of subgraph isomorphism
      • Cannot exactly compare/align networks
        • Use heuristics (approximate solutions)
      • Exact comparison inappropriate in biology
        • Due to biological variation
  • Noise revise models as data sets evolve
why model networks
Why model networks?
  • Understand laws  reproduction/predictions
  • Network models have already been used in biological applications:
    • Network motifs

(Shen-Orr et al., Nature Genetics 2002, Milo et al., Science 2002)

    • De-noising of PPI network data

(Kuchaiev et al., PLoS Comp. Biology, 2009)

    • Guiding biological experiments

(Lappe and Holm, Nature Biotechnology, 2004)

    • Development of computationally easy algorithms for PPI nets that are computationally intensive on graphs in general(Przulj et al., Bioinformatics, 2006)
network models
Network models

We will cover the following network models:

  • Erdos–Renyi random graphs
  • Generalized random graphs (with the same degree distribution as the data networks)
  • Small-world networks
  • Scale-free networks
  • Hierarchical model
  • Geometric random graphs
  • Stickiness index-based network model
erdos renyi random graphs er
Erdos–Renyi random graphs (ER)
  • Model a data network G(V,E) with |V|=n and |E|=m
  • An ER graph that models G is constructed as follows:
    • It has n nodes
    • Edges are added between pairs of nodes uniformly at random with the same probability p
    • Two (equivalent) methods for constructing ER graphs:
      • Gn,p: pick p so that the resulting model network has m edges
      • Gn,m: pick randomly m pairs of nodes and add edges between them with probability 1
erdos renyi random graphs er1
Erdos–Renyi random graphs (ER)
  • Number of edges, |E|=m, in Gn,pis:
  • Average degree is:
erdos renyi random graphs er2
Erdos–Renyi random graphs (ER)
  • Many properties of ER can be proven theoretically

(See: Bollobas, "Random Graphs," 2002)

  • Example:
      • When m=n/2,suddenly the giant component emerges, i.e.:
        • One connected component of the network has O(n) nodes
        • The next largest connected component has O(log(n)) nodes
erdos renyi random graphs er3
Erdos–Renyi random graphs (ER)
  • The degree distribution is binomial:
  • For large n, this can be approximated with Poisson distribution:

where z is the average degree

  • However, currently available biological networkshave power-law degree distribution
erdos renyi random graphs er4
Erdos–Renyi random graphs (ER)
  • Clustering coefficient, C, of ER is low (for low p)
  • C=p, since probability p of connecting any two nodes in an ER graph is the same, regardless of whether the nodes are neighbors
  • However, biological networkshave high clustering coefficients
erdos renyi random graphs er5
Erdos–Renyi random graphs (ER)
  • Average diameter of ER graphs is small
    • It is equal to
  • Biological networks alsohave small average diameters
  • Summary
generalized random graphs er dd
Generalized random graphs (ER-DD)
  • Preserve the degree distribution of data

(“ER-DD”)

  • Constructed as follows:
    • An ER-DD network has n nodes

(so does the data)

    • Edges are added between pairs of nodes using the “stubs method”
generalized random graphs er dd1
Generalized random graphs (ER-DD)
  • The “stubs method” for constructing ER-DD graphs:
    • The number of “stubs” (to be filled by edges) is assigned to each node in the model network according to the degree distribution of the real network to be modeled
    • Edges are created between pairs of nodes with

“available” stubs picked at random

    • After an edge is created, the number of stubs left available at the corresponding “end nodes” of the edges is decreased by one
    • Multiple edges between the same pair of nodes are not allowed
generalized random graphs er dd2
Generalized random graphs (ER-DD)
  • Summary
  • 2 global network properties are matched by ER-DD
  • How about local network properties (graphlet frequencies)?
    • Low-density graphlets are over-represented in ER and ER-DD
    • However, data have lots of dense graphlets, since they have high clustering coefficients
small world networks sw
Small-world networks (SW)
  • Watts and Strogatz, 1998
  • Created from regular ring lattices by random rewiring of a small percentage of their edges
  • E.g.
small world networks sw1
Small-world networks (SW)
  • SW networks have:
    • High clustering coefficients – introduced by “ring regularity”
    • Large average diameters of regular lattices – fixed by randomly re-wiring a small percentage of edges
  • Summary
scale free networks sf
Scale-free networks (SF)
  • Power-law degree distributions: P(k) = k−γ
    • γ > 0; 2 < γ < 3
scale free networks sf1
Scale-free networks (SF)
  • Power-law degree distributions: P(k) = k−γ
    • γ > 0; 2 < γ < 3
scale free networks sf2
Scale-free networks (SF)
  • Different models exist, e.g.:
    • Preferential Attachment Model (SF-BA)

(Barabasi-Albert, 1999)

    • Gene Duplication and Mutation Model (SF-GD)

(Vazquez et al., 2003)

scale free networks sf3
Scale-free networks (SF)
  • Preferential Attachment Model (SF-BA)
    • “Growth” model: nodes are added to an existing network
    • New nodes preferentially attach to existing nodes with probability proportional to the degrees of the existing nodes; e.g.:
    • This is repeated until the size of SF network matches

the size of the data

    • “Rich getting richer”
    • The starting network strongly influences the properties

of the resulting network (F. Hormozdiari, et al., PLoS Computational Biology, 3(7):e118, July 2007. )

    • SF-BA: particularly effective at describing Internet
scale free networks sf4
Scale-free networks (SF)
  • Gene Duplication and Mutation Model (SF-GD)
    • Biologically motivated
    • Attempts to mimic gene duplication and mutation processes
scale free networks sf5
Scale-free networks (SF)
  • Gene Duplication and Mutation Model (SF-GD)
    • At each time step, a node is added to the network as follows:
hierarchical model
Hierarchical model
  • Preserves network “modularity” via a fractal-like generation of the network
hierarchical model1
Hierarchical model
  • These graphs do not match any biological data and are highly unlikely to be found in data sets
geometric random graphs
Geometric random graphs
  • “Uniform” geometric random graphs (GEO)

N. Przulj lab, 2004-2010

  • Geometric gene duplication and mutation model (GEO-GD)

N. Przulj et al., PSB 2010

geometric random graphs1
Geometric random graphs
  • “Uniform” geometric random graphs (GEO)
    • Take any metric space and, using a uniform random distribution, place nodes within the space
    • If any nodes are within radius r (calculated via any chosen distance norm for the space), they will be connected
    • Choose r so that the size of the GEO network matches that of the data
    • There are many possible metric spaces (e.g., Euclidean space)
    • There are many possible distance norms

(e.g. the Euclidean distance, the Chessboard distance, and the Manhattan/Taxi Driver distance)

geometric random graphs2
Geometric random graphs
  • “Uniform” geometric random graphs (GEO)
  • Summary
geometric random graphs3
Geometric random graphs
  • Geometric gene duplication and mutation model (GEO-GD)
    • Gene duplications and mutations can be used to guide the growth process in geometric graph
geometric random graphs4
Geometric random graphs
  • Geometric gene duplication and mutation model (GEO-GD)
    • Gene duplications and mutations can be used to guide the growth process in geometric graph
geometric random graphs5
Geometric random graphs
  • Geometric gene duplication and mutation model (GEO-GD)
    • Gene duplications and mutations can be used to guide the growth process in geometric graph
geometric random graphs6
Geometric random graphs
  • Geometric gene duplication and mutation model (GEO-GD)
    • Gene duplications and mutations can be used to guide the growth process in geometric graph
geometric random graphs7
Geometric random graphs
  • Geometric gene duplication and mutation model (GEO-GD)
    • Gene duplications and mutations can be used to guide the growth process in geometric graph
geometric random graphs8
Geometric random graphs
  • Geometric gene duplication and mutation model (GEO-GD)
    • This variant also reproduces graphlet properties of the empirical dataset
    • Also, these networks have power-law degree distributions

-GD

stickiness index based network model
Stickiness index-based network model

(N. Przulj and D. Higham, Journal of the Royal Society Interface, vol 3, num 10, pp 711 - 716, 2006.)

  • Based on the stickiness index:
    • A number based on the a protein’s normalized degree in a PPI network
    • Used to summarize the abundance and popularity of binding domains of a protein
  • Assumption: a high degree protein has many binding domains
    • However, remember “date” vs. “party” hubs
  • A pair of proteins is more likely to interact under this model if both proteins have high stickiness indices
  • The probability of an edge between two nodes is the product of their stickiness indices
stickiness index based network model1
Stickiness index-based network model
  • “Sticky networks” have the expected degree distribution of the data
  • Also, they mimic well the clustering coefficients and the diameters of real-world networks
  • Summary
slide73

Software that implements many of these network models and evaluates their fit to data networks with respect to a variety of network properties (but there are others):

GraphCrunch: http://bio-nets.doc.ic.ac.uk/graphcrunch/

slide74

Software that implements many of these network models and evaluates their fit to data networks with respect to a variety of network properties (but there are others):

GraphCrunch: http://bio-nets.doc.ic.ac.uk/graphcrunch2/

slide75

Topics

  • Introduction to biology (cell, DNA, RNA, genes, proteins)
  • Sequencing and genomics (sequencing technology, sequence alignment algorithms)
  • Functional genomics and microarray analysis (array technology, statistics, clustering and classification)
  • Introduction to biological networks
  • Introduction to graph theory
  • Network properties
    • Global: network/node centralities
    • Local: network motifs and graphlets
  • Network models
  • Network/node clustering
  • Network comparison/alignment
  • Software tools for network analysis
  • Interplay between topology and biology

76

76