CSE891-001 Open Problems in Bioinformatics - Computational Network Biology

113 Ernst Bessey Hall CSE891-001 Open Problems in Bioinformatics - Computational Network Biology Jin Chen 232 Plant Biology Bld. 2012 Fall

About me… • Jin Chen, Assistant Professor in CSE and PRL since 2009 • Office: 232 Plant Biology Lab. Tel: (517) 355-5015. Email: jinchen@msu.edu

Outline • Course Description • Introduction to Computational Network Biology

Course Description • Course objectives: study interesting computational network biology problems and their algorithms, with a focus on the principles used to design those algorithms. (3 credits) • Instructor: Jin Chen, Office: 232 Plant Biology Bld. Email: jinchen@msu.edu • Office hours: Thursday 2PM-3PM. If you cannot attend office hours, email me about scheduling a different time. • Web page:http://www.msu.edu/~jinchen/cse891-2012

Course Description • Course work: One 80 minutes lecture, and 80 minutes of discussion & student presentations each week • Grading policies: The course will be graded on attendance (10%), participation (20%), and presentation (70%). • No Final Exam Term project vs. presentation

Course Description • Prerequisites: Graduate students in science or engineering. Note: an override is necessary for non-CSE graduate students; please send your PID & NetID to me. • No prior knowledge of biology is required. Computationally inclined biology graduate students are encouraged to take the class as well.

Suggested books • A.-L. Barabási, Linked: The new science of networks • U. Alon, An Introduction to Systems Biology • B. Palsson. Systems Biology: Properties of Reconstructed Networks • K. Kaneko, Life: An Introduction to Complex Systems Biology

Course Description Network Biology Graph Mining

Course Description • Select 3 papers for presentation from the online paper list • Each presentation is 45 min, including 15 min Q&A, followed with a discussion • Your grade will be largely determined by the presentation (70%) • Presentation starts from Sep 11 Or 1 term project + 1 presentation

Why Bioinformatics • The recent advances in biotechnology underlines the need for new computational tools in modern biology, which are essential for analyzing, understanding and manipulating the detailed information on life we now have at our disposal • Problems in computational biology vary from understanding sequence data to the analysis of protein shapes, prediction of biological function, study of gene networks, and cell-wide computations

Different Views to Study Biological Problems • Combinatorial Algorithms • Transcription factors, protein interactions • Statistical Algorithms • Gene expressions • Imaging Algorithms • Sub-cellular localization, feature extraction • Graph Algorithms • Biological networks, everything is related!

Science 14 January 2011: Vol. 331 no. 6014 pp. 183-185 DOI: 10.1126/science.1193210

Biological Solution to a Fundamental Distributed Computing Problem • Computational methods are extensively used to analyze and model biological systems • But this paper provides an example of the reverse of this strategy, in which a biological process is used to derive a solution to a long-standing computational problem • Distributed computing: a large number of processors jointly and distributively solve a task, without any of the processors getting all of the inputs or observing all of the outputs • Biological processes are also distributed

Maximal Independent Set (MIS) • A long-standing distributed computing problem is that of electing a set of local leaders (maximal independent set) in a network of connected processors. Formally, a MIS is defined as a set of nodes A, so that every node in the network is either in A or directly connected to a node in A, and no two nodes in A are connected. MIS is necessary for deployment of large, ad hoc sensor networks • Distributivelyelecting a MIS has been considered a challenging problem for three decades. Luby and Alonet alpresented fast probabilistic algorithms for electing a MIS. But to date, no method has been able to efficiently reduce message complexity without assuming knowledge of the number of neighbors. Blue = MIS

Selection of Neural Precursors • The selection of neural precursors during the development of the nervous system resembles the MIS election problem. The sensory organ precursors (SOPs) are selected during larvae and pupae development from clusters of equivalent cells. • A cell that is selected as a SOP inhibits its neighbors by expressing high levels of the membrane-bound protein Delta, which binds and activates the transmembrane receptor protein Notch on adjacent cells. This lateral-inhibition process is highly accurate, resulting in a regularly spaced pattern in which each cell is either selected as SOP or is inhibited by a neighboring SOP. Blue = SOP

Inspiration from Biology • Although similar, the biological solution differs from computational algorithms • SOP selection is probably performed without relying on knowledge of the number of neighbors that are not yet selected • SOP selection requires nonlinear inhibition that in effect reduces communication to the simplest set of possible messages • The authors thus asked whether they can develop an algorithm for MIS selection on the basis of a stochastic rate change model that would not require knowledge about the number of active neighbors and would only use threshold communication

Algorithm • In an arbitrary synchronous communication network, nodes can only broadcast one-bit messages. • A message broadcasted by a node reaches all of its neighbors that are still active in the algorithm. • In each round, a processor can only tell whether or not a message was sent to it. When a processor receives a message, it cannot tell which of its neighboring processors sent it, and it cannot count the number of messages received in a round.

Algorithm Complexity • The running time of the algorithm is O(log n * log D), which is the number of rounds required to execute the two nested loops • The worst-case running time is O(log2n) • By studying a developmental process in flies, the authors devised a solution to an important distributed computing problem. The new algorithm does not require knowledge of the degree of individual processors, uses one-bit messages, and has an optimal message complexity.

Conclusion • Using insights from biology to advance computational systems has mainly focused on optimization techniques inspired by biological observations. • Areas of computer science that require strict, provable guarantees can also benefit from knowledge regarding how biological systems operate. • Better understanding of these biological systems can lead to further improvement in the design of complex distributed computing systems.

Introduction to Computational Network Biology • Network biology belongs to systems biology, which belongs to genomics • Interested in the relations between entities rather than the entities themselves http://bionet.bioapps.biozentrum.uni-wuerzburg.de/

Network’s everywhere • Internet, social network, anti-terrorism network • Biological networks • Protein-protein interaction (PPI) network • protein-DNA interaction network • gene correlation network • gene regulatory network • metabolic network • signaling network… • Network is a tool for under standing complex systems • Network models explains network properties and support network behavior study • Network measures provide quantitative analysis for complex systems

Definition of network (graph) Self-loop Multi-set of edges Edge G(V,E) Node (vertex) Simple graph: does not have loops (self-edges) and does not havemulti-edges.

Definition of network (graph) Directed graph vs. Undirected graph Labeled graph vs. Unlabeled graph Symmetric graph vs. Asymmetric graph

Webpage layout Pages on a web site and the hyperlinks between them M. Newman and M. Girvan. Finding and evaluating community structure in networks. Phys. Rev. E 69, 026113 2004

Adopted from R Albert’s slides

Biological networks

Yeast Protein-Protein Interaction network HawoongJeong

Gene regulation network of sea urchin Eric Davidson

Metabolic flux analysis of E. coli AbhishekMurarka

Why study networks? • Complex systems cannot bedescribed in a reductionist view • Behavior study of complex systems starts withunderstanding the network topology • Network - related questions: • How do we reconstruct a network? • How can we quantitatively describe large networks? • How did networks get to be the way they are?

Simple measures • Node Degree: the number of edges connected to thenode • In-degree & Out-degree • Total in-degree == total out-degree • Average Degree: the average of node degrees for all the nodes in the network, denoted as: • Degree distribution: the degree distribution P(k) gives the fraction of nodes that have kedges where N is the number of nodes in the network, ki is the node degree of node i

Simple measures • Shortest path: to find a path between two nodes such that the sum of the weights of its constituent edges is minimized • Graph diameter: the longest shortest path between any pair of nodes in the graph. • Connected graph:any two vertices can be joined by a path • Bridge: if we erase the edge, the graph becomes disconnected

Simple measures • Betweenness centrality: for all node pairs (i, j),find all the shortest paths between nodes i and j, denoted asC(i,j), and determine how many of these pass through node k, denoted as Ck(i,j).Betweenness centrality of node k is • Calculating the betweenness involves calculating the shortest paths between all pairs of vertices on a graph. O(V2logV + VE) for sparse graph with Johnson’s algorithm. L. C. Freeman, Sociometry 40, 35 (1977); P. E. Black, Dictionary of Algorithms and Data Structures (2004)

Simple measures • Clustering coefficient: a measure of degree to which nodes in a graph tend to cluster together. It is based on triplets of nodes. • Neighborhood N for a vertex vi is defined as its immediately connected neighbors as • The local clustering coefficient Ci for a vertex vi is then given by the proportion of links between the vertices within its neighborhood divided by the number of links that could possibly exist between them: L. C. Freeman, Sociometry 40, 35 (1977)

Complex measures • Frequent subgraph mining • Graph comparison & classification • Graph isomorphic testing

Useful software • Visualization & Topological Analysis • Cytoscape (www.cytoscape.org) • Pajek (vlado.fmf.uni-lj.si/pub/networks/pajek) • Graph related programming • LEDA (www.algorithmic-solutions.com) • Nauty (www.cs.sunysb.edu/~algorith/implement/nauty/implement.shtml)

1960 1999 2002

Real networks are much more complex • Transcription regulatory networks of Yeast and E. coli show an interesting example of mixed characteristics • how many genes a TF interacts with • how many TFs interact with a given gene - scale-free - exponential

Modularity and network motif • Cellular function are likely to be carried out in a highly modular manner • Modular -- a group of genes/proteins that work together to achieve distinct functions • Biology is full of examples of modularity

Remaining challenges • Discovery of network motifs is closely related to the generation of random networks • Structure of network motifs does not necessary determine function • Relation between higher-level organizational, functional states and networks has not yet been studied Voigt, W. et al. Genetics 2005 Ingram P.J.et al. BMC Genomics 2006 Eric Werner. Nature 2007

Next class • PPI network construction • False-positive detection

CSE891-001 Open Problems in Bioinformatics - Computational Network Biology