1 / 50

Biological Networks CSci 732: Introduction to Bioinformatics

Biological Networks CSci 732: Introduction to Bioinformatics. Anne Denton Assistant Professor Department of Computer Science North Dakota State University, Fargo, ND. The Promise of Bioinformatics. Rich supply of data from high-throughput experiments Sequencing Microarray experiments

Download Presentation

Biological Networks CSci 732: Introduction to Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Biological NetworksCSci 732: Introduction to Bioinformatics Anne Denton Assistant Professor Department of Computer Science North Dakota State University, Fargo, ND

  2. The Promise of Bioinformatics • Rich supply of data from high-throughput experiments • Sequencing • Microarray experiments • Scores of specialized high-throughput experiments • Data made available • Most biological data disseminated in large biological databases • Dissemination is a condition of funding (in US) • Makes biology different from most other disciplines!

  3. Challenge and Opportunity • Traditional approach • Studies of one or a few gene at a time • Conclusions based on thorough domain knowledge • Challenge • Standard data analysis techniques inadequate for hundreds or thousands of genes • Opportunity • Massive data valuable to quantify evidence • New aspects can be studied, such as network structure

  4. From Sequencing to Functional Genomics • Sequencing genomes is a mature discipline experimentally and computationally • Whole-genome sequencing centrally involved computations • But what do the proteins do? • Sequence comparison (BLAST) among species • Great computational and experimental challenges • Networks have a fundamental role in specifying relationships

  5. Computer Scientist’s Introduction to Genomics • Encoding • Smallest unit of information • Computer: 1 bit (0 or 1) • Cell: 1 nucleotide of DNA (A,C,T, or G) • Most practical unit of information • Computer: 8 bit are 1 byte (can represent 256 values) • Cell: 3 nucleotide determine 1 amino acid of the protein (20 different amino acids)

  6. Further Analogy between Cells and Computers • Encoded values can serve very different purposes and are stored in the same location • Computer: Instructions and data are stored in the same memory (von Neumann architecture) • Cell: Proteins in the cell do very different things • Catalyze chemical reactions (proteins are part of the process) • Regulate other proteins (proteins change the process)

  7. Networks in Bioinformatics • Many network definitions: • Protein-protein interactions • Biochemical pathways • Annotation Networks • Different definitions in each category Scientific American 05/03

  8. Why Study Networks? • Biochemical pathways tell us about functioning of cell • Chemical processes (Metabolic pathways) • Control of other proteins (Regulatory pathways) • Neighbors in networks often have similar function • Structure of networks can tell us about evolution • Combined study of networks and data can uncover yet more information about cells

  9. Outline • Part 1: Properties of the Networks • Scale-free networks • Part 2: Networks and data • Relational data mining • Problem: Similarity between network neighbors • Solution: Focus on differences between neighbors • Comparison of different networks

  10. Example of a Biological Network:Physical Protein-Protein Interactions • Proteins interact (attach to each other) • Tests if proteins are stable in a close position • Proteins may perform function together (not tested) • Mathematically: Undirected graph • Only one definition of interactions between proteins? No! • Definition based on function: Genetic • Definition based on evolution: Domain fusion

  11. Physical Protein-Protein in Yeast Scientific American 05/03

  12. Scale-free Networks • Properties • Barabasi, Bonabeau 1998 • Some nodes have large number of links, most have only a few • Number of nodes that have a particular number of links decreases as a power law • Robust against accidental failures • Hubs with high connectivity

  13. Power-law Behavior • Probability that any node is connected to k other nodes is 1/k n with n between 2 and 3 • For k = 2: Probability of having twice as many links is a quarter as likely Scientific American 05/03

  14. Robust against accidental failures • As many as 80% of nodes can fail in a scale-free network without breaking up the entire cluster • I.e., even with large number of random mutations in genes, unaffected proteins continue to work together • Note that random removal of nodes is unlikely to remove hubs • Removal of only a few hubs does break up network significantly

  15. Examples Outside Biology • Hyperlink structure of the Internet • Physical structure (routers and communication lines) of the Internet • Social networks • Airline system • Scientific papers connected by citations

  16. Scale-free Network (Airline System) Scientific American 05/03

  17. Random Networks in Contrast • Links placed randomly • Mathematical model (Erdos 1959) • Example: Highway system • Bell-shaped (Poisson, similar to Gauss) distribution of number of nodes around typical value • For large number of links k, probability of k links decreases exponentially • Very unlikely to have nodes with a very large number of links

  18. Random Network (Highway System) Scientific American 05/03

  19. Reasons for the development of scale-free networks • Networks grow over time • Older nodes have had longer to accumulate links • The most connected nodes in E.coli metabolic network have an early evolutionary history • Preferential attachment • "The rich get richer“ cf. many people hyper-link to Google

  20. "Small World" Property • Game "Six Degrees of Kevin Bacon ": Acting in the same movie as links connects most actors in 6 steps • Internet pages are typically 19 clicks apart • Any two chemicals in a cell are only 3 reactions apart! • Small-world property is not limited to scale-free networks

  21. Are Protein Networks Pure Scale-free Networks? • Clusters of tightly connected nodes • Example • Proteins that perform function together • Recovering scale free network • Clustering of nodes leads to groups that interact as scale-free networks

  22. Summary of Scale-free Networks • Characterized by • Few hubs with a large number of edges • Many nodes with few edges • Show small-world property • Any node can be reached from any other in only a few steps • Ubiquitous in biology and outside

  23. Has Everything Interesting Related to Networks Been Done? • Graph theory is an old topic: Euler 1736 • Work on scale- free network has added to it • Surely now everything is done!

  24. Name: YAL040C Function: Cell Cycle & DNA Process Localization: nucleus Class: Cyclins Complex: Cyclin-dependent kinases Motif: PDOC00013 Phenotype: Cell cycle defects Protein-Protein Interaction Networks Name: YAL003W Function: Protein Synthesis Localization: Cytoplasm Class: GTP/GDP-exchange factors Complex: Translation complexes MOTIF: PDOC00648

  25. Outline (2): Data and Networks • Relational rather than graph-theoretic approach to mining of data on a graph • Problem of similarity between neighbors • New Algorithm • Focuses on differences • Comparison of Networks • Different definitions of Networks • Generalization of difference-based algorithm

  26. Questions of Interest • How do data between nodes relate? • Are there typical patterns among interacting proteins? • Can we find relationships that are not yet known to biologists? • It is expected that proteins in the nucleus interact with other proteins in the nucleus • It is more surprising if proteins in the nucleus interact with proteins in the mitochondria • How do different networks compare?

  27. Frequent Patterns in Tables:Association Rule Mining • 1st step: Finding sets of items that are frequent • Originally: Items in shopping carts • Here: Properties of proteins • Support: Fraction of transactions, in which the set of items occurs • 2nd step: Finding associations A -> B • If we find set A, we are likely to find set B • Similar to correlation, but goes in one direction only • Confidence: Fraction of transactions that have A, which also have B

  28. Relational Approach 0. 1.

  29. Generalization • Works for any number of nodes • Even 2-node structure allows complex rules • Results become harder to interpret for many nodes • “Small world property”: Any protein can be reached in 3-4 hops

  30. Effect of Joining on Item Sets

  31. ARM on Joined Tables • Protein names (key) used for joining but not in ARM • One node table participates multiple times • Items labeled to keep track of instance • Typical transactions • Simplest approach had been done overemphasizes similarities

  32. Problem with Naive Implementation • Many rules that are not interesting • Rules that reflect similarity between neighbors • 0.nucleus → 1.nucleus • Rules involving only one protein • 0.transcription → 0.nucleus • Note: Different support and confidence compared with ARM on node relation (protein data ignoring network) • Rules that are a consequence of both above • 0.transcription → 1.nucleus

  33. Algorithm

  34. Properties of Algorithm • Significant pruning at transaction level • Fewer items in transactions • Fewer transactions • Note: Pruning at rule or item set level would not be consistent • Example: 0.transcription → 1.nucleus • Differs from conventional setting: pruning at itemset level is gold standard • Modular approach • Unique operation can be combined with different ARM implementations

  35. Results for {0.transcription}  {1.nucleus} • Standard ARM: • Support = 0.29%, Confidence = 28.38% Rule: {0.transcription}  {0.nucleus} (0.70%, 69.59%) Rule: {0.nucleus}  {1.nucleus} (5.74%, 29.06%) • Differential ARM: • Support 0.02%, Confidence 2.08% • Typical range for differential ARM • Support 0.2-2%, Confidence 6-20%

  36. Focus on Differences • Similarity can be tested through calculation of correlation of items with themselves • Computationally easy • Association meaningless • Typical rules that are interesting to biologists • Which kinds of proteins show compartmental cross-talk? E.g., proteins in the nucleus interacting with proteins in the mitochondria • How do interaction definitions differ? E.g., which protein families show physical interactions but no domain fusion interactions

  37. Other Results of Differential ARM • Expected Cross-Talk past analysis papers • {1.mitochondria}  {0.cytoplasm} (1.2%, 27.3%) • Interesting Related Rules not found before • {1.mitochondria}  {0.nucleus} (0.72%, 16%) • {1.ER}  {0.mitochondira} (0.21%, 6%)

  38. Performance

  39. Comparison of Different Protein-Protein Interaction Networks • Different Definitions • Physical interactions • Genetic interactions • Domain fusion • Are the resulting networks biologically equivalent? • Common assumption: All networks signify similarity in neighbors

  40. Physical Interactions • Tests whether proteins physically interact when brought close together • Yeast-2-hybrid method • A gene is cut in half, and each half attached to one of the proteins in question • The gene can only perform its job if its parts come close through the protein-protein interactions • Can be done in any cell, i.e. does not test functions in cell

  41. Genetic Interactions • In vivo analysis, i.e., in the living organism • Typical scenario • Assume gene A and B can individually be deleted and the organism survives • Assume deleting A and B together means the organism does not survive • Other combinations are possible • Organism does not survive deletion of A and B individually but does survive combined deletion • Or: Organism survives but is noticeably changed

  42. Domain Fusion Interactions • Comparative Genome Analysis • Purely computational analysis • Based on evolutionary relationships • Assume species A has one gene with two domains 1 and 2 • Assume species B has two genes that have the same evolutionary origin (orthologs), 1' and 2' • Likely that proteins from genes 1' and 2' interact to generate the same function • 24,000 protein-protein interactions in yeast • No experimental verification!

  43. Can ARM Results be Compared Between Networks? • Not all proteins studied for all interactions • Networks have very different properties

  44. Construction of Network Comparison Basis Set

  45. Algorithm • Only nodes are considered that are involved in both networks • Items are eliminated if they occur in either of the two interaction types 1.A 1.C 1.D 2.A 2.B 0.C 0.E

  46. Results • Rules based on physical interactions have higher confidence than genetic or domain fusion • Example: Physical compared with Domain Fusion {0.ABC trans family signature(PDOC00185)}  {1.ATP/GTP binding site motif A(PDOC00017)} (0.45%, 90%) • ABC family is known to function together with ATP binding site • Nevertheless ABC family signature don’t occur in one gene together with ATP binding site in other species (no domain fusion) • Possible reason: many proteins with ATP binding site

  47. Conclusions of Part 2 • Differential algorithm to ARM in relations that describe networks contrasts • Neighbors within neighbors • Multiple networks • Solves other problems of network setting • Overwhelming number of rules due to neighbor similarity • Problem of rules that don’t involve all nodes • Rules that follow from combinations of both above problems • Some results confirmed by biologists • Some results interesting but plausible to biologists

  48. Overall Conclusions • Computer science techniques can uncover patterns in data that could not be identified by simple inspection • Large amount of data • Complex data • Exciting times are only starting • Functional genomics only at its beginning • Mutual understanding of language takes time

  49. Overall Conclusions (Data Miner’s perspective) • Bioinformatics excellent “playing field” for data miners • Data more easily available than in most other disciplines (possible exception: astronomy) • Results can directly benefit researchers in biology • Algorithms can/have to pass test of reality • Real data motivate fundamentally new algorithms

  50. Acknowledgements (Part 2) Computational side • Christopher Besemann Biological interpretation (Collaborators from NDSU Dept. of Biology) • Ajay Yekkirala • Ron Hutchison • Marc Anderson The work was funded by the Dept. of CS, EPSCoR and the NDSU Research Foundation

More Related