1 / 59

Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence. LO Leung Yau 7 th May, 2009. Outline. Biological Background Objective Current Approaches Various Models Problem: Insufficient Data Proposed Approach Predict TFBS from protein sequence

claire
Download Presentation

Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Inferring Prototypical Transcriptional Regulatory Network from Genome Sequence LO Leung Yau 7th May, 2009

  2. Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence  Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns

  3. Biological Background – Cell • Basic unit of organisms • Prokaryotic • Eukaryotic • A bag of chemicals • Metabolism controlled by various enzymes • Correct working needs • Suitable amounts of various proteins Picture taken from http://en.wikipedia.org/wiki/Cell_(biology)

  4. Biological Background – Protein • Polymer of 20 types of Amino Acids • Folds into 3D structure • Shape determines the function • Many types • Transcription Factors • Enzymes • Structural Proteins • … Picture taken from http://en.wikipedia.org/wiki/Protein http://en.wikipedia.org/wiki/Amino_acid

  5. Biological Background – DNA & RNA • DNA • Double stranded • Adenine, Cytosine, Guanine, Thymine • A-T, G-C • Those parts coding for proteins are called genes • RNA • Single stranded • Adenine, Cytosine, Guanine, Uracil Picture taken from http://en.wikipedia.org/wiki/Gene

  6. gene Biological Background –DNA  RNA  Protein Picture taken from http://en.wikipedia.org/wiki/Gene

  7. Biological Background –DNA  RNA  Protein Other functions Promoter regions Genes Transcription Factors Binding sites Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS).

  8. Complex Interactions between Genes, TFs and TFBSs

  9. Biological Background –DNA  RNA  Protein Other functions Promoter regions Genes Transcription Factors Binding sites Transcriptional Regulatory Network is the complex interaction between genes, transcription factors (TF) and transcription factor binding sites (TFBS).

  10. Importance of Inferring Transcriptional Regulatory Network • Revealing the working of a cell and life • Related to many diseases • Genetic disorders • Understanding them will help us • Understand the diseases • Design drugs to cure the diseases • Engineering genetics

  11. To infer transcriptional regulatory network (gene network) from genetic and experimental data, utilizing different data sources as/when appropriate Objective

  12. Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence  Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns

  13. Current Approaches • Main Data Source • Gene Expression Microarray Data • Models • Parts Lists • Topology Models • Control Logic Models • Dynamic Models • Problem • Insufficient Data

  14. Gene Expression Microarray Data • High throughput • Measures RNA level • Relies on A-T, G-C pairing • Can monitor expression of many genes Picture taken from http://en.wikipedia.org/wiki/DNA_microarray_experiment

  15. Gene Expression Microarray Data Picture taken from http://en.wikipedia.org/wiki/DNA_microarray

  16. Various Models of Transcriptional Regulatory Network (Gene Network) • Different level of details • Parts Lists • Topology Models • Control Logic Models • Dynamic Models • Boolean Network • Petri Nets • Difference and Differential Equations • Finite State Linear Model (FSLM) • Stochastic Networks [86, 87, 88]

  17. Parts List • The basic components of the gene network that we model • Including • Genes • Transcription Factors • Promoters • Transcription Factor Binding Sites • … gene

  18. Topology Models – Example

  19. Control Logic Models

  20. Dynamic Models • Describe and simulate the dynamic changes in the state of the system • Predicting the network’s response to various environmental changes and stimuli. • Boolean Network • Petri Nets • Difference and Differential Equations • Hybrid: Finite State Linear Model (FSLM) • Stochastic Networks

  21. Boolean Network [42, 93, 1, 55]

  22. Boolean Network –Yeast Fission Example 10 Genes 1024 States [22]

  23. Petri Nets - Example [79, 34, 67, 92]

  24. Difference and Differential Equations • Continuous concentration of various molecules • For difference equation, time is discrete • For differential equation, time is continuous • In general, they have the form [15, 24, 96]

  25. Difference and Differential Equations • Usually, the interactions are assumed to be linear • The model needs many parameters Interpretation: >>0 means gene n activates gene 1

  26. Finite State Linear Model (FSLM) [91, 2, 66]

  27. Stochastic Networks • In the real world, stochastic effects may play an important role • Some stochastic models have been proposed • Noisy Networks • Probabilistic Boolean Networks • Simulating a stochastic model is more computationally expensive • Depending on the purpose, stochastic models may not be necessary

  28. Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence  Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns

  29. Problem – Insufficient Data • In microarray data • Many genes • Small number of conditions/time points • Lead to unreliable estimated model [17, 53]

  30. Current Directions to Solve Insufficiency Problem • Analysis Techniques for Small Sample Size • Regularization • Akaike Information Criterion (AIC) • Bayesian Information Criterion (BIC) • Minimum Description Length (MDL) • … • New model • Integrate Multiple Microarray Data • Heterogeneous sources • Different experiment settings [21, 77, 54, 62, 104, 72, 84] [60, 107, 48, 8, 38]

  31. Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence  Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns

  32. Other functions Promoter regions Genes Transcription Factors Binding sites Proposed Approach – Use Sequence There is a lot of information in genome sequence We should try to use them!

  33. Proposed Approach – Core Components 1 The interaction between genes can therefore be inferred. Binding Sites? Binding Sites? Transcription Factor? Transcription Factor? 2 3 DNARNAProtein DNARNAProtein

  34. Microarray Data Proposed Approach – Core Components Missed! Gene TF Gene TF Gene TF Gene Gene Our approach gives initial network! Can be used together with other approaches TF Gene Extra!

  35. ……………..LYDVAEYAGVSYQTVSRVV ……………. ……………..gaaggGGTCAAGGTGACCgg…………… Component 1: Protein Sequence  Binding Sites • Need to predict • Binding domains of a protein • The DNA segment bound by the domain • The pattern bound by the protein • Need to search for occurrence of the pattern • Better motif model is helpful Protein DNA Picture taken from http://en.wikipedia.org/wiki/DNA-binding_domain

  36. ……………..LYDVAEYAGVSYQTVSRVV ……………. Component 2: Protein Sequence  Transcription Factor ? • Need to distinguish between • Transcription factors, and • Other proteins • Characteristic motifs in binding domains are helpful features Transcription Factor Other Proteins

  37. Component 3: DNA  RNA  Protein Sequence Trivial, only TU • DNA  pre-mRNA • Pre-mRNA  mRNA • mRNA  Protein sequence Alternative splicing! Genetic code of amino acids is known and quite universal Picture taken from http://en.wikipedia.org/wiki/Alternative_splicing

  38. Proposed Plan and Phases Started! Will start soon Preparatory Main Classifiers Initial Network Construction & Testing Stage

  39. Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence  Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns

  40. Short Term Subtasks • Q-gram Indexed Approximate String Matching Tool • Exploring Different Motif Models • Motifs with gaps • Develop an Improved Tool to Search Significant Patterns and Calculate p-value • Deterministic Finite Automata (DFA) • Finite Markov Chain Imbedding (FMCI) • Pattern Markov Chain (PMC) Already Done.

  41. Filtered out regions, do not bother to do fully sensitive checking Target (Text/DB/…) sequence Q-gram Indexed Approximate String Matching Tool • IDEA: quickly discard parts of the target which CANNOT contain a match • A kind of pruning • Pruning is a successful strategy in many problems Pattern

  42. Q-gram Indexed Approximate String Matching Tool

  43. Outline • Biological Background • Objective • Current Approaches • Various Models • Problem: Insufficient Data • Proposed Approach • Predict TFBS from protein sequence • Predict from protein sequence whether it is TF • Gene sequence  Protein sequence • Short Term Subtasks • Better Motif Model • Better tool to calculate P-value of Patterns

  44. Exploring Different Motif Model • Popular Motif Model • Position Weight Matrix (PWM) • Assumptions • Fixed-length contiguous • Independency of nucleotides • Easily handle wildcards • But difficult to handle gaps • Has been successful in some datasets • But perform poorly in Tompa(2005) dataset

  45. Exploring Different Motif Model • Aim: • To explore if motifs with gaps fit the data • To explore different notions of “over-represented” • Approach: • de novo motif discovery on existing dataset • Assuming different models • Assuming different notions of “over-represented”

  46. Exploring Different Motif Model • Models Tested

  47. Scores s1 c1 s2 c2 s1+s2+s3+s4 4 s3 c3 c4 s4 X times Background Model P(> X times in background) P(TFBS | c1,c2,..,c4) P(TFBS)P(c1,c2,..,c4 | TFBS) P(c1,c2,…,c4) = Exploring Different Motif Model - Notions of “over-represented” • Count score: • P-value: • Estimated probability:

  48. Preliminary Results – Max F-measure Recall = TP/(TP+FN) Precision = TP/(TP+FP) F-Measure = 2pr/(p+r)

  49. Preliminary Results – Tompa Recall = TP/(TP+FN) Precision = TP/(TP+FP) F-Measure = 2pr/(p+r)

  50. Preliminary Results – Tompa Recall = TP/(TP+FN) Precision = TP/(TP+FP) F-Measure = 2pr/(p+r)

More Related