1 / 30

A probabilistic method to detect regulatory modules

A probabilistic method to detect regulatory modules. Saurabh Sinha et al. ISMB 2003 Presented by Tian Xia. Outline. Objective To detect regulatory modules (clusters of binding sites) Motivation

gavivi
Download Presentation

A probabilistic method to detect regulatory modules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A probabilistic method to detect regulatory modules Saurabh Sinha et al. ISMB 2003 Presented by Tian Xia.

  2. Outline • Objective • To detect regulatory modules (clusters of binding sites) • Motivation • Discovery of modules is crucial for understanding the connection between genes and organism diversity • Method • HMM and an EM algorithm, given PWMs of s set of transcription factors known to work together • New features • Take motif correlations into account • Use phylogenetic comparisons to highlight a module

  3. GENE PROTEIN Binding Sites MODULE Basics • Objective:module detection

  4. ACAGTGA AGAGTGA GENE ACAGAGA ACCCGTT ACCGGTT Basics • Objective: module detection Binding sites are - short - similar to each other - have some variability Different transcription factors have different looking binding sites(“motifs”)

  5. Basics • Operation • Score one DNA sequence S with a set W of motifs. • Proceed from one end to the other, check successive sequence “windows” of a fixed length L. • Each window is scored. Output the series of scores.

  6. Algorithm: HMM0 • Assumption • sequence S generated by HMM • Model parameter • Given: • a set of motifs W (their PWMs) • a background motif wb (length 1. sampling probability of a base depends upon the previous k bases in the sequence) • Hidden: • Transition probabilities {pi} • Pr (Wj!Wi)= pi (independent of Wj)

  7. Algorithm: HMM0 • The Process • At each step, choose either a wifrom W, or the background motif wb • Choice dictated by {pi} • Once a motif w is chosen, sample a sequence from PWM of w, append it at the end of S • Proceed to the next step • Stop when the length of S reaches L

  8. Algorithm: HMM0 • Parse (T): • the sequence of motifs chosen in the successive steps of the process is called a parse • Pr (S, T | ) • Each parse T of the sequence S • Pr (S | ) = T Pr (S, T | ) • The probability that S is generated by an HMM with parameter

  9. Algorithm: HMM0 • Pr (S | ) = T Pr (S, T | ) • The probability that S is generated by an HMM with parameter  • Pr (S |b ) • The probability that S is generated using only background motifwb

  10. Algorithm: HMM0 • The score of the sequence S - Log likelihood ratio - How likely it is that S differs from background - How likely it is that S is generated by a HMM

  11. Algorithm: HMM0 • Score: • Train the hidden parameter {pi} to maximize F(S,  ) • Baum-Welch algorithm • Expectation-Maximization search (local minimum) • Dynamic programming to computelog Pr (S | )

  12. species1 GCGTGATCGAGCTATAACGGAA species2 CTGTGATCGTCGGGTAACGCCC species3 TGGTGATCGGAACCCCTAACGA species4 AAGTGATCGATTATCCTAACGT EVOLUTIONARY TREE CONSERVED BLOCKS Multiple Species • Objective: module detection • More data: genomes of multiple species (closely related) available

  13. species1 species1 species2 species3 species4 Conserved regions (evolved from common ancestral sequence) Multiple Species Four binding sites that are evolved from the same ancestral site Pr (species1 | ) Pr (species1, species2, species3, species4 | )

  14. species1 species2 species3 species4 Multiple Species One binding site, independent from others Pr (species1, species2, species3, species4 | )

  15. species1 species2 species3 species4 Look out for both kinds of binding sites ; treat them appropriately in the model Multiple Species

  16. Multiple Species • Multiple species • the extra information • improve module detection • Two steps • Identify conserved blocks by sequence alignment algorithms • Define a homologous window between species, score the window

  17. Multiple Species • Step 1 Identify conserved blocks by sequence alignment algorithms • Two species: Lagan (Brudno et al., 2003) • More than two: DiAlign (Morgenstern et al., 1998)

  18. Multiple Species • Step 2 • Define a homologous window • For two species A and B, it includes • A set of non-overlapping subsequences {x1 x2 … xk} aligned with similar subsequences {y1 y2 … yk} • All subsequences ofSAoutside of its aligned regions • All subsequences ofSBbetweenyiandyi+1

  19. Multiple Species • Step 2 • Score aligned blocks as a unit in the homologous window • Aligned block: derived from a common ancestor • Use the same weight matrix for the common ancestor and all descendants

  20. a Timet d1 d2 Score an aligned block in the homologous window • Generalize to a setof subsequences:Pr (s1,s2 | w) • Example: two species, sites d1and d2 in a conserved block • Short time limit (t ~ 0): • a = d1 = d2 • Pr (d1,d2 | W) = Pr (a | W) • Long time limit (t ~ ) • Pr (d1,d2 | W) = Pr (d1 | W)  Pr (d2 | W) • Interpolate between these two limits

  21. a Timet d1 d2 Score an aligned block in the homologous window • Pr (d1,d2 | W) =aPr(a |W)i2{1,2}Pr (di |a, W, t) • Pr (di |a, W, t) • depends on time t • depends on motif W • More specifically • UsePr ( | w)just asPr (s | w)in HMM0

  22. Multiple Species • Step 2 • Score an aligned block in the homologous window • Score an unaligned subsequence in the homologous window • Use HMM0 • Sum over all aligned blocks and unaligned regions

  23. Multiple Species • Results • Comparison of the discrimination of modules by SSPECIES and MSPECIES

  24. Motif Correlation • Motifs are correlated both in order and in spacing • In HMM0, motifs are chosen independently: Pr (Wj! Wi) = pi • Add to  a correlated transition probabilitypij • The previous non-background motif placed is ‘remembered’

  25. Motif Correlation • ‘History-conscious HMM’ (hcHMM) • Model parameter : includepijfor all pairs of motifs? • No (overfitting) • pijis added toonly if there is evidence for a correlation in occurrences ofwiandwj

  26. Motif Correlation • pijis added towhenZijandEijare above some thresholds • Aij(S): the average of times wj follows wi over all parses of S • Eijand : expectation and standard deviation of the random variable Aij (X), over all sequences X of length L

  27. Motif Correlation • If Corr (i, j) = true: Pr(i!j) = pij • If Corr (i, j) = false: • Now, model parameter  includes • {pi}, {pij}, W • Time complexity of each iteration of hcHMM training:O(L|W|2)vs.HMM0:O(L|W|)

  28. Motif Correlations • Results Performance of Stubb (hcHMM) on gap gene upstream region

  29. Motif Correlation • Results • advantage of hcHMM over HMM0 in detecting modules

  30. Implementation Issues • Stubb system • windows in the neighborhood of a high-scoring are also high-scoring • Suppress all overlapping windows • strand bias • Pre-processing: counts in both directions • background motif • Context window • alignment computation for conserved blocks • Lagan & DiAlign

More Related