1 / 29

Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

Erik van Nimwegen et al. Presented by Lyndsy Kron. Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics. Goal. To derive a unique probability distribution for assignments of binding sites into clusters – to identify regulons

latoya
Download Presentation

Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Erik van Nimwegen et al. Presented by Lyndsy Kron Probabilistic clustering of sequences: Inferring new bacterial regulons by comparative genomics

  2. Goal • To derive a unique probability distribution for assignments of binding sites into clusters – to identify regulons • Based on sequence similarity • Partitioned so each cluster corresponds to those targeted by same TF

  3. PROCSE Algorithm • Uses Monte Carlo sampling of this distribution to partition and align thousands of short DNA sequences into clusters • Determines number of clusters • Assigns significance to the resulting clusters • WMs are unknown – limiting factor

  4. WM Unknown • A set of sites sampled with unknown WMs is clusterable if it is possible to infer which sites were sampled from the same WM • If WMs are known, this task is trivial

  5. Problem Input A set D of short DNA sequences Output Most probable clustering C of input sequences

  6. Assumptions • Sequences in a cluster come from the same motif • Use weight matrix (WM) model for motifs • Consider only evolutionary conserved non-coding regions of orthologous genes • Consider bacterial genomes

  7. Model • WM: prob. of finding base alpha at location i • Information score I – scores quality of an alignment of putative binding sites • b is background frequency of base alpha • And are the WM probs. from sequence

  8. Model • Need to cluster a set of binding sites of an unknown number of TFs • Consider all ways to partition into clusters and assign prob. to each – prob. of partition is product of probs. for each cluster

  9. Model • To calculate prob. that a set of n length l sequences S was drawn from a given WM

  10. Model • To calculate P(S) that all sequences in S came from some w

  11. Model • From this we can define for any partition C of a data set of sequences D into clusters the likelihood P(D|C) that all sequences in a cluster were drawn from the same WM: P(D|C) = given by P(S) on previous slide

  12. Model • Posterior prob. P(C|D) for partition C given the data D: • Allows calculation of any statistic of interest by summing over the appropriate partitions C

  13. Classifiability vs. Clusterability • Classification: • Associating TFs with WMs • P(s|w) – prob. that w binds to s • Implies that for a sample s from w, we have that P(s|w) > P(s|w') for all other TFs w'

  14. Clusterability: • Assume clustering nG sequences obtained by sampling n times from G different WMs • Can calculate prob. That m of its n samples cocluster by summing P(C|D) over all partitions in which m samples of w occur together • Clusterable if for more than ½ of the G WMs the avg. of m > n/2

  15. Monte Carlo Implementation • Monte Carlo random walk to sample the distribution P(C|D) • At each step • Choose mini-WM at random • Consider reassigning it to a randomly chosen cluster • Evaluated using Metropolis-Hastings scheme

  16. Metropolis-Hastings Scheme • Moves that increase prob. P(C|D) are always accepted • Moves that lower P(C|D) are accepted with prob. P(C'|D)/P(C|D)

  17. Result of Monte Carlo • Generates “dynamic” clusters, membership fluctuates over time • Clusters can disappear altogether • New clusters can appear when a pair of mini-WMs is moved together • Find “significant” clusters by finding sets of mini-WMs that are persistent

  18. Solutions to Lack of Persistence • Search for ML partition to maximize P(C|D) through simulated annealing • Raise P(C|D) to the power β, increasing β over time • Provides candidate clusters • Significance of ML clusters are tested by sampling P(C|D)

  19. Complications: • Computationally prohibitive for large data sets

  20. Solutions to Lack of Persistence • Second Approach: • Use several Monte Carlo random walks • Measure prob. that each pair of mini-WMs coclusters • Construct graph, node corresponds to mini-WMs, edges between mini-WMs i and j exist if their coclustering prob. Is > ½

  21. Second Approach Cont. • Candidate clusters are now given by connected components of graph • Pairwise stats. are processed to obtain prob. cluster membership • Yields probabilities that mini-WM i belongs to cluster j • Also calc. for each cluster the prob. distribution p(k) of k of its members coclustering • Cluster significance judged from p(k)

  22. Finally • Once clusters are inferred, a WM can be estimated for each cluster • Then search for additional matching motifs to the cluster WMs in all regulatory regions

  23. Data set • 15-25 bp sequences

  24. Ex. Alignment

  25. Thank you!

  26. Results • Found that likelihood P(C|D) for the partition obtained in annealing runs is higher than that obtained when the sites are partitioned by annotation • Algorithm recovers almost ½ of all regulons for which binding sites are known and the large majority of regulons for which there are more than 3 sites known • Most E. coli binding sites are in the unclusterable regime

  27. Discussion • Algorithm assumes all WMs be of fixed length, so prior information about lengths and their dimeric nature need to be incorporated • Could also extend the hypothesis, by assuming that only some fraction, rather than all, of the sequences are WM samples – others are background model

More Related