1 / 43

Counting position weight matrices in a sequence & an application to discriminative motif finding

Counting position weight matrices in a sequence & an application to discriminative motif finding. Saurabh Sinha Computer Science University of Illinois, Urbana-Champaign. GENE. A C A G TG A. PROTEIN. Transcriptional Regulation. TRANSCRIPTION FACTOR. GENE. A C A G TG A. PROTEIN.

astra
Download Presentation

Counting position weight matrices in a sequence & an application to discriminative motif finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Counting position weight matrices in a sequence & an application to discriminative motif finding Saurabh Sinha Computer Science University of Illinois, Urbana-Champaign

  2. GENE ACAGTGA PROTEIN Transcriptional Regulation TRANSCRIPTION FACTOR

  3. GENE ACAGTGA PROTEIN Transcriptional Regulation TRANSCRIPTION FACTOR

  4. Binding sites and motifs • Transcription factor binding sites in a gene’s neighborhood are the fundamental units of the regulatory network • Transcription factor binding is specific, hence binding sites are similar to each other, but variability is often seen • A motif is the common sequence pattern among binding sites of transcription factor

  5. Motif models • Consensus string, e.g., ACGWGT • Position Weight Matrix (PWM)

  6. Binding sites ACCCGTT ACCGGTT ACAGGAT ACCGGTT ACATGAT Position Weight Matrix PWM

  7. Databases of PWMs • Transfac has ~100s of PWMs for human • Jaspar: a smaller, perhaps better curated database of PWMs • Organism specific databases coming up frequenctly • PWMs in databases often derived from experimentally validated binding sites

  8. Bioinformatics of PWMs • Popular motif model • i.e., several motif finding algorithms that attempt to find PWMs from sequences • Gibbs sampling: one of the earliest; tries to sample PWMs with high “relative entropy” • MEME: another early algorithm; uses expectation maximization to find PWMs that best “model the sequences” • Many more algorithms to find PWMs from a set of sequences

  9. Problem: counting motifs • Given DNA sequence, and a consensus motif (say “ACGWGT”), count the motif in the sequence • Trivial solution • What if the motif is a Position Weight Matrix (PWM) ? • Why hasn’t this problem been looked at? • Because previous algorithms used different scores of PWMs: how “sharp” they are, how well they explain data, etc.

  10. Counting matches to a PWM: A possibility • For each site s in sequence, compute • If Pr(s | W) > some threshold, call s a site • Count number of sites in sequence • No distinction between strong and weak sites, as long as they are above threshold • binary scheme, not realistic

  11. A wish-list (for the score) • Score should consider both strong and weak occurrences of motif • Score should assign appropriate weights to strong and weak occurrences • Score should be aware that there may also be sites of other known motifs in the sequence • The list goes on: score should be efficiently computable, score should be differentiable, score should …

  12. The “w-score” • Defined by a probabilistic model of sequence generation • Given one or more motifs, and a background distribution, defines a probability space on sequences • A simple (zeroth order) Hidden Markov model (HMM)

  13. W1 Wb W2 Probabilistic Model: toy example • Given two motifs W1,W2, a “background” motif Wb, and a sequence length L • Pr(Wi Wj) = pj • transition probability • When in state Wi, emit a substring s chosen with probability Pr(s | Wi) • emission probability • Stop when length of emitted sequence is L A stochastic process generating sequences of length L

  14. W1 W2 W2 • Another possible path T2 W2 Wb Wb Wb Wb Wb W1 A “path” through the HMM • One possible path T1

  15. W1 W2 W2 W2 Wb Wb Wb Wb Wb W1 Likelihood of sequence & paths • A path of the HMM defines the locations of motif matches • For a sequence S & a path T, the joint probability Pr(S,T) is easy to compute • Conditional probability of a path T, given the data S, is: • Strong matches make the probability higher • Paths with weak matches have lower conditional probabilities

  16. The “w-score” • Let the number of occurrences of a motif (say W1) in path T be • Compute: • In words: An average of the motif count , with weights equal to the probability of T given S

  17. The “w-score” (Cont’d) • Score depends both on number and quality of matches to motif. • Every substring is a potential binding site, and paths placing the motif there will contribute to the count • Pr(T | S) depends on the match strength of all motifs, not just the one being counted

  18. An exciting new feature of this motif score The wish-list (again)  • Score should give consider both strong and weak occurrences of motif • Score should assign appropriate weights to strong and weak occurrences • Score should be aware that there may also be sites of other known motifs in the sequence  

  19. Computational pros and cons • The w-score computation takes time, where L is sequence length, and lm is the motif length. This is relatively expensive • The w-score can be differentiated with respect to all of the PWM parameters in time • Important feature for search algorithms

  20. Using the “w-score” in discriminative motif finding

  21. Discriminative motif finding • Suppose we have a set of co-regulated genes, i.e., we believe they have binding sites of the same transcription factor (in their regulatory control regions) • Traditionally, motif finding tries to find these binding sites, based on over-representation, conservation etc. • Often we also know a set of genes that should NOT have binding sites of that transcription factor • Examples: ChIP-on-chip, In situ hybridization pictures of Drosophila embryo, etc.

  22. Problem formulation • Given two sets of sequences S+ and S- • Find a motif that has many occurrences in S+ and few occurrences in S- • Maximize the difference in the average counts of the motif in the two sets • Let W(S) = count of a motif W in sequence S • Maximize:

  23. Optimization problem • Find motif W that maximizes

  24. Derivatives of objective function • Let Wk be the PWM entry for base  in column k • We can efficiently compute • We can efficiently differentiate our objective function

  25. Algorithm • Search space: Set of n = 20 substrings of sequences in S+ (called “site set”) • Objective function: Construct PWM W from site-set, compute score • Length of sites is user-defined

  26. Algorithm Current site-set C S+

  27. Algorithm Replace one site with any site from sequence S+ Pick a replacement that improves objective function

  28. Algorithm • Current solution (site-set): C • Candidate new solution: C • Many possibilities for C (every substring of every sequence in S+ is a possible replacement) • Evaluate objective function on each candidate C • Too slow ! • Use derivative information !

  29. Algorithm • Estimate the objective function value for each candidate C using partial derivatives and first order approximation • Examine each candidate in decreasing order of estimated score • If a candidate C found with greater score than C, choose it.

  30. Estimated scores 10 Accurate score Accurate score 13 Accurate score 11 Algorithm illustration Current score = 12

  31. Algorithm Properties • Objective function has many desirable properties, but is an expensive operation • Derivative computation has the same time complexity, and is used to guide search • Avoids local optima by searching in a discretized PWM space • Performs significantly better and/or faster than Gibbs sampling and Conjugate Gradients, for this particular score

  32. Discriminative PWM Search (DIPS) • Software available • Can easily handle data sets of ~100 sequences • Can find multiple motifs iteratively, but without masking: • Find a PWM, then include it in the model as a known PWM, find another PWM, and so on

  33. Performance tests • Tested on synthetic data • Compared to traditional motif finder as well as two discriminative motif finders • Superior performance in the presence of “distractor” motifs • it really helps to be able to count a motif in the presence of other known motifs

  34. Tests on Drosophila Enhancers BICOID (ACTIVATOR) Protein Concentration HEAD TAIL

  35. Tests on Drosophila Enhancers CAUDAL (ACTIVATOR) Protein Concentration HEAD TAIL

  36. DIPS runs • S+ = promoters of genes expressed in anterior half of embryo • S- = promoters of genes expressed in posterior half of embryo • Top motif: Bicoid ! BICOID (ACTIVATOR) Protein Concentration TAIL HEAD

  37. DIPS runs • S+ = promoters of genes expressed in posterior half of embryo • S- = promoters of genes expressed in anterior half of embryo • Top motif: Caudal ! CAUDAL (ACTIVATOR) Protein Concentration TAIL HEAD

  38. Summary of results

  39. Social regulation in honey bee • Transition from nursing in the hive to foraging for food is age related, but also regulated by the needs of the colony • 32 genes demonstrated to be significantly differentially expressed in brains of nurses and foragers (21 active in foragers only, 11 active in nurses only) • DIPS run on 2Kbp promoters of these social behavior-related genes

  40. Results on honey bee genes

  41. Conclusion • Discriminative motif finding increasingly becoming a necessary analysis • Motif finding in the presence of other known motifs also becoming relevant • A search algorithm that maximizes any objective function of the motif counts in the sequences • (as long as its differentiable) • Several extensions and variations possible

  42. Acknowledgements • Eric Siggia, Eran Segal • Yoseph Barash (“LearnPSSM”) • Andrew Smith (“DME”)

  43. Reference • ISMB 2006 (Brazil); Bioinformatics journal.

More Related