1 / 28

A Statistical Method for Finding Transcriptional Factor Binding Sites

A Statistical Method for Finding Transcriptional Factor Binding Sites. Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss. Regulation of Gene Expression. Difficulties of Motif Finding.

rona
Download Presentation

A Statistical Method for Finding Transcriptional Factor Binding Sites

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Statistical Method for Finding Transcriptional Factor Binding Sites Authors: Saurabh Sinha and Martin Tompa Presenter: Christopher Schlosberg CS598ss

  2. Regulation of Gene Expression

  3. Difficulties of Motif Finding • Regulatory sequences don’t follow same orientation as the coding sequence or each other • Multiple binding sites might exist for each regulated gene • Large variation in the binding sites of a single factor. Variations are not well understood.

  4. Previous & Proposed Methods for Finding Motifs • Previous Methods: • Find longer, general motifs • Use local search algorithms (Gibbs sampling, Expectation Maximization, greedy algorithms) • Proposed Method: • TFBS is small enough to use enumerative methods • Enumerative statistical methods guarantee global optimality and affordability

  5. Proposed Method Highlights • Allows variations in the binding site instances of a given transcription factor • Allows for motifs to include “spacers” • Allows for overlapping occurrences (in both orientations), which lends to complex dependencies • Statistical significance of a motif (s) is based on the frequencies of shorter (more frequent) oligonucleotides • Use of Markov chain to model background genomic distribution • Use of z-score to measure statistical significance • Allows for multiple binding sites

  6. Characteristics of a Motif • Any single TFBS has significant variation • Many motifs have spacers from 1-11bp • Variation often occurs as a transition (e.g. purine  purine) rather than a transversion (e.g. pyrimidine  purine) • Variation occurs less between a pair of complementary bases. • Indels are uncommon

  7. Proposed Motif Definition • Motif will be a string with Σ= {A,C,G,T, R,Y,S,W,N} • A,C,G,T (DNA bp), R (purine), Y (pyrimidine), S (strong), W (weak), N (spacer) • TF database (SCPD) confirms this model of variation • Of 50 binding site consensi, 31 exact fits (62%) • Another 10 fit if slight variations allowed

  8. Measure of Statistical Significance • Given set of corregulated S. cerevisiae genes, the input to the problem is corresponding set of 800bp upstream sequences having 3’ end on start site of gene translation. • Model must measure from input sequences: • Absolute number of occurrences (Ns) of motif (s) • Background genomic distribution • X is a set of random DNA sequences in the same number and lengths of the input sequences • Generated by Markov chain of order m • Transition probabilities determined by (m+1)-mer frequencies in fully complement of 6000+ (800bp in length) • Background model chooses m=3

  9. z-score • Xs– r.v. is number of occurrences of motif (s) in X • E(Xs) – expectation, σ(Xs) – standard deviation • zs – number of S.D. by which observed value Ns exceeds expectation

  10. Implications • Possibility of overlap of a motif with itself (in either orientation) • Previous study of pattern autocorrelation • Generalized computation of SD, treating motif as a finite set of strings • Higher order Markov chains • Spacers handled at no extra computational cost • Handles motif in either orientation

  11. Algorithm • Enumerates over each input sequence • Tabulates number Nsof occurrences of each motif in either direction • Compute expectation and SD for each motif s.t. Ns>0 • Calculate z-score • Rank motifs by z-score

  12. Algorithm Analysis • For single motif, complexity is O(c2k2) • k – # of nonspacer characters in motif • c – # of instantiations of R, Y, S, W in motif • Only modest values of k • Linear dependence on genome size • Can trim variance calculation to optimize

  13. Number of Occurrences • Convert motif s into a multiset W • Add reverse complements for each string in W • Motif s only occurs at position in X iff some string in W occurs at same position • Xs - # of occurrences (in X) of each member of W • Handling Palindromes • Wi – member of W • |W| = T

  14. Number of Occurrences Con’t

  15. Expectation • Linearity of Expectation

  16. Variance  B term  C term

  17. C Term  A term

  18. A Term

  19. Overlapping Concatenation • CW (like W) is potentially a multiset • One-to-one correspondence

  20. C Term Simplification

  21. A Term Revisited

  22. Si1Si2 Term & Approximation • Kleffe and Borodovsky (1992) Approximation

  23. B Term

  24. B Term Con’t

  25. Summary

  26. Higher Order Markov Models • Variance calculations remain the same except for Si1Si2term • Experimental m = 3

  27. Experimental Results & Future Considerations • 17 coregulated sets of genes • Known TF with known binding site consensus • In 9 experiments, known consensus was one of 3 highest scoring motifs • Future Topics: • Non-centered spacers • Enumeration Loop optimization • Filtering repeats

  28. Question • E(Xs) is more straight-forward to calculate compared to σ(Xs). Under the assumptions given in the paper, name one of the reasons for this complication.

More Related