1 / 40

Transcription factor binding motifs (part I)

Transcription factor binding motifs (part I). 10/17/07. Steps of gene transcription. Pol II. TFIID. activator. TATA. The term “transcription factor” (TF) usually means an activator or repressor. Understand Regulation. Which TFs are involved in the regulation?

ashley
Download Presentation

Transcription factor binding motifs (part I)

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Transcription factor binding motifs (part I) 10/17/07

  2. Steps of gene transcription Pol II TFIID activator TATA The term “transcription factor” (TF) usually means an activator or repressor.

  3. Understand Regulation • Which TFs are involved in the regulation? • Does a TF enhance / repress gene expression? • Which genes are regulated by this TF? • Are there binding partner / competitor for the TF? • Why disease when a TF went wrong?

  4. Understand Regulation • Which TFs are involved in the regulation? • Does a TF enhance / repress gene expression? • Which genes are regulated by this TF? • Are there binding partner / competitor for the TF? • Why disease when a TF went wrong?

  5. Sequence specificity of TF binding

  6. Motif representation • Consensus: GCGAA • PWM Alignment matrix

  7. Motif representation • Consensus: GCGAA • PWM frequency matrix

  8. Motif representation • Consensus: GCGAA • PWM • Logo

  9. Objectives of motif finding • Known motif mapping • Given a known motif, find all the matches over a query sequence. • De novo motif discovery • Both motif patterns and match positions are unknown • much harder

  10. Known Motif Mapping • The matching score for a new sequence x is given by where qm is the entries in the frequency matrix q0is the background model: p0(A), …, p0(T), or can be third-order Markov model (see next slide). • Calculate the matching score for all genomic sequences. Motif sites correspond to highest scores.

  11. The probability of generating a new base is dependent on the previous three bases. 3rd order Markov dependency p( ) Third-order Markov model

  12. De novo motif discovery • Statistical approach • Identify sequence patterns that occur more frequently than random. • Target regions: • Promoters regions of co-regulated genes • Promoters regions of differentially expressed genes • Experimentally identified TF binding sites • Very common • Biophysical approach • Calculate protein-DNA binding affinities from first principles. • See Roider et al. 2006 for an example.

  13. Methods • PWM modeling • MEME, GMS, AlignACE, BioProspector • Word enumeration • YMF, MDScan • Use negative control • REDUCE, Motif Regressor • Comparative genomic • MCS, ComparProspector, Phylocon • CHIP-chip (will discuss later)

  14. The challenges no motif sites

  15. The challenges multiple motif sites

  16. The challenges variable relative positions

  17. The challenges ATCCG ATTCG variable sequence pattern

  18. MEME (Bailey and Elkan 1994) • Input • A set of sequences: Y = {Yi} • For a fixed length w, partition Y into overlapping w-mers: X = {Xi} • A set of alphabets: A = {aj} = {A,C,G,T} • Mixture Model • qm Motif model: • q0 Background model: 0th or 3rd Markov

  19. Log-likelihood • Missing data: Z = { Zi } • The log-likelihood is • Select l and q to maximize the log-likelihood, but how?

  20. Expectation-Maximization (EM) • Iteratively update hidden states and parameter values. Commonly used in bioinformatics research. • E-step: • Under current estimate of q(0), l(0), and the observed data, evaluate the expected value of log-likelihood over the values of the missing data Z.

  21. Expectation Maximization (EM) • M-step: • Update the parameters so that expected log-likelihood is maximized. For l, For q, Iterative E- and M- steps until convergence

  22. Issue with EM algorithm • Can get trapped into local minimum • Results depend on initial guess • Often need to do multiple runs starting with difference initial guesses. Then pick the best one.

  23. Gibbs sampling • Gibbs sampling is an algorithm to generate a sequence of samples from the joint probability distribution of two or more random variables • Gibbs sampling is applicable when the joint distribution is not known explicitly, but the conditional distribution of each variable is known. • The sequence of samples comprises a Markov Chain. • As the iteration number goes to infinity, the asymptotic distribution approaches the underlying joint distribution.

  24. Key differences between EM and Gibbs sampling

  25. Gibbs Motif Sampler (Lawrence et al. 1993; Liu et al. 1995) Assume each sequence contains one motif. But the position p and the motif frequency matrix q are unknown. 11 21 31 41 51 

  26. Gibbs Motif Sampler 1 Without 11Segment • Take out one sequence with its sites from current motif 11 21 31 41 51

  27. Gibbs Motif Sampler 1 Without 11Segment • Score each possible segment of this sequence Sequence 1 Segment (2-7): 3 21 31 41 51

  28. Gibbs Motif Sampler 12 Modified 1 • Sample a new segment to put the sequence back 21 31 41 51

  29. Advantage of Gibbs sampling • Stochastic sampling permits the algorithm to escape from local minima. More robust than determinstic sampling as in EM. • Fast.

  30. Transcription level changes in glucose vs galactose (Roth 1998)

  31. (Roth 1998)

  32. MDscan (Liu et al. 2002) • Basic idea • True targets are likely to be more differentially expressed than other genes. • Procedure: • Rank genes according to p-values, gene expression levels, etc. • Search TF motif from highest ranking targets first (high signal / background ratio) • Refine candidate motifs with all targets

  33. m-matches for TGTAACGT Similarity defined by m-match For a given w-mer and any other random w-mer TGTAACGT 8-mer TGTAACGT matched 8 AGTAACGT matched 7 TGCAACAT matched 6 TGACACGG matched 5 AATAACAG matched 4 Pick a reasonable m to call two w-mers similar

  34. Seed1 m-matches MDscan Algorithm:Finding candidate motifs Significance of differential gene expression

  35. Seed2 m-matches MDscan Algorithm:Finding candidate motifs Significance of differential gene expression

  36. Specific (unlikely in genome background) Motif Signal Abundant Positions Conserved MDscan Algorithm:Scoring candidate motifs • Maximum a posteriori (MAP) score function: • Prefer: conserved motifs with many sites, but are not often seen in the genome background • Keep best 30-50 candidate motifs

  37. Seed1 m-matches MDscan Algorithm:Update motifs with remaining seqs Significance of differential gene expression

  38. Seed1 m-matches MDscan Algorithm:Refine the motifs Significance of differential gene expression

  39. MDscan Algorithm • Check high signal/background ratio sequences first, more likely to find the correct motif • Algorithm summary: • Seed with w-mer in top, find m-match to make matrix • Keep good motifs to be update by remaining sequences • Refine motifs by removing bad sites • Can check motif of any width very fast • Only consider existing w-mers, finite dataset • Seed in top sequences O(n2) • Update motifs with all sequences O(n)

  40. Word enumeration YMF (Sinha and Tompa 2002) • Search in ALL possible w-mers. For each w-mer, calculate a z-score measuring whether it is over-represented in the selected sequences vs the background. • Rank the words by the z-score. • Select the top ones. Advantage: • Global optimum Drawback: • Computational time grows exponentially with w, so can only be used to search short motifs. 6~10 mer.

More Related