Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment - PowerPoint PPT Presentation

bernad
detecting subtle sequence signals a gibbs sampling strategy for multiple alignment n.
Skip this Video
Loading SlideShow in 5 Seconds..
Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment PowerPoint Presentation
Download Presentation
Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment

play fullscreen
1 / 32
Download Presentation
Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment
556 Views
Download Presentation

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment Lawrence et al. 1993 Presented By: Manish Agrawal Slides adapted from Prof Sinha’s notes.

  2. To define a motif, lets say we know where the motif starts in the sequence The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st) A motif model Genes regulated by same transcription factor

  3. a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ A3 0 1 0 31 1 0 Matrix C24 0 0 14 0 0 G 0 1 4 0 0 0 31 T 0 0 0 5 1 0 14 _________________ Consensus A C G T A C G T Line up the patterns by their start indexes s = (s1, s2, …, st) Construct “position weight matrix” with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest frequency in column Motifs: Matrices and Consensus

  4. Motif Finding Problem(Simplified) • Given a set of sequences, find the motif shared by all or most sequences, while its starting position in each sequence is unknown • Assumption: • Each motif appears exactly once in one sequence. • The motif has fixed length.

  5. Generative Model • Suppose the sequences are aligned, the aligned regions are generated from a motif model. • Motif model is a PWM. A PWM is a position-specific multinomial distribution. • For each position i (from 1 to W), a multinomial distribution on amino acids, consisting of variables qi1, qi2,…..,qi20 • The unaligned regions are generated from a background model: p1,p2, ……, p20

  6. Notations • Set of symbols: • Sequences: S = {S1, S2, …, SN} • Starting positions of motifs: A = {a1, a2, …, aN} • Motif model ( ) : qij = P(symbol at the i-th position = j) • Background model: pj = P(symbol = j) • Count of symbols in each column: cij= count of symbol, j, in the i-th column in the aligned region

  7. Probability of data given model

  8. Scoring Function • Maximize the log-odds ratio: • Is greater than zero if the data is a better match to the motif model than to the background model

  9. Scoring function • A particular alignment “A” gives us the • counts cij. • In the scoring function “F”, use:

  10. Scoring function • Thus, given an alignment A, we can calculate the scoring function F • We need to find A that maximizes this scoring function, which is a log-odds score

  11. Optimization and Sampling • To maximize a function, f(x): • Brute force method: try all possible x • Sample method: sample x from probability distribution: p(x) ~ f(x) • Idea: suppose xmax is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting xmax

  12. Markov Chain Sampling • To sample from a probability distribution p(x), we set up a Markov chain s.t. each state represents a value of x and for any two states, x and y, the transitional probabilities satisfy: • This would then imply:

  13. Gibbs sampling to maximize F • Gibbs sampling is a special type of Markov chain sampling algorithm • Our goal is to find the optimal A = (a1,…aN) • The Markov chain we construct will only have transitions from A to alignments A’ that differ from A in only one of the ai • In round-robin order, pick one of the ai to replace • Consider all A’ formed by replacing ai with some other starting position ai’ in sequence Si • Move to one of these A’ probabilistically • Iterate the last three steps

  14. Algorithm Randomly initialize A0; Repeat: (1) randomly choose a sequence z from S; A* = At \ az; compute θt from A*; (2) sample az according to P(az = x), which is proportional to Qx/Px; update At+1 = A*  x; Select At that maximizes F; Qx: the probability of generating x according to θt; Px: the probability of generating x according to the background model

  15. Algorithm Current solution At

  16. Algorithm Choose one az to replace

  17. Algorithm x For each candidate site xin sequence z, calculate Qx and Px: Probabilities of sampling x from motif model and background model resp.

  18. Algorithm x Among all possible candidates, choose one (say x) with probability proportional to Qx/Px

  19. Algorithm x Set At+1 = A*  x

  20. Algorithm x Repeat

  21. Local optima • The algorithm may not find the “global” or true maximum of the scoring function • Once “At” contains many similar substrings, others matching these will be chosen with higher probability • Algorithm will “get locked” into a “local optimum” • all neighbors have poorer scores, hence low chance of moving out of this solution

  22. Phase shifts • After every M iterations, compare the current At with alignments obtained by shifting every aligned substring ai by some amount, either to left or right

  23. Phase shift

  24. Phase shift

  25. Pattern Width • The algorithm described so far requires pattern width(W) to be input. • We can modify the algorithm so that it executes for a range of plausible widths. • The function F is not immediately useful for this purpose as its optimal value always increases with increasing W.

  26. Pattern Width • Another function based on the incomplete-data log-probability ratio G can be used. • Dividing G by the number of free parameters needed to specify the pattern (19W in the case of proteins) produced a statistic useful for choosing pattern width. This quantity can be called information per parameter.

  27. Examples • The algorithm was applied to locate helix-turn-helix (HTH) motif, which represent a large class of sequence-specific DNA binding structures involved in numerous cases of gene regulation. • Detection and alignment of HTH motifs is a well recognized problem because of the great sequence variation.

  28. HTH Motif Complete Sequences Non-site seq Random seq

  29. Convergence behavior of Gibbs Sampling Algorithm

  30. Time complexity analysis • For a typical protein sequence, it was found that, for a single pattern width, each input sequence needs to be sampled fewer than T = 100 times before convergence. • L*W multiplications are performed in Step2 of the algorithm. • Total multiplications to execute the algorithm = TNLavgW • Linear Time complexity has been observed in applications

  31. Motif finding • The Gibbs sampling algorithm was originally applied to find motifs in amino acid sequences • Protein motifs represent common sequence patterns in proteins, that are related to certain structure and function of the protein • Gibbs sampling is extensively used to find motifs in DNA sequence, i.e., transcription factor binding sites

  32. Thank You