Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment Lawrence, Altschul, Boguski, Liu, Neuwald, Wootton, Science, 1993

Motif Finding Problem • Biological description: given a set of sequences, find the motif shared by all or most sequences, while its starting position in each sequence is unknown • Assumption: • Each motif appears exactly once in one sequence • The motif has fixed length

Generative Model • Suppose the sequences are aligned, the aligned regions are generated from a motif model (position-specific multinomial distribution) • The unaligned regions are generated from a background model

Notations • Set of symbols: • Sequences: S = {S1, S2, …, SN} • Starting positions of motifs: A = {a1, a2, …, aN} • Motif model ( ) : qij = P(symbol at the i-th position = j) • Background model: pj = P(symbol = j) • Count of symbols in each column: cij= count of symbol, j, in the i-th column in the aligned region

Motif Finding Problem • Problem: find starting positions and model parameters simultaneously to maximize the posterior probability: • This is equivalent to maximizing the likelihood by Bayes’ Theorem, assuming uniform prior distribution:

Equivalent Scoring Function • Maximize the log-odds ratio:

Sampling and Optimization • To maximize a function, f(x): • Brute force method: try all possible x • Sample method: sample x from probability distribution: p(x) ~ f(x) • Idea: suppose xmax is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting xmax

Detailed Balance of Markov Chain • To sample from a probability distribution p(x), we set up a Markov chain s.t. each state represents a value of x and for any two states, x and y, the transitional probabilities satisfy: • This would then imply:

Gibbs Sampling • Idea: a joint distribution may be hard to sample from, but it may be easy to sample from the conditional distributions where all variables are fixed except one • To sample from p(x1, x2, …xn), let each state of the Markov chain represent (x1, x2, …xn), the probability of moving to a state (x1, x2, …xn) is: p(xi |x1, …xi-1,xi+1,…xn). Then the detailed balance is satisfied.

Gibbs Sampling

Gibbs Sampling in Motif Finding • We should sample from joint distribution of A and θ, but θ has a high dimension, and thus this is inefficient. • Instead, we draw sample from: • It can be shown that to find the conditional probability, p(az|A*,S) is roughly equivalent to: compute the estimator of θ from A*, then compute the probability of generating az according to this θ.

Estimator of θ • Given an alignment A, i.e. the starting positions of motifs, θ can be estimated by its MLE with smoothing (alternatively, Dirichlet prior with parameter bj):

Algorithm Randomly initialize A0; Repeat: (1) randomly choose a sequence z from S; A* = At \ az; compute θt = estimator of θ given S and A*; (2) sample az according to P(az = x), which is proportional to Qx/Px; update At+1 = A* union x; Select At that maximizes F; Qx: the probability of generating x according to θt; Px: the probability of generating x according to the background model

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment