Gibbs sampling for motif finding in biological sequences

Gibbs sampling for motif finding in biological sequences Christopher Sheldahl

Biological Sequences • Proteins : “strings” of amino acids. 20 common types |alphabet| = 20 • Nucleic acids : “strings” of nucleotides 4 common types for RNA; 4 for DNA |alphabet| = 4

Motif Finding • Multiple sequences : How to detect related segments (or “motifs”)? 1: …ATACGTTGAGA… 2: …CGACGTTGCAA… 3: …CCACGCTGGAC…

Sequence Databases are Large • Genbank : NA : 108 million sequences • SwissProt : proteins : ~500,000 sequences

Growth of SwissProt http://www.expasy.org/sprot/relnotes/relstat.html

The basic problem • We have N biological sequences. • Assume that there is one evolutionarily related segment (substring) of known width W in each sequence. • How do we find this segment in each sequence? Remember that there might be mutations.

Protein case : We’d like to know the following: • Motif description (matrix): Probability of each of the 20 aa at each position in the motif of width w: qi,R : Probability ofR type aa at position i of motif. Matrix is W x 20. • Background probability (vector) of each of the 20 aa in sites not part of motif: p1….p20 • Ak : alignment vector of starting indices for the motif in each of the N sequences.

Scoring a given string • If we knew the motif description matrix and the background vector we could score any string of width W : L : score of segment wrt the motif. Ri : the aa type at position i of the segment.

Scoring an alignment of N sequences We can score an entire alignment :

What if we just had one new sequence? • What if we knew the motif description matrix and the background vector for a large number of sequences, and we want to quickly align a new (presumably related) sequence?

What if we just had one new sequence? • Score possible locations of the motif in the new sequence and pick one with a high (or max) value of L: • Note that the better our motif description and background model are, the better the new alignment will be.

Motif Finding Dilemma • If we knew the alignment, we could calculate the motif description matrix and the background vector. • If we knew the motif description and the background, we could calculate the alignment (by choosing high L positions). • We don’t know either of these - we’ve seen one way to approach this problem before...

Expectation Maximization or The Road Not Taken in this Lecture Start with Random motif description and background. Repeat until motif description converges: 1) E - step : Calculate alignment points (Ak) from Motif Description and Background. 2) M - step : Calculate Motif Description and background from alignments. Lawrence et. al. 1990. “An Expectation Maximization Algorithm for the Identification and Characterization of Common Sites in Unaligned Biolpolymer Sequences”. Proteins.

Local minima • After initialization, simple EM algorithm is a deterministic process. • If the initialization is bad, can get trapped in local minima. • Heuristic : Try many different starting points.

Random Sampling • Note the heuristic of trying many random initial positions. Maybe we can overcome local minima if we employ randomness more thoroughly. • Can we take random sample alignments from the distribution : p(Align | S, W, Motif Matrix, Background) where Align means A1, A2…AN.

Markov Chain Monte Carlo • Markov Chain Monte Carlo methods generate a Markov chain of points that converges to a distribution of interest. • “Monte Carlo” : The methods employ randomness.

Metropolis-Hastings • Metropolis-Hastings is an MCMC model that can sample from any distribution P, using a proposal distribution Q(x’; x). • Initialize with random x. • Generate new x’ = • Proposal position according to • Q(x’; x) • Compute α = min( (P(x’) / P(x) ), 1) and accept change with probability α. Figure : Wikipedia

Gibbs Sampling • Gibbs sampling is a variety of Metropolis-Hastings sampling where the sampling step is always accepted. • For multivariate distributions, in Gibbs sampling only one parameter is changed at a time. • This makes Gibbs sampling particularly useful for multivariate distributions.

Motif Finding with Gibbs :Site Sampler • Site sampler : Sample starting points for motif in each sequence. • Start with random alignments, then use random sampling for one sequence at a time to gradually improve the alignments. Lawrence et al., “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for Multiple Alignment”, Science. 1993. 262(5131), 208-214

Gibbs sampling: Changing One Thing at a Time • N = 3 • We want to sample alignment points A1 for sequence 1, given the pattern description M, background model B, the sequences S and width W, and alignments A2 (for sequence 2) and A3 (for sequence 3). p( A1 | S, M, B,W, A2, A3)

Site Sampler Algorithm Initialize with random alignment points. While not converged: Do steps 1 and 2 for each of N sequences : 1) Predictive update step : Calculate motif matrix and background using all sequences but the currently selected one. 2) Sampling step. Calculate Lx = Qx / Px for all starting points x in this sequence. Choose one with a probability proportional to Lx. • The inner loop is an iteration of sequences. • In step 2 you sample a new value for Ak i.e. the alignment position for sequence k • Convergence : In theory you sample until alignment no longer changes

Example: Random Alignments Rouchka, 1997, “A Brief Overview of Gibbs Sampling” DNA: Alphabet = {A,C,G,T}

Example: Initial Counts New Counts Remove 1st sequence (ATTTAT) TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT Rouchka, 1997.

Sequence 1: TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT Rouchka, 1997.

Final Alignment

Assumptions for Site Sampler • Assumption of one motif. • Assumption of one copy of motif in each sequence. • Assumption that we know the motif width.

Motif Sampler • Sample different motifs for each part of the sequence. Store motif descriptions and background vector for each motif. • Allows multiple motifs, and varying number of each motif to be present. • Still subject to assumption of known motif width. Neuwald et al., “Gibbs Motif Sampling: Detection of bacterial membrane protein repeats”. 1995. Protein Science. 4:1618-1632.

Null model • Null model has pattern description probabilities identical to the background probabilities for its constituent amino acids.

Widths and Copy Numbers • The motif sampler requires widths and expected numbers of each motif per sequence as input parameters. The widths are fixed during the algorithm, but the number of copies can change. • Construction of subfamilies (more specific models) favored by larger widths and smaller expected number of copies. • Construction of superfamilies (less specific models) favored by smaller widths and larger expected number of copies.

Motif Sampler Algorithm Initialize with Ei non-overlapping random segments for each motif i in each of the sequences. While not converged: Do steps 1 and 2 for successive segments in the biological sequences : 1) Predictive update step : If this segment is in an alignment remove it and recalculate motif matrixand background data excluding this segment. 2) Sampling step : Sample one of the motifs with a probability proportional to the score that the motif description assigns this segment. • In the inner loop the motif sampler iterates through segments (substrings) • Motif sampler samples among different motif models for each segment.

Modifications to the Samplers • There are methods to allow for gapped motifs. • High scoring alignments can be stored for later investigation.

Neuwald et al., 1995 Porin structures

Porin alignment Neuwald et al., 1995

Motif descriptions

Motivation for Gibbs sampling • We have a joint probability density f(x, y1, y2,…yp) • We want a marginal density, f(x). • The integrations are difficult. Casella and George, “Explaining the Gibbs Sampler”

Near-optimum Sampler • A particular motif may not be present in the best alignment found, even though its found in many other near-optimal alignments. • Sample among near-optimal alignments to capture all sites present.

Column sampler • The column sampler allows for a motif to shift: GCACCTG --> GCACCTG • It also allows for the development of motifs with gaps: GCACCTG --> GCACCTG

Gibbs sampling for motif finding in biological sequences

Gibbs sampling for motif finding in biological sequences

Presentation Transcript

Regulatory Motif Finding

Gibbs sampling

(Regulatory-) Motif Finding

Motif finding

Motif identification with Gibbs Sampler

Comparative Motif Finding

Motif Finding

Motif Finding

Motif finding

Motif Finding

Motif Finding

Gibbs Sampling

Multiple Species Gene Finding using Gibbs Sampling

Gibbs sampling for motif finding

Biological Motif Discovery

Regulatory Motif Finding

Motif Finding

Motif Finding

Biological sampling

Motif finding with Gibbs sampling

The Gibbs Motif Sampler

Gibbs Sampling in Motif Finding