gibbs sampling for motif finding in biological sequences l.
Download
Skip this Video
Download Presentation
Gibbs sampling for motif finding in biological sequences

Loading in 2 Seconds...

play fullscreen
1 / 37

Gibbs sampling for motif finding in biological sequences - PowerPoint PPT Presentation


  • 133 Views
  • Uploaded on

Gibbs sampling for motif finding in biological sequences. Christopher Sheldahl. Biological Sequences. Proteins : “strings” of amino acids. 20 common types |alphabet| = 20 Nucleic acids : “strings” of nucleotides 4 common types for RNA; 4 for DNA |alphabet| = 4. Motif Finding .

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Gibbs sampling for motif finding in biological sequences' - dunne


Download Now An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
biological sequences
Biological Sequences
  • Proteins : “strings” of amino acids.

20 common types

|alphabet| = 20

  • Nucleic acids : “strings” of nucleotides

4 common types for RNA; 4 for DNA

|alphabet| = 4

motif finding
Motif Finding
  • Multiple sequences : How to detect related segments (or “motifs”)?

1: …ATACGTTGAGA…

2: …CGACGTTGCAA…

3: …CCACGCTGGAC…

sequence databases are large
Sequence Databases are Large
  • Genbank : NA : 108 million sequences
  • SwissProt : proteins : ~500,000 sequences
growth of swissprot
Growth of SwissProt

http://www.expasy.org/sprot/relnotes/relstat.html

the basic problem
The basic problem
  • We have N biological sequences.
  • Assume that there is one evolutionarily related segment (substring) of known width W in each sequence.
  • How do we find this segment in each sequence? Remember that there might be mutations.
protein case we d like to know the following
Protein case : We’d like to know the following:
  • Motif description (matrix): Probability of each of the 20 aa at each position in the motif of width w:

qi,R : Probability ofR type aa at position i of motif. Matrix is W x 20.

  • Background probability (vector) of each of the 20 aa in sites not part of motif: p1….p20
  • Ak : alignment vector of starting indices for the motif in each of the N sequences.
scoring a given string
Scoring a given string
  • If we knew the motif description matrix and the background vector we could score any string of width W :

L : score of segment wrt the motif.

Ri : the aa type at position i of the segment.

scoring an alignment of n sequences
Scoring an alignment of N sequences

We can score an entire alignment :

what if we just had one new sequence
What if we just had one new sequence?
  • What if we knew the motif description matrix and the background vector for a large number of sequences, and we want to quickly align a new (presumably related) sequence?
what if we just had one new sequence11
What if we just had one new sequence?
  • Score possible locations of the motif in the new sequence and pick one with a high (or max) value of L:
  • Note that the better our motif description and background model are, the better the new alignment will be.
motif finding dilemma
Motif Finding Dilemma
  • If we knew the alignment, we could calculate the motif description matrix and the background vector.
  • If we knew the motif description and the background, we could calculate the alignment (by choosing high L positions).
  • We don’t know either of these - we’ve seen one way to approach this problem before...
expectation maximization or the road not taken in this lecture
Expectation Maximization or The Road Not Taken in this Lecture

Start with Random motif description and background.

Repeat until motif description converges:

1) E - step : Calculate alignment points (Ak) from Motif Description and Background.

2) M - step : Calculate Motif Description and background from alignments.

Lawrence et. al. 1990. “An Expectation Maximization Algorithm for the Identification and Characterization

of Common Sites in Unaligned Biolpolymer Sequences”. Proteins.

local minima
Local minima
  • After initialization, simple EM algorithm is a deterministic process.
  • If the initialization is bad, can get trapped in local minima.
  • Heuristic : Try many different starting points.
random sampling
Random Sampling
  • Note the heuristic of trying many random initial positions. Maybe we can overcome local minima if we employ randomness more thoroughly.
  • Can we take random sample alignments from the distribution :

p(Align | S, W, Motif Matrix, Background)

where Align means A1, A2…AN.

markov chain monte carlo
Markov Chain Monte Carlo
  • Markov Chain Monte Carlo methods generate a Markov chain of points that converges to a distribution of interest.
  • “Monte Carlo” : The methods employ randomness.
metropolis hastings
Metropolis-Hastings
  • Metropolis-Hastings is an MCMC model that can sample from any distribution P, using a proposal distribution

Q(x’; x).

  • Initialize with random x.
  • Generate new x’ =
    • Proposal position according to
    • Q(x’; x)
  • Compute α = min( (P(x’) / P(x) ), 1)

and accept change with probability α.

Figure : Wikipedia

gibbs sampling
Gibbs Sampling
  • Gibbs sampling is a variety of Metropolis-Hastings sampling where the sampling step is always accepted.
  • For multivariate distributions, in Gibbs sampling only one parameter is changed at a time.
  • This makes Gibbs sampling particularly useful for multivariate distributions.
motif finding with gibbs site sampler
Motif Finding with Gibbs :Site Sampler
  • Site sampler : Sample starting points for motif in each sequence.
  • Start with random alignments, then use random sampling for one sequence at a time to gradually improve the alignments.

Lawrence et al., “Detecting Subtle Sequence Signals: A Gibbs Sampling Strategy for

Multiple Alignment”, Science. 1993. 262(5131), 208-214

gibbs sampling changing one thing at a time
Gibbs sampling: Changing One Thing at a Time
  • N = 3
  • We want to sample alignment points A1 for sequence 1, given the pattern description M, background model B, the sequences S and width W, and alignments A2 (for sequence 2) and A3 (for sequence 3).

p( A1 | S, M, B,W, A2, A3)

site sampler algorithm
Site Sampler Algorithm

Initialize with random alignment points.

While not converged:

Do steps 1 and 2 for each of N sequences :

1) Predictive update step :

Calculate motif matrix and background using all sequences but the currently selected one.

2) Sampling step. Calculate Lx = Qx / Px for all starting points x in this sequence. Choose one with a probability proportional to Lx.

  • The inner loop is an iteration of sequences.
  • In step 2 you sample a new value for Ak i.e. the alignment position for sequence k
  • Convergence : In theory you sample until alignment no longer changes
example random alignments
Example: Random Alignments

Rouchka, 1997,

“A Brief Overview

of Gibbs Sampling”

DNA:

Alphabet =

{A,C,G,T}

example
Example:

Initial Counts

New Counts

Remove 1st sequence (ATTTAT)

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT

Rouchka, 1997.

slide24

Sequence 1:

TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT

Rouchka, 1997.

assumptions for site sampler
Assumptions for Site Sampler
  • Assumption of one motif.
  • Assumption of one copy of motif in each sequence.
  • Assumption that we know the motif width.
motif sampler
Motif Sampler
  • Sample different motifs for each part of the sequence. Store motif descriptions and background vector for each motif.
  • Allows multiple motifs, and varying number of each motif to be present.
  • Still subject to assumption of known motif width.

Neuwald et al., “Gibbs Motif Sampling: Detection of bacterial membrane protein

repeats”. 1995. Protein Science. 4:1618-1632.

null model
Null model
  • Null model has pattern description probabilities identical to the background probabilities for its constituent amino acids.
widths and copy numbers
Widths and Copy Numbers
  • The motif sampler requires widths and expected numbers of each motif per sequence as input parameters. The widths are fixed during the algorithm, but the number of copies can change.
  • Construction of subfamilies (more specific models) favored by larger widths and smaller expected number of copies.
  • Construction of superfamilies (less specific models) favored by smaller widths and larger expected number of copies.
motif sampler algorithm
Motif Sampler Algorithm

Initialize with Ei non-overlapping random segments for each motif i in each of the sequences.

While not converged:

Do steps 1 and 2 for successive segments in the biological sequences :

1) Predictive update step :

If this segment is in an alignment remove it and recalculate motif matrixand background data excluding this segment.

2) Sampling step : Sample one of the motifs with a probability proportional to the score that the motif description assigns this segment.

  • In the inner loop the motif sampler iterates through segments (substrings)
  • Motif sampler samples among different motif models for each segment.
modifications to the samplers
Modifications to the Samplers
  • There are methods to allow for gapped motifs.
  • High scoring alignments can be stored for later investigation.
porin alignment
Porin alignment

Neuwald et al., 1995

motivation for gibbs sampling
Motivation for Gibbs sampling
  • We have a joint probability density

f(x, y1, y2,…yp)

  • We want a marginal density, f(x).
  • The integrations are difficult.

Casella and George, “Explaining the Gibbs Sampler”

near optimum sampler
Near-optimum Sampler
  • A particular motif may not be present in the best alignment found, even though its found in many other near-optimal alignments.
  • Sample among near-optimal alignments to capture all sites present.
column sampler
Column sampler
  • The column sampler allows for a motif to shift:

GCACCTG --> GCACCTG

  • It also allows for the development of motifs with gaps:

GCACCTG --> GCACCTG