detecting subtle sequence signals a gibbs sampling strategy for multiple alignment n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment PowerPoint Presentation
Download Presentation
Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment

Loading in 2 Seconds...

play fullscreen
1 / 32

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment - PowerPoint PPT Presentation


  • 546 Views
  • Uploaded on

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment. Lawrence et al. 1993. Presented By: Manish Agrawal Slides adapted from Prof Sinha’s notes. To define a motif, lets say we know where the motif starts in the sequence

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment' - bernad


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
detecting subtle sequence signals a gibbs sampling strategy for multiple alignment

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment

Lawrence et al. 1993

Presented By: Manish Agrawal

Slides adapted from Prof Sinha’s notes.

a motif model
To define a motif, lets say we know where the motif starts in the sequence

The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)

A motif model

Genes regulated

by same

transcription

factor

motifs matrices and consensus
a G g t a c T t

C c A t a c g t

Alignment a c g t T A g t

a c g t C c A t

C c g t a c g G

_________________

A3 0 1 0 31 1 0

Matrix C24 0 0 14 0 0

G 0 1 4 0 0 0 31

T 0 0 0 5 1 0 14

_________________

Consensus A C G T A C G T

Line up the patterns by their start indexes

s = (s1, s2, …, st)

Construct “position weight matrix” with frequencies of each nucleotide in columns

Consensus nucleotide in each position has the highest frequency in column

Motifs: Matrices and Consensus
motif finding problem simplified
Motif Finding Problem(Simplified)
  • Given a set of sequences, find the motif shared by all or most sequences, while its starting position in each sequence is unknown
  • Assumption:
    • Each motif appears exactly once in one sequence.
    • The motif has fixed length.
generative model
Generative Model
  • Suppose the sequences are aligned, the aligned regions are generated from a motif model.
  • Motif model is a PWM. A PWM is a position-specific multinomial distribution.
    • For each position i (from 1 to W), a multinomial distribution on amino acids, consisting of variables qi1, qi2,…..,qi20
  • The unaligned regions are generated from a background model: p1,p2, ……, p20
notations
Notations
  • Set of symbols:
  • Sequences: S = {S1, S2, …, SN}
  • Starting positions of motifs: A = {a1, a2, …, aN}
  • Motif model ( ) : qij = P(symbol at the i-th position = j)
  • Background model: pj = P(symbol = j)
  • Count of symbols in each column: cij= count of symbol, j, in the i-th column in the aligned region
scoring function
Scoring Function
  • Maximize the log-odds ratio:
  • Is greater than zero if the data is a better match to the motif model than to the background model
scoring function1
Scoring function
  • A particular alignment “A” gives us the
  • counts cij.
  • In the scoring function “F”, use:
scoring function2
Scoring function
  • Thus, given an alignment A, we can calculate the scoring function F
  • We need to find A that maximizes this scoring function, which is a log-odds score
optimization and sampling
Optimization and Sampling
  • To maximize a function, f(x):
    • Brute force method: try all possible x
    • Sample method: sample x from probability distribution: p(x) ~ f(x)
    • Idea: suppose xmax is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting xmax
markov chain sampling
Markov Chain Sampling
  • To sample from a probability distribution p(x), we set up a Markov chain s.t. each state represents a value of x and for any two states, x and y, the transitional probabilities satisfy:
  • This would then imply:
gibbs sampling to maximize f
Gibbs sampling to maximize F
  • Gibbs sampling is a special type of Markov chain sampling algorithm
  • Our goal is to find the optimal A = (a1,…aN)
  • The Markov chain we construct will only have transitions from A to alignments A’ that differ from A in only one of the ai
  • In round-robin order, pick one of the ai to replace
  • Consider all A’ formed by replacing ai with some other starting position ai’ in sequence Si
  • Move to one of these A’ probabilistically
  • Iterate the last three steps
algorithm
Algorithm

Randomly initialize A0;

Repeat:

(1) randomly choose a sequence z from S;

A* = At \ az; compute θt from A*;

(2) sample az according to P(az = x), which is proportional to Qx/Px; update At+1 = A*  x;

Select At that maximizes F;

Qx: the probability of generating x according to θt;

Px: the probability of generating x according to the background model

algorithm1
Algorithm

Current solution At

algorithm2
Algorithm

Choose one az to replace

algorithm3
Algorithm

x

For each candidate site

xin sequence z,

calculate Qx and Px:

Probabilities of sampling

x from motif model and

background model resp.

algorithm4
Algorithm

x

Among all possible

candidates, choose one

(say x) with probability

proportional to Qx/Px

algorithm5
Algorithm

x

Set At+1 = A*  x

algorithm6
Algorithm

x

Repeat

local optima
Local optima
  • The algorithm may not find the “global” or true maximum of the scoring function
  • Once “At” contains many similar substrings, others matching these will be chosen with higher probability
  • Algorithm will “get locked” into a “local optimum”
    • all neighbors have poorer scores, hence low chance of moving out of this solution
phase shifts
Phase shifts
  • After every M iterations, compare the current At with alignments obtained by shifting every aligned substring ai by some amount, either to left or right
pattern width
Pattern Width
  • The algorithm described so far requires pattern width(W) to be input.
  • We can modify the algorithm so that it executes for a range of plausible widths.
  • The function F is not immediately useful for this purpose as its optimal value always increases with increasing W.
pattern width1
Pattern Width
  • Another function based on the incomplete-data log-probability ratio G can be used.
  • Dividing G by the number of free parameters needed to specify the pattern (19W in the case of proteins) produced a statistic useful for choosing pattern width. This quantity can be called information per parameter.
examples
Examples
  • The algorithm was applied to locate helix-turn-helix (HTH) motif, which represent a large class of sequence-specific DNA binding structures involved in numerous cases of gene regulation.
  • Detection and alignment of HTH motifs is a well recognized problem because of the great sequence variation.
hth motif
HTH Motif

Complete Sequences

Non-site seq

Random seq

time complexity analysis
Time complexity analysis
  • For a typical protein sequence, it was found that, for a single pattern width, each input sequence needs to be sampled fewer than T = 100 times before convergence.
  • L*W multiplications are performed in Step2 of the algorithm.
  • Total multiplications to execute the algorithm = TNLavgW
  • Linear Time complexity has been observed in applications
motif finding
Motif finding
  • The Gibbs sampling algorithm was originally applied to find motifs in amino acid sequences
    • Protein motifs represent common sequence patterns in proteins, that are related to certain structure and function of the protein
  • Gibbs sampling is extensively used to find motifs in DNA sequence, i.e., transcription factor binding sites