Detecting subtle sequence signals a gibbs sampling strategy for multiple alignment
Download
1 / 32

Powerpoint slides - PowerPoint PPT Presentation


  • 538 Views
  • Updated On :

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment. Lawrence et al. 1993. Presented By: Manish Agrawal Slides adapted from Prof Sinha’s notes. To define a motif, lets say we know where the motif starts in the sequence

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Powerpoint slides' - bernad


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Detecting subtle sequence signals a gibbs sampling strategy for multiple alignment

Detecting Subtle Sequence Signals: a Gibbs Sampling Strategy for Multiple Alignment

Lawrence et al. 1993

Presented By: Manish Agrawal

Slides adapted from Prof Sinha’s notes.


A motif model

To define a motif, lets say we know where the motif starts in the sequence

The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st)

A motif model

Genes regulated

by same

transcription

factor


Motifs matrices and consensus

a in the sequenceG g t a c T t

C c A t a c g t

Alignment a c g t T A g t

a c g t C c A t

C c g t a c g G

_________________

A3 0 1 0 31 1 0

Matrix C24 0 0 14 0 0

G 0 1 4 0 0 0 31

T 0 0 0 5 1 0 14

_________________

Consensus A C G T A C G T

Line up the patterns by their start indexes

s = (s1, s2, …, st)

Construct “position weight matrix” with frequencies of each nucleotide in columns

Consensus nucleotide in each position has the highest frequency in column

Motifs: Matrices and Consensus


Motif finding problem simplified
Motif Finding Problem(Simplified) in the sequence

  • Given a set of sequences, find the motif shared by all or most sequences, while its starting position in each sequence is unknown

  • Assumption:

    • Each motif appears exactly once in one sequence.

    • The motif has fixed length.


Generative model
Generative Model in the sequence

  • Suppose the sequences are aligned, the aligned regions are generated from a motif model.

  • Motif model is a PWM. A PWM is a position-specific multinomial distribution.

    • For each position i (from 1 to W), a multinomial distribution on amino acids, consisting of variables qi1, qi2,…..,qi20

  • The unaligned regions are generated from a background model: p1,p2, ……, p20


Notations
Notations in the sequence

  • Set of symbols:

  • Sequences: S = {S1, S2, …, SN}

  • Starting positions of motifs: A = {a1, a2, …, aN}

  • Motif model ( ) : qij = P(symbol at the i-th position = j)

  • Background model: pj = P(symbol = j)

  • Count of symbols in each column: cij= count of symbol, j, in the i-th column in the aligned region



Scoring function
Scoring Function in the sequence

  • Maximize the log-odds ratio:

  • Is greater than zero if the data is a better match to the motif model than to the background model


Scoring function1
Scoring function in the sequence

  • A particular alignment “A” gives us the

  • counts cij.

  • In the scoring function “F”, use:


Scoring function2
Scoring function in the sequence

  • Thus, given an alignment A, we can calculate the scoring function F

  • We need to find A that maximizes this scoring function, which is a log-odds score


Optimization and sampling
Optimization and Sampling in the sequence

  • To maximize a function, f(x):

    • Brute force method: try all possible x

    • Sample method: sample x from probability distribution: p(x) ~ f(x)

    • Idea: suppose xmax is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting xmax


Markov chain sampling
Markov Chain Sampling in the sequence

  • To sample from a probability distribution p(x), we set up a Markov chain s.t. each state represents a value of x and for any two states, x and y, the transitional probabilities satisfy:

  • This would then imply:


Gibbs sampling to maximize f
Gibbs sampling to maximize F in the sequence

  • Gibbs sampling is a special type of Markov chain sampling algorithm

  • Our goal is to find the optimal A = (a1,…aN)

  • The Markov chain we construct will only have transitions from A to alignments A’ that differ from A in only one of the ai

  • In round-robin order, pick one of the ai to replace

  • Consider all A’ formed by replacing ai with some other starting position ai’ in sequence Si

  • Move to one of these A’ probabilistically

  • Iterate the last three steps


Algorithm
Algorithm in the sequence

Randomly initialize A0;

Repeat:

(1) randomly choose a sequence z from S;

A* = At \ az; compute θt from A*;

(2) sample az according to P(az = x), which is proportional to Qx/Px; update At+1 = A*  x;

Select At that maximizes F;

Qx: the probability of generating x according to θt;

Px: the probability of generating x according to the background model


Algorithm1
Algorithm in the sequence

Current solution At


Algorithm2
Algorithm in the sequence

Choose one az to replace


Algorithm3
Algorithm in the sequence

x

For each candidate site

xin sequence z,

calculate Qx and Px:

Probabilities of sampling

x from motif model and

background model resp.


Algorithm4
Algorithm in the sequence

x

Among all possible

candidates, choose one

(say x) with probability

proportional to Qx/Px


Algorithm5
Algorithm in the sequence

x

Set At+1 = A*  x


Algorithm6
Algorithm in the sequence

x

Repeat


Local optima
Local optima in the sequence

  • The algorithm may not find the “global” or true maximum of the scoring function

  • Once “At” contains many similar substrings, others matching these will be chosen with higher probability

  • Algorithm will “get locked” into a “local optimum”

    • all neighbors have poorer scores, hence low chance of moving out of this solution


Phase shifts
Phase shifts in the sequence

  • After every M iterations, compare the current At with alignments obtained by shifting every aligned substring ai by some amount, either to left or right


Phase shift
Phase shift in the sequence


Phase shift1
Phase shift in the sequence


Pattern width
Pattern Width in the sequence

  • The algorithm described so far requires pattern width(W) to be input.

  • We can modify the algorithm so that it executes for a range of plausible widths.

  • The function F is not immediately useful for this purpose as its optimal value always increases with increasing W.


Pattern width1
Pattern Width in the sequence

  • Another function based on the incomplete-data log-probability ratio G can be used.

  • Dividing G by the number of free parameters needed to specify the pattern (19W in the case of proteins) produced a statistic useful for choosing pattern width. This quantity can be called information per parameter.


Examples
Examples in the sequence

  • The algorithm was applied to locate helix-turn-helix (HTH) motif, which represent a large class of sequence-specific DNA binding structures involved in numerous cases of gene regulation.

  • Detection and alignment of HTH motifs is a well recognized problem because of the great sequence variation.


Hth motif
HTH Motif in the sequence

Complete Sequences

Non-site seq

Random seq



Time complexity analysis
Time complexity analysis in the sequence

  • For a typical protein sequence, it was found that, for a single pattern width, each input sequence needs to be sampled fewer than T = 100 times before convergence.

  • L*W multiplications are performed in Step2 of the algorithm.

  • Total multiplications to execute the algorithm = TNLavgW

  • Linear Time complexity has been observed in applications


Motif finding
Motif finding in the sequence

  • The Gibbs sampling algorithm was originally applied to find motifs in amino acid sequences

    • Protein motifs represent common sequence patterns in proteins, that are related to certain structure and function of the protein

  • Gibbs sampling is extensively used to find motifs in DNA sequence, i.e., transcription factor binding sites


Thank you

Thank You in the sequence


ad