Motif identification with gibbs sampler
This presentation is the property of its rightful owner.
Sponsored Links
1 / 17

Motif identification with Gibbs Sampler PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on
  • Presentation posted in: General

Not enough material. Motif identification with Gibbs Sampler. Xuhua Xia [email protected] http://dambe.bio.uottawa.ca. Background. Named after Josiah Willard Gibbs (February 11, 1839 – April 28, 1903), winner of the Copley Medal of the Royal Society of London in 1901.

Download Presentation

Motif identification with Gibbs Sampler

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Motif identification with gibbs sampler

Not enough material

Motif identification with Gibbs Sampler

Xuhua Xia

[email protected]

http://dambe.bio.uottawa.ca


Background

Background

  • Named after Josiah Willard Gibbs (February 11, 1839 – April 28, 1903), winner of the Copley Medal of the Royal Society of London in 1901.

  • One of Markov chain Monte Carlo algorithms

  • Biological applications

    • Identification of regulatory sequences of genes (Aerts et al., 2005; Coessens et al., 2003; Lawrence et al., 1993; Qin et al., 2003; Thijs et al., 2001; Thijs et al., 2002a; Thijs et al., 2002b; Thompson et al., 2004; Thompson et al., 2003) and functional motifs in proteins (Mannella et al., 1996; Neuwald et al., 1995; Qu et al., 1998)

    • Classification of biological images (Samso et al., 2002)

    • Pairwise sequence alignment (Zhu et al., 1998) and multiple sequence alignment (Holmes and Bruno, 2001; Jensen and Hein, 2005).


Motif identification by gibbs sampler

Motif Identification by Gibbs sampler

Other outputs of Gibbs sampler:

Position weight matrix that can be used to scan other sequences for motifs, the associated significance tests

Position weight matrix scores for identified motifs.


Gibbs sampler in motif finding

Gibbs sampler in motif finding

  • Site sampler

  • Motif sampler


Algorithm details initialization

Algorithm details: Initialization

1 2 3 4 1234567890123456789012345678901234567890123 S1 TCAGAACCAGTTATAAATTTATCATTTCCTTCTCCACTCCT

S2 CCCACGCAGCCGCCCTCCTCCCCGGTCACTGACTGGTCCTG

S3 TCGACCCTCTGAACCTATCAGGGACCACAGTCAGCCAGGCAAG

S4 AAAACACTTGAGGGAGCAGATAACTGGGCCAACCATGACTC

S5 GGGTGAATGGTACTGCTGATTACAACCTCTGGTGCTGC.. ...S11 CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG.. ...

FA: 325FC: 316FG: 267FT: 301Sum: 1209

Randomly choose motif start Ai.

Table 7-1. Site-specific distribution of nucleotides from the 29 random motifs of length 6. The second column lists the distribution of nucleotides outside the 29 random motifs.


Algorithm details predictive update

Algorithm details: Predictive update

S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG


Predictive update frequencies

Predictive update: Frequencies

Table 7-3. Site-specific distribution of nucleotide frequencies derived from data in Table 7-2, with  = 0.0001 The second column lists the distribution of nucleotide frequencies outside the 28 random motifs.


Predictive update pwm

Predictive update: PWM

S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG

Odds ratio for CATGCC = e-0.9113-0.0693+0.1731-0.4469-0.2228-0.4042 = 0.153


Predictive update

Predictive update

S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG

Table 7-4. Possible locations of the 6-mer motif along S11, together with the corresponding motifs and their position weight matrix scores expressed as odds ratios. The last column lists the odds ratios normalized to have a sum of 1.

Scaled to sum to 1

Pick up the one with the largest odds ratio, update the Ai value, and generate a new frequency matrix and a new PWM

Originally picked

New one to replace the originally picked because of the largest odds ratio

40 – 6 + 1 = 35


Algorithm details predictive update1

Algorithm details: Predictive update

S11CATGCCCTCAAGTGTGCAGATTGGTCACAGCATTTCAAGG

A New PWM

Scan another sequence

Xuhua Xia

Slide 10


F as a criterion

F as a criterion

  • Once all sequences are updated and a new set of Ai values obtained, compute

  • Update all the sequences again to obtain a new set of Ai and a new F. If the new F is greater the old F, replace the new set of Ai values by the new set of Ai values. Repeat until F value no long increases or when the maximum number of local iterations is reached.

  • This (from initiation to this slide) completes one global cycle of iteration

  • Repeat a number of global cycles until F does not increase.


F as a criterion1

F as a criterion

..............

..............


Summary of the algorithms

Summary of the algorithms

  • To find a motif of length L from a set of N sequences, randomly pick up a L-mer from each sequence

  • From the N L-mers, produce a PWM.

  • Randomly pick a sequence and use the PWM to scan the sequence along to obtain a set of PWMS each for a L-mer along the sequence.

  • Use the L-mer with the highest PWMS to update PWM.

  • Repeat this scanning and updating until all sequences have been used.

  • Calculate F1

  • Repeat the entire process and calculate F2.

  • Continue the process until Fi does not increase any more.

  • Output

    • the final PWM, as well as PWMS for each sequence

    • The aligned motifs

    • Associated statistics


Final report final frequency

Final Report: Final Frequency

Final site-specific counts:

A C G U

1 3 11 0 15

2 0 0 8 21

3 21 0 8 0

4 0 0 0 29

5 10 18 0 1

6 17 0 1 11

Final site-specific frequencies:

A C G U

1 0.10413 0.37882 0.00092 0.51613

2 0.00112 0.00109 0.27563 0.72217

3 0.72225 0.00109 0.27563 0.00103

4 0.00112 0.00109 0.00092 0.99688

5 0.34451 0.61920 0.00092 0.03537

6 0.58489 0.00109 0.03526 0.37877

Final PWM [ln(Qij/Q0)]:

A C G U

1 -0.93304 0.31199 -5.57384 0.86909

2 -5.46894 -5.54337 0.13202 1.20499

3 1.00364 -5.54337 0.13202 -5.34419

4 -5.46894 -5.54337 -5.57384 1.52737

5 0.26340 0.80335 -5.57384 -1.81131

6 0.79269 -5.54337 -1.92440 0.55966


Motif alignment

Motif alignment

Seq V V

1 UCAGAACCAGUUAUAAAUUUAUCAUUUCCUUCUCCACUCCU

2 CCCACGCAGCCGCCCUCCUCCCCGGUCACUGACUGGUCCUG

3 UCGACCCUCUGAACCUAUCAGGGACCACAGUCAGCCAGGCAAG

4 AAAACACUUGAGGGAGCAGAUAACUGGGCCAACCAUGACUC

5 GGGUGAAUGGUACUGCUGAUUACAACCUCUGGUGCUGC

6 AGCCUAGAGUGAUGACUCCUAUCUGGGUCCCCAGCAGGA

7 GCCUCAGGAUCCAGCACACAUUAUCACAAACUUAGUGUCCA

8 CAUUAUCACAAACUUAGUGUCCAUCCAUCACUGCUGACCCU

9 UCGGAACAAGGCAAAGGCUAUAAAAAAAAUUAAGCAGC

10 GCCCCUUCCCCACACUAUCUCAAUGCAAAUAUCUGUCUGAAACGGUUCC

11 CAUGCCCUCAAGUGUGCAGAUUGGUCACAGCAUUUCAAGG

12GAUUGGUCACAGCAUUUCAAGGGAGAGACCUCAUUGUAAG

13 UCCCCAACUCCCAACUGACCUUAUCUGUGGGGGAGGCUUUUGA

14 CCUUAUCUGUGGGGGAGGCUUUUGAAAAGUAAUUAGGUUUAGC

15 AUUAUUUUCCUUAUCAGAAGCAGAGAGACAAGCCAUUUCUCUUUCCUCCC

23 GAAAAAAAAUAAAUGAAGUCUGCCUAUCUCCGGGCCAGAGCCCCU

24 UGCCUUGUCUGUUGUAGAUAAUGAAUCUAUCCUCCAGUGACU

25 GGCCAGGCUGAUGGGCCUUAUCUCUUUACCCACCUGGCUGU

26 CAACAGCAGGUCCUACUAUCGCCUCCCUCUAGUCUCUG

27 CCAACCGUUAAUGCUAGAGUUAUCACUUUCUGUUAUCAAGUGGCUUCAGC

28 GGGAGGGUGGGGCCCCUAUCUCUCCUAGACUCUGUG

29 CUUUGUCACUGGAUCUGAUAAGAAACACCACCCCUGC


Motif scores

Motif scores

SeqName Motif Start PWMS

S1 UUAUCA 18 493.3101

S2 CGGUCA 22 40.4251

S3 CUAUCA 14 282.6008

S4 AGAUAA 17 16.2174

S5 UGAUUA 16 12.3482

S6 CUAUCU 18 223.8567

S7 UUAUCA 20 493.3101

S8 UUAUCA 2 493.3101

S9 CUAUAA 17 164.6933

S10 CUAUCU 14 223.8567

S11 UGGUCA 21 70.5663

S12 UUGUAA 33 120.2498

S13 UUAUCU 20 390.7660

S14 UUAUCU 2 390.7660

S15 UUAUCA 10 493.3101

... ... ... ...

S27 UUAUCA 19 493.3101

S28 CUAUCU 15 223.8567

S29 UUGUCA 2 206.3393


Motif sampler output

Motif sampler output


  • Login