1 / 17

A Very Basic Gibbs Sampler for Motif Detection

A Very Basic Gibbs Sampler for Motif Detection . Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute. What?. What is a motif? What is the biological point? What is the goal? What is a limitation with dynamic programming? What is Gibbs sampling?

leann
Download Presentation

A Very Basic Gibbs Sampler for Motif Detection

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A Very Basic Gibbs Sampler for Motif Detection Frances Tong July 28, 2004 Southern California Bioinformatics Summer Institute

  2. What? • What is a motif? • What is the biological point? • What is the goal? • What is a limitation with dynamic programming? • What is Gibbs sampling? • What is the program going to do? • What is the program missing? • What is a much better program?

  3. What is a motif? • A motif is a sequence pattern that occurs repeatedly in a group of related dna or protein sequences.

  4. What is the biological point? • Encoding the “structural motif” of a protein (dna in exons) • Prediction of function (protein) • Protein binding sites (dna) • Transcription binding factors

  5. What is the goal? • Perform local multiple sequence alignment to find consensus sequences (motifs)

  6. What is a limitation with dynamic programming? • Memory and time complexity issues • Size of search space = (L-W+1)^N L = length of a sequence W = width of motif N = number of sequences • ex: L=30, W=7, N=10 (30 -7+1)^10 = 1333735776850284124449081472843776 • Alignment of even four such sequences will take a few hours ~10^4 seconds

  7. What is Gibbs sampling? • Stochastic optimization method • Works well with local multiple alignment without gaps (motif searching) • Searches for the statistically most probable motifs by sampling random positions instead of going through entire search space

  8. What is the program going to do? • Ask user for : • file containing multiple dna or protein sequences • motif width • how many motifs wanted • Calculate the background frequencies of A,C,G,T from all the sequences. [0.34951456310679613, 0.17799352750809061, 0.21035598705501618, 0.23300970873786409]

  9. What is the program going to do? • Generate random start positions for the motif in each sequence. ex: 10 sequences, 30 bp in length, motif width of 7 start = [2, 6, 9, 14, 5, 7, 20, 20, 6, 22] >> random.uniform(0,ceiling) where ceiling=len(sequence)-width

  10. What is the program going to do? 4. Construct position specific score matrix from all sequences except one.

  11. What is the program going to do? 5. Score the left-out sequence according to the position specific score matrix:

  12. What is the program going to do? Example: Use the position specific matrix and background from before: [A: 0.34951456310679613, C: 0.17799352750809061, G: 0.21035598705501618, T: 0.23300970873786409] GATTACA:

  13. What is the program going to do? 6. Randomly generate another start position of the motif for that left-out sequence. 7. Score that sequence with its new start position. 8. Compare this new score with its original score. 9. If newscore >= oldscore, then jump to that new start position, else jump to that new start position with probability =

  14. What is the program going to do? 10. Start all over again with this updated start position with another sequence left out Do this many many times! ~ 1000 iterations Gibbs will converge to a stationary distribution of the start positions => a probable alignment of the multiple sequences

  15. What is the program missing? • Doesn’t do reinitializations in the middle to get out of local maxima • Doesn’t optimize the width (you have to specify width explicitly) • Doesn’t do the Bayesian approach – just frequentist (easier for me and for you to understand!) • Doesn’t read in fasta files • Doesn’t do error checking! • And other things that don’t know they are missing yet!

  16. What is a much better program? • Gibbs Motif Sampler • http://bayesweb.wadsworth.org/gibbs/gibbs.html • AlignAce • http://atlas.med.harvard.edu/cgi-bin/alignace.pl

  17. That’s it!

More Related