1 / 35

Motif finding

Motif finding. CS 466 Saurabh Sinha. DNA and Proteins. www.ornl.gov/.../slides/ images/01-0037low.jpg. Genes to proteins: “ transcription ”. POL. RNA Polymerase. POL. POL. http://instruct.westvalley.edu/svensson/CellsandGenes/Transcription%5B1%5D.gif. Regulation of gene activity. GENE.

cheryl
Download Presentation

Motif finding

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Motif finding CS 466 Saurabh Sinha

  2. DNA and Proteins www.ornl.gov/.../slides/ images/01-0037low.jpg

  3. Genes to proteins: “transcription” POL RNA Polymerase POL POL http://instruct.westvalley.edu/svensson/CellsandGenes/Transcription%5B1%5D.gif

  4. Regulation of gene activity GENE GENE More frequent transcription ⇒ more mRNA ⇒ more protein.

  5. Cell states defined by gene activity Genes 1&2 Genes 1&4 Gene 4 Genes 1&3

  6. Cell states defined by gene activity Disease onset young heart Progressed disease old nerve adult embryo

  7. How is the cell state specified ? • In other words, how is a specific set of genes turned on at a precise time and cell ?

  8. Gene regulation by “transcription factors” TF POL GENE

  9. Transcription factors TF TF TF TF TF TF TF TF may activate … or repress

  10. Gene regulation is encoded in the DNA TF POL GENE BINDING SITE It should be possible to predict where transcription factors bind, by reading the DNA sequence TCTACGTG

  11. Regulatory networks • Genes are switches, transcription factors are (one type of) input signals, proteins are outputs • Proteins (outputs) may be transcription factors and hence become signals for other genes (switches) • This may be the reason why humans have so few genes (the circuit, not the number of switches, carries the complexity)

  12. Decoding the regulatory network • Find patterns (“motifs”) in DNA sequence that occur more often than expected by chance • These are likely to be binding sites for transcription factors • Knowing these can tell us if a gene is regulated by a transcription factor (i.e., the “switch”)

  13. To define a motif, lets say we know where the motif starts in the sequence The motif start positions in their sequences can be represented as s = (s1,s2,s3,…,st) A motif model Genes regulated by same transcription factor

  14. a G g t a c T t C c A t a c g t Alignment a c g t T A g t a c g t C c A t C c g t a c g G _________________ Profile A3 0 1 0 31 1 0 Matrix C24 0 0 14 0 0 G 0 1 4 0 0 0 31 T 0 0 0 5 1 0 14 _________________ Consensus A C G T A C G T Line up the patterns by their start indexes s = (s1, s2, …, st) Construct “profile matrix” with frequencies of each nucleotide in columns Consensus nucleotide in each position has the highest frequency in column Motifs: Matrices and Consensus

  15. Position weight matrices • Suppose there were t sequences to begin with • Consider a column of a profile matrix • The column may be (t, 0, 0, 0) • A perfectly conserved column • The column may be (t/4, t/4, t/4, t/4) • A completely uniform column • “Good” profile matrices should have more conserved columns

  16. Information Content • First convert a “profile matrix” to a “position weight matrix” or PWM • Convert frequencies to probabilities • PWM W: Wk = frequency of base  at position k • q = frequency of base  by chance • Information content of W:

  17. Information Content • If Wk is always equal to q, i.e., if W is similar to random sequence, information content of W is 0. • If W is different from q, information content is high.

  18. Motif Finding Problem • Given a set of sequences and a “motif length” parameter • Goal is to find the starting positions s=(s1,…st) of l-length substrings, one in each given sequence, so as to maximize Score(s) = the information content of the resulting PWM

  19. Greedy Motif Search • Find two l-mers in sequences 1 and 2, form 2 x lalignment matrix and compute Score(s); pick the s with the highest score. • Iteratively add one l-mer from each of the other (t-2) sequences • At each of the following t-2 iterations, finds a “best”l-mer in sequence i. That is, try each l-mer in sequence i, add it to profile matrix, compute score, and pick the l-mer that leads to the highest score. • Sacrifices optimal solution for speed: in fact the bulk of the time is actually spent locating the first 2 l-mers

  20. Greedy Motif Search • Try different motif lengths • Use statistical criteria to evaluate significance of Information Content • At each step, instead of choosing the top (1) partial motif, keep the top k partial motifs • “Beam search” • The program “CONSENSUS” from Stormo lab. • Further Reading: Hertz, Hartzell & Stormo, CABIOS (1990) http://bioinf.kvl.dk/~gorodkin/teach/bioinf2004/hertz90.pdf

  21. Gibbs sampling for motif finding • This is an alternative to the greedy algorithm • It is a randomized algorithm: uses coin flips to proceed; may not do the same thing on every run • Like Greedy Search, it is not guaranteed to find optimal solution

  22. Optimization and Sampling • To maximize a function, f(x): • Brute force method: try all possible x • Sample method: sample x from probability distribution: p(x) ~ f(x) • Idea: suppose xmax is argmax of f(x), then it is also argmax of p(x), thus we have a high probability of selecting xmax

  23. Motif Finding Problem • Given a set of sequences and a “motif length” parameter • Goal is to find the starting positions s=(s1,…st) of l-length substrings, one in each given sequence, so as to maximize Score(s) = the information content of the resulting PWM

  24. Gibbs sampling algorithm for motif finding Current solution s(t)

  25. Algorithm Sequence z Choose one sz s(t)to replace

  26. Algorithm Sequence z Construct a PWM (t) from s(t) - {sz}

  27. Algorithm x For each candidate site xin sequence z, calculate Qx and Px: Qx: the probability of generating x according to θt; Px: the probability of generating x according to the background model

  28. Algorithm x Among all possible candidates, choose one (say x) with probability proportional to Qx/Px

  29. Algorithm x Set st+1 = s(t) – {sz}  x

  30. Algorithm x Repeat

  31. Algorithm x Each iteration is a “sample”. Obtain large number of samples and report the best one.

  32. Qx and Px • Qx: the probability of generating x according to θt • What is θ? • For each position i, a prob. distribution on (A,C,G,T): θiA, θiC, θiG, θiT • Px: the probability of generating x according to the background model • What is background model? • A prob. distribution on (A,C,G,T): pA,pC,pG,pT

  33. Gibbs sampling review • Gibbs sampling is a special type of Markov chain sampling algorithm • Our goal is to find the optimal s = (s1,…st) • The Markov chain we construct only has transitions from s to alignments s’ that differ from s in only one of the si • In round-robin order, pick one of the si to replace • Consider all s’ formed by replacing si with some other starting position si’ in sequence i • Move to one of these s’ probabilistically • Iterate the last three steps

  34. Local optima • The algorithm may not find the “global” or true maximum of the scoring function • Once “st” contains many similar substrings, others matching these will be chosen with higher probability • Algorithm will “get locked” into a “local optimum” • all neighbors have poorer scores, hence low chance of moving out of this solution

  35. But • The important thing to note is that the algorithm can (and will) make some “moves” that decrease the score. • Not a greedy algorithm; more of a “global search”.

More Related