1 / 46

Evaluation of count scores for weight matrix motifs

Evaluation of count scores for weight matrix motifs. Project Presentation for CS598SS Hong Cheng and Qiaozhu Mei. Problem Background. Understand the mechanism of gene regulation and predict the gene regulation.

Download Presentation

Evaluation of count scores for weight matrix motifs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Evaluation of count scores for weight matrix motifs Project Presentation for CS598SS Hong Cheng and Qiaozhu Mei

  2. Problem Background • Understand the mechanism of gene regulation and predict the gene regulation. • Need a quantitative measure of the strength of a TF Binding Site correlated in a gene sequence. • This measure can be used as an important feature in the study of gene regulation.

  3. Project Background (cont.) • There are no standard criteria for such a measure. • But we expect a good measure can model • The quality of a Binding Site. • The occurrence of a binding site in the sequence. • There can be many choices for such a measure, but which one is better…?

  4. Project Overview • Project Goal • Possible scoring measures • Evaluate a score: Constraint Analysis • Experiments and results • Current Status and Future Work

  5. Goal of the project • Three Steps: • Formalize the problem of counting score of weight matrix motifs and propose an evaluation mechanism. • Evaluate the existing scoring methods of weight matrix motifs. • Either suggest a good motif counting method or propose a new score better than existing scores.

  6. Possible scoring measures • Simple Counting • Match or not match • Likelihood Sum • Data likelihood of a site is generated by a motif: sum over all possible sites • Model based scores • Free Energy • Normalization of existing scores

  7. Simple counting • Simple Counting (match or not match) • Doesn’t work for fuzzy motifs • Variation: count a motif if for a subsequence, P(s|w) is above a threshold. • Likelihood sum score • A soft version of simple counting • Ad hoc, doesn’t have a sound probabilistic interpretation

  8. Model Based Scores • Consider the sequence to be generated by a model involving a set of motifs • HMM model: Stubb (Sinha et al 2003) • Count of a motif as an average number of times the motif is planted in the sequence • Two options: • With fixed transition probabilities • Fitting transition probabilities by unsupervised learning

  9. Other Possible Scores • Free Energy: • : a set of motifs M and model parameters • b: model parameters and only backgrounds • F(s, ) = log( Pr(S| )/ Pr(S| )) • Models the score of a sequence and a set of motifs, cannot give score of a specific motif unless run the computation for only one motif • Normalization of Existing Scores • Estimating P(C >= x) instead of #sequences • Use well known normalization methods to normalize the actual counts • Min-Max; Z-Score ( Z = (N-E)/S)

  10. Question:What makes a good score for weight matrix motif?

  11. Evaluation of scores • Empirical evaluation with Lab Experiments • Comparing the score with lab experiments to see the effectiveness • ChIP to Chip studies • Problem: • Lab experiment data not easy to get • Performance may vary over species (thus may be biased) • Analytical evaluation: heuristic constraints

  12. Analytical Evaluation with Heuristic Constrains • There are many heuristic constraints which we expect a good score will satisfy • The effectiveness of a score can be implied by how good it satisfies the constraints • Whether a score satisfy a constraint can be studied analytically or with experiments on random data • Combining with empirical evaluation, constraint analysis can tell us why a score is better than others, and help us defining a new score.

  13. Heuristic Constraints (I) • Formalization • Motif PWMs: w, M • Sequences: S • Possible binding sites: s • Score of the nth run: Cn(S, w) • Motif Quality Constraint • Focus on quality of sites • Contribution of a motif w with length l on sites • For one motif w, two sequence S1, S2, a site position [i , i + l - 1]. S1[i , i + l - 1] != S2[i , i + l - 1] and other positions are the same. • If I(S1[i , i + l - 1] ) <= I(S2[i , i + l - 1] ) • C(S1, w) >= C(S2, w)

  14. Heuristic Constraints (II) • Motif Length Constraint • For two motifs w1 and w2, length(w1) = length(w2) + 1. For any position i <= length(w2), the multinomial vector w1(i) = w2(i) . • Compute the score of M1 and M2 on one sequence S independently • C(S, w1)<= C(S, w2) • Motif Sharpness Constraint • For two motifs w1 and w2, length(w1) = length(w2), if for any position i, j = 0, 1, 2, w1(i, j) < w2(i, j) and w1(i, 3) > w2(i, 3) • (w1 is sharper than w2) • Compute the score of w1 and w2 on a large number of sequences independently • Expectation [C(w1)]<= Expectation [C(w2)]

  15. Heuristic Constraints (III) • Motif Probability Constraint • For one motif w, one sequence S, if we compute the score C(S, w) two times and give higher probability to w in the second run • (e.g. transition probability or prior probability in HMM) • E.g. p1< p2 • C1(S, w) <= C2(S, w) • Motif Competition Constraint • For two motifs w1 and w2, one sequence S. First compute the score for w1 only, then compute considering the co-occurrences of w1 and w2. • C1(S, w1) >= C2(S, w2)

  16. Heuristic Constraints (IV) • Deterministic Constraint • One motif w, one sequence S, if we compute the score of w twice with no parameter changing, • C1(S, w) = C2(S, w) • Upper Bound Constraint • An existing set of motifs M, a sequence S. if we adding a new motif wn and compute the scores for M and wn again, • But cannot exceed an upper bound (e.g. the length of S)

  17. A summary of constraints • The heuristic constraints can allow us to analyze the effectiveness of a score without doing experiments. • In experiments show that one score is better than others, the heuristic constraints can indicate why it is better. • Difficult to find a close set of constraints • Some constraints are closely related (maybe not orthogonal, though not redundant)

  18. Experiment Design • Regular (comparing distribution): • Method • Stubb with learnt p • Simple Count • Data • Real motifs, real sequence data • Real motifs, random generated very long sequence (say, 10k~100k) • Random motifs, including long, short, fuzzy and sharp combinations, random long sequence

  19. Experiment Design • Stubb with Fixed Prior Probability • Vary prior prob p : 0.0001, …, 0.001, …, 0.01… • Data • Real motifs, random generated long sequence • Random motifs, including long, short, fuzzy and sharp combinations, random generated long sequence • See score distribution

  20. Experiment Design: Constraints • Motif Length: • Random generated motifs (uniform, varying length), random generated long sequence. • Random generated motifs (uniform, varying length), real sequences • Motif sharpness: • Random generated motifs (varying sharpness, equal length), random generated long sequence (100k)

  21. Experiment Design: Constraints • Motif Competition • Real motifs, real sequence/random sequence data • several runs: • 1st run: only motif M1 • 2nd run: M1 and M2, • 3rd run: M1 and M2 and M3, • … • Plot the distribution of M1 in several runs.

  22. Experiment Design (cont.) • Deterministic constraints: • Real motifs, real sequences, run it several times, plot the distributions of Motif 1 to see whether it changes a lot. • Normalization: • Z-Score only; Min-Max only; P(C>=N) only; P(C>=N) + Z-Score; P(C>=N) + Min-Max

  23. Experiment Result(1) • Stubb on real sequences against real motifs • Simple count on real sequences against real motifs • Four motifs • Bicoid, length 11, medium sharp • Kruppel, length 9, medium sharp • Gt, length 12, a bit sharper • Hkb, length 7, sharpest, every row has one non-zero count and three 0s

  24. Experiment Result (1)-Stubb

  25. Experiment Result (1)-Simple Count

  26. Experiment Result (1) – Normalization P(x>=N)

  27. Result (1) – Normalization z-score on motif score

  28. Experiment Result(2) • Stubb on random sequences against random motifs • Simple count on random sequences against random motifs • Four motifs • Long_fuzzy, length 20, uniform • Long_sharp, length 20, sharp • Short_fuzzy, length 5, uniform • Short_sharp, length 5, sharp

  29. Experiment Result(2)-Stubb

  30. Experiment Result (2)-Simple Count

  31. Experiment Result(3) • Stubb with Fixed Prior Probability, varying p 0.0001 ~0.05 • Four real motifs • Bicoid • Kruppel • Hkb • Gt • Four random motifs • Long_fuzzy • Long_sharp • Short_sharp • Short_fuzzy

  32. Experiment Result(3)-Bicoid

  33. Experiment Result(3)-Hkb

  34. Experiment Result(3)-Long_fuzzy

  35. Experiment Result(3)-Short_sharp

  36. Experiment Result(4)-Constraint Motif Length • Test on this heuristic • Stubb • Simple Count • Generate 10 random motifs, uniform, vary length from 1 to 10

  37. Experiment Result(4)-Stubb

  38. Experiment Result(4)-Simple count

  39. Experiment Result(5)-Contraint Motif Sharpness • Test on this heuristic • Stubb • Simple Count • Generate 10 random motifs, length 10, vary sharpness

  40. Experiment Result(5)-Stubb

  41. Experiment Result(5)-Simple count

  42. Experiment Result(6)-Motif Competition • Test on this constraint • Stubb • Simple Count • 1st run: using bicoid only • 2nd run: using bicoid and other five motifs • 3rd run: using bicoid and other nine motifs • Monitor the bicoid score

  43. Experiment Result(6)-Stubb

  44. Experiment Result(6)-Simple count

  45. Summary

  46. Future Work • Finish constraint tests • Evaluate more scores (e.g. Free Energy) • Define and formalize more constraints • Comparing with ChIP-chip experiment results, study the effectiveness of scores and the relation to constraints

More Related