Evaluation of count scores for weight matrix motifs

Evaluation of count scores for weight matrix motifs Project Presentation for CS598SS Hong Cheng and Qiaozhu Mei

Problem Background • Understand the mechanism of gene regulation and predict the gene regulation. • Need a quantitative measure of the strength of a TF Binding Site correlated in a gene sequence. • This measure can be used as an important feature in the study of gene regulation.

Project Background (cont.) • There are no standard criteria for such a measure. • But we expect a good measure can model • The quality of a Binding Site. • The occurrence of a binding site in the sequence. • There can be many choices for such a measure, but which one is better…?

Project Overview • Project Goal • Possible scoring measures • Evaluate a score: Constraint Analysis • Experiments and results • Current Status and Future Work

Goal of the project • Three Steps: • Formalize the problem of counting score of weight matrix motifs and propose an evaluation mechanism. • Evaluate the existing scoring methods of weight matrix motifs. • Either suggest a good motif counting method or propose a new score better than existing scores.

Possible scoring measures • Simple Counting • Match or not match • Likelihood Sum • Data likelihood of a site is generated by a motif: sum over all possible sites • Model based scores • Free Energy • Normalization of existing scores

Simple counting • Simple Counting (match or not match) • Doesn’t work for fuzzy motifs • Variation: count a motif if for a subsequence, P(s|w) is above a threshold. • Likelihood sum score • A soft version of simple counting • Ad hoc, doesn’t have a sound probabilistic interpretation

Model Based Scores • Consider the sequence to be generated by a model involving a set of motifs • HMM model: Stubb (Sinha et al 2003) • Count of a motif as an average number of times the motif is planted in the sequence • Two options: • With fixed transition probabilities • Fitting transition probabilities by unsupervised learning

Other Possible Scores • Free Energy: • : a set of motifs M and model parameters • b: model parameters and only backgrounds • F(s, ) = log( Pr(S| )/ Pr(S| )) • Models the score of a sequence and a set of motifs, cannot give score of a specific motif unless run the computation for only one motif • Normalization of Existing Scores • Estimating P(C >= x) instead of #sequences • Use well known normalization methods to normalize the actual counts • Min-Max; Z-Score ( Z = (N-E)/S)

Question:What makes a good score for weight matrix motif?

Evaluation of scores • Empirical evaluation with Lab Experiments • Comparing the score with lab experiments to see the effectiveness • ChIP to Chip studies • Problem: • Lab experiment data not easy to get • Performance may vary over species (thus may be biased) • Analytical evaluation: heuristic constraints

Analytical Evaluation with Heuristic Constrains • There are many heuristic constraints which we expect a good score will satisfy • The effectiveness of a score can be implied by how good it satisfies the constraints • Whether a score satisfy a constraint can be studied analytically or with experiments on random data • Combining with empirical evaluation, constraint analysis can tell us why a score is better than others, and help us defining a new score.

Heuristic Constraints (I) • Formalization • Motif PWMs: w, M • Sequences: S • Possible binding sites: s • Score of the nth run: Cn(S, w) • Motif Quality Constraint • Focus on quality of sites • Contribution of a motif w with length l on sites • For one motif w, two sequence S1, S2, a site position [i , i + l - 1]. S1[i , i + l - 1] != S2[i , i + l - 1] and other positions are the same. • If I(S1[i , i + l - 1] ) <= I(S2[i , i + l - 1] ) • C(S1, w) >= C(S2, w)

Heuristic Constraints (II) • Motif Length Constraint • For two motifs w1 and w2, length(w1) = length(w2) + 1. For any position i <= length(w2), the multinomial vector w1(i) = w2(i) . • Compute the score of M1 and M2 on one sequence S independently • C(S, w1)<= C(S, w2) • Motif Sharpness Constraint • For two motifs w1 and w2, length(w1) = length(w2), if for any position i, j = 0, 1, 2, w1(i, j) < w2(i, j) and w1(i, 3) > w2(i, 3) • (w1 is sharper than w2) • Compute the score of w1 and w2 on a large number of sequences independently • Expectation [C(w1)]<= Expectation [C(w2)]

Heuristic Constraints (III) • Motif Probability Constraint • For one motif w, one sequence S, if we compute the score C(S, w) two times and give higher probability to w in the second run • (e.g. transition probability or prior probability in HMM) • E.g. p1< p2 • C1(S, w) <= C2(S, w) • Motif Competition Constraint • For two motifs w1 and w2, one sequence S. First compute the score for w1 only, then compute considering the co-occurrences of w1 and w2. • C1(S, w1) >= C2(S, w2)

Heuristic Constraints (IV) • Deterministic Constraint • One motif w, one sequence S, if we compute the score of w twice with no parameter changing, • C1(S, w) = C2(S, w) • Upper Bound Constraint • An existing set of motifs M, a sequence S. if we adding a new motif wn and compute the scores for M and wn again, • But cannot exceed an upper bound (e.g. the length of S)

A summary of constraints • The heuristic constraints can allow us to analyze the effectiveness of a score without doing experiments. • In experiments show that one score is better than others, the heuristic constraints can indicate why it is better. • Difficult to find a close set of constraints • Some constraints are closely related (maybe not orthogonal, though not redundant)

Experiment Design • Regular (comparing distribution): • Method • Stubb with learnt p • Simple Count • Data • Real motifs, real sequence data • Real motifs, random generated very long sequence (say, 10k~100k) • Random motifs, including long, short, fuzzy and sharp combinations, random long sequence

Experiment Design • Stubb with Fixed Prior Probability • Vary prior prob p : 0.0001, …, 0.001, …, 0.01… • Data • Real motifs, random generated long sequence • Random motifs, including long, short, fuzzy and sharp combinations, random generated long sequence • See score distribution

Experiment Design: Constraints • Motif Length: • Random generated motifs (uniform, varying length), random generated long sequence. • Random generated motifs (uniform, varying length), real sequences • Motif sharpness: • Random generated motifs (varying sharpness, equal length), random generated long sequence (100k)

Experiment Design: Constraints • Motif Competition • Real motifs, real sequence/random sequence data • several runs: • 1st run: only motif M1 • 2nd run: M1 and M2, • 3rd run: M1 and M2 and M3, • … • Plot the distribution of M1 in several runs.

Experiment Design (cont.) • Deterministic constraints: • Real motifs, real sequences, run it several times, plot the distributions of Motif 1 to see whether it changes a lot. • Normalization: • Z-Score only; Min-Max only; P(C>=N) only; P(C>=N) + Z-Score; P(C>=N) + Min-Max

Experiment Result(1) • Stubb on real sequences against real motifs • Simple count on real sequences against real motifs • Four motifs • Bicoid, length 11, medium sharp • Kruppel, length 9, medium sharp • Gt, length 12, a bit sharper • Hkb, length 7, sharpest, every row has one non-zero count and three 0s

Experiment Result (1)-Stubb

Experiment Result (1)-Simple Count

Experiment Result (1) – Normalization P(x>=N)

Result (1) – Normalization z-score on motif score

Experiment Result(2) • Stubb on random sequences against random motifs • Simple count on random sequences against random motifs • Four motifs • Long_fuzzy, length 20, uniform • Long_sharp, length 20, sharp • Short_fuzzy, length 5, uniform • Short_sharp, length 5, sharp

Experiment Result(2)-Stubb

Experiment Result (2)-Simple Count

Experiment Result(3) • Stubb with Fixed Prior Probability, varying p 0.0001 ~0.05 • Four real motifs • Bicoid • Kruppel • Hkb • Gt • Four random motifs • Long_fuzzy • Long_sharp • Short_sharp • Short_fuzzy

Experiment Result(3)-Bicoid

Experiment Result(3)-Hkb

Experiment Result(3)-Long_fuzzy

Experiment Result(3)-Short_sharp

Experiment Result(4)-Constraint Motif Length • Test on this heuristic • Stubb • Simple Count • Generate 10 random motifs, uniform, vary length from 1 to 10

Experiment Result(4)-Simple count

Experiment Result(5)-Contraint Motif Sharpness • Test on this heuristic • Stubb • Simple Count • Generate 10 random motifs, length 10, vary sharpness

Experiment Result(6)-Motif Competition • Test on this constraint • Stubb • Simple Count • 1st run: using bicoid only • 2nd run: using bicoid and other five motifs • 3rd run: using bicoid and other nine motifs • Monitor the bicoid score

Summary

Future Work • Finish constraint tests • Evaluate more scores (e.g. Free Energy) • Define and formalize more constraints • Comparing with ChIP-chip experiment results, study the effectiveness of scores and the relation to constraints

Evaluation of count scores for weight matrix motifs

Evaluation of count scores for weight matrix motifs

Presentation Transcript

CMCD: Count Matrix based Code Clone Detection

Down for the Count! The Evaluation of Syncope

Down for the Count! The Evaluation of Syncope

Evaluation of a framework for action on feedback: Making Assessment Count

Motifs

External Factor Evaluation Matrix

Computing Marzano Evaluation Scores

Motifs

Examples of Motifs

Evaluation of English Intonation based on Combination of Multiple Evaluation Scores

The Matrix – Names, Symbols, Motifs

Improved similarity scores for comparing motifs

Motifs

Motifs

Motifs for Unknown Sites

Motifs, Motifs, Motifs

Online Business Vendor Evaluation Matrix

Misadministration of standardized achievement tests: Can we count on test scores

Venue Evaluation Matrix

Formulation and Evaluation of Licozinat Matrix Tablet

Motifs

Down for the Count! The Evaluation of Syncope