CS5263 Bioinformatics

CS5263 Bioinformatics Lecture 20 Practical issues in motif finding Final project

Homework problem 3.1 • Count separately the number of character comparisons and the number of steps needed to find the next matching character using the bad character rule • Question: can you give an example?

Extended bad character rule k T: xpbctbxabpqqaabpqz P: tpabxab *^^ In this iteration: # of comparison = 3 Table lookup: 2 Find T(k) in P that is immediately left to i, shift P to align T(k) with that position P: tpabxab

Results for some real genes llr = 394 E-value = 2.0e-023 llr = 347 E-value = 9.8e-002 llr = 110 E-value = 1.4e+004

Strategies to improve results • Combine results from different algorithms usually helpful • Ones that appeared multiple times are probably more interesting • Except simple repeats like AAAAA or ATATATATA • Cluster motifs into groups according to their similarities

Strategies to improve results • Compare with known motifs in database • TRANSFAC • JASPAR • Issues: • How to determine similarity between motifs? • Alignment between matrices • How similar is similar? • Empirically determine some threshold

Strategies to improve results • Statistical test of significance • Enrichment in target sequences vs background sequences Background set B Target set T Assumed to contain a common motif, P Assumed to not contain P, or with very low frequency Ideal case: every sequence in T has P, no sequence in B has P

Statistical test for significance Background set + target set B + T P • Intuitively: if n / N >> m / M • P is enriched (over-represented) in T • Statistical significance? Target set T Size = N Appear in n sequences Appear in m sequences Size = M

Hypergeometric distribution • A box with M balls, of which N are red, and the rest are blue. • Red ball: target sequences • Blue ball: background sequences • If we randomly draw m balls from the box, • what’s the probability we’ll see nred balls? • If probability very small, we are probably not drawing randomly • Total # of choices: (M choose m) • # of choices to have nred balls: (N choose n) x (M-N choose m-n)

Cumulative hypergeometric test for motif significance • We are interested in: if we randomly pick m balls, how likely that we’ll see at leastnred balls? This can be interpreted as the p-value for the null hypothesis that we are randomly picking. Alternative hypothesis: our selection favors red balls. Equivalent: the target set T is enriched with motif P. Or: P is over-represented in T.

Examples • Yeast genome has 6000 genes • Select 50 genes believed to be co-regulated by a common TF • Found a motif for these 50 genes • It appeared in 20 out of these 50 genes • In the whole genome, 100 genes have this motif • M = 6000, N = 50, m = 100+20 = 120, n = 20 • Intuition: • m/M = 120/6000. In Genome, 1 out 50 genes have the motif • N = 50, would expect only 1 gene in the target set to have the motif • 20-fold enrichment • P-value = 6 x 10-22 • n = 5. 5-fold enrichment. P-value = 0.003 • Normally a very low p-value is needed, e.g. 10-10

ROC curve for motif significance • Motif is usually a PWM • Any word will have a score • Typical scoring function: Log P(W | M) / P(W | B) • W: a word. • M: a PWM. • B: background model • To determine whether a sequence contains a motif, a cutoff has to be decided • With different cutoffs, you get different number of genes with the motif • Hyper-geometric test first assumes a cutoff • It may be better to look at a range of cutoffs

ROC curve for motif significance Background set + target set B + T P • With different score cutoff, will have different m and n • Assume you want to use P to classify T and B • Sensitivity: n / N • Specificity: (M-N-m+n) / (M-N) • False Positive Rate = 1 – specificity: (m – n) / (M-N) • With decreasing cutoff, sensitivity , FPR  Target set T Given a score cutoff Size = N Appeared in n sequences Size = M Appeared in m sequences

ROC curve for motif significance A good cutoff ROC-AUC: area under curve. • 1: perfect separation. • 0.5: random. Motif 1 is better than motif 2. 1 Lowest cutoff. Every sequence has the motif. Sensitivity = 1. specificity = 0. sensitivity Motif 1 Motif 2 Random 0 1-specificity 0 1 Highest cutoff. No motif can pass the cutoff. Sensitivity = 0. specificity = 1.

Other strategies • Cross-validation • Randomly divide sequences into 10 sets, hold 1 set for test. • Do motif finding on 9 sets. Does the motif also appear in the testing set? • Phylogenetic conservation information • Does a motif also appears in the homologous genes of another species? • Strongest evidence • However, will not be able to find species-specific ones

Other strategies • Finding motif modules • Will two motifs always appear in the same gene? • Location preference • Some motifs appear to be in certain location • E.g., within 50-150bp upstream to transcription start • If a detect motif has strong positional bias, may be a sign of its function • Evidence from other types of data sources • Do the genes having the motif always have similar activities (gene expression levels) across different conditions? • Interact with the same set of proteins? • Similar functions? • etc.

To search for new instances • Usually many false positives • Score cutoff is critical • Can estimate a score cutoff from the “true” binding sites Motif finding Scoring function A set of scores for the “true” sites. Take mean - std as a cutoff. (or a cutoff such that the majority of “true” sites can be predicted).

To search for new instances • Use other information, such as positional biases of motifs to restrict the regions that a motif may appear • Use gene expression data to help: the genes having the true motif should have similar activities • Phylogenetic conservation is the key

Final project • Write a review paper on a topic that we didn’t cover in lectures Or • Implement an algorithm and do some experiments • Compare several algorithms (existing implementation ok) • Combine several algorithms to form a pipeline (e.g. gene expression + motif analysis) • Final: • 5 -10 pages report (single space, single column, 12pt) + 15 minutes presentation

Possible topics for term paper • Possible topics: • Haplotype inferencing • Computational challenges associated with new microarray technologies • Phylogenetic footprinting • Small RNA gene / target prediction (siRNA, mRNA, …) • Biomedical text mining • Protein structure prediction • Topology of biological networks

An example project • Given a gene expression data (say cell cycle) • Cluster genes using k-means • Find motifs using several algorithms • (Cluster and combine similar motifs) • Rank motifs according to their specificity to the target sequences comparing to the other clusters • Get their logos • Use the sequences to search the whole genome for more genes with the motif • Do they have any functional significance?

CS5263 Bioinformatics