Regular Meeting 14 th Jan 2014. Outline. Introduction Materials Methodology Results Discussion Conclusion. Transcription Factor (TF) and TFbinding sites (TFBSs).
Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.
14th Jan 2014
The Central Dogma
A motif logo example
Length: L
A korder de Bruijin sequence contains all 4k overlapping
kmers exactly once.
Subsequences of a de Bruijin sequence, with overlapping at the start and the end of the sequence
The subsequences are put on an array
Perform protein binding experiment on the array. The brighter the spot, the stronger the binding
“kmers: a string of length k”
PBM was developed to measure the binding preference of a protein to a complete set of kmers in vitro. Here are some of its properties:
L
k
The segments of a DeBruijnSequence
Primer sequence
Normalized Signal Intensity
PWM assumes independence between columns
Source: http://en.wikipedia.org/wiki/Position_weight_matrix
[1] Erill, Ivan, and Michael C. O'Neill. "A reexamination of information theorybased methods for DNAbinding site identification." BMC bioinformatics 10.1 (2009): 57.
[2] Stormo, Gary D. "Maximally efficient modeling of DNA sequence motifs at all levels of complexity." Genetics 187.4 (2011): 12191224.
The model:
A possible run:
0.9
0.5
H
B
A
A
0.5
A
T
0.5
0.1
0.8
0.25
H
0.5
B
T
T
H
T
0.75
0.2
CSCI3220 Algorithms for Bioinformatics  Kevin Yipcsecuhk  Fall 2013
The model:
The model:
A possible run:
0.9
0.9
0.5
0.5
H
H
B
A
A
0.5
0.5
A
A
T
T
0.5
0.5
0.1
0.1
0.8
0.8
0.25
0.25
H
H
0.5
0.5
B
B
T
T
T
H
T
0.75
0.75
0.2
0.2
The coins used are hidden and you can only observe whether it is a head or a tail
?
?
?
T
H
T
CSCI3220 Algorithms for Bioinformatics  Kevin Yipcsecuhk  Fall 2013
CSCI3220 Algorithms for Bioinformatics  Kevin Yipcsecuhk  Fall 2013
Intensity: 936656.772613291
Intensity:
425493.538359013
……
……
>= 40, 000 sequences,
Each sequence has a length of 35 ~ 40 bp
>= 40, 000 sequences,
Each sequence has a length of 35 ~ 40 bp
Step 0
Step 3
Step 4a
Step 1
Step 2
Step 4b
Step 03: Training
(on one array)
Step 4a: Testing
on another array
Step 4b: Analyzing the trained HMM
Primer
Example (k = 8)
CCATGGGCA 601
ATGCCGGAC 552
CATGGGCAA 579
{CCATGGGC, CATGGGCA, ATGCCGGA, TGCCGGAC,
CATGGGCA, ATGGGCAA}
Example (k = 8)
CCATGGGCA 601
ATGCCGGAC 552
CATGGGCAA 579
{CCATGGGC, CATGGGCA, ATGCCGGA, TGCCGGAC,
ATGGGCAA}
CCATGGGC{601} 601
CATGGGCA{579, 601} 590
ATGCCGGA {552} 552
TGCCGGAC {552} 552
ATGGGCAA {579} 579
* The median computation has been simplified for presentation purpose.
Example
CCATGGGC{601} 601
CATGGGCA{579, 601} 590
ATGCCGGA {552} 552
TGCCGGAC {552} 552
ATGGGCAA {579} 579
m = median* of {552, 579, 590, 601} = 584.5
= std* of {552, 579, 590, 601} = 21.016
Assume** n = 0.5, = 595
CCATGGGC 601 > 595 +ve
CATGGGCA 579 < 595 ve
ATGCCGGA 552 < 595 ve
TGCCGGAC 552 < 595 ve
ATGGGCAA 579 < 595 ve
,where is the intensity associating with the ith kmer
* The median and std computation has been simplified for presentation purpose.
** n set to 4 by default in this study.
Example
CCATGGGC +ve
GGGCATTT +ve
CGGATTTC +ve
GGACTTTA +ve
TTTACCAT +ve
Multiple Sequence Alignment (MSA)
Actual rank
Predicted rank
3
4
1
…
2
4
3
2
…
1
Spearman’s rank correlation coefficient
,where are ranks
Predicted signal intensity
Testing
The HMM approach
= max. values associated with the kmers
= max. values associated with the kmers
= max. values associated with the kmers
…
= max. values associated with the kmers
Predicted signal intensity
Testing
Goldstandard
kmer approach
= max. values associated with the kmers
= max. values associated with the kmers
= max. values associated with the kmers
…
= max. values associated with the kmers
Actual class labels
Predicted class labels
+

+
…

+
+

…
+
How to obtain the actual class labels?
Given a list of actual intensities, compute the median and the standard deviation (std) . A sequence is +ve if its intensity >= , otherwise, ve.
How to obtain the predicted class labels?
After using forward algorithm, each sequence is assigned a predicted binding intensity. We find a threshold such that specificity = 99%, and use this threshold to perform the classification.
kmerHMM is the best comparing to other algorithms, except on Ceh22.
SR:
Spearman Ranking
TPR:
True Positive Rate
AUC:
Area Under Curve
Comparing with RankMotif++, which is also a kmer approach.
Each path can be used to generate a PWM, by simply use the emission probability as a matrix column.
Verification:
We can observe that the motif logos generated by kmerHMM is similar to the other approaches which are based on PWMs.
This experiment also reflects that HMM modeling is necessary for multimodal motifs, comparing with other modeling in which state transition path topology is restricted to a principal state transition path manually.
Blue: TFs with one DNA motif in one array
Green: TFs with two DNA motifs in one array
Red:
TFs with three DNA motifs in one array
Novelty lies in two aspects.
Limitation: The use of sliding window
Implication: Not limited to motif discovery
Freely accessible in the UniProbe Database
http://the_brain.bwh.harvard.edu/uniprobe/downloads.php
Freely accessible in the authors’ website
http://www.cs.toronto.edu/~wkc/kmerHMM/downloads.html