Advanced Techniques for TF-binding Site Identification

Regular Meeting 14th Jan 2014

Outline • Introduction • Materials • Methodology • Results • Discussion • Conclusion

Transcription Factor (TF) and TF-binding sites (TFBSs) • To fully understand a gene’s function, it is essential to identify the TFs that regulate the gene and the corresponding TF-binding sites (TFBS). • TFBS are relatively short (10–20 bp) and highly degenerate sequence motifs, which make their effective identification a challenging task. The Central Dogma A motif logo example

Experimental Techniques • A fundamental bottleneck in TFBS identification is the lack of quantitative binding affinity data for a large proportion of the TFs. • The advancement of new high-throughput technologies such asChIP-chip, ChIP-seq, has made it possible to determine the binding affinity of these TFs. • While Chip-chip and Chip-seq are in vivo (“within the living”) approach, another high-throughput approach, known as Protein Binding Microarray (PBM)is introduced. • It enables us to measure the DNA sequence binding of TFs in vitro (test-tube) and in a very comprehensivemanner.

Protein Binding Microarray (PBM) Length: L A k-order de Bruijin sequence contains all 4k overlapping k-mers exactly once. Subsequences of a de Bruijin sequence, with overlapping at the start and the end of the sequence The subsequences are put on an array Perform protein binding experiment on the array. The brighter the spot, the stronger the binding “k-mers: a string of length k”

The properties of PBM PBM was developed to measure the binding preference of a protein to a complete set of k-mers in vitro. Here are some of its properties: • Each probe has a length (L) considerably greater than the width of the DNA binding site (k) that we intend to inspect. The binding site will be contained within each spot. • All possible k-mer sequence variants are studied in the experiment in a high-throughput manner. Each of their binding intensity can be obtained. • Due to the limitation of the technology, k can only be set as 8~10. However, it has already enabled us to study DNA binding sites in a comprehensive and efficient manner. L k

Protein Binding Microarray Data The segments of a DeBruijnSequence Primer sequence Normalized Signal Intensity

Motivation • Given PBM data, a new branch of algorithms are needed to take into account the quantitative affinity data to uncover a motif model, i.e. motif discovery in PBM data. • Existing algorithms including MatrixREDUCE, MDScan, PREGO, RankMotif++, and Seed and Wobble all make use of the most common motif model, the Position Weight Matrix (PWM). PWM assumes independence between columns Source: http://en.wikipedia.org/wiki/Position_weight_matrix

Motivation • However, the independence assumption that PWM makes is unrealistic in many cases [1]. • Although a recent attempt has been made to generalize PWM, the insertion and deletion operations between adjacent nucleotide positions are still challenging [2]. [1] Erill, Ivan, and Michael C. O'Neill. "A reexamination of information theory-based methods for DNA-binding site identification." BMC bioinformatics 10.1 (2009): 57. [2] Stormo, Gary D. "Maximally efficient modeling of DNA sequence motifs at all levels of complexity." Genetics 187.4 (2011): 1219-1224.

Contribution • In this work, the authors developed a hidden Markov models (HMM)-based approach to model the dependence between adjacent nucleotide positions rigorously, which outperforms existing approaches. • The authors also discovered that HMM can be used to deduce multiple binding modes for a given TF, which is verified by comparison with existing algorithms and statistical tests.

Hidden Markov Model (HMM) The model: A possible run: • There are two coins in my pocket. • Coin A is fair, with 50% chance of head (H) and 50% chance of tail (T) in a toss. • Coin B is unfair, with 25% chance of H and 75% chance of T in a toss. • Now we do this: • Take out one of the coins from my pocket without letting you know which one it is. • Throw the coin and let you see the outcome. • Repeat #1 and #2 for a few times (each time I could use a different coin) • Then you need to guess which coin did I use each time. 0.9 0.5 H B A A 0.5 A T 0.5 0.1 0.8 0.25 H 0.5 B T T H T 0.75 0.2 CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2013

Hidden Markov Model (HMM) • From my perspective: • From your perspective: The model: The model: A possible run: 0.9 0.9 0.5 0.5 H H B A A 0.5 0.5 A A T T 0.5 0.5 0.1 0.1 0.8 0.8 0.25 0.25 H H 0.5 0.5 B B T T T H T 0.75 0.75 0.2 0.2 The coins used are hidden and you can only observe whether it is a head or a tail ? ? ? T H T CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2013

Three algorithms in HMM CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2013

The authors’ Objectives to apply HMM in PBM data • For a TF, train a Hidden Markov Model (HMM) based on DNA k-mers with binding intensities obtained from Protein Binding Microarray (PBM) data to represent the binding specificity of the TF. • The use of the trained HMM can be used to • Predict if a DNA sequence can be bound by the TF • Rank the binding intensities of the DNA sequences • Identify the multiple binding modes in the PBM data

Problem Definition

Protein Binding Microarray Data Two arrays (data replicates) for each TF • Array #1 for TF Arid3a • Array #2 for TF Arid3a Intensity: 936656.772613291 Intensity: 425493.538359013 …… …… >= 40, 000 sequences, Each sequence has a length of 35 ~ 40 bp >= 40, 000 sequences, Each sequence has a length of 35 ~ 40 bp

Two Datasets • Dataset 1, 5 TFs • Cbf1 • Ceh-22 • Oct-1 • Rap1 • Zif628 • Each TF contains 2 arrays. • These PBM data have also been processed by algorithms including MatrixREDUCE, MDScan, PREGO, RankMotif++ and Seed and Wobble. • Dataset 2 (Mouse Data), 42TFs • Arid3a • Ascl2 • Bcl6b • Bhlhb2 • E2F2 • … • Each TF contains 2 arrays. • These PBM data have also been processed by the algorithm RankMotif++.

Methodology Overview Step 0 Step 3 Step 4a Step 1 Step 2 Step 4b Step 0-3: Training (on one array) Step 4a: Testing on another array Step 4b: Analyzing the trained HMM

Step 0: Identify all distinct k-mers • For each DNA sequence, trim the primer region. • For each DNA sequence, use a sliding window of length k to identify all the distinct k-mers (with considerations on their reverse complements) • k = 8 is used throughout the study, in order to compare with the other approaches Primer Example (k = 8) CCATGGGCA 601 ATGCCGGAC 552 CATGGGCAA 579 {CCATGGGC, CATGGGCA, ATGCCGGA, TGCCGGAC, CATGGGCA, ATGGGCAA}

Step 1: Build a Signal Intensity List and Associate the median to the k-mer • For each distinct DNA k-mer, build a signal intensity list. • For each DNA k-mer, use the median of the signal intensity list to represent the signal of the k-mer Example (k = 8) CCATGGGCA 601 ATGCCGGAC 552 CATGGGCAA 579 {CCATGGGC, CATGGGCA, ATGCCGGA, TGCCGGAC, ATGGGCAA} CCATGGGC{601} 601 CATGGGCA{579, 601} 590 ATGCCGGA {552} 552 TGCCGGAC {552} 552 ATGGGCAA {579} 579 * The median computation has been simplified for presentation purpose.

Step 1.5: Label the k-mers by +ve / -ve Example CCATGGGC{601} 601 CATGGGCA{579, 601} 590 ATGCCGGA {552} 552 TGCCGGAC {552} 552 ATGGGCAA {579} 579 m = median* of {552, 579, 590, 601} = 584.5 = std* of {552, 579, 590, 601} = 21.016 Assume** n = 0.5, = 595 CCATGGGC 601 > 595 +ve CATGGGCA 579 < 595 -ve ATGCCGGA 552 < 595 -ve TGCCGGAC 552 < 595 -ve ATGGGCAA 579 < 595 -ve • Compute the median and the standard deviation (std) of all the intensities associating with the k-mers • Label each k-mer by +ve / -ve based on the following scheme • +ve: if • -ve: otherwise ,where is the intensity associating with the i-th k-mer * The median and std computation has been simplified for presentation purpose. ** n set to 4 by default in this study.

Step 2: Align the k-mers using Multiple Sequence Alignment (MSA) • Select only the +ve k-mers • Align the k-mers using multiple sequence alignment. Example CCATGGGC +ve GGGCATTT +ve CGGATTTC +ve GGACTTTA +ve TTTACCAT +ve Multiple Sequence Alignment (MSA)

Step 3: Train a HMM based on the Multiple Sequence Alignment • Train a HMM based on the Multiple Sequence Alignment as input, using the Baum-Welch Training algorithm. • 50 hidden states are used. • As each HMM is initialized randomly, the training is repeated for 10 times to avoid any suboptimal convergence.

Step 4a: Verifications on Testing DNA sequences • Test the HMM on another array, i.e. if the HMM is trained on array #1, it will be tested on array #2 (and vice versa). • Two testing procedures: • Given the testing DNA sequences only, ask the algorithm to rank them and compare to original ranking. • Given the testing DNA sequences only, ask the algorithm to classify which one is likely to be bound by the TF (+ve class) and which one is less likely to be bound by the TF (-ve class).

Step 4b: Find multimodal binding • Given a HMM, we find the Top N most probable paths for a visit of L-states using the N-Max-Product Algorithm, i.e. the generalized version of the Viterbi algorithm, where L is the alignment length of the k-mers. • For each probable path, we can generate a position-specific weight matrix (PWM) based on the emission probabilities in that state. • Single-linkage Hierarchical clustering can be applied on these PWMs to identify multiple binding modes of the TF.

Results Overview • Step 4a: Verifications on Testing DNA sequences • Experiment 1: to evaluate the algorithms by ranking and classification on dataset 1, which is consist of 5 TFs. • Experiment 2: to evaluate the algorithms by ranking and classification on dataset 2 on dataset 2 (mouse data), which is consist of 42 TFs. • Step 4b: Find multimodal binding • Experiment 3: to analyze the HMM trained on dataset 1 • Experiment 4: to analyze the HMM trained on dataset 2

Experiment 1 (Ranking) • On the testing dataset, given only the DNA sequences, ask the algorithms to rank them according to their binding intensities. • The algorithms are trained on array #1 and tested on array #2 (or vice versa). • The similarity between the predicted rank and the actual rank can be measured by Spearman’s rank correlation coefficient. The higher the value is, the better the performance. Actual rank Predicted rank 3 4 1 … 2 4 3 2 … 1 Spearman’s rank correlation coefficient ,where are ranks

HMM: Use Forward Algorithm to Predict the binding intensity Predicted signal intensity Testing The HMM approach = max. values associated with the k-mers = max. values associated with the k-mers = max. values associated with the k-mers … = max. values associated with the k-mers • HMM approach • Use a sliding window to scan all possible k-mers on a sequence • Use forward algorithm to associate each k-mer with a probability value • Select the k-mer with the maximum probability value and assign this value to sequence as the predicted binding intensity. • Forward Algorithm: Given a HMM, and a k-mer, the algorithm computes how likely is the k-mer generated by the HMM.

The gold standard k-mer approach Predicted signal intensity Testing Gold-standard k-mer approach = max. values associated with the k-mers = max. values associated with the k-mers = max. values associated with the k-mers … = max. values associated with the k-mers • Gold standard k-mer approach, • Use a sliding window to scan all possible k-mers on a sequence • Remember that in step 1, we have already associated each k-mer with the median of a list of signal intensities. • Select the k-mer with the maximum probability value and assign this value to sequence as the predicted binding intensity. • Why is it called Gold Standard ? • It is because, in step 2, we have used the information of actual signal intensity, which should be excluded in a testing dataset.

Results of Experiment 1 (Ranking) • From those results, we can observe that kmerHMMperforms better than other methods on three datasets (Cbf1, Oct-1 and Zif238). • On the two other data sets (Ceh-22 and Rap1), kmerHMM is not the top performer but is close. • In the case of Rap1, kmerHMM performed slightly worse than other methods. The consensus binding motif for Rap1 is 13 nt long, which is longer than most of the common TFs. • kmeHMM only considers motifs of 8 bp; therefore, it is at an disadvantage.

Experiment 1 (Classification) • On the testing dataset, given only the DNA sequences, ask the algorithms to classify them by +ve or –ve. • The algorithms are trained on array #1 and tested on array #2 (or vice versa). • The problem can be considered as a binary classification problem and thus can be evaluated by AUC. Actual class labels Predicted class labels + - + … - + + - … + How to obtain the actual class labels? Given a list of actual intensities, compute the median and the standard deviation (std) . A sequence is +ve if its intensity >= , otherwise, -ve. How to obtain the predicted class labels? After using forward algorithm, each sequence is assigned a predicted binding intensity. We find a threshold such that specificity = 99%, and use this threshold to perform the classification.

Results of Experiment 1 (Classification) kmerHMM is the best comparing to other algorithms, except on Ceh-22.

Results of Experiment 1 (Classification)

Results of Experiment 2 (on 42 TFs) (Ranking & Classification) SR: Spearman Ranking TPR: True Positive Rate AUC: Area Under Curve Comparing with RankMotif++, which is also a k-mer approach.

Results of Experiment 3:Find multimodal binding in dataset 1 • Run N-Max Product Algorithm on the HMM trained for each TF in dataset 1 to find the top N most probable paths. • Analogy • Viterbi algorithm: Given a HMM and a observed sequence of length L, compute the most probable path of state transitions of Length L. • N-Max Product Algorithm: Given a HMM and a parameter L, compute the top N most probable path of state transitions of Length L. (In the study, N is set to a number until the probability of the N+1-th path is < 0.001.) Each path can be used to generate a PWM, by simply use the emission probability as a matrix column.

The most probable path Verification: We can observe that the motif logos generated by kmerHMM is similar to the other approaches which are based on PWMs.

Top N most probable paths • A single-linkage hierarchical clustering is applied on the PWMs generated by the Top N most probable paths. • A dendrogram cutoff was chosen such that the mean of a cluster validity measure, i.e. the silhouette values (the higher, the better), was the highest. • Oct-1 was chosen as an illustrative example such that it was found to have two clusters.

Top N most probable paths on Oct-1 This experiment also reflects that HMM modeling is necessary for multimodal motifs, comparing with other modeling in which state transition path topology is restricted to a principal state transition path manually.

Results of Experiment 4:Find multimodal binding in dataset 2 • Apply the N-Max Product Algorithm on all the HMM trained on mouse PBM data to discover how frequently a mouse TF can bind to multiple motifs. • An addition requirement is added to ensure the distance between any two output motif models is larger than a threshold, i.e. , where t = {0.3, 0.5, 0.7} in this study.

Results of Experiment 4:Find multimodal binding in dataset 2 Blue: TFs with one DNA motif in one array Green: TFs with two DNA motifs in one array Red: TFs with three DNA motifs in one array

Discussion Novelty lies in two aspects. • kmerHMM outperforms existing methods by using HMM to derive a model to represent PBM data. To the authors’ knowledge, this is the first instance that HMM is used in representing PBM data. • Second, kmerHMM incorporates N-max algorithm and can derive multiple motif matrix models to represent PBM data. To the authors’ knowledge, this work is the first study incorporating HMMs into the PBM motif discovery problem. In a broader sense, this work is also the first study incorporating max-product algorithms into the general motif discovery problem explicitly.

Discussion Limitation: The use of sliding window • The potential drawback of the proposed approach is that it relies on a sliding window to segment DNA probe sequences into individual k-mers, which may lose the sequence context information. • We expect such a limitation will be alleviated when a future improved PBM technology can generate binding affinity for longer probes (i.e. higher k value).

Discussion Implication: Not limited to motif discovery • From the research framework, we can learn how to make use of the k-mers to train a HMM for pattern recognition, rather than a simple exact or approximate matching. • From the state transition path analysis, we can observe that HMM training is effective in handling multimodal pattern recognitions, which other modeling methods may not be able to handle.

Conclusion • In this study, the authors proposed a computational pipeline for PBM motif discovery in which HMMs are trained to model DNA motifs, and Belief Propagation is used to elucidate multiple motif models from each trained HMM. • The new algorithm, kmerHMM, is compared with other existing methods on benchmark PBM data sets and demonstrated its effectiveness and uniqueness. • The authors also demonstrated that kmerHMM can capture multiple binding modes of a DNA-binding protein, for which a single position weight matrix (PWM) model is unable to do. • The authors foresee that a method like kmerHMM will provide biological insights and will be useful in this arena or other domains.

Protein Binding Microarray Data Freely accessible in the UniProbe Database http://the_brain.bwh.harvard.edu/uniprobe/downloads.php

Advanced Techniques for TF-binding Site Identification

Advanced Techniques for TF-binding Site Identification

Presentation Transcript

Forest Garden Workshop 14 th Jan 2012

Bellwork – Wednesday, Jan. 15 th , 2014

Entry Task: Jan 14 th Monday

Friday , March 14 th , 2014

CLICdp news -- Monthly meeting April 14 th 2014--

Monday 14 th Jan 2013

DSS Meeting Jan. 13 th

DSS Meeting Jan. 27 th

Lab meeting 24 th Jan

April 14 th , 2014

Petra Velzeboer – 18 th Jan, 2014

SVD software meeting summary Jan. 14 th , 2004

Portsmouth Sustainable Business Network 14 th Jan 2014, Portsmouth Marriott Hotel

Warm up Jan 14 th

14 th February 2014

SVD software meeting summary Jan. 14 th , 2004

October 14 th , 2014

API Winter Meeting – Jan 2014

The 14 th meeting

Jan 14, 2014

14 th November 2014