Regular meeting 14 th jan 2014
Download
1 / 51

Regular Meeting 14 th Jan 2014 - PowerPoint PPT Presentation


  • 57 Views
  • Uploaded on

Regular Meeting 14 th Jan 2014. Outline. Introduction Materials Methodology Results Discussion Conclusion. Transcription Factor (TF) and TF-binding sites (TFBSs).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Regular Meeting 14 th Jan 2014' - maris


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Regular meeting 14 th jan 2014

Regular Meeting

14th Jan 2014


Outline
Outline

  • Introduction

  • Materials

  • Methodology

  • Results

  • Discussion

  • Conclusion


Transcription factor tf and tf binding sites tfbss
Transcription Factor (TF) and TF-binding sites (TFBSs)

  • To fully understand a gene’s function, it is essential to identify the TFs that regulate the gene and the corresponding TF-binding sites (TFBS).

  • TFBS are relatively short (10–20 bp) and highly degenerate sequence motifs, which make their effective identification a challenging task.

The Central Dogma

A motif logo example


Experimental techniques
Experimental Techniques

  • A fundamental bottleneck in TFBS identification is the lack of quantitative binding affinity data for a large proportion of the TFs.

  • The advancement of new high-throughput technologies such asChIP-chip, ChIP-seq, has made it possible to determine the binding affinity of these TFs.

  • While Chip-chip and Chip-seq are in vivo (“within the living”) approach, another high-throughput approach, known as Protein Binding Microarray (PBM)is introduced.

  • It enables us to measure the DNA sequence binding of TFs in vitro (test-tube) and in a very comprehensivemanner.


Protein binding microarray pbm
Protein Binding Microarray (PBM)

Length: L

A k-order de Bruijin sequence contains all 4k overlapping

k-mers exactly once.

Subsequences of a de Bruijin sequence, with overlapping at the start and the end of the sequence

The subsequences are put on an array

Perform protein binding experiment on the array. The brighter the spot, the stronger the binding

“k-mers: a string of length k”


The properties of pbm
The properties of PBM

PBM was developed to measure the binding preference of a protein to a complete set of k-mers in vitro. Here are some of its properties:

  • Each probe has a length (L) considerably greater than the width of the DNA binding site (k) that we intend to inspect. The binding site will be contained within each spot.

  • All possible k-mer sequence variants are studied in the experiment in a high-throughput manner. Each of their binding intensity can be obtained.

  • Due to the limitation of the technology, k can only be set as 8~10. However, it has already enabled us to study DNA binding sites in a comprehensive and efficient manner.

L

k


Protein binding microarray data
Protein Binding Microarray Data

The segments of a DeBruijnSequence

Primer sequence

Normalized Signal Intensity


Motivation
Motivation

  • Given PBM data, a new branch of algorithms are needed to take into account the quantitative affinity data to uncover a motif model, i.e. motif discovery in PBM data.

  • Existing algorithms including MatrixREDUCE, MDScan, PREGO, RankMotif++, and Seed and Wobble all make use of the most common motif model, the Position Weight Matrix (PWM).

PWM assumes independence between columns

Source: http://en.wikipedia.org/wiki/Position_weight_matrix


Motivation1
Motivation

  • However, the independence assumption that PWM makes is unrealistic in many cases [1].

  • Although a recent attempt has been made to generalize PWM, the insertion and deletion operations between adjacent nucleotide positions are still challenging [2].

[1] Erill, Ivan, and Michael C. O'Neill. "A reexamination of information theory-based methods for DNA-binding site identification." BMC bioinformatics 10.1 (2009): 57.

[2] Stormo, Gary D. "Maximally efficient modeling of DNA sequence motifs at all levels of complexity." Genetics 187.4 (2011): 1219-1224.


Contribution
Contribution

  • In this work, the authors developed a hidden Markov models (HMM)-based approach to model the dependence between adjacent nucleotide positions rigorously, which outperforms existing approaches.

  • The authors also discovered that HMM can be used to deduce multiple binding modes for a given TF, which is verified by comparison with existing algorithms and statistical tests.


Hidden markov model hmm
Hidden Markov Model (HMM)

The model:

A possible run:

  • There are two coins in my pocket.

    • Coin A is fair, with 50% chance of head (H) and 50% chance of tail (T) in a toss.

    • Coin B is unfair, with 25% chance of H and 75% chance of T in a toss.

  • Now we do this:

    • Take out one of the coins from my pocket without letting you know which one it is.

    • Throw the coin and let you see the outcome.

    • Repeat #1 and #2 for a few times (each time I could use a different coin)

    • Then you need to guess which coin did I use each time.

0.9

0.5

H

B

A

A

0.5

A

T

0.5

0.1

0.8

0.25

H

0.5

B

T

T

H

T

0.75

0.2

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2013


Hidden markov model hmm1
Hidden Markov Model (HMM)

  • From my perspective:

  • From your perspective:

The model:

The model:

A possible run:

0.9

0.9

0.5

0.5

H

H

B

A

A

0.5

0.5

A

A

T

T

0.5

0.5

0.1

0.1

0.8

0.8

0.25

0.25

H

H

0.5

0.5

B

B

T

T

T

H

T

0.75

0.75

0.2

0.2

The coins used are hidden and you can only observe whether it is a head or a tail

?

?

?

T

H

T

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2013


Three algorithms in hmm
Three algorithms in HMM

CSCI3220 Algorithms for Bioinformatics | Kevin Yip-cse-cuhk | Fall 2013


The authors objectives to apply hmm in pbm data
The authors’ Objectives to apply HMM in PBM data

  • For a TF, train a Hidden Markov Model (HMM) based on DNA k-mers with binding intensities obtained from Protein Binding Microarray (PBM) data to represent the binding specificity of the TF.

  • The use of the trained HMM can be used to

    • Predict if a DNA sequence can be bound by the TF

    • Rank the binding intensities of the DNA sequences

    • Identify the multiple binding modes in the PBM data



Outline1
Outline

  • Introduction

  • Materials

  • Methodology

  • Results

  • Discussion

  • Conclusion


Protein binding microarray data two arrays data replicates for each tf
Protein Binding Microarray Data Two arrays (data replicates) for each TF

  • Array #1 for TF Arid3a

  • Array #2 for TF Arid3a

Intensity: 936656.772613291

Intensity:

425493.538359013

……

……

>= 40, 000 sequences,

Each sequence has a length of 35 ~ 40 bp

>= 40, 000 sequences,

Each sequence has a length of 35 ~ 40 bp


Two datasets
Two Datasets

  • Dataset 1, 5 TFs

    • Cbf1

    • Ceh-22

    • Oct-1

    • Rap1

    • Zif628

  • Each TF contains 2 arrays.

  • These PBM data have also been processed by algorithms including MatrixREDUCE, MDScan, PREGO, RankMotif++ and Seed and Wobble.

  • Dataset 2 (Mouse Data), 42TFs

    • Arid3a

    • Ascl2

    • Bcl6b

    • Bhlhb2

    • E2F2

  • Each TF contains 2 arrays.

  • These PBM data have also been processed by the algorithm RankMotif++.


Outline2
Outline

  • Introduction

  • Materials

  • Methodology

  • Results

  • Discussion

  • Conclusion


Methodology overview
Methodology Overview

Step 0

Step 3

Step 4a

Step 1

Step 2

Step 4b

Step 0-3: Training

(on one array)

Step 4a: Testing

on another array

Step 4b: Analyzing the trained HMM


Step 0 identify all distinct k mers
Step 0: Identify all distinct k-mers

  • For each DNA sequence, trim the primer region.

  • For each DNA sequence, use a sliding window of length k to identify all the distinct k-mers (with considerations on their reverse complements)

  • k = 8 is used throughout the study, in order to compare with the other approaches

Primer

Example (k = 8)

CCATGGGCA 601

ATGCCGGAC 552

CATGGGCAA 579

{CCATGGGC, CATGGGCA, ATGCCGGA, TGCCGGAC,

CATGGGCA, ATGGGCAA}


Step 1 build a signal intensity list and associate the median to the k mer
Step 1: Build a Signal Intensity List and Associate the median to the k-mer

  • For each distinct DNA k-mer, build a signal intensity list.

  • For each DNA k-mer, use the median of the signal intensity list to represent the signal of the k-mer

Example (k = 8)

CCATGGGCA 601

ATGCCGGAC 552

CATGGGCAA 579

{CCATGGGC, CATGGGCA, ATGCCGGA, TGCCGGAC,

ATGGGCAA}

CCATGGGC{601} 601

CATGGGCA{579, 601} 590

ATGCCGGA {552} 552

TGCCGGAC {552} 552

ATGGGCAA {579} 579

* The median computation has been simplified for presentation purpose.


Step 1 5 label the k mers by ve ve
Step 1.5: Label the k- median to the k-mers by +ve / -ve

Example

CCATGGGC{601} 601

CATGGGCA{579, 601} 590

ATGCCGGA {552} 552

TGCCGGAC {552} 552

ATGGGCAA {579} 579

m = median* of {552, 579, 590, 601} = 584.5

= std* of {552, 579, 590, 601} = 21.016

Assume** n = 0.5, = 595

CCATGGGC 601 > 595 +ve

CATGGGCA 579 < 595 -ve

ATGCCGGA 552 < 595 -ve

TGCCGGAC 552 < 595 -ve

ATGGGCAA 579 < 595 -ve

  • Compute the median and the standard deviation (std) of all the intensities associating with the k-mers

  • Label each k-mer by +ve / -ve based on the following scheme

    • +ve: if

    • -ve: otherwise

      ,where is the intensity associating with the i-th k-mer

* The median and std computation has been simplified for presentation purpose.

** n set to 4 by default in this study.


Step 2 align the k mers using multiple sequence alignment msa
Step 2: Align the k- median to the k-mers using Multiple Sequence Alignment (MSA)

  • Select only the +ve k-mers

  • Align the k-mers using multiple sequence alignment.

Example

CCATGGGC +ve

GGGCATTT +ve

CGGATTTC +ve

GGACTTTA +ve

TTTACCAT +ve

Multiple Sequence Alignment (MSA)


Step 3 train a hmm based on the multiple sequence alignment
Step 3: Train a HMM based on the Multiple Sequence Alignment

  • Train a HMM based on the Multiple Sequence Alignment as input, using the Baum-Welch Training algorithm.

  • 50 hidden states are used.

  • As each HMM is initialized randomly, the training is repeated for 10 times to avoid any suboptimal convergence.


Step 4a verifications on testing dna sequences
Step 4a: Verifications on Testing DNA sequences

  • Test the HMM on another array, i.e. if the HMM is trained on array #1, it will be tested on array #2 (and vice versa).

  • Two testing procedures:

    • Given the testing DNA sequences only, ask the algorithm to rank them and compare to original ranking.

  • Given the testing DNA sequences only, ask the algorithm to classify which one is likely to be bound by the TF (+ve class) and which one is less likely to be bound by the TF (-ve class).


Step 4b find multimodal binding
Step 4b: Find multimodal binding

  • Given a HMM, we find the Top N most probable paths for a visit of L-states using the N-Max-Product Algorithm, i.e. the generalized version of the Viterbi algorithm, where L is the alignment length of the k-mers.

  • For each probable path, we can generate a position-specific weight matrix (PWM) based on the emission probabilities in that state.

  • Single-linkage Hierarchical clustering can be applied on these PWMs to identify multiple binding modes of the TF.


Outline3
Outline

  • Introduction

  • Materials

  • Methodology

  • Results

  • Discussion

  • Conclusion


Results overview
Results Overview

  • Step 4a: Verifications on Testing DNA sequences

    • Experiment 1: to evaluate the algorithms by ranking and classification on dataset 1, which is consist of 5 TFs.

    • Experiment 2: to evaluate the algorithms by ranking and classification on dataset 2 on dataset 2 (mouse data), which is consist of 42 TFs.

  • Step 4b: Find multimodal binding

    • Experiment 3: to analyze the HMM trained on dataset 1

    • Experiment 4: to analyze the HMM trained on dataset 2


Experiment 1 ranking
Experiment 1 (Ranking)

  • On the testing dataset, given only the DNA sequences, ask the algorithms to rank them according to their binding intensities.

  • The algorithms are trained on array #1 and tested on array #2 (or vice versa).

  • The similarity between the predicted rank and the actual rank can be measured by Spearman’s rank correlation coefficient. The higher the value is, the better the performance.

Actual rank

Predicted rank

3

4

1

2

4

3

2

1

Spearman’s rank correlation coefficient

,where are ranks


Hmm use forward algorithm to predict the binding intensity
HMM: Use Forward Algorithm to Predict the binding intensity

Predicted signal intensity

Testing

The HMM approach

= max. values associated with the k-mers

= max. values associated with the k-mers

= max. values associated with the k-mers

= max. values associated with the k-mers

  • HMM approach

  • Use a sliding window to scan all possible k-mers on a sequence

  • Use forward algorithm to associate each k-mer with a probability value

  • Select the k-mer with the maximum probability value and assign this value to sequence as the predicted binding intensity.

  • Forward Algorithm: Given a HMM, and a k-mer, the algorithm computes how likely is the k-mer generated by the HMM.


The gold standard k mer approach
The gold standard k- mer approach

Predicted signal intensity

Testing

Gold-standard

k-mer approach

= max. values associated with the k-mers

= max. values associated with the k-mers

= max. values associated with the k-mers

= max. values associated with the k-mers

  • Gold standard k-mer approach,

  • Use a sliding window to scan all possible k-mers on a sequence

  • Remember that in step 1, we have already associated each k-mer with the median of a list of signal intensities.

  • Select the k-mer with the maximum probability value and assign this value to sequence as the predicted binding intensity.

  • Why is it called Gold Standard ?

  • It is because, in step 2, we have used the information of actual signal intensity, which should be excluded in a testing dataset.


Results of experiment 1 ranking
Results of Experiment 1 (Ranking)

  • From those results, we can observe that kmerHMMperforms better than other methods on three datasets (Cbf1, Oct-1 and Zif238).

  • On the two other data sets (Ceh-22 and Rap1), kmerHMM is not the top performer but is close.

  • In the case of Rap1, kmerHMM performed slightly worse than other methods. The consensus binding motif for Rap1 is 13 nt long, which is longer than most of the common TFs.

  • kmeHMM only considers motifs of 8 bp; therefore, it is at an disadvantage.


Experiment 1 classification
Experiment 1 (Classification)

  • On the testing dataset, given only the DNA sequences, ask the algorithms to classify them by +ve or –ve.

  • The algorithms are trained on array #1 and tested on array #2 (or vice versa).

  • The problem can be considered as a binary classification problem and thus can be evaluated by AUC.

Actual class labels

Predicted class labels

+

-

+

-

+

+

-

+

How to obtain the actual class labels?

Given a list of actual intensities, compute the median and the standard deviation (std) . A sequence is +ve if its intensity >= , otherwise, -ve.

How to obtain the predicted class labels?

After using forward algorithm, each sequence is assigned a predicted binding intensity. We find a threshold such that specificity = 99%, and use this threshold to perform the classification.


Results of experiment 1 classification
Results of Experiment 1 (Classification)

kmerHMM is the best comparing to other algorithms, except on Ceh-22.



Results of experiment 2 on 42 tfs ranking classification
Results of Experiment 2 (on 42 TFs) (Ranking & Classification)

SR:

Spearman Ranking

TPR:

True Positive Rate

AUC:

Area Under Curve

Comparing with RankMotif++, which is also a k-mer approach.


Results of experiment 3 find multimodal binding in dataset 1
Results of Experiment 3: Find multimodal binding in dataset 1

  • Run N-Max Product Algorithm on the HMM trained for each TF in dataset 1 to find the top N most probable paths.

  • Analogy

    • Viterbi algorithm: Given a HMM and a observed sequence of length L, compute the most probable path of state transitions of Length L.

    • N-Max Product Algorithm: Given a HMM and a parameter L, compute the top N most probable path of state transitions of Length L. (In the study, N is set to a number until the probability of the N+1-th path is < 0.001.)

Each path can be used to generate a PWM, by simply use the emission probability as a matrix column.


The most probable path
The most probable path

Verification:

We can observe that the motif logos generated by kmerHMM is similar to the other approaches which are based on PWMs.


Top n most probable paths
Top N most probable paths

  • A single-linkage hierarchical clustering is applied on the PWMs generated by the Top N most probable paths.

  • A dendrogram cutoff was chosen such that the mean of a cluster validity measure, i.e. the silhouette values (the higher, the better), was the highest.

  • Oct-1 was chosen as an illustrative example such that it was found to have two clusters.


Top n most probable paths on oct 1
Top N most probable paths on Oct-1

This experiment also reflects that HMM modeling is necessary for multimodal motifs, comparing with other modeling in which state transition path topology is restricted to a principal state transition path manually.


Results of experiment 4 find multimodal binding in dataset 2
Results of Experiment 4:Find multimodal binding in dataset 2

  • Apply the N-Max Product Algorithm on all the HMM trained on mouse PBM data to discover how frequently a mouse TF can bind to multiple motifs.

  • An addition requirement is added to ensure the distance between any two output motif models is larger than a threshold, i.e. , where t = {0.3, 0.5, 0.7} in this study.


Results of experiment 4 find multimodal binding in dataset 21
Results of Experiment 4: Find multimodal binding in dataset 2

Blue: TFs with one DNA motif in one array

Green: TFs with two DNA motifs in one array

Red:

TFs with three DNA motifs in one array


Outline4
Outline

  • Introduction

  • Materials

  • Methodology

  • Results

  • Discussion

  • Conclusion


Discussion
Discussion

Novelty lies in two aspects.

  • kmerHMM outperforms existing methods by using HMM to derive a model to represent PBM data. To the authors’ knowledge, this is the first instance that HMM is used in representing PBM data.

  • Second, kmerHMM incorporates N-max algorithm and can derive multiple motif matrix models to represent PBM data. To the authors’ knowledge, this work is the first study incorporating HMMs into the PBM motif discovery problem. In a broader sense, this work is also the first study incorporating max-product algorithms into the general motif discovery problem explicitly.


Discussion1
Discussion

Limitation: The use of sliding window

  • The potential drawback of the proposed approach is that it relies on a sliding window to segment DNA probe sequences into individual k-mers, which may lose the sequence context information.

  • We expect such a limitation will be alleviated when a future improved PBM technology can generate binding affinity for longer probes (i.e. higher k value).


Discussion2
Discussion

Implication: Not limited to motif discovery

  • From the research framework, we can learn how to make use of the k-mers to train a HMM for pattern recognition, rather than a simple exact or approximate matching.

  • From the state transition path analysis, we can observe that HMM training is effective in handling multimodal pattern recognitions, which other modeling methods may not be able to handle.


Outline5
Outline

  • Introduction

  • Materials

  • Methodology

  • Results

  • Discussion

  • Conclusion


Conclusion
Conclusion

  • In this study, the authors proposed a computational pipeline for PBM motif discovery in which HMMs are trained to model DNA motifs, and Belief Propagation is used to elucidate multiple motif models from each trained HMM.

  • The new algorithm, kmerHMM, is compared with other existing methods on benchmark PBM data sets and demonstrated its effectiveness and uniqueness.

  • The authors also demonstrated that kmerHMM can capture multiple binding modes of a DNA-binding protein, for which a single position weight matrix (PWM) model is unable to do.

  • The authors foresee that a method like kmerHMM will provide biological insights and will be useful in this arena or other domains.


Protein binding microarray data1
Protein Binding Microarray Data

Freely accessible in the UniProbe Database

http://the_brain.bwh.harvard.edu/uniprobe/downloads.php


Source code and executables
Source Code and Executables

Freely accessible in the authors’ website

http://www.cs.toronto.edu/~wkc/kmerHMM/downloads.html


ad