group meeting presentation n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Group Meeting Presentation PowerPoint Presentation
Download Presentation
Group Meeting Presentation

Loading in 2 Seconds...

play fullscreen
1 / 46

Group Meeting Presentation - PowerPoint PPT Presentation


  • 154 Views
  • Uploaded on

Group Meeting Presentation. Enhancer Prediction Antonio Sze -To 29-04-2014. Outline. Background Biological background about enhancers Properties used in Enhancer Prediction Introduction Motivation Objective Problem Input and Output Materials and Methodology Dataset and Preprocessing

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Group Meeting Presentation' - livana


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
group meeting presentation

Group Meeting Presentation

Enhancer Prediction

Antonio Sze-To

29-04-2014

outline
Outline
  • Background
    • Biological background about enhancers
    • Properties used in Enhancer Prediction
  • Introduction
    • Motivation
    • Objective
    • Problem Input and Output
  • Materials and Methodology
    • Dataset and Preprocessing
    • Feature Extraction
    • Classification Methods
  • Results
  • Discussion and Conclusion
what are enhancers
What are Enhancers?

Enhancers are distinct genomic regions (or the DNA sequences) that contain binding site sequences for transcription factors (TFs) that can upregulate (that is, enhance) the transcription of a target gene from its transcription start site (TSS).

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

identifying enhancers are challenging
Identifying Enhancers are Challenging

According to ‘Pennacchio, Len A., et al. "Enhancers: five essential questions." Nature Reviews Genetics 14.4 (2013): 288-295’, an enhancer can be 1 million base pairs away from the gene.

It can be at any distance as possible

Along the linear genomic DNA sequence, enhancers can be located at any distance from their target genes, which makes their identificationchallenging.

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

how the enhancers are brought close to target promoters
How the enhancers are brought close to target promoters

A single TF

In a given tissue, active enhancers (Enhancer A) are bound by activating TFs and are brought into proximity of their respective target promoters by looping, which is thought to be mediated by cohesin and other protein complexes.

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

how the enhancers are brought close to target promoters1
How the enhancers are brought close to target promoters

Multiple TFs

In a given tissue, active enhancers (Enhancer B) are bound by activating TFs and are brought into proximity of their respective target promoters by looping, which is thought to be mediated by cohesin and other protein complexes.

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

enhancer activities are cell type or tissue specific
Enhancer activities are cell-type- or tissue-specific

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

biochemical features of active and inactive gene regulatory elements
Biochemical features of Active and Inactive gene regulatory elements
  • Active enhancers are characterized by
  • - A depletion of nucleosomes, which is the structural unit of eukaryotic chromatin.
  • Nucleosomes that flank active enhancers show specific histone modifications, for example, histone H3 lysine 4 monomethylation (H3K4me1) and H3K27 acetylation (H3K27ac).

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

biochemical features of active and inactive gene regulatory elements1
Biochemical features of Active and Inactive gene regulatory elements
  • Inactive enhancers might be silenced by different mechanisms, such as by
  • the Polycomb protein-associated repressive H3K27me3 mark
  • (Polycomb-group proteins can remodel chromatin such that epigenetic silencing of genes takes place.)
  • by repressive TF binding

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

chromatin accessibility and histone marks at regulatory elements
Chromatin accessibility and histone marks at regulatory elements

Chromatin is shown as a ‘gatekeeper’ for transcription factor (TF) binding and enhancer activity. Densely positioned nucleosomes can restrict access for transcription factors and other proteins. Accessible (that is, nucleosome-free) regions can be bound by these proteins, which define and mediate the identity of a region (for example, active enhancers, repressors or core promoters). The transition from ‘open’ to ‘closed’ chromatin, and vice versa, is determined by regulatory proteins, including pioneer transcription factors.

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

chromatin accessibility and histone marks at regulatory elements1
Chromatin accessibility and histone marks at regulatory elements

b) Active enhancer: H3K4me1, H3K27ac

c) Active promoter: H3K4me3, H3K27ac

d) Closed or poised enhancer: H3K4me1, H3K27me3

e) Primed enhancer (Soon to be Active): H3K4me1

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

chromatin accessibility and histone marks at regulatory elements2
Chromatin accessibility and histone marks at regulatory elements

Latent enhancers are located in closed chromatin and are not pre-marked by known histone modifications.

However, in the presence of external stimuli the DNA becomes accessible, and flanking nucleosomes acquire H3K4me1 and H3K27ac marks.

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

properties used in enhancer prediction1
Properties used in Enhancer Prediction
  • Prediction by motifs and sequence conservation
  • Prediction by Regulator Binding
  • Prediction by Chromatin Accessibility
  • Prediction by Histone modification
prediction by motifs and sequence conservation
Prediction by motifs and sequence conservation

Genome-Wide Scan

Genome

Predicted Motif

Predicted Motif

Existing Motifs from JASPAR, TRANSFAC or UniPROBE

Cis-regulatory Modules

  • Disadvantages:
  • - Random Matching is possible
  • Cell-type/tissue-specific
  • Enhancers are not necessarily to be conserved in sequences
prediction by regulator binding
Prediction by Regulator Binding

with co-activators

Disadvantages:

non-functional or neutral binding

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

prediction by chromatin accessibility
Prediction by Chromatin Accessibility

Disadvantages:

- Promoter regions around TSSs are often accessible invariantly across different cell types

- Proteins that regulate other aspects of gene expression or chromosomal biology also bind to the DNA at accessible regions

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

prediction by histone modification
Prediction by Histone Modification

Disadvantages:

- None of the known histone modifications correlates perfectly with enhancer activity, and even combinations of marks are not perfect predictors, e.g. many active enhancers lack characteristic marks

Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

properties used in enhancer prediction2
Properties used in Enhancer Prediction
  • Prediction by motifs and sequence conservation
    • Random match
    • Cell-type/ Tissuse-specific
    • Sequence conservation is not a must
  • Prediction by Regulator Binding
    • non-functional or neutral binding
  • Prediction by Chromatin Accessibility
    • Promoter regions around TSSs are often accessible invariantly across different cell types
    • Proteins that regulate other aspects of gene expression or chromosomal biology also bind to the DNA at nucleosome-free regions.
  • Prediction by Histone modification
    • None of the known histone modifications correlates perfectly with enhancer activity, and even combinations of marks are not perfect predictors, e.g. many active enhancers lack characteristic marks;
outline1
Outline
  • Background
    • Biological background about enhancers
    • Properties used in Enhancer Prediction
  • Introduction
    • Motivation
    • Objective
    • Problem Input and Output
  • Materials and Methodology
    • Dataset and Preprocessing
    • Feature Extraction
    • Classification Methods
  • Results
  • Discussion and Conclusion
motivation
Motivation
  • Understanding enhancers is of fundamental importance to the study of gene regulation.
  • Identification of enhancers remain challenging because they can be located at any distance from their target genes.
  • While chromatin-based prediction methods are prevalent, sequence motifs have not yet been exploited in detail for enhancer prediction in Drosophila melanogaster.
drosophila melanogaster red fly
Drosophila melanogaster (Red Fly)

http://en.wikipedia.org/wiki/Drosophila_melanogaster

objective
Objective
  • The authors attempt to combine the strengths of both
    • sequence-based features and
    • chromatin-based features

to predict if a sequence contains an active enhancer in Drosophila melanogaster (Red Fly) .

problem input and output
Problem Input and Output
  • Input
    • A DNA sequence
  • Method
    • A classification algorithm to predict if the DNA sequence contains an active enhancer
  • Output
    • Yes or No
outline2
Outline
  • Background
    • Biological background about enhancers
    • Properties used in Enhancer Prediction
  • Introduction
    • Motivation
    • Objective
    • Problem Input and Output
  • Materials and Methodology
    • Dataset and Preprocessing
    • Feature Extraction
    • Classification Methods
  • Results
  • Discussion and Conclusion
dataset and preprocessing
Dataset and Preprocessing
  • Training Dataset
    • 8008 positive samples of active (mesodermal) enhancers was taken from Zinzen, Robert P., et al.
    • Statistics of the 8008 positive samples:
      • Average Length: 270.47 base pairs
      • Maximum Length: 1182 base pairs
      • Minimum Length: 115 base pairs
      • Standard deviation: 112 base pairs
    • 8008 negative samples were randomly chosen from the remaining of the genome, according to Gaussian distribution with the same statistical properties on the length of the positive samples.

Zinzen, Robert P., et al. "Combinatorial binding predicts spatio-temporal cis-regulatory activity." Nature 462.7269 (2009): 65-70.

dataset and preprocessing1
Dataset and Preprocessing
  • Testing Dataset 1
    • 1830 positive samples of active (general) enhancers was taken from REDFly Database
    • Statistics of the 1830 positive samples:
      • Average Length: 1829 base pairs
      • Maximum Length: 22573 base pairs
      • Minimum Length: 14 base pairs
    • 1830 negative samples were randomly chosen from the remaining of the genome, according to Gaussian distribution with the same statistical properties on the length of the positive samples.
    • To avoid bias, the regions that overlap with the training dataset were removed. Hence the dataset is reduced to 1480 positive samples and 1824 negative samples.
dataset and preprocessing2
Dataset and Preprocessing
  • Testing Dataset 2
    • 325 positive samples of active (mesodermal) enhancers was taken from REDFly Database
    • Statistics of the 325 positive samples:
      • Average Length: 1796 base pairs
      • Maximum Length: 20253 base pairs
      • Minimum Length: 66 base pairs
    • 325 negative samples were randomly chosen from the remaining of the genome, according to Gaussian distribution with the same statistical properties on the length of the positive samples.
    • To avoid bias, the regions that overlap with the training dataset were removed. Hence the dataset is reduced to 250 positive samples and 325 negative samples.
feature extraction
Feature Extraction
  • ChIP-Seq data was taken from Bonn, Stefan, et al.
  • It contained 8 different types:
    • H3K4me1
    • H3K4me3
    • H3K27ac
    • H3K27me3
    • H3K36me3
    • H3K79me3
    • Mef2 (Myocyte enhancer factor-2)
    • PolII (DNA polymerase II)

Bonn, Stefan, et al. "Tissue-specific analysis of chromatin state identifies temporal signatures of enhancer activity during embryonic development." Nature genetics 44.2 (2012): 148-156.

feature extraction1
Feature Extraction
  • Each value in a row means the strength of the signal, on a standard window-size at a particular region on a specific chromosome.
  • For example, at a standard window size of 50, the value in row17010 means the signal strength for position from 850,451 to 850,500 in chromosome 2 is 7.99.
  • There is a header description for each file.

What a processed ChIP-Seq file looks like according to Bonn, Stefan, et al.

feature extraction2
Feature Extraction

Signal

Input sequence

Genomic coordinates

To extract features from ChIP-Seq data, the authors simply sum the signals overlapped with the input sequence and take an average on the sum.

Hence, for each input sequence, we have 8 features based on ChIP-Seq, as we have 8 different types of data.

feature extraction3
Feature Extraction
  • 125 insect related transcription factor binding site motifs publicly available in JASPAR database were downloaded.
  • For each motif-sequence pair, the authors computed the thermodynamical binding energy score.
  • Hence, for each input sequence, we have 125 features based on motifs, as we have 125 different types of data.
feature extraction4
Feature Extraction

In short, for each input sequence, i.e. sample, we have 8 features based on ChIP-Seq data and 125 features based on motif data, and one class label.

classification methods
Classification Methods
  • Bayesian Network was used in Zinzen, Robert P., et al. to perform enhancer prediction based on ChIP-Seq features.
  • In this study, the authors used two more methods, i.e. SVM and Random Forest, to perform enhancer prediction based on ChIP-Seq features and sequence motif features.

Zinzen, Robert P., et al. "Combinatorial binding predicts spatio-temporal cis-regulatory activity." Nature 462.7269 (2009): 65-70.

outline3
Outline
  • Background
    • Biological background about enhancers
    • Properties used in Enhancer Prediction
  • Introduction
    • Motivation
    • Objective
    • Problem Input and Output
  • Materials and Methodology
    • Dataset and Preprocessing
    • Feature Extraction
    • Classification Methods
  • Results
  • Discussion and Conclusion
10 fold cross validation on the training data
10-Fold Cross-Validation on the Training Data

Area Under Curve (AUC) is shown.

EPI: Features based on ChIP-Seq only

MOT: Features based on Sequence motifs

ALL: Features based on both type of data

We observed that Bayesian network performed well in features based on ChIP-Seq but not in the Features based on sequence motifs. Random Forest was the best in the study.

measuring feature importance
Measuring Feature Importance

ChIP-Seq features were observed to be more important than Motif Features.

Motif Features

Random Features

ChIP-Seq Features

ranking the chip seq features
Ranking the ChiP-Seq Features

Mef2, H3K4me1 and H3K36me3 were observed to be the most important features.

the minimum set of features
The minimum set of Features
  • The authors claimed that they could use a minimum set of features consisting of
    • 1 sequence motif (zeste) feature and
    • 3 ChIP-Seq features (H3K4me1, H3K36me3 and Mef2

to build a model with stable prediction accuracy.

  • This hypothesis was examined confirmed by the 10-foldcross-validation, where the average accuracy obtained was 0.979.
the performance of the classifier applying on the testing data
The performance of the classifier applying on the Testing data

Area Under Curve (AUC) is shown.

EPI: Features based on ChIP-Seq only

MOT: Features based on Sequence motifs

ALL: Features based on both type of data

To my understanding, the authors applied Random Forest to the testing data. We observed that sequence-based features were observed to improve the classification performance.

outline4
Outline
  • Background
    • Biological background about enhancers
    • Properties used in Enhancer Prediction
  • Introduction
    • Motivation
    • Objective
    • Problem Input and Output
  • Materials and Methodology
    • Dataset and Preprocessing
    • Feature Extraction
    • Classification Methods
  • Results
  • Discussion and Conclusion
discussion
Discussion
  • As enhancers are cell-type/tissue-specific, classifiers in the future should be careful of this issue in the training.
  • It is interesting to observe that the classifier can also be applied to the testing dataset, which includes a mesodermal-specific dataset and a general dataset. This reduces the risk of over-fitting.
  • One question to ask if it is possible to learn the general and specific properties of enhancers separately.
  • One simple idea is to use all training data to learn one model and use separated data to learn specific models. Comparison can be performed on the trained classification models
discussion1
Discussion
  • While the results in this study are very promising, it should be noted that the training data comes from a relatively simple organism.
  • It is interesting to study if this can be applied to more complex organisms such as mammals.
  • However, it is currently very difficult due to lack of comprehensive enhancer datasets such as RedFly.
  • Additionally, the size of genome in human is much larger, which will be a challenge to the scalability of algorithms.
  • Scalable machine learning algorithms which mixed supervised and unsupervised techniques should be developed to confront the challenge.
conclusion
Conclusion
  • Understanding enhancers is important for studying gene regulation. Enhancer prediction remains challenging because they can be far from their target genes.
  • While classification method exploiting features from ChIP-Seq are popular, the authors attempted to build a classifier exploiting both sequence-based and chromatin features and hypothesized an improvement on performance
  • The authors tested and confirmed their hypothesis in Drosophila melanogaster with a very high classification accuracy.
  • It would be interesting to test if these findings can also be applied to complex systems such as mammalian genomes.