Group Meeting Presentation

Group Meeting Presentation Enhancer Prediction Antonio Sze-To 29-04-2014

Outline • Background • Biological background about enhancers • Properties used in Enhancer Prediction • Introduction • Motivation • Objective • Problem Input and Output • Materials and Methodology • Dataset and Preprocessing • Feature Extraction • Classification Methods • Results • Discussion and Conclusion

Biological Background about enhancers

What are Enhancers? Enhancers are distinct genomic regions (or the DNA sequences) that contain binding site sequences for transcription factors (TFs) that can upregulate (that is, enhance) the transcription of a target gene from its transcription start site (TSS). Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

Identifying Enhancers are Challenging According to ‘Pennacchio, Len A., et al. "Enhancers: five essential questions." Nature Reviews Genetics 14.4 (2013): 288-295’, an enhancer can be 1 million base pairs away from the gene. It can be at any distance as possible Along the linear genomic DNA sequence, enhancers can be located at any distance from their target genes, which makes their identificationchallenging. Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

How the enhancers are brought close to target promoters A single TF In a given tissue, active enhancers (Enhancer A) are bound by activating TFs and are brought into proximity of their respective target promoters by looping, which is thought to be mediated by cohesin and other protein complexes. Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

How the enhancers are brought close to target promoters Multiple TFs In a given tissue, active enhancers (Enhancer B) are bound by activating TFs and are brought into proximity of their respective target promoters by looping, which is thought to be mediated by cohesin and other protein complexes. Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

Enhancer activities are cell-type- or tissue-specific Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

Biochemical features of Active and Inactive gene regulatory elements • Active enhancers are characterized by • - A depletion of nucleosomes, which is the structural unit of eukaryotic chromatin. • Nucleosomes that flank active enhancers show specific histone modifications, for example, histone H3 lysine 4 monomethylation (H3K4me1) and H3K27 acetylation (H3K27ac). Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

Biochemical features of Active and Inactive gene regulatory elements • Inactive enhancers might be silenced by different mechanisms, such as by • the Polycomb protein-associated repressive H3K27me3 mark • (Polycomb-group proteins can remodel chromatin such that epigenetic silencing of genes takes place.) • by repressive TF binding Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

Chromatin accessibility and histone marks at regulatory elements Chromatin is shown as a ‘gatekeeper’ for transcription factor (TF) binding and enhancer activity. Densely positioned nucleosomes can restrict access for transcription factors and other proteins. Accessible (that is, nucleosome-free) regions can be bound by these proteins, which define and mediate the identity of a region (for example, active enhancers, repressors or core promoters). The transition from ‘open’ to ‘closed’ chromatin, and vice versa, is determined by regulatory proteins, including pioneer transcription factors. Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

Chromatin accessibility and histone marks at regulatory elements b) Active enhancer: H3K4me1, H3K27ac c) Active promoter: H3K4me3, H3K27ac d) Closed or poised enhancer: H3K4me1, H3K27me3 e) Primed enhancer (Soon to be Active): H3K4me1 Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

Chromatin accessibility and histone marks at regulatory elements Latent enhancers are located in closed chromatin and are not pre-marked by known histone modifications. However, in the presence of external stimuli the DNA becomes accessible, and flanking nucleosomes acquire H3K4me1 and H3K27ac marks. Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

Properties used in Enhancer Prediction

Properties used in Enhancer Prediction • Prediction by motifs and sequence conservation • Prediction by Regulator Binding • Prediction by Chromatin Accessibility • Prediction by Histone modification

Prediction by motifs and sequence conservation Genome-Wide Scan Genome Predicted Motif Predicted Motif Existing Motifs from JASPAR, TRANSFAC or UniPROBE Cis-regulatory Modules • Disadvantages: • - Random Matching is possible • Cell-type/tissue-specific • Enhancers are not necessarily to be conserved in sequences

Prediction by Regulator Binding with co-activators Disadvantages: non-functional or neutral binding Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

Prediction by Chromatin Accessibility Disadvantages: - Promoter regions around TSSs are often accessible invariantly across different cell types - Proteins that regulate other aspects of gene expression or chromosomal biology also bind to the DNA at accessible regions Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

Prediction by Histone Modification Disadvantages: - None of the known histone modifications correlates perfectly with enhancer activity, and even combinations of marks are not perfect predictors, e.g. many active enhancers lack characteristic marks Shlyueva, Daria, Gerald Stampfel, and Alexander Stark. "Transcriptional enhancers: from properties to genome-wide predictions." Nature Reviews Genetics 15.4 (2014): 272-286.

Properties used in Enhancer Prediction • Prediction by motifs and sequence conservation • Random match • Cell-type/ Tissuse-specific • Sequence conservation is not a must • Prediction by Regulator Binding • non-functional or neutral binding • Prediction by Chromatin Accessibility • Promoter regions around TSSs are often accessible invariantly across different cell types • Proteins that regulate other aspects of gene expression or chromosomal biology also bind to the DNA at nucleosome-free regions. • Prediction by Histone modification • None of the known histone modifications correlates perfectly with enhancer activity, and even combinations of marks are not perfect predictors, e.g. many active enhancers lack characteristic marks;

An enhancer prediction method making use of chromatic marks and sequences motifs

Motivation • Understanding enhancers is of fundamental importance to the study of gene regulation. • Identification of enhancers remain challenging because they can be located at any distance from their target genes. • While chromatin-based prediction methods are prevalent, sequence motifs have not yet been exploited in detail for enhancer prediction in Drosophila melanogaster.

Drosophila melanogaster (Red Fly) http://en.wikipedia.org/wiki/Drosophila_melanogaster

Objective • The authors attempt to combine the strengths of both • sequence-based features and • chromatin-based features to predict if a sequence contains an active enhancer in Drosophila melanogaster (Red Fly) .

Problem Input and Output • Input • A DNA sequence • Method • A classification algorithm to predict if the DNA sequence contains an active enhancer • Output • Yes or No

Dataset and Preprocessing • Training Dataset • 8008 positive samples of active (mesodermal) enhancers was taken from Zinzen, Robert P., et al. • Statistics of the 8008 positive samples: • Average Length: 270.47 base pairs • Maximum Length: 1182 base pairs • Minimum Length: 115 base pairs • Standard deviation: 112 base pairs • 8008 negative samples were randomly chosen from the remaining of the genome, according to Gaussian distribution with the same statistical properties on the length of the positive samples. Zinzen, Robert P., et al. "Combinatorial binding predicts spatio-temporal cis-regulatory activity." Nature 462.7269 (2009): 65-70.

Dataset and Preprocessing • Testing Dataset 1 • 1830 positive samples of active (general) enhancers was taken from REDFly Database • Statistics of the 1830 positive samples: • Average Length: 1829 base pairs • Maximum Length: 22573 base pairs • Minimum Length: 14 base pairs • 1830 negative samples were randomly chosen from the remaining of the genome, according to Gaussian distribution with the same statistical properties on the length of the positive samples. • To avoid bias, the regions that overlap with the training dataset were removed. Hence the dataset is reduced to 1480 positive samples and 1824 negative samples.

Dataset and Preprocessing • Testing Dataset 2 • 325 positive samples of active (mesodermal) enhancers was taken from REDFly Database • Statistics of the 325 positive samples: • Average Length: 1796 base pairs • Maximum Length: 20253 base pairs • Minimum Length: 66 base pairs • 325 negative samples were randomly chosen from the remaining of the genome, according to Gaussian distribution with the same statistical properties on the length of the positive samples. • To avoid bias, the regions that overlap with the training dataset were removed. Hence the dataset is reduced to 250 positive samples and 325 negative samples.

Feature Extraction • ChIP-Seq data was taken from Bonn, Stefan, et al. • It contained 8 different types: • H3K4me1 • H3K4me3 • H3K27ac • H3K27me3 • H3K36me3 • H3K79me3 • Mef2 (Myocyte enhancer factor-2) • PolII (DNA polymerase II) Bonn, Stefan, et al. "Tissue-specific analysis of chromatin state identifies temporal signatures of enhancer activity during embryonic development." Nature genetics 44.2 (2012): 148-156.

Feature Extraction • Each value in a row means the strength of the signal, on a standard window-size at a particular region on a specific chromosome. • For example, at a standard window size of 50, the value in row17010 means the signal strength for position from 850,451 to 850,500 in chromosome 2 is 7.99. • There is a header description for each file. What a processed ChIP-Seq file looks like according to Bonn, Stefan, et al.

Feature Extraction Signal Input sequence Genomic coordinates To extract features from ChIP-Seq data, the authors simply sum the signals overlapped with the input sequence and take an average on the sum. Hence, for each input sequence, we have 8 features based on ChIP-Seq, as we have 8 different types of data.

Feature Extraction • 125 insect related transcription factor binding site motifs publicly available in JASPAR database were downloaded. • For each motif-sequence pair, the authors computed the thermodynamical binding energy score. • Hence, for each input sequence, we have 125 features based on motifs, as we have 125 different types of data.

Feature Extraction In short, for each input sequence, i.e. sample, we have 8 features based on ChIP-Seq data and 125 features based on motif data, and one class label.

Classification Methods • Bayesian Network was used in Zinzen, Robert P., et al. to perform enhancer prediction based on ChIP-Seq features. • In this study, the authors used two more methods, i.e. SVM and Random Forest, to perform enhancer prediction based on ChIP-Seq features and sequence motif features. Zinzen, Robert P., et al. "Combinatorial binding predicts spatio-temporal cis-regulatory activity." Nature 462.7269 (2009): 65-70.

10-Fold Cross-Validation on the Training Data Area Under Curve (AUC) is shown. EPI: Features based on ChIP-Seq only MOT: Features based on Sequence motifs ALL: Features based on both type of data We observed that Bayesian network performed well in features based on ChIP-Seq but not in the Features based on sequence motifs. Random Forest was the best in the study.

Measuring Feature Importance ChIP-Seq features were observed to be more important than Motif Features. Motif Features Random Features ChIP-Seq Features

Ranking the ChiP-Seq Features Mef2, H3K4me1 and H3K36me3 were observed to be the most important features.

The minimum set of Features • The authors claimed that they could use a minimum set of features consisting of • 1 sequence motif (zeste) feature and • 3 ChIP-Seq features (H3K4me1, H3K36me3 and Mef2 to build a model with stable prediction accuracy. • This hypothesis was examined confirmed by the 10-foldcross-validation, where the average accuracy obtained was 0.979.

The performance of the classifier applying on the Testing data Area Under Curve (AUC) is shown. EPI: Features based on ChIP-Seq only MOT: Features based on Sequence motifs ALL: Features based on both type of data To my understanding, the authors applied Random Forest to the testing data. We observed that sequence-based features were observed to improve the classification performance.

Discussion • As enhancers are cell-type/tissue-specific, classifiers in the future should be careful of this issue in the training. • It is interesting to observe that the classifier can also be applied to the testing dataset, which includes a mesodermal-specific dataset and a general dataset. This reduces the risk of over-fitting. • One question to ask if it is possible to learn the general and specific properties of enhancers separately. • One simple idea is to use all training data to learn one model and use separated data to learn specific models. Comparison can be performed on the trained classification models

Discussion • While the results in this study are very promising, it should be noted that the training data comes from a relatively simple organism. • It is interesting to study if this can be applied to more complex organisms such as mammals. • However, it is currently very difficult due to lack of comprehensive enhancer datasets such as RedFly. • Additionally, the size of genome in human is much larger, which will be a challenge to the scalability of algorithms. • Scalable machine learning algorithms which mixed supervised and unsupervised techniques should be developed to confront the challenge.

Conclusion • Understanding enhancers is important for studying gene regulation. Enhancer prediction remains challenging because they can be far from their target genes. • While classification method exploiting features from ChIP-Seq are popular, the authors attempted to build a classifier exploiting both sequence-based and chromatin features and hypothesized an improvement on performance • The authors tested and confirmed their hypothesis in Drosophila melanogaster with a very high classification accuracy. • It would be interesting to test if these findings can also be applied to complex systems such as mammalian genomes.

Group Meeting Presentation

Group Meeting Presentation

Presentation Transcript

Group Meeting

User Group Presentation – Release 2.15 (No meeting)

User Group Presentation – Release 2.18 (No Meeting)

Group Meeting

Group Meeting

Group Meeting

Group meeting

Group Meeting

Group Meeting

Group Meeting

Group meeting

Group Meeting

Group Meeting

Group Meeting

Group Meeting

Group Meeting

User Group Presentation – Release 2.17 (No Meeting)