1 / 25

Computational detection of cis-regulatory modules

Computational detection of cis-regulatory modules. Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium Slides by Chulyun Kim Presented by Saurabh Sinha. Contents. Introduction Methods Methodology overview Score functions

hilda
Download Presentation

Computational detection of cis-regulatory modules

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational detection of cis-regulatory modules Stein Aerts, Peter Van Loo, Ger Thijs, Yves Moreau and Bart De Moor Katholieke Universiteit Leuven, Belgium Slides by Chulyun Kim Presented by Saurabh Sinha

  2. Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions

  3. Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions

  4. Motivation • The transcriptional regulation of a metazoan gene depends on the cooperative action of multiple transcription factors • These factors bind to cis-regulatory modules(CRMs) located in the neighborhood of the gene • By integrating multiple signals, CRMs confer an organism specific spatial and temporal rate of transcription

  5. Related Works • Yuh et al., 1998: Working with combinations of factors makes it possible to integrate multiple inputs and this further provides cross-coupling of a signal transduction and gene regulatory path ways • Bray et al., 2003: AVID, alignment algorithm designed to identify functional non coding segments • Aerts et al., 2003: delineation of putative regions containing CRMs in large intergenic sequences • Thijs et al., 2002: detecting DNA motifs by their statistical over-representation in a set of sequences • Aerts et al., 2003: detecting over-represented hits of known TFBSs • Recently, exploiting colocalization to find true biding sites in a particular gene yields valuable hypotheses regarding transcriptional regulation

  6. Problem • To find the best combination of transcription factor binding sites(TFBSs) that occur several times across multiple coregulated human genes • Specifically within syntenic regions with respective mouse orthologous genes

  7. Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions

  8. Methodology Overview

  9. Data • Human-mouse orthologous pairs • 10kb of sequence upstream of the coding sequence of the human and mouse gene from Ensemble release 9 • 18,778 pairs with successful selection

  10. Alignment and Parsing • Alignment • Each 10kb pair was aligned with AVID • Parsing • The alignment output was parsed using VISTA • Select regions with at least 75% identity in windows of 100 bp • 33,282 regions in total • Syntenic fastA database

  11. Background Model and MotifScanner • Background Model • 3rd-order Markov model is calculated form Syntenic fastA database • For scoring and generating artificial dataset • MotifScanner • All syntenic regions are scanned to predict trascription factor binding sites(TFBSs) • TRANSFAC: Frequency matrices • All occurrences are stored in GFF format in Syntenic GFF database GFF (Gene-Finding Format or General Feature Format): a protocol for the transfer of feature information Fields are: <seqname> <source> <feature> <start> <end> <score> <strand> <frame> PO A C G T 0112 4 3 1 A 023 2 11 4 G 0311 2 4 3 A …..

  12. Coregulated Genes • Sets of coexpressed genes • From SOURCE database for cyclin B2 • Dataset of gene expression during the cell cycle in a human cancer cell line • 44 genes might share a common cis-regulatory element • Of these, 34 had a Ensemble identifier • Among them, 13 genes have at least one syntenic region with the respective mouse gene • 32 regions in total

  13. Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions

  14. Scoring single TFBSs • Combining a position-specific frequency matrix Θ (PSFM) and a higher-order background model Bm • How likely it is that the segment is generated by the motif model with respect to the background • x is a segment [b1, b2, … , bw] • Bj is the nucleotide found at position j in x • Θ(bj, j) is the probability of fiding bj at position j according to the PSFM • P(bj | s, Bm) is the probability of finding bj in the sequence according to the background model

  15. Matrix similarity • Redundancy of motif model • There can be multiple matrices describing the same TF • There can be distinct TFs with similar PSFMs • Kullback-Leiber distance between two motif models • Θ1(j,b) is the probability of finding base b at position j in Motif 1 • w is the length of the motif • A is the set of all possible alignments for an allowed shift • The motif models can be grouped into classes depending on a threshold on this average distance

  16. Module Score Function • A biding site and a motif model (a frequency matrix)  CRMs and CRM models • CRMs: clusters of actual binding sites on a sequence • CRM models: sets of motif models • The score of a CRM model m on a set of sequences s=(s1,…,sn)

  17. The score of a CRM model mon a sequence s • m is a collection of motif models Θ1, …, Θl • is a set of matching binding sites • represents a count over the occurring TFBSs of model Θi in sequence s • If the number of the occurrences is q, can take any value in 0, … , q • is the kth instance of Θi on sequence s • is the score of single TFBS • b(t) is a boolean function expressing whether the given combination of TFBSs is valid or not • Overlap between different TFBSs • The sites within the specified window length  distance constraint • p(t) is the penalization function of CRMS • The number of occurring sites divided by the number of motif models l • The score does not take the motif order into account

  18. Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions

  19. ModuleSearch • Since the order of sites is not considered, CRM models can be sorted in alphabetical order • nΘ which is the number of sites a module should contain is given • Search for the best CRM model on a set of coregulated genes • Typical Best-First / Branch-and-bound search • From empty model, expand incomplete models by adding a model in a different class until there is no incomplete models whose overestimate heuristic score is greater than the score of the current best complete model • The model having the best heuristic score is first expanded

  20. Heuristic Score • is the score function without penalization of m • is an overestimate heuristic value of the rise in score from CRM model m to the best child CRM model • [Θi] is a CRM model containing one matrix Θi • t = ( )  (Θl +1 , …, Θe) • is a boolean function expressing whether the classes of motif models, when added to m, their class are all different or not

  21. Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions

  22. Semi-Artificial Sequences • Artificial sequences were generated by sampling symbols from the background model

  23. Detecting Modules in Microarray Clusters • Selected gene cluster around cyclin B2 • The best module model in the cluster selected by ModuleSearcher • window=100 bp and nΘ=4 • [NFY, STAF, TCF4, CEBPA]

  24. Contents • Introduction • Methods • Methodology overview • Score functions • ModuleSearch algorithm • Results • Conclusions

  25. Conclusions • the scoring functions of module for syntenic regions and the algorithm to find the best scoring module were proposed • They have tested the proposed algorithm on artificial data and showed that wit could find the hidden modules with a high sensitivity • They predicted a module in a set of coexpressed genes and validated the prediction using the same approach

More Related