1 / 25

Modeling Regulatory Motifs

Modeling Regulatory Motifs. 3/26/2013. Transcriptional Regulation. a. a. b. b. Transcription is controlled by the interaction of tran -acting elements called transcription factors (TFs) and cis -acting elements of DNA.

Download Presentation

Modeling Regulatory Motifs

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Modeling Regulatory Motifs 3/26/2013

  2. Transcriptional Regulation a a b b • Transcription is controlled by the interaction of tran-acting elements called transcription factors (TFs) and cis-acting elements of DNA. • Prediction of cis-acting elements or TF binding sites is a challenging problem in computational biology. Transcriptional regulation of in prokaryotes FT binding site Promoter region Ribosome binding site Terminator TSS TF1 TF2 s -300 -35 -10 +1 Transcription 5’UTR 3’UTR RNA

  3. Specific Protein-DNA interactions • Protein-DNA interactions are specific, guaranteeing that transcriptional regulation is specific and precise. • The specificity of protein-DNA interactions are realized by the 3-D structures on the DNA-binding face of TF protein and the TF binding site of the DNA sequence. • Usually a TF recognizes variable but similar binding sites associated with different genes. • All the binding site recognized by the same TF is called a TF-binding motif.

  4. Experimental determination of binding sites • There are in vitro and in vivo methods for determining the binding sites of TFs. • Systematic evolution of ligands by exponential enrichment (SELEX) is likely to identify all possible sequences recognized by a TF; • SELEX may not work if TF-DNA interaction requires unknown co-factors; • The method is laborious as tedious molecular cloning and sequencing are required to determine the binding sites. Motif finding Geertz M , and Maerkl S J Briefings in Functional Genomics 2010;9:362-373

  5. Experimental determination of binding sites Protein binding microarray (PBM) is another in vitro method, which avoid the molecular cloning step, and the binding site can be directly read out from the microarray; • PBM can determine binding sites at single base resolution. • But as SELEX, PBM may not work if TF-DNA interaction requires unknown co-factor; • PBM may not work either if the binding site is long, e.g., longer than 12 pb. • The putative binding site determined by PBM may not necessarily the real binding site in cells. Geertz M and Maerkl S J Briefings in Functional Genomics 2010;9:362-373

  6. Experimental determination of binding sites • ChIP-seq and ChIP-chip are two high throughput in vivo methods for determining the binding sites of a TF. • ChIP-seq and ChIP-chip can determine actual binding sites in a genome, but to determine all binding sites, many cell types need to be explored. Geertz M , and Maerkl S J Briefings in Functional Genomics 2010;9:362-373 Motif finding

  7. Profile representation of TF binding sites TACGAT TATAAT TATAAT GATACT TATGAT TATGTT TATAGT TATAAT Examples of s70 binding sites in E. coli Consensus sequence [TG]A[TC][GA]XT Regular expression Frequency matrix To avoid 0 counting, add a pseudo count of 1

  8. Profile representation of TF binding sites Profile: for a motif of n samples (sequences), the probability of residue b at positioni is where nb,iis the frequency of residueb at position i; and kis a pseudocount to avoid zero probability. Profile pb,i, of the s70 binding sites in E. coli, pseudocountk = 1

  9. Profile representation of TF binding sites Position specific weigh (scoring) matrix (PSWM): for a motif of n samples, the weight of residue b at positioni is defined as where pb,iis the probability of residueb at position i; and pb is the probability of residue b in the background sequences. PSWM of the s70 binding sites in E. coli, assuming pA=pC=pG=pT=0.25

  10. Profile representation of TF binding sites Information content at position i of the sequence profile is given by: Information contents ofa motif: Logo representation: where e(n) is a correction factor required when one only has a few (n) sample. A pseudo count is not added when computing pb,i. The height of each base is http://weblogo.berkeley.edu/logo.cgi

  11. Score of a sequence using a PSWM If we represent a sequence S = {b1 b2 … bj …bn} as a binary matrix: A C G T 0 0 0 1 1 1 0 0 0 2 0 0 0 1 3 S =TATAAT {sj,b}nx4= 1 0 0 0 4 1 0 0 0 5 0 0 0 1 6 The score a sequence against a profile (or PSWM) is defined as

  12. Score of a sequence using a PSWM A C G T 0 0 0 1 1 1 0 0 0 2 0 0 0 1 3 TATAAT = {Sj,b } = 1 0 0 0 4 1 0 0 0 5 0 0 0 1 6

  13. Higher order PSWM To account for the dependence among adjacent positions of TF-DNA interaction, we can use higher order PSWMs. A higher order PSWM corresponds to a k-th order Markov chain, in which position i is dependent on the previous k positions. A higher order PSWM is also called a position weight array. First order PWSM for the s70 factor binding sites TACGAT TATAAT TATAAT GATACT TATGAT TATGTT TATAGT To avoid 0 counting, add a pseudo count of 1

  14. Maximal dependence decomposition Maximal dependence decomposition (MDD) models the dependence between any two positions. It estimates the extent to which the nucleotides bjat position j depend on the nucleotides bi at position i. MDD uses the 2 test to determine whether position j depends on positions i. bj bi For each position i, we divide binding sites in two groups: Ci: Binding sites having the consensus base at i; : Binding sites having non-consensus base at i. T A C G A T T A T A A T T A T A A T G A T A C T T A T G A T T A T G T T T A T A G T T A T A AT bj bi bj bi T A C G A T T A T A A T T A T A A T T A T G A T G A T A C T T A T G T T T A T A G T Consensus bases: G - C G C – G T Non-consensus bases: Ci

  15. Maximal dependence decomposition Let fbbe the probability base b at position j in the binding sites in Let N and Nb be the total number of binding sites and count of base b at j in Ci, respectively, then the 2 static is defined as, Ci bj bi bj bi G A T A C T T A T G T T T A T A G T T A C G A T T A T A A T T A T A A T T A T G A T N binding sites NA NC NG NT fA fC fG fT

  16. Maximal dependence decomposition This 2 static describes the dependence of position j on position i, and is denoted as 2(j|i). The MDD approach proceeds iteratively as follows. For each position i, compute Among all the positions, select position iwith maximum Si, and partition sequences into two groups Ciand ; Repeat steps 1 and 2 separately for Ciand ; Stop if there is no significant dependence or if there is an insufficient number of binding sites in Cior . In either case construct a standard PWSM for the remaining subset of binding sites.

  17. Maximal dependence decomposition Illustration of the MDD procedure: modeling Maximum S3 AACGTG AGCCTG ...... AACGTG Insufficient dependence PSWM1 Maximum S1 AACGTG AGGCTG AGCTTT ...... AACGTG AACGTG AGGCTG AGCTTT ...... TACGTG CACGGT GATGGG AAGGTG AGGCTG ...... AATGTG Insufficient dependence PSWM2 CACGGT GATGGG ...... GACTTG

  18. Maximal dependence decomposition Illustration of the MDD procedure: scoring Position 1 has the consensus base ‘A’ Position 3 has non-consensus base ‘G’ Score X using PSWM2 X=AAGGTG Maximum S3 AACGTG AGCCTG ...... AACGTG Insufficient dependence PSWM1 Maximum S1 AACGTG AGGCTG AGCTTT ...... AACGTG AACGTG AGGCTG AGCTTT ...... TACGTG CACGGT GATGGG AAGGTG AGGCTG ...... AATGTG Insufficient dependence PSWM2 CACGGT GATGGG ...... GACTTG AGCGTG

  19. Modeling and detecting arbitrary dependencies We can also use a digraph to model the dependence among the positions: a S1 S2 S3 S4 b S1 S2 S3 S4 c S1 S2 S3 S4 T d S1 S2 S3 S4

  20. Searching for novel binding site using a PSWM Scan a sequence using a sliding window of the length of the PSWM, and return the windows that have a significantly high score. ...G A G T T A T A A T T A A G A... The significance of a score S can be computed as an empirical p value, or as follows, where Sminand Smax is the minimal and maximal score can be scored by the PSWM,

  21. De novel prediction of TF binding sites • The motif-finding problem: Since there are usually no fixed patterns of cis-regulatory elements of a TF, a cis-regulatory element can be only predicted by comparing a set of sequences that are likely to contain the binding site of the same TF. The problem of finding cis-regulatory elements in a given set of sequences is called the motif-finding problem. • Currently, all sequence-based motif-finding algorithms are based on the assumption that binding sites of a TF are more conserved than the flanking sequences in a genome. A larger number of motif-finding algorithms have been developed: Greedy algorithms: CONSENSUS, DREME Probabilistic algorithms: MEME, BioProspector Graph-theoretic algorithms: CUBIC, MotifClick ……

  22. Methods for finding a set of intergenic sequences for motif-finding • One genome, multiple genes approach: identify a set of co-regulated genes from an organism of interest through clustering analysis of gene expression profiles. Motif finding IA IB IC ID IE IF

  23. Methods for finding a set of intergenic sequences for motif-finding • One gene, multiple genomes approach---phylogenetic footprinting: in closely related species, more often both the coding sequences and cis-regulatory elements of orthologous genes are conserved. TFBSs Genes -300 -35 -10 +1 Homologous -300 -35 -10 +1 A operon from another genome

  24. Phylogenetic footprinting Orthologues identification T.g1 T.gm G1.g1 G1.gm …… G2.g1 G2.gm . . . . Gn.g1 Gn.gm Intergenic regions Motif finding Predicted binding Sites . . PSWM m

  25. Additional hallmarks of functional TF binding sites In high eukaryote, genes are regulated by multiple TFs binding to a close cluster of respective binding sites. These clusters of binding sites of the same and/or different TFs are called cis-regulatory modules (CRMs), they can be in different orientations, located in the upstream, downstream or in the intron of a gene, can be very far away from the target gene, and can be even on a different chromosome. Borok M J et al. Development 2010;137:5-13 Wyeth W. Wasserman & AlbinSandelin Nature Reviews Genetics 2004; 5, 276-287

More Related