Canadian bioinformatics workshops
1 / 89

Canadian Bioinformatics Workshops - PowerPoint PPT Presentation

  • Uploaded on

Canadian Bioinformatics Workshops. Module #: Title of Module. 2. Canadian Bioinformatics Workshops 2009 Module 3 Inferring Regulatory Mechanisms Governing Sets of Genes Wyeth W. Wasserman University of British Columbia. Module 3: Overview.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about ' Canadian Bioinformatics Workshops' - fatima-dean

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Canadian bioinformatics workshops

Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops 2009

Module 3

Inferring Regulatory Mechanisms Governing Sets of Genes

Wyeth W. Wasserman

University of British Columbia

Module 3 overview
Module 3: Overview

Part 1: Overview of transcription

Lab 3.1: Promoters in Genome Browser (UCSC)

Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”)

Lab 3.2: TFBS scan (Footer)

Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors

Lab 3.3: TFBS Over-Representation (oPOSSUM)

Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)

Lab 3.4: Motif Discovery (MEME/Motif-Compare)

Restrictions in coverage
Restrictions in Coverage

  • Focus on Eukaryotic cells and PolII Promoters

    • Principles apply to prokaryotes

    • Will provide suggestions for similar tools for other species

  • Many of the examples drawn from my lab’s work - there are many equivalent tools (links to be provided)

  • Part 1 introduction to transcription in eukaryotic cells
    Part 1Introduction to transcription in eukaryotic cells

    Transcription over simplified
    Transcription Over-Simplified

    • Three-step Process:

    • TF binds to TFBS (DNA)

    • TF catalyzes recruitment of polymerase II complex

    • Production of RNA from transcription start site (TSS)






    Anatomy of transcriptional regulation warning terms vary widely in meaning between scientists
    Anatomy of Transcriptional RegulationWARNING: Terms vary widely in meaning between scientists

    Core Promoter/Initiation Region (Inr)


    • Core Promoter – Sufficient for initiation of transcription; orientation dependent

      • TSR – transcription start region

        • Refers to a region rather than specific start site (TSS)

  • TFBS – single transcription factor binding site

  • Regulatory Regions

    • Proximal/Distal – vague reference to distance from TSR

    • May be positive (enhancing) or negative (repressing)

    • Orientation independent (generally)

    • Modules – Sets of TFBS within a region that function together

  • Transcriptional Unit

    • DNA sequence transcribed as a single polycistronic mRNA

  • Distal Regulatory Region

    Proximal Regulatory Region

    Distal R.R.











    Complexity in transcription
    Complexity in Transcription


    Distal enhancer

    Proximal enhancer

    Core Promoter

    Distal enhancer

    Lab discovery of tf binding sites
    Lab Discovery of TF Binding Sites

    Reporter Gene Activity











    Identify functional regulatory region within a sequence and delineate specific TFBS through mutagenesis (and in vitro binding studies)

    Emsa gel shift assays to identify binding proteins
    EMSA/Gel Shift Assays to Identify Binding Proteins

    TF + DNA


    High throughput methods
    High-throughput Methods

    • SELEX

      • mix random ds DNA oligonucleotides with TF protein, recover TF-DNA complexes and sequence DNA

    • Protein Binding Arrays (UniProbe Database)

      • prepare arrays with ds DNA attached, label protein with a fluorescent mark and observe DNA bound by protein

    • ChIP

      • covalently link proteins to DNA in cell, shear DNA, recover protein-DNA complexes and identify DNA (PCR, array or sequencing)


    • In most vertebrates the delineation of the transcription start position is not easy

      • cDNA often incomplete at 5’ end

  • Multiple promoters for most human genes

    • Referencing position relative to the initiation “site” is therefore not a good idea

      • But done almost uniformly in biological papers

    • Translation start equally problematic

      • Can be in internal exon

      • Multiple ORF start positions common

  • Importance of promoter proximal regions varies between species

    • Humans appear to have little enrichment for functional sequences; vast regions to consider generally leads to restricted region around promoter(s), but justification is not strong

    • Yeast and C.elegans have more compact regions and promoter proximity can be a useful property to restrict analyses

  • Mrna caps for mapping initiation sites
    mRNA Caps for Mapping Initiation Sites

    • 5’ end of mRNA have a “cap” structure that can be precipitated with an antibody

      • Allows for large-scale sequencing of “full-length” cDNAs and “tags” derived from the 5’ end of mRNAs

      • RIKEN the leading generators of such sequences

      • Not well represented in genome annotation resources (unfortunately)

    Classes of initiation regions


    TATA Box



    CpG Island


    Classes of Initiation Regions

    % CAGE Cap Tags per Position


    This is over-simplified - see paper for greater detail. Take home message is that promoters are not drawn from a single continuous distribution of properties, rather drawn from at least two classes.

    Image from Carninci P, et al (2006). Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet. Apr 28 PMID: 16645617

    Cpg islands



    CpG Islands

    • DNA methylation occurs in competition with histone acetylation

      • Acetylation promotes open chromatin structure that is permissive for TF binding to DNA

      • Methylation of DNA inhibits histone acetylation

      • Certain TFs promote histone acetylation by recruiting acetylases

  • Methylation occurs on cytosines

    • Preferentially on cytosine adjacent to guanines (CG dinucleotides, generally referred to as CpG)

    • Methylated cytosines frequently undergo deamination to form thymidine (CpG -> TpG)

  • CpG Islands are regions of DNA where CG dinucleotides occur at a frequency consistent with C and G mononucleotide frequencies

    • Highlight regions of active transcription

  • Cpg islands 2
    CpG Islands (2)

    • Important to recognize that promoters selectively active after early development will not be acetylated (and hence will be methylated) in the cell divisions preceding the establishment of germ cells and therefore will not have CpG islands

    • Lists of genes that have higher or lower CpG frequencies than average can misleadingly appear to have TF binding motifs based on this compositional characteristic

      • CpG Island bias in a gene set can mislead an analyst to think that there are patterns of TFBS (patterns with internal CG for island-rich and TG for island-poor sets)

    Additional topics
    Additional Topics

    • Chromatin modification studies making great strides

      • Signatures indicative of active regulatory sequences such as H3K4me3

  • Co-activator (p300) ChIP study suggests possibility to “read-off” regulatory regions

  • No methods currently address 3D properties of nucleus (long-run will be necessary)

  • Section 3 1 what have we learned
    Section 3.1What have we learned?

    • Transcription controlled by regulatory regions

    • Regulatory regions can be distant from initiation regions

    • Laboratory methods can identify regulatory regions and TF binding sites

    • Concept of single initiation site is flawed

    • Promoters fall into subclasses

      • CpG vs TATA

      • Can impact assessment of TFBS in sets of genes








    Please, please, please . . .


    . . . now is a great chance.

    Module 3
    Module 3

    Part 1: Overview of transcription

    Lab 3.1: Promoters in Genome Browser (UCSC)

    Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”)

    Lab 3.2: TFBS scan (Footer)

    Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors

    Lab 3.3: TFBS Over-Representation (oPOSSUM)

    Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)

    Lab 3.4: Motif Discovery (MEME/Motif-Compare)

    Part 2 prediction of tf binding sites
    Part 2 Prediction of TF Binding Sites

    Teaching a computer to find TFBS…

    Representing binding sites for a tf

    A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3

    C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12

    G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2

    T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

    Logo – A graphical representation of frequency matrix. Y-axis is information content , which reflects the strength of the pattern in each column of the matrix

    Representing Binding Sites for a TF

    Set of































    • A single site


    • A set of sites represented as a consensus

      • VDRTWRWWSHD (IUPAC degenerate DNA)

    TGCTG = 0.9

    Conversion of PFMs to Position Specific Scoring Matrices (PSSM)PSSMs also known as Position Weight Matrices(PWMs)

    Add the following features to the matrix profile:

    1. Correct for nucleotide frequencies in genome

    2. Weight for the confidence (depth) in the pattern

    3. Convert to log-scale probability for easy arithmetic



    A 1.6 -1.7 -0.2 -1.7 -1.7

    C -1.7 0.5 0.5 1.3 -1.7

    G -1.7 1.0 -0.2 -1.7 1.3

    T -1.7 -1.7 -0.2 -0.2 -0.2

    A 5 0 1 0 0

    C 0 2 2 4 0

    G 0 3 1 0 4

    T 0 0 1 1 1

    f(b,i)+ s(n)



    Pssm scoring scales
    PSSM Scoring Scales

    • Raw scores

      • Sum of values from indicated cells of the matrix

  • Relative Scores (most common)

    • Normalize the scores to range of 0-1 or 0%-100%

  • Empirical p-values

    • Based on distribution of scores for some DNA sequence, determine a p-value (see next slide)

  • Detecting binding sites in a single sequence


    A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]

    C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]

    G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]

    T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

    Relative Scores

    A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]

    C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]

    G [ 1.23481.23482.12222.1222 0.4368 1.23481.51281.74571.7457 -1.5 ]

    T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

    Max_score = 15.2 (sum of highest column scores)

    A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]

    C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]

    G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]

    T [ 0.4368 -0.2284-1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

    Min_score = -10.3 (sum of lowest column scores)

    Detecting binding sites in a single sequence

    Raw Scores


    Abs_score = 13.4 (sum of column scores)

    Empirical p-value Scores





    Area to right of value

    Area under entire curve


    0.0 0.2 0.4 0.6 0.8 1.0

    Relative Score




    ( )

    The good
    The Good…

    • Tronche (1997) tested 50 predicted HNF1 TFBS using an in vitro binding test and found that 96% of the predicted sites were bound!

    • Stormo and Fields (1998) found in detailed biochemical studies that the best weight matrices produce scores highly correlated with in vitro binding energy




    The bad
    …the Bad…

    • Fickett (1995) found that a profile for the myoD TF made predictions at a rate of 1 per ~500bp of human DNA sequence

      • This corresponds to an average of 20 sites / gene (assuming 10,000 bp as average gene size)

    And the ugly
    …and the Ugly!

    Human Cardiac a-Actin gene analyzed

    with a set of profiles

    (each line represents a TFBS prediction)

    Futility Conjuncture: TFBS predictions are almost always wrong

    Red boxes are protein coding exons -

    TFBS predictions excluded in this analysis

    Advanced topic issues of column independence
    ADVANCED TOPICIssues of Column Independence

    • PSSM model assumes independence between positions

      • For example, if you observe a G at position 2, the model assumes there is no influence on the likelihood of a T at position 3 - this is known to be an incorrect assumption

  • Other models can represent dependence

    • Hidden Markov models of Nth order where Nth refers to the number of influencing positions

    • For the cases where there are hundreds of TFBS known for a TF, there has been only modest improvement in the specificity of TFBS predictions using advanced column inter-dependent models

    • The newly emerging ChIP-Seq data collections will ultimately lead to the systematic use of more advanced models (not likely to advance to wet labs for ~3 years)

  • A conundrum



    A Conundrum…

    • Counter to intuition, the ratio of true positives to predictions fails to improve for “stringent” thresholds

      • For most predictive models this ratio would increase

  • Why?

    • True binding sites are defined by properties not incorporated into the profile scores - above some threshold all sites could be bound if accessible

  • Section 3 1a what have we learned
    Section 3.1AWhat have we learned?

    • PSSMs accurately reflect in vitro binding properties of DNA binding proteins

    • Suitable binding sites occur at a rate far too frequent to reflect in vivo function

    • Bioinformatics methods that use PSSMs for binding site studies must incorporate additional information to enhance specificity

      • Unfiltered predictions are too noisy for most applications

      • Organisms with short regulatory sequences are less problematic (e.g. yeast and bacteria)

    Using phylogenetic footprinting to improve tfbs discrimination

    Using Phylogenetic Footprinting to Improve TFBS Discrimination

    70,000,000 years of evolution can reveal regulatory regions

    Phylogenetic footprinting
    Phylogenetic Footprinting Discrimination

    FoxC2 – a single exon gene







    % Human-Mouse Identity

    • Align orthologous gene sequences (e.g. LAGAN)

    • For first window of 100 bp, of sequence#1, determine the % with identical match in sequence#2

      • Step across the first sequence, recording the percentage of identical nucleotides in each window

  • Observe that single exon contains a region of high identity that corresponds to the ORF, with lower identity in the 5’ and 3’ UTRs

  • Additional conserved region could be regulatory regions

  • Phylogenetic footprinting cont
    Phylogenetic Footprinting (cont) Discrimination

    % Identity

    200 bp Window Start Position (human sequence)

    Actin gene compared between human and mouse

    Multi species phylogenetic footprinting
    Multi-species DiscriminationPhylogenetic Footprinting

    • PhastCons scores indicate the regions of DNA which are unusual in their sequence composition in some subset of organisms

    Phylogenetic footprints in ucsc genome browser
    Phylogenetic Footprints in DiscriminationUCSC Genome Browser

    • PhyloCons (regions score)

    • PhyloP (position score)



    Phylogenetic footprinting dramatically reduces spurious hits

    Human Discrimination


    Phylogenetic Footprinting Dramatically Reduces Spurious Hits

    Actin, alpha cardiac

    Tfbs prediction with human mouse pairwise phylogenetic footprinting
    TFBS Prediction with Human & Mouse Pairwise Phylogenetic Footprinting



    • Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set)

    • 75-80% of defined sites detected with conservation filter, while only 11-16% of total predictions retained

    Choosing the right species for pairwise comparison
    Choosing the ”right” species for Footprintingpairwise comparison...







    ConSite Footprinting

    Tfbs discrimination tools
    TFBS Discrimination Tools Footprinting

    • Phylogenetic Footprinting Servers

      • FOOTER

      • CONSITE

      • rVISTA

      • ORCAtk

  • SNPs in TFBS Analysis

    • RAVEN

  • Prokaryotes or Yeast



  • Software Packages

    • TOUCAN

  • Programming Tools

    • TFBS

    • ORCAtk

  • Analysis of tfbs with phylogenetic footprinting
    Analysis of FootprintingTFBS with Phylogenetic Footprinting

    Scanning a single sequence

    Scanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions

    A dramatic improvement in the percentage of biologically significant detections

    • Low specificity of profiles:

      • too many hits

      • great majority not biologically significant

    Section 3.2B FootprintingWhat have we learned?

    • TFBS discrimination coupled with phylogenetic footprinting has greater specificity with tolerable loss of sensitivity

      • As with any purification process, some true binding sites will be lost

  • Available online resources support phylogenetic footprinting

  • Questions1

    Questions? Footprinting

    Please Ask

    Laboratory exercise 3 2

    Laboratory Exercise 3.2 Footprinting

    TF Binding Site Prediction

    20 minute break

    20 minute break Footprinting

    Until 10:50am

    Next: Sections 3.3 and 3.4

    Module 31
    Module 3 Footprinting

    Part 1: Overview of transcription

    Lab 3.1: Promoters in Genome Browser (UCSC)

    Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”)

    Lab 3.2: TFBS scan (Footer)

    Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors

    Lab 3.3: TFBS Over-Representation (oPOSSUM)

    Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)

    Lab 3.4: Motif Discovery (MEME/Motif-Compare)

    Part 3 inferring regulating tfs for sets of co expressed genes

    Part 3: FootprintingInferring Regulating TFs for Sets of Co-Expressed Genes

    Tfbs over representation
    TFBS Over-representation Footprinting

    • Akin to the GO studies yesterday, we seek to determine if a set of co-expressed genes contains an over-abundance of predicted binding sites for a known TF

      • Phylogenetic footprinting to reduce false prediction rate

    Two examples of tfbs over representation

    Foreground Footprinting


    More Total TFBS



    Two Examples of TFBS Over-Representation

    More Genes with TFBS

    Statistical methods for identifying over represented tfbs
    Statistical Methods for Identifying Over-represented TFBS Footprinting

    • Binomial test (Z scores)

      • Based on the number of occurrences of the TFBS relative to background

      • Normalized for sequence length

      • Simple binomial distribution model

    • Fisher exact probability scores

      • Based on the number of genes containing the TFBS relative to background

      • Hypergeometric probability distribution

    Validation using reference gene sets
    Validation using Reference Gene Sets Footprinting

    TFs with experimentally-verified sites in the reference sets.

    C myc sage data
    C-Myc SAGE Data Footprinting

    • c-Myc transcription factor dimerizes with the Max protein

    • Key regulator of cell proliferation, differentiation and apoptosis

    • Menssen and Hermeking identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC cells

    • They then went on to confirm the induction of 53 genes using microarray analysis and RT-PCR

    Structurally related tfs with indistinguishable tfbs
    Structurally-related TFs with Indistinguishable TFBS Footprinting

    • Most structurally related TFs bind to highly similar patterns

      • Zn-finger is a big exception

    Opossum server
    oPOSSUM Server Footprinting

    Ets factor family
    Ets Factor Family Footprinting

    • EG232974

    • EG432800

    • Ehf

    • Elf1

    • Elf2

    • Elf3

    • Elf4

    • Elf5

    • Elk1

    • Elk3

    • Elk4

    • Erf

    • Erg

    • Ets1

    • Ets2

    • How to pick which one?

      • At this stage there are TF catalogs coming that will be coupled to characteristics.

    • Candidate gene prioritization software can be used (such as TOPPGENE)

    • Etv1

    • Etv2

    • Etv3

    • Etv3l

    • Etv4

    • Etv5

    • Etv6

    • Fev

    • Fli1

    • Gabpa

    • LOC100

    • LOC100

    • factor)

    • LOC634494

    • Sfpi1

    • Spdef

    • Spib

    • Spic

    Section 3.3 FootprintingWhat have we learned?

    • New generation of tools to help interrogate the meaning of observed clusters of co-expressed genes

    • Generally best performance has been with data directly linked to a transcription factor

      • Highly dependent on the experimental design – cannot overcome noisy data from poor design (Recall Day 1)

  • The identity of a mediating TF may not be apparent when many proteins can bind to the same motif

  • Questions2

    Questions? Footprinting

    Now is a good time

    Laboratory exercise 3 3

    Laboratory Exercise 3.3 Footprinting

    TFBS Over-Representation Analysis

    Module 3 overview1
    Module 3: Overview Footprinting

    Part 1: Overview of transcription

    Lab 3.1: Promoters in Genome Browser (UCSC)

    Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”)

    Lab 3.2: TFBS scan (Footer)

    Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors

    Lab 3.3: TFBS Over-Representation (oPOSSUM)

    Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)

    Lab 3.4: Motif Discovery (MEME/Motif-Compare)

    Part 4 de novo discovery of tf binding sites

    Part 4: Footprintingde novo Discovery of TF Binding Sites

    De novo pattern discovery
    de novo Footprinting Pattern Discovery

    • String-based

      • e.g. YMF (Sinha & Tompa)

      • Generalization: Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections

      • Used often for yeast promoter analysis

    • Profile-based

      • e.g. AnnSpec (Workman & Stormo) or MEME (Bailey & Elkin)

      • Generalization: Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics

    Assessing discovered patterns
    Assessing Discovered Patterns Footprinting

    • Strength

    • Similarity search

    String based methods 1
    String-based methods(1) Footprinting

    How likely are X words in a set of sequences, given background sequence characteristics?

    CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range

    String based methods 2
    String-based methods(2) Footprinting

    Find all words of length n in the yeast promoters (e.g. n=7)


    Make a lookup table:

    AAACCTTT 456

    TTTTTTTT 57788

    GATAGGCA 589


    String based methods 3
    String-based methods(3) Footprinting

    Xw: Instances of a word w within our set of X genes

    E[Xw]: Average number of instances of w based on number of genes in our set

    Var[Xw]: Variance – how much deviation from the average is expected for w

    Limitations of string based methods
    Limitations of String-based Methods Footprinting

    • Longer word lengths not possible

    • While degeneracy codes can be used, TFBS are not words – we lose quantitation for variable positions with consensus sequences

      • Imagine column in PFM with 7 A’s and 1 T --- in a consensus sequence we would represent as W or throw out the instance with T

  • Recently the string-based method has found renewed utility in the analysis of 3’UTRs for the presence of microRNA target sequences...

  • Microrna target sequences
    microRNA Target Sequences Footprinting

    • Lim et al expressed miRNAs in cells and observed that the overall pattern of gene expression shifted toward the pattern of expression observed in cells which naturally express the miRNA

    • The genes with reduced expression in response to miRNA exposure shared 7nt motifs the 3’UTR of their transcripts

    • Nice website tutorial:


    Probabilistic methods for pattern discovery

    Probabilistic Methods for Pattern Discovery Footprinting

    What is a probabilistic method?

    The Gibbs sampler algorithm

    Probabilistic methods
    Probabilistic Methods Footprinting


    Find a local alignment of width x of sites that maximizes information content (or related measure) in reasonable time

    Usually by Gibbs sampling or EM methods


    TFBS are not words

    Efficiency – can handle longer patterns than string-based methods

    Can be intentionally influenced to reflect prior knowledge

    What does probabilistic mean
    What does probabilistic mean? Footprinting

    • Based on probability

    • Functionally, it means we’re going to guess our way to a good pattern (TFBS)

      • We’re going to try to make a good guess

  • Two different flavours of the approach

    • Expectation Maximization in which we try to make the best guess each time

    • Gibbs Sampling in which we make our guesses based on the strength of our conviction

  • Gibbs sampling
    Gibbs Sampling Footprinting


    Two data structures used:

    1) Current pattern nucleotide frequencies qi,1,..., qi,4and corresponding background frequencies pi,1,..., pi,4

    2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes toqi,j.

    One starting point in each sequence is chosen randomly initially.




    Iterations in gibbs sampling

    ’Score’ the current pattern against each possible occurence ak in z. Draw a new ak with probabilities based on respective score divided by the background model


    Iterations in Gibbs Sampling

    Remove one sequence z from the set. Update the current pattern according to



    Pseudocount for symbol j




    Sum of all pseudocounts in column


    Gibbs sampling grossly over simplified

    ttcgctcc occurence


    1 2 3 4 5 6 7 8

    A 2 0 2 2 2 1 0 1

    C 0 2 3 3 2 1 6 2

    G 0 4 1 0 1 0 1 1

    T 4 1 1 2 2 5 0 2






    Gibbs Sampling(grossly over-simplified)

    Pattern discovery
    Pattern Discovery occurence

    • Gibbs sampling is guaranteed to return an optimal pattern if repeated sufficiently often

      • Procedure is fast, so running many 1000s of times is feasible

  • Unfortunately, we have a problem…what if the mediating TFBS are not strongly over-represented relative to other patterns…

  • Applied pattern discovery is acutely sensitive to noise

    Pink line is negative control occurence

    with no Mef2 sites included




    Applied Pattern Discovery is Acutely Sensitive to Noise

    True Mef2 Binding Sites

    Four approaches to improve sensitivity
    Four Approaches to Improve Sensitivity occurence

    • Better background models

      -Higher-order properties of DNA

    • Phylogenetic Footprinting

      • Human:Mouse comparison eliminates ~75% of sequence

    • Regulatory Modules

      • Architectural rules

    • Limit the types of binding profiles allowed

      • TFBS patterns are NOT random

    Pattern discovery summary
    Pattern Discovery Summary occurence

    • Pattern discovery methods can recover over-represented patterns in the promoters of co-expressed genes

    • Methods are acutely sensitive to noise, indicating that the signal we seek is weak

      • TFs tolerate great variability between binding sites

  • As for pattern discrimination, supplementary information/approaches are required to over-come the noise

  • Questions3

    Questions? occurence

    Winding down

    Laboratory exercise 3 4

    Laboratory Exercise 3.4 occurence

    Motif Discovery

    REFLECTIONS occurence

    • Part 2

      • Futility Theorem – Essentially predictions of individual TFBS have no relationship to an in vivo function

      • Successful bioinformatics methods for site discrimination incorporate additional information (clusters, conservation)

    • Part 3

      • TFBS over-representation is a powerful new means to identify TFs likely to contribute to observed patterns of co-expression

    • Part 4

      • Pattern discovery methods are severely restricted by the Signal-to-Noise problem

        • Observed patterns must be carefully considered

      • Successful methods for pattern discovery will have to incorporate additional information (conservation, structural constraints on TFs)

    Module 3 overview2
    Module 3: Overview occurence

    Part 1: Overview of transcription

    Lab 3.1: Promoters in Genome Browser (UCSC)

    Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”)

    Lab 3.2: TFBS scan (Footer)

    Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors

    Lab 3.3: TFBS Over-Representation (oPOSSUM)

    Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)

    Lab 3.4: Motif Discovery (MEME/Motif-Compare)

    The end
    THE END occurence

    • Questions before the break?

    • Lab exercises address Sections 2 and 3


    LUNCH occurence

    On your own

    (Food court Downstairs)

    Back at: ??