Canadian bioinformatics workshops
Download
1 / 89

Canadian Bioinformatics Workshops - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Canadian Bioinformatics Workshops. www.bioinformatics.ca. Module #: Title of Module. 2. Canadian Bioinformatics Workshops 2009 Module 3 Inferring Regulatory Mechanisms Governing Sets of Genes Wyeth W. Wasserman University of British Columbia. www.cisreg.ca. Module 3: Overview.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Canadian Bioinformatics Workshops' - fatima-dean


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Canadian bioinformatics workshops

Canadian Bioinformatics Workshops

www.bioinformatics.ca



Canadian Bioinformatics Workshops 2009

Module 3

Inferring Regulatory Mechanisms Governing Sets of Genes

Wyeth W. Wasserman

University of British Columbia

www.cisreg.ca


Module 3 overview
Module 3: Overview

Part 1: Overview of transcription

Lab 3.1: Promoters in Genome Browser (UCSC)

Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”)

Lab 3.2: TFBS scan (Footer)

Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors

Lab 3.3: TFBS Over-Representation (oPOSSUM)

Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)

Lab 3.4: Motif Discovery (MEME/Motif-Compare)


Restrictions in coverage
Restrictions in Coverage

  • Focus on Eukaryotic cells and PolII Promoters

    • Principles apply to prokaryotes

    • Will provide suggestions for similar tools for other species

  • Many of the examples drawn from my lab’s work - there are many equivalent tools (links to be provided)


  • Part 1 introduction to transcription in eukaryotic cells
    Part 1Introduction to transcription in eukaryotic cells


    Transcription over simplified
    Transcription Over-Simplified

    • Three-step Process:

    • TF binds to TFBS (DNA)

    • TF catalyzes recruitment of polymerase II complex

    • Production of RNA from transcription start site (TSS)

    TF

    Pol-II

    TFBS

    TATA

    TSS


    Anatomy of transcriptional regulation warning terms vary widely in meaning between scientists
    Anatomy of Transcriptional RegulationWARNING: Terms vary widely in meaning between scientists

    Core Promoter/Initiation Region (Inr)

    TSR

    • Core Promoter – Sufficient for initiation of transcription; orientation dependent

      • TSR – transcription start region

        • Refers to a region rather than specific start site (TSS)

  • TFBS – single transcription factor binding site

  • Regulatory Regions

    • Proximal/Distal – vague reference to distance from TSR

    • May be positive (enhancing) or negative (repressing)

    • Orientation independent (generally)

    • Modules – Sets of TFBS within a region that function together

  • Transcriptional Unit

    • DNA sequence transcribed as a single polycistronic mRNA

  • Distal Regulatory Region

    Proximal Regulatory Region

    Distal R.R.

    EXON

    EXON

    TFBS

    TFBS

    TFBS

    TFBS

    TFBS

    TATA

    TFBS

    TFBS


    Complexity in transcription
    Complexity in Transcription

    Chromatin

    Distal enhancer

    Proximal enhancer

    Core Promoter

    Distal enhancer


    Lab discovery of tf binding sites
    Lab Discovery of TF Binding Sites

    Reporter Gene Activity

    0%

    100%

    LUCIFERASE

    LUCIFERASE

    LUCIFERASE

    LUCIFERASE

    LUCIFERASE

    LUCIFERASE

    LUCIFERASE

    mutation

    Identify functional regulatory region within a sequence and delineate specific TFBS through mutagenesis (and in vitro binding studies)


    Emsa gel shift assays to identify binding proteins
    EMSA/Gel Shift Assays to Identify Binding Proteins

    TF + DNA

    DNA

    http://www.biomedcentral.com/content/figures/1741-7015-4-28-8.jpg


    High throughput methods
    High-throughput Methods

    • SELEX

      • mix random ds DNA oligonucleotides with TF protein, recover TF-DNA complexes and sequence DNA

    • Protein Binding Arrays (UniProbe Database)

      • prepare arrays with ds DNA attached, label protein with a fluorescent mark and observe DNA bound by protein

    • ChIP

      • covalently link proteins to DNA in cell, shear DNA, recover protein-DNA complexes and identify DNA (PCR, array or sequencing)


    Promoters
    Promoters

    • In most vertebrates the delineation of the transcription start position is not easy

      • cDNA often incomplete at 5’ end

  • Multiple promoters for most human genes

    • Referencing position relative to the initiation “site” is therefore not a good idea

      • But done almost uniformly in biological papers

    • Translation start equally problematic

      • Can be in internal exon

      • Multiple ORF start positions common

  • Importance of promoter proximal regions varies between species

    • Humans appear to have little enrichment for functional sequences; vast regions to consider generally leads to restricted region around promoter(s), but justification is not strong

    • Yeast and C.elegans have more compact regions and promoter proximity can be a useful property to restrict analyses


  • Mrna caps for mapping initiation sites
    mRNA Caps for Mapping Initiation Sites

    • 5’ end of mRNA have a “cap” structure that can be precipitated with an antibody

      • Allows for large-scale sequencing of “full-length” cDNAs and “tags” derived from the 5’ end of mRNAs

      • RIKEN the leading generators of such sequences

      • Not well represented in genome annotation resources (unfortunately)

    http://departments.oxy.edu/biology/Stillman/bi221/111300/26_18a.GIF


    Classes of initiation regions

    Bias:

    TATA Box

    (“Selective”)

    Bias:

    CpG Island

    (“Broad”)

    Classes of Initiation Regions

    % CAGE Cap Tags per Position

    Position

    This is over-simplified - see paper for greater detail. Take home message is that promoters are not drawn from a single continuous distribution of properties, rather drawn from at least two classes.

    Image from Carninci P, et al (2006). Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet. Apr 28 PMID: 16645617


    Cpg islands

    [CpG]

    [C][G]

    CpG Islands

    • DNA methylation occurs in competition with histone acetylation

      • Acetylation promotes open chromatin structure that is permissive for TF binding to DNA

      • Methylation of DNA inhibits histone acetylation

      • Certain TFs promote histone acetylation by recruiting acetylases

  • Methylation occurs on cytosines

    • Preferentially on cytosine adjacent to guanines (CG dinucleotides, generally referred to as CpG)

    • Methylated cytosines frequently undergo deamination to form thymidine (CpG -> TpG)

  • CpG Islands are regions of DNA where CG dinucleotides occur at a frequency consistent with C and G mononucleotide frequencies

    • Highlight regions of active transcription


  • Cpg islands 2
    CpG Islands (2)

    • Important to recognize that promoters selectively active after early development will not be acetylated (and hence will be methylated) in the cell divisions preceding the establishment of germ cells and therefore will not have CpG islands

    • Lists of genes that have higher or lower CpG frequencies than average can misleadingly appear to have TF binding motifs based on this compositional characteristic

      • CpG Island bias in a gene set can mislead an analyst to think that there are patterns of TFBS (patterns with internal CG for island-rich and TG for island-poor sets)


    Additional topics
    Additional Topics

    • Chromatin modification studies making great strides

      • Signatures indicative of active regulatory sequences such as H3K4me3

  • Co-activator (p300) ChIP study suggests possibility to “read-off” regulatory regions

  • No methods currently address 3D properties of nucleus (long-run will be necessary)


  • Section 3 1 what have we learned
    Section 3.1What have we learned?

    • Transcription controlled by regulatory regions

    • Regulatory regions can be distant from initiation regions

    • Laboratory methods can identify regulatory regions and TF binding sites

    • Concept of single initiation site is flawed

    • Promoters fall into subclasses

      • CpG vs TATA

      • Can impact assessment of TFBS in sets of genes


    Questions

    ?

    ?

    ?

    ?

    ?

    Questions?

    Please, please, please . . .

    ASK QUESTIONS

    . . . now is a great chance.


    Module 3
    Module 3

    Part 1: Overview of transcription

    Lab 3.1: Promoters in Genome Browser (UCSC)

    Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”)

    Lab 3.2: TFBS scan (Footer)

    Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors

    Lab 3.3: TFBS Over-Representation (oPOSSUM)

    Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)

    Lab 3.4: Motif Discovery (MEME/Motif-Compare)


    Part 2 prediction of tf binding sites
    Part 2 Prediction of TF Binding Sites

    Teaching a computer to find TFBS…


    Representing binding sites for a tf

    A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3

    C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12

    G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2

    T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4

    Logo – A graphical representation of frequency matrix. Y-axis is information content , which reflects the strength of the pattern in each column of the matrix

    Representing Binding Sites for a TF

    Set of

    binding

    sites

    AAGTTAATGA

    CAGTTAATAA

    GAGTTAAACA

    CAGTTAATTA

    GAGTTAATAA

    CAGTTATTCA

    GAGTTAATAA

    CAGTTAATCA

    AGATTAAAGA

    AAGTTAACGA

    AGGTTAACGA

    ATGTTGATGA

    AAGTTAATGA

    AAGTTAACGA

    AAATTAATGA

    GAGTTAATGA

    AAGTTAATCA

    AAGTTGATGA

    AAATTAATGA

    ATGTTAATGA

    AAGTAAATGA

    AAGTTAATGA

    AAGTTAATGA

    AAATTAATGA

    AAGTTAATGA

    AAGTTAATGA

    AAGTTAATGA

    AAGTTAATGA

    • A single site

      • AAGTTAATGA

    • A set of sites represented as a consensus

      • VDRTWRWWSHD (IUPAC degenerate DNA)


    TGCTG = 0.9

    Conversion of PFMs to Position Specific Scoring Matrices (PSSM)PSSMs also known as Position Weight Matrices(PWMs)

    Add the following features to the matrix profile:

    1. Correct for nucleotide frequencies in genome

    2. Weight for the confidence (depth) in the pattern

    3. Convert to log-scale probability for easy arithmetic

    pssm

    pfm

    A 1.6 -1.7 -0.2 -1.7 -1.7

    C -1.7 0.5 0.5 1.3 -1.7

    G -1.7 1.0 -0.2 -1.7 1.3

    T -1.7 -1.7 -0.2 -0.2 -0.2

    A 5 0 1 0 0

    C 0 2 2 4 0

    G 0 3 1 0 4

    T 0 0 1 1 1

    f(b,i)+ s(n)

    Log()

    p(b)


    Pssm scoring scales
    PSSM Scoring Scales

    • Raw scores

      • Sum of values from indicated cells of the matrix

  • Relative Scores (most common)

    • Normalize the scores to range of 0-1 or 0%-100%

  • Empirical p-values

    • Based on distribution of scores for some DNA sequence, determine a p-value (see next slide)


  • Detecting binding sites in a single sequence

    ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC

    A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]

    C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]

    G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]

    T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

    Relative Scores

    A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]

    C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]

    G [ 1.23481.23482.12222.1222 0.4368 1.23481.51281.74571.7457 -1.5 ]

    T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

    Max_score = 15.2 (sum of highest column scores)

    A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ]

    C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ]

    G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ]

    T [ 0.4368 -0.2284-1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ]

    Min_score = -10.3 (sum of lowest column scores)

    Detecting binding sites in a single sequence

    Raw Scores

    Sp1

    Abs_score = 13.4 (sum of column scores)

    Empirical p-value Scores

    0.3

    0.2

    0.1

    0.0

    Area to right of value

    Area under entire curve

    Frequency

    0.0 0.2 0.4 0.6 0.8 1.0

    Relative Score


    JASPAR:

    AN OPEN-ACCESS DATABASE

    OF TF BINDING PROFILES

    ( jaspar.genereg.net )


    The good
    The Good…

    • Tronche (1997) tested 50 predicted HNF1 TFBS using an in vitro binding test and found that 96% of the predicted sites were bound!

    • Stormo and Fields (1998) found in detailed biochemical studies that the best weight matrices produce scores highly correlated with in vitro binding energy

    BINDING

    ENERGY

    PSSM SCORE


    The bad
    …the Bad…

    • Fickett (1995) found that a profile for the myoD TF made predictions at a rate of 1 per ~500bp of human DNA sequence

      • This corresponds to an average of 20 sites / gene (assuming 10,000 bp as average gene size)


    And the ugly
    …and the Ugly!

    Human Cardiac a-Actin gene analyzed

    with a set of profiles

    (each line represents a TFBS prediction)

    Futility Conjuncture: TFBS predictions are almost always wrong

    Red boxes are protein coding exons -

    TFBS predictions excluded in this analysis


    Advanced topic issues of column independence
    ADVANCED TOPICIssues of Column Independence

    • PSSM model assumes independence between positions

      • For example, if you observe a G at position 2, the model assumes there is no influence on the likelihood of a T at position 3 - this is known to be an incorrect assumption

  • Other models can represent dependence

    • Hidden Markov models of Nth order where Nth refers to the number of influencing positions

    • For the cases where there are hundreds of TFBS known for a TF, there has been only modest improvement in the specificity of TFBS predictions using advanced column inter-dependent models

    • The newly emerging ChIP-Seq data collections will ultimately lead to the systematic use of more advanced models (not likely to advance to wet labs for ~3 years)


  • A conundrum

    PPV

    THRESHOLD

    A Conundrum…

    • Counter to intuition, the ratio of true positives to predictions fails to improve for “stringent” thresholds

      • For most predictive models this ratio would increase

  • Why?

    • True binding sites are defined by properties not incorporated into the profile scores - above some threshold all sites could be bound if accessible


  • Section 3 1a what have we learned
    Section 3.1AWhat have we learned?

    • PSSMs accurately reflect in vitro binding properties of DNA binding proteins

    • Suitable binding sites occur at a rate far too frequent to reflect in vivo function

    • Bioinformatics methods that use PSSMs for binding site studies must incorporate additional information to enhance specificity

      • Unfiltered predictions are too noisy for most applications

      • Organisms with short regulatory sequences are less problematic (e.g. yeast and bacteria)


    Using phylogenetic footprinting to improve tfbs discrimination

    Using Phylogenetic Footprinting to Improve TFBS Discrimination

    70,000,000 years of evolution can reveal regulatory regions


    Phylogenetic footprinting
    Phylogenetic Footprinting Discrimination

    FoxC2 – a single exon gene

    100%

    80%

    60%

    40%

    20%

    0%

    % Human-Mouse Identity

    • Align orthologous gene sequences (e.g. LAGAN)

    • For first window of 100 bp, of sequence#1, determine the % with identical match in sequence#2

      • Step across the first sequence, recording the percentage of identical nucleotides in each window

  • Observe that single exon contains a region of high identity that corresponds to the ORF, with lower identity in the 5’ and 3’ UTRs

  • Additional conserved region could be regulatory regions


  • Phylogenetic footprinting cont
    Phylogenetic Footprinting (cont) Discrimination

    % Identity

    200 bp Window Start Position (human sequence)

    Actin gene compared between human and mouse


    Multi species phylogenetic footprinting
    Multi-species DiscriminationPhylogenetic Footprinting

    • PhastCons scores indicate the regions of DNA which are unusual in their sequence composition in some subset of organisms


    Phylogenetic footprints in ucsc genome browser
    Phylogenetic Footprints in DiscriminationUCSC Genome Browser

    • PhyloCons (regions score)

    • PhyloP (position score)

    INSERT

    SCREENSHOT


    Phylogenetic footprinting dramatically reduces spurious hits

    Human Discrimination

    Mouse

    Phylogenetic Footprinting Dramatically Reduces Spurious Hits

    Actin, alpha cardiac


    Tfbs prediction with human mouse pairwise phylogenetic footprinting
    TFBS Prediction with Human & Mouse Pairwise Phylogenetic Footprinting

    SELECTIVITY

    SENSITIVITY

    • Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set)

    • 75-80% of defined sites detected with conservation filter, while only 11-16% of total predictions retained



    Choosing the right species for pairwise comparison
    Choosing the ”right” species for Footprintingpairwise comparison...

    CHICKEN

    HUMAN

    MOUSE

    HUMAN

    COW

    HUMAN


    Consite
    ConSite Footprinting


    Tfbs discrimination tools
    TFBS Discrimination Tools Footprinting

    • Phylogenetic Footprinting Servers

      • FOOTER http://biodev.hgen.pitt.edu/footer_php/Footerv2_0.php

      • CONSITE http://asp.ii.uib.no:8090/cgi-bin/CONSITE/consite/

      • rVISTA http://rvista.dcode.org/

      • ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk

  • SNPs in TFBS Analysis

    • RAVEN http://burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?rm=home

  • Prokaryotes or Yeast

    • PRODORIC http://prodoric.tu-bs.de/

    • YEASTRACT http://www.yeastract.com/index.php

  • Software Packages

    • TOUCAN http://homes.esat.kuleuven.be/~saerts/software/toucan.php

  • Programming Tools

    • TFBS http://tfbs.genereg.net/

    • ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk


  • Analysis of tfbs with phylogenetic footprinting
    Analysis of FootprintingTFBS with Phylogenetic Footprinting

    Scanning a single sequence

    Scanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions

    A dramatic improvement in the percentage of biologically significant detections

    • Low specificity of profiles:

      • too many hits

      • great majority not biologically significant


    Section 3.2B FootprintingWhat have we learned?

    • TFBS discrimination coupled with phylogenetic footprinting has greater specificity with tolerable loss of sensitivity

      • As with any purification process, some true binding sites will be lost

  • Available online resources support phylogenetic footprinting


  • Questions1

    Questions? Footprinting

    Please Ask


    Laboratory exercise 3 2

    Laboratory Exercise 3.2 Footprinting

    TF Binding Site Prediction


    20 minute break

    20 minute break Footprinting

    Until 10:50am

    Next: Sections 3.3 and 3.4


    Module 31
    Module 3 Footprinting

    Part 1: Overview of transcription

    Lab 3.1: Promoters in Genome Browser (UCSC)

    Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”)

    Lab 3.2: TFBS scan (Footer)

    Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors

    Lab 3.3: TFBS Over-Representation (oPOSSUM)

    Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)

    Lab 3.4: Motif Discovery (MEME/Motif-Compare)


    Part 3 inferring regulating tfs for sets of co expressed genes

    Part 3: FootprintingInferring Regulating TFs for Sets of Co-Expressed Genes


    Tfbs over representation
    TFBS Over-representation Footprinting

    • Akin to the GO studies yesterday, we seek to determine if a set of co-expressed genes contains an over-abundance of predicted binding sites for a known TF

      • Phylogenetic footprinting to reduce false prediction rate


    Two examples of tfbs over representation

    Foreground Footprinting

    Foreground

    More Total TFBS

    Background

    Background

    Two Examples of TFBS Over-Representation

    More Genes with TFBS


    Statistical methods for identifying over represented tfbs
    Statistical Methods for Identifying Over-represented TFBS Footprinting

    • Binomial test (Z scores)

      • Based on the number of occurrences of the TFBS relative to background

      • Normalized for sequence length

      • Simple binomial distribution model

    • Fisher exact probability scores

      • Based on the number of genes containing the TFBS relative to background

      • Hypergeometric probability distribution


    Validation using reference gene sets
    Validation using Reference Gene Sets Footprinting

    TFs with experimentally-verified sites in the reference sets.



    C myc sage data
    C-Myc SAGE Data Footprinting

    • c-Myc transcription factor dimerizes with the Max protein

    • Key regulator of cell proliferation, differentiation and apoptosis

    • Menssen and Hermeking identified 216 different SAGE tags corresponding to unique mRNAs that were induced after adenoviral expression of c-Myc in HUVEC cells

    • They then went on to confirm the induction of 53 genes using microarray analysis and RT-PCR


    Structurally related tfs with indistinguishable tfbs
    Structurally-related TFs with Indistinguishable TFBS Footprinting

    • Most structurally related TFs bind to highly similar patterns

      • Zn-finger is a big exception


    Opossum server
    oPOSSUM Server Footprinting


    Ets factor family
    Ets Factor Family Footprinting

    • EG232974

    • EG432800

    • Ehf

    • Elf1

    • Elf2

    • Elf3

    • Elf4

    • Elf5

    • Elk1

    • Elk3

    • Elk4

    • Erf

    • Erg

    • Ets1

    • Ets2

    • How to pick which one?

      • At this stage there are TF catalogs coming that will be coupled to characteristics.

    • Candidate gene prioritization software can be used (such as TOPPGENE)

    • Etv1

    • Etv2

    • Etv3

    • Etv3l

    • Etv4

    • Etv5

    • Etv6

    • Fev

    • Fli1

    • Gabpa

    • LOC100

    • LOC100

    • factor)

    • LOC634494

    • Sfpi1

    • Spdef

    • Spib

    • Spic


    Section 3.3 FootprintingWhat have we learned?

    • New generation of tools to help interrogate the meaning of observed clusters of co-expressed genes

    • Generally best performance has been with data directly linked to a transcription factor

      • Highly dependent on the experimental design – cannot overcome noisy data from poor design (Recall Day 1)

  • The identity of a mediating TF may not be apparent when many proteins can bind to the same motif


  • Questions2

    Questions? Footprinting

    Now is a good time


    Laboratory exercise 3 3

    Laboratory Exercise 3.3 Footprinting

    TFBS Over-Representation Analysis


    Module 3 overview1
    Module 3: Overview Footprinting

    Part 1: Overview of transcription

    Lab 3.1: Promoters in Genome Browser (UCSC)

    Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”)

    Lab 3.2: TFBS scan (Footer)

    Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors

    Lab 3.3: TFBS Over-Representation (oPOSSUM)

    Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)

    Lab 3.4: Motif Discovery (MEME/Motif-Compare)


    Part 4 de novo discovery of tf binding sites

    Part 4: Footprintingde novo Discovery of TF Binding Sites


    De novo pattern discovery
    de novo Footprinting Pattern Discovery

    • String-based

      • e.g. YMF (Sinha & Tompa)

      • Generalization: Identify over-represented oligomers in comparison of “+” and “-” (or complete) promoter collections

      • Used often for yeast promoter analysis

    • Profile-based

      • e.g. AnnSpec (Workman & Stormo) or MEME (Bailey & Elkin)

      • Generalization: Identify strong patterns in “+” promoter collection vs. background model of expected sequence characteristics


    Assessing discovered patterns
    Assessing Discovered Patterns Footprinting

    • Strength

    • Similarity search


    String based methods 1
    String-based methods(1) Footprinting

    How likely are X words in a set of sequences, given background sequence characteristics?

    CCCGCCGGAATGAAATCTGATTGACATTTTCC >EP71002 (+) Ce[IV] msp-56 B; range -100 to -75 TTCAAATTTTAACGCCGGAATAATCTCCTATT >EP63009 (+) Ce Cuticle Col-12; range -100 to -75 TCGCTGTAACCGGAATATTTAGTCAGTTTTTG >EP63010 (+) Ce Cuticle Col-13; range -100 to -75 TATCGTCATTCTCCGCCTCTTTTCTT >EP11013 (+) Ce vitellogenin 2; range -100 to -75 GCTTATCAATGCGCCCGGAATAAAACGCTATA >EP11014 (+) Ce vitellogenin 5; range -100 to -75 CATTGACTTTATCGAATAAATCTGTT >EP11015 (-) Ce vitellogenin 4; range -100 to -75 ATCTATTTACAATGATAAAACTTCAA >EP11016 (+) Ce vitellogenin 6; range -100 to -75 ATGGTCTCTACCGGAAAGCTACTTTCAGAATT >EP11017 (+) Ce calmodulin cal-2; range -100 to -75 TTTCAAATCCGGAATTTCCACCCGGAATTACT >EP63007 (-) Ce cAMP-dep. PKR P1+; range -100 to -75 TTTCCTTCTTCCCGGAATCCACTTTTTCTTCC >EP63008 (+) Ce cAMP-dep. PKR P2; range -100 to -75 ACTGAACTTGTCTTCAAATTTCAACACCGGAA >EP17012 (+) Ce hsp 16K-1 A; range -100 to -75 TCAATGCCGGAATTCTGAATGTGAGTCGCCCT >EP55011 (-) Ce hsp 16K-1 B; range


    String based methods 2
    String-based methods(2) Footprinting

    Find all words of length n in the yeast promoters (e.g. n=7)

    GTCTTTATCTTCAAAGTTGTCTGTCCAAGATTTGGACTTGAAGGACAAGCGTGTCTTCTCAGAGTTGACTTCAACGTCCCATTGGACGGTAAGAAGATCACTTCTAACCAAAGAATTGTTGCTGCTTTGCCAACCATCAAGTACGTTTTGGAACACCACCCAAGATACGTTGTCTTGTTCTCACTTGGGTAGACCAAACGGTGAAAGAAACGAAAAATACTCTTTGGCTCCAGTTGCTAAGGAATTGCAATCATTGTTGGGTAAGGATGTCACCTTCTTGAACGACTGTGTCGGTCCAGAAGTTGAAGCCGCTGTCAAGGCTTCTGCCCCAGGTTCCGTTATTTTGTTGGAAAACTGCGTTACCACATCGAAGAAGAAGGTTCCAGAAAGGTCGATGGTCAAAAGGTCAAGGCTCAAGGAAGATGTTCAAAAGTTCAGACACGAATTGAGCTCTTTGGCTGATGTTTACATCACGATGCCTTCGGTACCGCTCACAGAGCTCACTCTTCTATGGTCGGTTTCGACTTGCCAACGTGCTGCCGGTTTCTTGTTGGAAAAGGAATTGAAGTACTTCGGTAAGGCTTTGGAGAACCCAACCAGACCATTCTTGGCCATCTTAGGTGGTGCCAAGGTTGCTGACAAGATTCAATTGATTGACAACTTGTTGGACAAGGTCGACTCTATCATCATTGGTGGTGGTATGGCTTTCCCTTCAAGAAGGTTTTGGAAAACACTGAAATCGGTGACTCCATCTTCGACAAGGCTGGTGCTGAAATCGTTCCAAAGTTGATGGAAAAGGCCAAGGCCAAGGGTGTCGAAGTCGTCTTGCAGTCGACTTCATCATTGCTGATGCTTTCTCTGCTGATGCCAACACCAAGACTGTCACTGACAAGGAAGGTATTCCAGCTGGCTGGCAAGGGTTGGACAATGGTCCAGAATCTAGAAAGTGTTTGCTGCTACTGTTGCAAAGGCTAAGACCATTGTCTGGAACGGTCCACCAGGTGTTTTCGAATTCGAAAAGTTCGCTGCTGGTACTAAGGCTTTGTTAGACGAAGTTGTCAAGAGCTCTGCTGCTGGTAACACCGTCATCATTGGTGGTGGTGACACTGCCA

    Make a lookup table:

    AAACCTTT 456

    TTTTTTTT 57788

    GATAGGCA 589

    Etc...


    String based methods 3
    String-based methods(3) Footprinting

    Xw: Instances of a word w within our set of X genes

    E[Xw]: Average number of instances of w based on number of genes in our set

    Var[Xw]: Variance – how much deviation from the average is expected for w


    Limitations of string based methods
    Limitations of String-based Methods Footprinting

    • Longer word lengths not possible

    • While degeneracy codes can be used, TFBS are not words – we lose quantitation for variable positions with consensus sequences

      • Imagine column in PFM with 7 A’s and 1 T --- in a consensus sequence we would represent as W or throw out the instance with T

  • Recently the string-based method has found renewed utility in the analysis of 3’UTRs for the presence of microRNA target sequences...


  • Microrna target sequences
    microRNA Target Sequences Footprinting

    • Lim et al expressed miRNAs in cells and observed that the overall pattern of gene expression shifted toward the pattern of expression observed in cells which naturally express the miRNA

    • The genes with reduced expression in response to miRNA exposure shared 7nt motifs the 3’UTR of their transcripts

    • Nice website tutorial:

      • http://www.ambion.com/main/explorations/mirna.html


    Probabilistic methods for pattern discovery

    Probabilistic Methods for Pattern Discovery Footprinting

    What is a probabilistic method?

    The Gibbs sampler algorithm


    Probabilistic methods
    Probabilistic Methods Footprinting

    Overview:

    Find a local alignment of width x of sites that maximizes information content (or related measure) in reasonable time

    Usually by Gibbs sampling or EM methods

    Motivation:

    TFBS are not words

    Efficiency – can handle longer patterns than string-based methods

    Can be intentionally influenced to reflect prior knowledge


    What does probabilistic mean
    What does probabilistic mean? Footprinting

    • Based on probability

    • Functionally, it means we’re going to guess our way to a good pattern (TFBS)

      • We’re going to try to make a good guess

  • Two different flavours of the approach

    • Expectation Maximization in which we try to make the best guess each time

    • Gibbs Sampling in which we make our guesses based on the strength of our conviction


  • Gibbs sampling
    Gibbs Sampling Footprinting

    tgacttcc

    Two data structures used:

    1) Current pattern nucleotide frequencies qi,1,..., qi,4and corresponding background frequencies pi,1,..., pi,4

    2) Current positions of site startpoints in the N sequences a1, ..., aN , i.e. the alignment that contributes toqi,j.

    One starting point in each sequence is chosen randomly initially.

    tgatctct

    agacctca

    tgacctct


    Iterations in gibbs sampling

    ’Score’ the current pattern against each possible occurence ak in z. Draw a new ak with probabilities based on respective score divided by the background model

    B

    Iterations in Gibbs Sampling

    Remove one sequence z from the set. Update the current pattern according to

    A

    z

    Pseudocount for symbol j

    tgacttcc

    tgatctct

    agacctca

    Sum of all pseudocounts in column

    tgacctct


    Gibbs sampling grossly over simplified

    ttcgctcc occurence

    cgatacgc

    1 2 3 4 5 6 7 8

    A 2 0 2 2 2 1 0 1

    C 0 2 3 3 2 1 6 2

    G 0 4 1 0 1 0 1 1

    T 4 1 1 2 2 5 0 2

    tgctacct

    tgacttcc

    agacctca

    ctgtagtg

    acgcatct

    Gibbs Sampling(grossly over-simplified)


    Pattern discovery
    Pattern Discovery occurence

    • Gibbs sampling is guaranteed to return an optimal pattern if repeated sufficiently often

      • Procedure is fast, so running many 1000s of times is feasible

  • Unfortunately, we have a problem…what if the mediating TFBS are not strongly over-represented relative to other patterns…


  • Applied pattern discovery is acutely sensitive to noise

    Pink line is negative control occurence

    with no Mef2 sites included

    PATTERN SIMILARITY

    vs. TRUE MEF2 PROFILE

    SEQUENCE LENGTH

    Applied Pattern Discovery is Acutely Sensitive to Noise

    True Mef2 Binding Sites


    Four approaches to improve sensitivity
    Four Approaches to Improve Sensitivity occurence

    • Better background models

      -Higher-order properties of DNA

    • Phylogenetic Footprinting

      • Human:Mouse comparison eliminates ~75% of sequence

    • Regulatory Modules

      • Architectural rules

    • Limit the types of binding profiles allowed

      • TFBS patterns are NOT random


    Pattern discovery summary
    Pattern Discovery Summary occurence

    • Pattern discovery methods can recover over-represented patterns in the promoters of co-expressed genes

    • Methods are acutely sensitive to noise, indicating that the signal we seek is weak

      • TFs tolerate great variability between binding sites

  • As for pattern discrimination, supplementary information/approaches are required to over-come the noise


  • Questions3

    Questions? occurence

    Winding down


    Laboratory exercise 3 4

    Laboratory Exercise 3.4 occurence

    Motif Discovery


    Reflections
    REFLECTIONS occurence

    • Part 2

      • Futility Theorem – Essentially predictions of individual TFBS have no relationship to an in vivo function

      • Successful bioinformatics methods for site discrimination incorporate additional information (clusters, conservation)

    • Part 3

      • TFBS over-representation is a powerful new means to identify TFs likely to contribute to observed patterns of co-expression

    • Part 4

      • Pattern discovery methods are severely restricted by the Signal-to-Noise problem

        • Observed patterns must be carefully considered

      • Successful methods for pattern discovery will have to incorporate additional information (conservation, structural constraints on TFs)


    Module 3 overview2
    Module 3: Overview occurence

    Part 1: Overview of transcription

    Lab 3.1: Promoters in Genome Browser (UCSC)

    Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”)

    Lab 3.2: TFBS scan (Footer)

    Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors

    Lab 3.3: TFBS Over-Representation (oPOSSUM)

    Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”)

    Lab 3.4: Motif Discovery (MEME/Motif-Compare)


    The end
    THE END occurence

    • Questions before the break?

    • Lab exercises address Sections 2 and 3


    Lunch

    LUNCH occurence

    On your own

    (Food court Downstairs)

    Back at: ??


    ad