Canadian Bioinformatics Workshops

Canadian Bioinformatics Workshops www.bioinformatics.ca

Module #: Title of Module 2

Canadian Bioinformatics Workshops 2009 Module 3 Inferring Regulatory Mechanisms Governing Sets of Genes Wyeth W. Wasserman University of British Columbia www.cisreg.ca

Module 3: Overview Part 1: Overview of transcription Lab 3.1: Promoters in Genome Browser (UCSC) Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”) Lab 3.2: TFBS scan (Footer) Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors Lab 3.3: TFBS Over-Representation (oPOSSUM) Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”) Lab 3.4: Motif Discovery (MEME/Motif-Compare)

Restrictions in Coverage • Focus on Eukaryotic cells and PolII Promoters • Principles apply to prokaryotes • Will provide suggestions for similar tools for other species • Many of the examples drawn from my lab’s work - there are many equivalent tools (links to be provided)

Part 1Introduction to transcription in eukaryotic cells

Transcription Over-Simplified • Three-step Process: • TF binds to TFBS (DNA) • TF catalyzes recruitment of polymerase II complex • Production of RNA from transcription start site (TSS) TF Pol-II TFBS TATA TSS

Anatomy of Transcriptional RegulationWARNING: Terms vary widely in meaning between scientists Core Promoter/Initiation Region (Inr) TSR • Core Promoter – Sufficient for initiation of transcription; orientation dependent • TSR – transcription start region • Refers to a region rather than specific start site (TSS) • TFBS – single transcription factor binding site • Regulatory Regions • Proximal/Distal – vague reference to distance from TSR • May be positive (enhancing) or negative (repressing) • Orientation independent (generally) • Modules – Sets of TFBS within a region that function together • Transcriptional Unit • DNA sequence transcribed as a single polycistronic mRNA Distal Regulatory Region Proximal Regulatory Region Distal R.R. EXON EXON TFBS TFBS TFBS TFBS TFBS TATA TFBS TFBS

Complexity in Transcription Chromatin Distal enhancer Proximal enhancer Core Promoter Distal enhancer

Lab Discovery of TF Binding Sites Reporter Gene Activity 0% 100% LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE LUCIFERASE mutation Identify functional regulatory region within a sequence and delineate specific TFBS through mutagenesis (and in vitro binding studies)

EMSA/Gel Shift Assays to Identify Binding Proteins TF + DNA DNA http://www.biomedcentral.com/content/figures/1741-7015-4-28-8.jpg

High-throughput Methods • SELEX • mix random ds DNA oligonucleotides with TF protein, recover TF-DNA complexes and sequence DNA • Protein Binding Arrays (UniProbe Database) • prepare arrays with ds DNA attached, label protein with a fluorescent mark and observe DNA bound by protein • ChIP • covalently link proteins to DNA in cell, shear DNA, recover protein-DNA complexes and identify DNA (PCR, array or sequencing)

Promoters • In most vertebrates the delineation of the transcription start position is not easy • cDNA often incomplete at 5’ end • Multiple promoters for most human genes • Referencing position relative to the initiation “site” is therefore not a good idea • But done almost uniformly in biological papers • Translation start equally problematic • Can be in internal exon • Multiple ORF start positions common • Importance of promoter proximal regions varies between species • Humans appear to have little enrichment for functional sequences; vast regions to consider generally leads to restricted region around promoter(s), but justification is not strong • Yeast and C.elegans have more compact regions and promoter proximity can be a useful property to restrict analyses

mRNA Caps for Mapping Initiation Sites • 5’ end of mRNA have a “cap” structure that can be precipitated with an antibody • Allows for large-scale sequencing of “full-length” cDNAs and “tags” derived from the 5’ end of mRNAs • RIKEN the leading generators of such sequences • Not well represented in genome annotation resources (unfortunately) http://departments.oxy.edu/biology/Stillman/bi221/111300/26_18a.GIF

Bias: TATA Box (“Selective”) Bias: CpG Island (“Broad”) Classes of Initiation Regions % CAGE Cap Tags per Position Position This is over-simplified - see paper for greater detail. Take home message is that promoters are not drawn from a single continuous distribution of properties, rather drawn from at least two classes. Image from Carninci P, et al (2006). Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet. Apr 28 PMID: 16645617

[CpG] [C][G] CpG Islands • DNA methylation occurs in competition with histone acetylation • Acetylation promotes open chromatin structure that is permissive for TF binding to DNA • Methylation of DNA inhibits histone acetylation • Certain TFs promote histone acetylation by recruiting acetylases • Methylation occurs on cytosines • Preferentially on cytosine adjacent to guanines (CG dinucleotides, generally referred to as CpG) • Methylated cytosines frequently undergo deamination to form thymidine (CpG -> TpG) • CpG Islands are regions of DNA where CG dinucleotides occur at a frequency consistent with C and G mononucleotide frequencies • Highlight regions of active transcription

CpG Islands (2) • Important to recognize that promoters selectively active after early development will not be acetylated (and hence will be methylated) in the cell divisions preceding the establishment of germ cells and therefore will not have CpG islands • Lists of genes that have higher or lower CpG frequencies than average can misleadingly appear to have TF binding motifs based on this compositional characteristic • CpG Island bias in a gene set can mislead an analyst to think that there are patterns of TFBS (patterns with internal CG for island-rich and TG for island-poor sets)

Additional Topics • Chromatin modification studies making great strides • Signatures indicative of active regulatory sequences such as H3K4me3 • Co-activator (p300) ChIP study suggests possibility to “read-off” regulatory regions • No methods currently address 3D properties of nucleus (long-run will be necessary)

Section 3.1What have we learned? • Transcription controlled by regulatory regions • Regulatory regions can be distant from initiation regions • Laboratory methods can identify regulatory regions and TF binding sites • Concept of single initiation site is flawed • Promoters fall into subclasses • CpG vs TATA • Can impact assessment of TFBS in sets of genes

? ? ? ? ? Questions? Please, please, please . . . ASK QUESTIONS . . . now is a great chance.

Module 3 Part 1: Overview of transcription Lab 3.1: Promoters in Genome Browser (UCSC) Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”) Lab 3.2: TFBS scan (Footer) Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors Lab 3.3: TFBS Over-Representation (oPOSSUM) Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”) Lab 3.4: Motif Discovery (MEME/Motif-Compare)

Part 2 Prediction of TF Binding Sites Teaching a computer to find TFBS…

A matrix describing a set of sites: A 14 16 4 0 1 19 20 1 4 13 4 4 13 12 3 C 3 0 0 0 0 0 0 0 7 3 1 0 3 1 12 G 4 3 17 0 0 2 0 0 9 1 3 0 5 2 2 T 0 2 0 21 20 0 1 20 1 4 13 17 0 6 4 Logo – A graphical representation of frequency matrix. Y-axis is information content , which reflects the strength of the pattern in each column of the matrix Representing Binding Sites for a TF Set of binding sites AAGTTAATGA CAGTTAATAA GAGTTAAACA CAGTTAATTA GAGTTAATAA CAGTTATTCA GAGTTAATAA CAGTTAATCA AGATTAAAGA AAGTTAACGA AGGTTAACGA ATGTTGATGA AAGTTAATGA AAGTTAACGA AAATTAATGA GAGTTAATGA AAGTTAATCA AAGTTGATGA AAATTAATGA ATGTTAATGA AAGTAAATGA AAGTTAATGA AAGTTAATGA AAATTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA AAGTTAATGA • A single site • AAGTTAATGA • A set of sites represented as a consensus • VDRTWRWWSHD (IUPAC degenerate DNA)

TGCTG = 0.9 Conversion of PFMs to Position Specific Scoring Matrices (PSSM)PSSMs also known as Position Weight Matrices(PWMs) Add the following features to the matrix profile: 1. Correct for nucleotide frequencies in genome 2. Weight for the confidence (depth) in the pattern 3. Convert to log-scale probability for easy arithmetic pssm pfm A 1.6 -1.7 -0.2 -1.7 -1.7 C -1.7 0.5 0.5 1.3 -1.7 G -1.7 1.0 -0.2 -1.7 1.3 T -1.7 -1.7 -0.2 -0.2 -0.2 A 5 0 1 0 0 C 0 2 2 4 0 G 0 3 1 0 4 T 0 0 1 1 1 f(b,i)+ s(n) Log() p(b)

PSSM Scoring Scales • Raw scores • Sum of values from indicated cells of the matrix • Relative Scores (most common) • Normalize the scores to range of 0-1 or 0%-100% • Empirical p-values • Based on distribution of scores for some DNA sequence, determine a p-value (see next slide)

ACCCTCCCCAGGGGCGGGGGGCGGTGGCCAGGACGGTAGCTCC A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] Relative Scores A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.23481.23482.12222.1222 0.4368 1.23481.51281.74571.7457 -1.5 ] T [ 0.4368 -0.2284 -1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] Max_score = 15.2 (sum of highest column scores) A [-0.2284 0.4368 -1.5 -1.5 -1.5 0.4368 -1.5 -1.5 -0.2284 0.4368 ] C [-0.2284 -0.2284 -1.5 -1.5 1.5128 -1.5 -0.2284 -1.5 -0.2284 -1.5 ] G [ 1.2348 1.2348 2.1222 2.1222 0.4368 1.2348 1.5128 1.7457 1.7457 -1.5 ] T [ 0.4368 -0.2284-1.5 -1.5 -0.2284 0.4368 0.4368 0.4368 -1.5 1.7457 ] Min_score = -10.3 (sum of lowest column scores) Detecting binding sites in a single sequence Raw Scores Sp1 Abs_score = 13.4 (sum of column scores) Empirical p-value Scores 0.3 0.2 0.1 0.0 Area to right of value Area under entire curve Frequency 0.0 0.2 0.4 0.6 0.8 1.0 Relative Score

JASPAR: AN OPEN-ACCESS DATABASE OF TF BINDING PROFILES ( jaspar.genereg.net )

The Good… • Tronche (1997) tested 50 predicted HNF1 TFBS using an in vitro binding test and found that 96% of the predicted sites were bound! • Stormo and Fields (1998) found in detailed biochemical studies that the best weight matrices produce scores highly correlated with in vitro binding energy BINDING ENERGY PSSM SCORE

…the Bad… • Fickett (1995) found that a profile for the myoD TF made predictions at a rate of 1 per ~500bp of human DNA sequence • This corresponds to an average of 20 sites / gene (assuming 10,000 bp as average gene size)

…and the Ugly! Human Cardiac a-Actin gene analyzed with a set of profiles (each line represents a TFBS prediction) Futility Conjuncture: TFBS predictions are almost always wrong Red boxes are protein coding exons - TFBS predictions excluded in this analysis

ADVANCED TOPICIssues of Column Independence • PSSM model assumes independence between positions • For example, if you observe a G at position 2, the model assumes there is no influence on the likelihood of a T at position 3 - this is known to be an incorrect assumption • Other models can represent dependence • Hidden Markov models of Nth order where Nth refers to the number of influencing positions • For the cases where there are hundreds of TFBS known for a TF, there has been only modest improvement in the specificity of TFBS predictions using advanced column inter-dependent models • The newly emerging ChIP-Seq data collections will ultimately lead to the systematic use of more advanced models (not likely to advance to wet labs for ~3 years)

PPV THRESHOLD A Conundrum… • Counter to intuition, the ratio of true positives to predictions fails to improve for “stringent” thresholds • For most predictive models this ratio would increase • Why? • True binding sites are defined by properties not incorporated into the profile scores - above some threshold all sites could be bound if accessible

Section 3.1AWhat have we learned? • PSSMs accurately reflect in vitro binding properties of DNA binding proteins • Suitable binding sites occur at a rate far too frequent to reflect in vivo function • Bioinformatics methods that use PSSMs for binding site studies must incorporate additional information to enhance specificity • Unfiltered predictions are too noisy for most applications • Organisms with short regulatory sequences are less problematic (e.g. yeast and bacteria)

Using Phylogenetic Footprinting to Improve TFBS Discrimination 70,000,000 years of evolution can reveal regulatory regions

Phylogenetic Footprinting FoxC2 – a single exon gene 100% 80% 60% 40% 20% 0% % Human-Mouse Identity • Align orthologous gene sequences (e.g. LAGAN) • For first window of 100 bp, of sequence#1, determine the % with identical match in sequence#2 • Step across the first sequence, recording the percentage of identical nucleotides in each window • Observe that single exon contains a region of high identity that corresponds to the ORF, with lower identity in the 5’ and 3’ UTRs • Additional conserved region could be regulatory regions

Phylogenetic Footprinting (cont) % Identity 200 bp Window Start Position (human sequence) Actin gene compared between human and mouse

Multi-species Phylogenetic Footprinting • PhastCons scores indicate the regions of DNA which are unusual in their sequence composition in some subset of organisms

Phylogenetic Footprints in UCSC Genome Browser • PhyloCons (regions score) • PhyloP (position score) INSERT SCREENSHOT

Human Mouse Phylogenetic Footprinting Dramatically Reduces Spurious Hits Actin, alpha cardiac

TFBS Prediction with Human & Mouse Pairwise Phylogenetic Footprinting SELECTIVITY SENSITIVITY • Testing set: 40 experimentally defined sites in 15 well studied genes (Replicated with 100+ site set) • 75-80% of defined sites detected with conservation filter, while only 11-16% of total predictions retained

1kbp insulin receptor promoter screened with footprinting

Choosing the ”right” species for pairwise comparison... CHICKEN HUMAN MOUSE HUMAN COW HUMAN

ConSite

TFBS Discrimination Tools • Phylogenetic Footprinting Servers • FOOTER http://biodev.hgen.pitt.edu/footer_php/Footerv2_0.php • CONSITE http://asp.ii.uib.no:8090/cgi-bin/CONSITE/consite/ • rVISTA http://rvista.dcode.org/ • ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk • SNPs in TFBS Analysis • RAVEN http://burgundy.cmmt.ubc.ca/cgi-bin/RAVEN/a?rm=home • Prokaryotes or Yeast • PRODORIC http://prodoric.tu-bs.de/ • YEASTRACT http://www.yeastract.com/index.php • Software Packages • TOUCAN http://homes.esat.kuleuven.be/~saerts/software/toucan.php • Programming Tools • TFBS http://tfbs.genereg.net/ • ORCAtk http://burgundy.cmmt.ubc.ca/cgi-bin/OrcaTK/orcatk

Analysis of TFBS with Phylogenetic Footprinting Scanning a single sequence Scanning a pair orf orthologous sequences for conserved patterns in conserved sequence regions A dramatic improvement in the percentage of biologically significant detections • Low specificity of profiles: • too many hits • great majority not biologically significant

Section 3.2BWhat have we learned? • TFBS discrimination coupled with phylogenetic footprinting has greater specificity with tolerable loss of sensitivity • As with any purification process, some true binding sites will be lost • Available online resources support phylogenetic footprinting

Questions? Please Ask

Laboratory Exercise 3.2 TF Binding Site Prediction

20 minute break Until 10:50am Next: Sections 3.3 and 3.4

Module 3 Part 1: Overview of transcription Lab 3.1: Promoters in Genome Browser (UCSC) Part 2: Prediction of transcription factor binding sites using binding profiles (“Discrimination”) Lab 3.2: TFBS scan (Footer) Part 3: Interrogation of sets of co-expressed genes to identify mediating transcription factors Lab 3.3: TFBS Over-Representation (oPOSSUM) Part 4: Detection of novel motifs (TFBS) over-represented in regulatory regions of co-expressed genes (“Discovery”) Lab 3.4: Motif Discovery (MEME/Motif-Compare)

Canadian Bioinformatics Workshops