670 likes | 854 Views
Evolutionary and genomic approaches to find gene regulatory sequences. Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton Nekrutenko, Kateryna Makova, Stephan Schuster, Ross Hardison
E N D
Evolutionary and genomic approaches to find gene regulatory sequences Penn State University, Center for Comparative Genomics and Bioinformatics: Webb Miller, Francesca Chiaromonte, Anton Nekrutenko, Kateryna Makova, Stephan Schuster, Ross Hardison University of California at Santa Cruz: David Haussler, Jim Kent Children’s Hospital of Philadelphia: Mitch Weiss NimbleGen:Roland Green University of Nebraska, Lincoln February 14. 2007
Major goals of comparative genomics • Identify all DNA sequences in a genome that are functional • Selection to preserve function • Adaptive selection • Determine the biological role of each functional sequence • Elucidate the evolutionary history of each type of sequence • Provide bioinformatic tools so that anyone can easily incorporate insights from comparative genomics into their research
Known types of gene regulatory regions G.A. Maston, S.K. Evans, M.R. Green (2006) Ann. Rev. Genomics & Human Genetics 7:29-59.
Regulatory regions tend to be clusters of transcription factor binding sites Sequence-specific SV40 promoters and enhancer
Properties of known regulatory regions • Binding sites for transcription factors, many with sequence specificity • Clusters of binding sites • Conventional promoters encompass major start sites for transcription • Conserved over evolutionary time???
Structures involved in transcription are probably more complex Middle image: Green: active transcription (Br-UTP label) Red: all nucleic acids HeLa cell Sides: EM spreads of transcripts Peter R. Cook, Oxford University, http://users.path.ox.ac.uk/~pcook/images/Images.html
Domain opening is associated with movement to non-heterochromatic regions Schubeler, Francastel, Cimbora, Reik, Martin, Groudine (2000) Genes & Dev. 14: 940-950
Other possible activities for sequences involved in gene regulation • Opening or closing a chromosomal domain • Move a gene to or away from a transcription factory • Control how long a gene is in a transcription factory • Long association • High level expression • Really long gene • Short association • Lower level expression • Rapid regulation • Are these conserved over evolutionary time?
3 modes of evolution Sequence matches at longer phylogenetic distances could reflect purifying selection Sequence differences at closer phylogenetic distances could reflect adaptive evolution.
Conservation vs. Constraint • Conserved sequences are those that align between two species thought to be descended from a common ancestor • Constrained sequences show evidence in their alignments of negative (purifying) selection • E.g. change at a rate significantly slower than “neutral” DNA
Human vs mouse Negative selection (purifying) Similarity Neutral DNA Human vs rhesus Neutral DNA Similarity Positive selection (adaptive) P (not neutral) Neutral DNA Position along chromosome DNA segments with a function common to divergent species. DNA segments in which change is beneficial to at least one of the two species. Ideal cases for interpretation
Messages about evolutionary approaches to predicting regulatory regions • Regulatory regions are conserved, but not all to the same phylogenetic distance. • Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential). • Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity. • In vivo occupancy by GATA-1 suggests other activities in addition to enhancers. • Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.
Finding all gene regulatory regions is a challenge for comparative genomics • Known regulatory regions for the HBB complex • 23 total • 19 conserved (align) between human and mouse • Many others show no significant difference in a measure of constraint (phastCons) from the bulk or neutral DNA
ENCODE projects • ENCODE (ENCyclopedia Of DNA Elements): consortium aiming to find function for all human DNA sequences • Phase I focused on 1% of human DNA • 30 Mb, 44 regions • About 10 regions had known genes of interest (CFTR, HOX) • Others were chosen to get a sampling of regions varying in gene density and alignability with mouse • Major areas • Genes and transcripts • Transcriptional regulation • Chromatin structure • Multiple sequence alignment • Variation in human populations
Biochemical assays for protein-binding sites in DNA Purified protein & Naked DNA Chromatin Immunoprecipitation: DNA sites occupied by a protein inside cells.
Putative transcriptional regulatory regions = pTRRs • Antibodies vs 10 sequence-specific factors: • Sp1, Sp3, E2F1, E2F4, cMyc, STAT1, cJun, CEBPe, PU1, RA Receptor A • High resolution ChIP-chip platforms: Affymetrix and NimbleGen • Data from several different labs in ENCODE consortium • High likelihood hits for ChIP-chip • 5% false discovery rate • Supported by chromatin modification data • Modified histones in chromatin: H4Ac, H3Ac, H3K4me, H3K4me2, H3K4me3, etc. • DNase hypersensitive sites (DHSs) or nucleosome depleted sites • Result: set of 1369 pTRRs
A small fraction of cis-regulatory modules are conserved from human to chicken • About 4% of pTRRs, 4% of DNase HSs, 4-7% of promoters active in multiple cell lines • Tend to regulate genes whose products control transcription and development Millions of years 91 173 310 450 David King
Most pTRRs are conserved in eutherian mammals Percentage of class that align no further than: pTRRs DNase HSs Promoters 11% Primates: 3% 1-13% Millions of years 91 70% Eutherians: 71% 63% 173 310 14% Marsupials: 21% 16-28% 450 Tetrapods: 4% 4% 4-7% Vertebrates: 1% 1% 2-4% Within aligned noncoding DNA of eutherians, need to distinguish constrained DNA (purifying selection) from neutral DNA.
Measures of conservation and constraint capture only a subset of pTRRs Fraction overlapping an MCS phastCons (background rate corrected) Composite alignability (background rate corrected) Aligns, but no inference about purifying selection Allows a range of constraint Stringent constraint
Different measures perform better on specific functional regions Sensitivity 1-Specificity
Messages about evolutionary approaches to predicting regulatory regions • Regulatory regions are conserved, but not all to the same phylogenetic distance. • Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential). • Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity. • In vivo occupancy by GATA-1 suggests other activities in addition to enhancers. • Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.
Good performance of ESPERR for gene regulatory regions (RP) - Francesca Chiaromonte James Taylor
Messages about evolutionary approaches to predicting regulatory regions • Regulatory regions are conserved, but not all to the same phylogenetic distance. • Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential). • Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity. • In vivo occupancy by GATA-1 suggests other activities in addition to enhancers. • Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.
Conservation of predicted binding sites for transcription factors Binding site for GATA-1
Genes Co-expressed in Late Erythroid Maturation • G1E-ER cells: proerythroblast line lacking the transcription factor GATA-1. • Can rescue by expressing an estrogen-responsive form of GATA-1 • Rylski et al., Mol Cell Biol. 2003
Predicted cis-Regulatory Modules (preCRMs) Around Erythroid Genes B:Yong Cheng, Ross, Yuepin Zhou, David King F:Ying Zhang, Joel Martin, Christine Dorman, Hao Wang
preCRMs with conserved consensus GATA-1 BS tend to be active on transfected plasmids
preCRMs with conserved consensus GATA-1 BS tend to be active after integration into a chromosome
preCRMs with High RP and Conserved Consensus GATA-1 Tend To Be Validated
All validated preCRMs Same parameters All nonvalidated preCRMs Compare the outputs Consensus for EKLF binding site: C C N C M C C C W CCNCMCCCW CCNCMCCCW CACC box helps distinguish validated from nonvalidated preCRMs Ying Zhang
Messages about evolutionary approaches to predicting regulatory regions • Regulatory regions are conserved, but not all to the same phylogenetic distance. • Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential). • Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity. • In vivo occupancy by GATA-1 suggests other activities in addition to enhancers. • Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.
preCRMs with conserved consensus GATA-1 binding sites are usually occupied by that protein: ChIP assay
50 50 100 Design of ChIP-chip for occupancy by GATA-1 • Non-overlapping tiling array with 50bp probe and 100bp resolution (NimbleGen) • Cover range Mouse chr7:57225996-123812258 (~70Mbp) 3. Antibody against the ER portion of GATA-1-ER protein in rescued G1E-ER4 cells Yong Cheng, with Mitch Weiss & Lou Dore (CHoP), Roland Green (NimbleGen)
Signals in known occupied sites in Hbb LCR HS1 HS2 HS3 1) Cluster of high signals 2) “hill” shape of the signals
Peak Finding Programs • TAMALPAIS Mark Bieda from Peggy Farmham’s lab Focus more on the cluster of the signals 4 thresholds based on number of consecutive probes with signals in the 98th or 95th percentiles • MPEAK Bing Ren’s lab Focus more one the “hill” shape of the signal 4 thresholds, for a series of probes with at least one that is 3, 2.5, 2 or 1 standard deviations above the mean
ChIP-chip hits for GATA-1 occupancy Technical replicates of ChIP-chip with antibody against GATA1-ER Mpeak TAMALPAIS 275 hits in both 276 hits in both 59 216 60 321 total ChIP-chip hits
ChIP-chip hits validate at a high rate Validation determined by quantitative PCR. 19 of the 321 hits were tested. 13 (~70%) were validated. ChIP DNA Validation rate is similar at different thresholds 9 regions were “hits” in only one of the two technical replicates. None were validated.
Association of WGATAR and conservation with ChIP-chip Hits • 249 out of the 321 (78%) have WGATAR motifs, binding site for GATA-1 • Of the GATA-1 binding motifs in those 249 hits, 112 (45%) are conserved between mouse and at least one non-rodent species.
Distribution of ChIP-chip hits on 70Mb of mouse chr7 Yong Cheng, Yuepin Zhou and Christine Dorman
Almost half the GATA-1 ChIP-chip hits increase expression of a transgene, K562 cells 15 6 6 No GATA-1 GATA-1 occupied sites by ChIP-chip 24 validated out of 56 fragments with ChIP-chip hits tested 43%
Conserved, active Conserved, not active Not conserved, active Conserved and nonconserved ChIP-chip hits can be active as enhancers
Messages about evolutionary approaches to predicting regulatory regions • Regulatory regions are conserved, but not all to the same phylogenetic distance. • Incorporation of pattern and composition information along with with conservation can lead to effective discrimination of functional classes (regulatory potential). • Regulatory potential in combination with conservation of a GATA-1 binding motif is an effective predictor of enhancer activity. • In vivo occupancy by GATA-1 suggests other activities in addition to enhancers. • Comparison of polymorphism and divergence from closely related species can reveal regulatory regions that are under recent selection.