420 likes | 552 Views
Translating the Cell’s “Instruction Manual” A Biophysicist’s Approach to Understanding Gene Regulation. Rachel Patton McCord Bulyk Lab Harvard University Biophysics Program 3/20/08. “Knobloch lives?” What are characteristics of “life”? Response to environment
E N D
Translating the Cell’s “Instruction Manual” A Biophysicist’s Approach to Understanding Gene Regulation Rachel Patton McCord Bulyk Lab Harvard University Biophysics Program 3/20/08
“Knobloch lives?” • What are characteristics of “life”? • Response to environment • Take in nutrients and produce waste • Reproduction • ….
Biological Signal Processing oxygen ethanol
Biological Signal Processing Inputs Outputs protein Transcription Factor mRNA Nucleus
Regulation of Gene Expression • Transcription Factor (TF) recognizes DNA bases (ACGT) • Promotes gene expression: transcription of mRNA RNA Polymerase Sequence-Specific TFs RNA (output)
A few hundred bp A few hundred kb Organisms • Ideal: understand gene regulation in human • Problems: Large genome size, diverse cell types, likely complicated gene regulation “rules” • Begin with model system single celled organism Saccharomyces cerevisiae (yeast)
Goals: • Find DNA sequences bound by TFs • Predict how TFs function in the cell • Look for biophysical links between TF structure and function • Use quantitative approaches to maintain a physically realistic view of biology.
TF TF TF-DNA Sequence Recognition Protein Binding Microarray (PBM) Technology dsDNA Fluorophore labeled antibody Microarray slide Mukherjee, Berger, et al., Nature Genetics (2004), 36:1331-1339.
TF-DNA Sequence Recognition Protein Binding Microarray (PBM) Technology Detector Laser (488 nm) Mukherjee, Berger, et al., Nature Genetics (2004), 36:1331-1339.
Universal Array Design • Interested in sequences of 8-10 bases 410 ≈ 1,000,000 total 10-mers 410 ≈ 1,000,000 total 10-mers 410 / 27 ≈ 40,000 total spots 36 nt variable sequence 24 nt fixed sequence 5’ 3’ CTATCTACACACAACTATGCGGTCGCCATGGAAATGGTCTGTGTTCCGTTGTCCGTGCTG CTATCTACACA TATCTACACAC 27 10-mers per spot ATCTACACACA TCTACACACAA Berger, Philippakis et al., Nature Biotechnology (2006), 24:1429-1435. Philippakis, Qureshi et al., RECOMB (2007).
T T C T T T G C G C T C A C T A T A C G T A G G A T C G A A T T A A A C A A G A C C C T G A G C A T G G C C A G T G T C C G G G Universal Array Design • Use an idea from cryptography: • “de Bruijn” sequence contains all sequence variants of length k in the shortest sequence possible de Bruijn sequence All possible 3-mers AAA AAC AAG AAT ACA ACC ACG ACT AGA AGC AGG AGT ATA ATC ATG ATT CAA CAC CAG CAT CCA CCC CCG CCT CGA CGC CGG CGT CTA CTC CTG CTT GAA GAC GAG GAT GCA GCC GCG GCT GGA GGC GGG GGT GTA GTC GTG GTT TAA TAC TAG TAT TCA TCC TCG TCT TGA TGC TGG TGT TTA TTC TTG TTT Test sequence (36 bp) TCGATTGCGTGACAGGGTAAAACAAGACCCTGACCATGGCAGTGT TCGATTGCGTGACAGGGTAGTCCGGGTTCTTTGCGCTCACTATAC Length = 43 = 64 bp Fixed sequence (24 bp) Anthony Philippakis, Mike Berger
Deriving Binding Strength at each Sequence • Every 8mer is represented 16 times • Take median over intensities of all spots containing this 8mer Example: CATGGAAA CCGTCAGCAGTCATGGAAAGCTGGTAGAAGTTCTGGGTCTGTGTTCCGTTGTCCGTGCTG TTATACCATGGAAAGACAAACGTAGCATGTTGGAGTGTCTGTGTTCCGTTGTCCGTGCTG CCATGGAAATGTGTCCCTAAGGGTGGTAACAAAATAGTCTGTGTTCCGTTGTCCGTGCTG CACTACGCAAGTGCGGTGCATGGAAAGGGTTCTGGAGTCTGTGTTCCGTTGTCCGTGCTG ATCTCATGGAAAAGACTCATAACGATCAACAGTCGGGTCTGTGTTCCGTTGTCCGTGCTG ACAACAGAGCACCGATGGCATGGAAACTTGCGTAGAGTCTGTGTTCCGTTGTCCGTGCTG GTGGAGAAAGGGGTCAAACATGGAAACGCATCGACAGTCTGTGTTCCGTTGTCCGTGCTG GCCCGGGATCCCATCCATGGAAAATGTCGCTTACATGTCTGTGTTCCGTTGTCCGTGCTG CAGAAGTGTCCTACGTAACATCCACATGGAAAGTACGTCTGTGTTCCGTTGTCCGTGCTG GTTGCATACACGCATGGAAATAACAATCGAACTCCAGTCTGTGTTCCGTTGTCCGTGCTG TCATGTGCTGGGCTTGATTCAGCATGGAAAACCAGTGTCTGTGTTCCGTTGTCCGTGCTG TATTCTTCTCTTCATGGAAACAGTAAAAAATCGGACGTCTGTGTTCCGTTGTCCGTGCTG CTATCTACACACAACTATGCGGTCGCCATGGAAATGGTCTGTGTTCCGTTGTCCGTGCTG CCTGGGGACATGGAAAAATGAAGTCACCCATGGTGCGTCTGTGTTCCGTTGTCCGTGCTG ATCATCCTTACATTACATGGAAATCGTGTGCCAATAGTCTGTGTTCCGTTGTCCGTGCTG AAGGCCCATGGAAACCACGTCATATTCACAACTAACGTCTGTGTTCCGTTGTCCGTGCTG
Affinity vs. PBM Signal (Cbf1) log (KD-1) log (PBM Median Signal) Maerkl and Quake. Science (2007); 315:233-237. Deriving Binding Strength at each Sequence 8-mer Rev. Comp. Median Signal GTCACGTG CACGCGAC 108178 GCACGTGC GCACGTGC 95854 CACGTGCC GGCACGTG 89203 GCACGTGA TCACGTGC 74295 TCACGTGA TCACGTGA 69377 ACACGTGA TCACGTGT 68733 ATCACGTG CACGTGAT 58874 CACGTGTA TACACGTG 58656 CCACGTGA TCACGTGG 47900 ACACGTGG CCACGTGT 47240 CACGTGAG CTCACGTG 42887 AGCACGTG CACGTGCT 41755 ACACGTGC GCACGTGT 36764 CACGTGTC GACACGTG 36463 ACCACGTG CACGTGGT 36380 CACGTGCG CGCACGTG 35515 CACGTGCA TGCACGTG 32370 AACACGTG CACGTGTT 28948 CCACGTGC GCACGTGG 22983 CACGTGGC GCCACGTG 19315 ... ... ... ka kd ka [TF] + [DNA] [TF-DNA] kd
Goals: • Find DNA sequences bound by TFs • PBMs • Predict how TFs function in the cell • Look for biophysical links between TF structure and function • Use quantitative approaches to maintain a physically realistic view of biology.
Predicting TF Cellular Functions • Use known/measurable inputs and outputs: Gene expression Heat shock Gene Deletion mRNA
Gene Expression Data • 1327 Publicly Available Microarray Datasets Condition 1 mRNA Condition 2
Predicting Cellular Functions of Components • Basic model/assumptions • TF binding near genes causes change in expression • Similar TF binding probability + similar expression = active regulation Expression data PBM data TF1 Gene 1 TF1 Gene 2 TF1 Gene 3 TF1 Gene 4 Gene 5
Physically Realistic Binding Probability • Simple (and often used) view: Promoter region is BOUND: Gene is ON Cbf1 GGCACGTGGCTGCATGAGCGGAGTCACGTGGGAAAATACAACAGTCACCCACGTGCCGTGCACCGACGTACTCGCCTCAGTGCACCCTTTTATGTTGTCAGTGGGTGCAC Gene Promoter region is NOT BOUND: Gene is OFF GGCACGTGGCTGCATGAGCGGAGGCTCGCGGGAAAATACAACAGTCACCCACGTGCCGTGCACCGACGTACTCGCCTCCGTGCGCCCTTTTATGTTGTCAGTGGGTGCAC Gene
Physically Realistic Binding Probability • Physical reality: • Energy landscape of potential TF binding • TF occupancy probability = Integration of binding potential across sequence near gene • Dictates likelihood of recruiting RNA polymerase and thus level of mRNA transcription Cbf1 GGCACGTGGCTGCATGAGCGGAGTCACGTGGGAAAATACAACAGTCACCCACGTGCCGTGCACCGACGTACTCGCCTCAGTGCACCCTTTTATGTTGTCAGTGGGTGCAC Gene
Physically Realistic Binding Probability • Physical reality: • Energy landscape of potential binding • Sum median intensity data across all possible 8-mers in sequence near gene Cbf1 GGCACGTGGCTGCATGAGCGGAGTCACGTGGGAAAATACAACAGTCACCCACGTGCCGTGCACCGACGTACTCGCCTCAGTGCACCCTTTTATGTTGTCAGTGGGTGCAC Gene Intensity = 117651 Intensity = 215352 GGCACGTGGCTGCATGAGCGGAGTCACGTGGGAAAATACAACAGTCACCCACGTGCCGTGCACCGACGTACTCGCCTCAGTGCACCCTTTTATGTTGTCAGTGGGTGCAC Gene
Goals of New Analysis Method • Combine binding probability with expression data to predict TF function and condition specific binding site usage Target Gene: 1 2 3 4 5 6 PBM data Gene expression Condition A Condition B Condition C Condition D TF Function
Goals of New Analysis Method • Consider all data rather than drawing arbitrary cutoffs • Low affinity binding as well as minor expression changes may be biologically relevant • Tanay, 2006; Foat et al., 2006 Binding probability ?
CRACR “Combination Rank-order Analysis of Condition-specific Regulation”
TF binding rank: 2 3 6 9 1 8 5 10 4 7 11 Most Most inducedrepressed YGR087C YAR003W YGR043C YAR044W YER130C YPL054W YAR014C YGR088W YAR018W YAR029W YAL003C Basics of CRACR Approach • Order genes by expression in condition of interest • Assign ranks based on PBM-derived binding probability for TF
foreground Basics of Analysis Approach • Select: • similarly expressed foreground genes • background set PBM p-value rank: background 2 3 6 9 1 8 5 10 4 7 11 Most Most inducedrepressed YGR087C YAR003W YGR043C YAR044W YER130C YPL054W YAR014C YGR088W YAR018W YAR029W YAL003C
[ [ ρ = rank sum F = foreground B = background • ρBρF area = (B + F) B F Basics of Analysis Approach • Slide window along ordered expression • Calculate an area statistic for enrichment of PBM targets within each window vs. background PBM p-value rank: 2 3 6 9 1 8 5 10 4 7 11 Most Most inducedrepressed YGR087C YAR003W YGR043C YAR044W YER130C YPL054W YAR014C YGR088W YAR018W YAR029W YAL003C
Predicting TF Function • Plot area statistic (ranges -0.5 to 0.5) at each window • Determine condition significance by permutation test-derived threshold (gray line: p < 0.001) metabolism switch metabolism enzyme Glucose added: Mig1 targets repressed Glucose area statistic Mig1 induced-----------------repressed mRNA >8.0 5.0 3.4 2.3 1.5 0 -1.5 -2.3 -3.4 -5 <-8 Expression fold change
Predicting TF Function • Determine which individual genes are repressed by Mig1 Group of genes repressed by Mig1 Glucose added: Mig1 targets repressed Mig1 YHR005C Mig1 area statistic YER130C Mig1 YBL054W induced-----------------repressed >8.0 5.0 3.4 2.3 1.5 0 -1.5 -2.3 -3.4 -5 <-8 Expression fold change
Prediction of General TF Function • Find all (of 1327) expression conditions where a TF is predicted to be active • Look for enrichment of general biological functions in this set Selected Mcm1 significant conditions
Prediction of General TF Function • Find all (of 1327) expression conditions where a TF is predicted to be active • Look for enrichment of general biological functions in this set Selected Mcm1 significant conditions
Prediction of General TF Function • Find all (of 1327) expression conditions where a TF is predicted to be active • Look for enrichment of general biological functions in this set • Prediction: Mcm1 involved in cell cycle and mating Selected Mcm1 significant conditions alpha factor “alpha” cell “a” cell
Prediction of TF function • After PBM experiments, CRACR has been used to predict functions of 90 yeast TFs (paper in process)
Binding Site Affinity Effects TF concentration low High affinity TF TF concentration medium Gene 1 TF concentration high Binding affinity TF Medium affinity Gene 2 TF Low affinity Gene 3 ka ka kd [TF] + [DNA] [TF-DNA] kd
Demonstrating Effects of Binding site affinity • Low vs. high affinity binding sites may have different biological functions Experimentally Validated Expression after oxidative stress vs. Rap1 binding affinity Highest binding affinity……………Lowest binding affinity
Goals: • Find DNA sequences bound by TFs • PBMs • Predict how TFs function in the cell • CRACR • Look for biophysical links between TF structure and function • Use quantitative approaches to maintain a physically realistic view of biology.
Reasons for Different Functions: TF structure? • Goal: Consider biophysical TF structure instead of cartoon “TF blob” tup1 cyc8 Mig1
TF Structure and Function • Are certain TFs structurally suited for certain types of biological processes? • Case Study: CST6 (bZIP) Lower Information Content Motif Regulatory hub; many target genes cell fate, cell cycle GAL4 (Zn2Cys6) Higher Information Content Motif More specific, fewer target genes metabolism of specific nutrients
Goals: • Find DNA sequences bound by TFs • PBMs • Predict how TFs function in the cell • CRACR • Look for biophysical links between TF structure and function • Use quantitative approaches to maintain a physically realistic view of biology.
Future Directions • Completion of functional predictions and study of yeast gene regulation • Toward predictive model in humans • Experiments for understanding gene regulation rules
Acknowledgements Martha Bulyk Mike Berger Anthony Philippakis Cong Zhu Kelsey Byers Trevor Siggers Vicky Zhou Cherelle Walls Jason Warner Jaime Chapoy Other Bulyk Lab Members • NSF graduate research fellowship • NIH/NHGRI R01
Advantages and Challenges of Interdisciplinary Work • Insight gained by quantitative reasoning in biology, combining of different perspectives • “Physicists and mathematicians choose projects in biology that are fun, but not necessarily important” • Important not to get caught up in what “counts” as “true biology” or “true physics”