Chris Penkett Wellcome Trust Sanger Institute

Chris Penkett Wellcome Trust Sanger Institute • Overview: • Web-based software for orthologs and primers • In-house microarray processing • Stationary phase experiments with fission yeast • Analysis of introns and expression data

YOGY: a web-based database for protein orthologs and associated GO terms Can be used to search for the most of the major eukaryotic model organisms using a variety of gene IDs. Data is stored in a MySQL database and results are shown on the web. Includes data from various ortholog prediction results including KOGs, Inparanoid, OrthoMCL and HomoloGene. Also allows Gene Ontology (GO) terms to be retrieved for each ortholog along with evidence codes giving an overview of the protein function. It is now being used by the GO Reference Genome Consortium to aid with assigning GO terms between the model organisms.

Overview of output for search with cdc22

OrthoMCL results: example for cdc22 ortholog prediction output

Part of the GO output for cdc22

PPPP: a web-based primer design program for gene tagging/deletion Scripts to design primers for N- and C-terminal tagging/deletion of genes using the method of homologous recombination. Primers are integrated into a kanamycin-resistant plasmid using PCR, and then transformed into fission yeast cells. In addition to gene deletion, a gene can be tagged with an inducible promoter, a tag that is recognised by antibodies, or with a fluorescently labelled protein (GFP). Primers can also be designed that allow checking of correct integration of the plasmid into the chromosomal location using PCR.

Primers for homologous recombination

Primers for checking integration

Data flow for in-house arrays The group has two PCR-based, spotted arrays for ORF’s (and non-coding RNA’s) and intergenic regions. The ORF array was originally produced back in 2000, and is still used today. The advantage is that data from a wide range of experiments (environmental stress, cell cycle, mating, sporulation, translation data, RNA half-lives, etc.) have been done under nearly identical conditions. Needed to produce a robust, easily maintainable pipeline to get the data from these arrays in a windows-based environment where ~1000 arrays are used per year in the lab. Also needed to design new primers to obtain a nearly complete set of sequences. Recently, the biggest problem was the amount of data in GeneSpring – and it was necessary to upgrade to the Oracle-based version.

Spotted array data flow 96-well to 384-well conversion program Primers: 96-well plate format Primers: 384-well plate format GAL file: microarray layout TAS software Microarray primer DB GenePix image analysis software Primer design scripts: ORF/tiling arrays Images/ GPR files GeneDB: Sequences Annotation Local normalisation script Hyb Info DB: experiment info Tab2Mage Normalised all/spot/gene files ArrayExpress Tab2Mage SPGE data viewer GeneSpring/ R (BioConductor) SPGE loaders

Microarray primer database Initiated as a pipeline to check that we had a complete set of valid primers for all ORFs and intergenic regions on the in-house S. pombe arrays. Stored the data from this pipeline in a MySQL database, which is managed and viewed on the web with Perl scripts using CGI/DBI modules. Contains information about 96-plate info together with primer information: sequence (including for primers and final amplicon), mapping information, melting temperature, % GC content, PCR result, etc.

ORF array Intergenic array

96-well to 384-well conversion program • Java program that works for both ORF and intergenic arrays. • Two conversion patterns used by array makers. • Can also add any number of bacterial plates anywhere on array.

Local normalisation script • Perl/Tk script that works on both arrays. • Uses a sliding window around each spot for normalisation. • Works with bacterial spikes using various algorithms.

Hyb Info DB for MIAME experiment annotation

Starvation/stationary phase study • - Rationale: most cells in our body have stopped growing, or • yeast in the wild, on a grape for example, also no longer in growth. • - Grow WT cells from mid exponential phase (OD ~ 0.3) • to stationary phase in minimal medium at 32 C (OD ~ 3). • - Experimental issues: • Different numbers of cells at different time points. • Less total RNA per cell in stationary phase. • Normalise to cell numbers (by counting cells) RNA amounts, • and relative mRNA levels (by using bacterial spikes). • Need to extract consistent amounts of RNA to normalise • using RNA yield. • pH and other factors change during experiment.

Fission yeast - life cycle (partial) Stationary Phase Environmental factors (stress) Nutrient (glucose) deprivation Nutrient (nitrogen) deprivation Re-supply nutrients Re-supply nutrients Mitotic cell cycle Conjugation of h+/h- cells Meiosis/ sporulation Zygotic ascus Dormant ascospores Zygote formation

pH Time (hours) Stationary phase expression profile: data up to 11 days Time (hours)

Overall normalisation Time points normalised using bacterial controls (difficult to get accurate) and cell counts. Time points Large scale expression and cell morphology changes

Induced CESR (common) stress genes Ribosomal genes

Glycolysis pathway Starch and sugar metabolism Pathways related to sugar metabolism Citric acid cycle Pombe seems to use non-glucose sources for energy in stat. phase and stores sugar and starch.

More pathways of interest Genes that are up-regulated 2-fold in low glucose medium Mitochondrial electron transport tRNA coupling genes Genes associated with RNA pol II

Budding yeast findings >1800 genes increase 5-10 mins after refeeding. Mitochondrial function is important for stat. phase entry. 2 out of 3 stat. phase genes have human orthologs. >2500 genes up-regulated. ChIP-chip reveals RNA pol II present in intergenic sites upstream of genes are induced upon stat. phase exit.

Transcription factors that come up early in stat. phase rsv1: cell viability in low glucose C1105.14: selected for deletion studies pcr1: meiosis rst2: meiosis res2: DNA synthesis/meiotic division - goes down later atf1: meiosis/stat. phase/ stress response pap1: stress response hsf: binds to heat shock elements mbx2: cell wall synthesis php2: respiration/mitochondrial electron transport jmj2: chromatin remodelling C320.03 C2H10.01 C25B8.19c C19C7.10

Gene deletion of C1105.14 Exponential phase – 5 repeats Stationary phase – 2 repeats Green is for down-regulated genes in the WT time course. Red is for up-regulated genes.

50 most repressed genes in C1105.14 mutant in WT stationary phase time course

Genes regulated in stationary phase • Up-regulated: • Stress MAPK pathway and stress response genes. • Citric acid cycle and mitochondrial transport genes. • Starch and sugar metabolism genes. • Genes that are 2-fold up-regulated in low glucose medium • (including sugar transporter genes). • Genes involved with RNA polymerase II. • Transcription factors known to be involved in starvation, stress, meiosis. • Some unknown TF’s that are now being investigated further in the lab. • Down-regulated: • Ribosomal proteins. • Glycolysis pathway. • Fatty acid synthesis genes. • tRNA coupling genes. • Amino acid and nucleotide • metabolism genes.

Effect of introns in up-regulated stress-response (CESR) genes in stationary phase time course Geneswithout introns Genes with introns

Gene with and without introns in different oxidative stress conditions Seems to be general in pombe stress experiments.

E3 E2 – E1 R2-1 = t2 – t1 E2 E3 – E2 R3-2 = t3 – t2 E1 Rmax = abs{max(R2-1, R3-2) } t2 t3 t1 Comparing data sets from different organisms It seems that in pombe the transcriptional response to stress conditions is governed by a need to produce functional mRNA’s quickly (without the need for splicing) – is this common to other organisms? As studies in different organisms use various time points, need a way to compare data both within and between time courses using a standard common metric – expression change within unit time.

Correlation of max value against intron number using Spearman’s rank: P = 7.6 x 10-6 Median of Rmax for all data Correlation of all values against intron number using Spearman’s rank: P < 2.2 x 10-16 Rmax for stress data against number of introns in pombe Data is from Chen et al. (2003), for 5 different stresses with t = 0, 15, 60 mins. Data is from 2-colour microarrays, so is relative expression levels, compared to t = 0.

Compare genes without introns against genes with introns Compare two data sets using Wilcoxon (Mann-Whitney) non-parametric rank test: P < 2.2 x 10-16

Compare with cell cycle data W (0 vs >0): <2.2 x 10-16 S (all data): <2.2 x 10-16 W (0 vs >0): 3.9 x 10-5 S (all data): 4.4 x 10-5 Data from 3 elutriations of wt cells over 2 cell cycles (Rustici et al., 2004).

Compare with Arabidopsis stress data W (0 vs >0): <2.2 x 10-16 S (all data): <2.2 x 10-16 W (0 vs >0): <2.2 x 10-16 S (all data): <2.2 x 10-16 Data from various stresses in Arabidopsis for both roots and shoots including drought, UV-B, cold, heat, genotoxic, salt, wounding and osmotic from ? et al. Time points include 30, 60, 180, 360, 720, 1440 mins (plus 15 and 240 for some).

Considerations with Arabidopsis data Data collected using Affymetrix chips, so get absolute expression levels. Hence can use data that is from the absolute values or a ratio to t = 0 (to compare with the 2-colour pombe data). Also they did a time course with control untreated plants, so can also compare stress data using ratios to the control time points.

Rmax for Arabidopsis data using different methods W (0 vs >0): <2.2 x 10-16 S (all data): <2.2 x 10-16 W (0 vs >0): <2.2 x 10-16 S (all data): <2.2 x 10-16 W (0 vs >0): <2.2 x 10-16 S (all data): <2.2 x 10-16 Absolute and data vs t = 0 look virtually identical – as have taken logs of expression values.

Mouse stress data W (0 introns vs >0 introns): - (mean/median less for 0 introns) S (all data): - (positive gradient) Similar poor stats for GDS683. 2 Affymetrix data sets in GEO: GDS1015 (fetal bovine serum factor; Philippar et al., 2004) and GDS683 (oxidative stress; Madsen et al., 2004). GDS1015 has better time points: t = 0, 10, 30, 50, 180 mins. GDS683: t = 0, 15, 60, 480, 1008 mins. Some genes have >100 introns, so put into 22 equi-spaced bins (0-4, 5-9, 10-14, etc. introns).

Use new metric for mouse data As the transcripts are generally very long in mouse, the amount of time taken to transcribe the pre mRNA is also going to be a factor as well as the time taken to splice out introns. Additionally, number of introns correlates with transcript length for Arabidopsis and mouse (only get a small correlation in pombe). Can use an alternative metric called intron density, which correlates positively with intron number and inversely with transcript length: Number of introns Intron density = Genomic length of transcript

Mouse stress data using intron densities W (<1/10th max vs >1/10th max): 4.0 x 10-10 S (all data): 1.1 x 10-6 W (0 introns vs >0 introns): - (mean/median less) S (all data): - (positive gradient) As intron density is a continuous variable, put into 10 equi-spaced bins.

Check Arabidopis stress data for trend with intron densities W (0 vs >0): <2.2 x 10-16 S (all data): <2.2 x 10-16 W (<1/10 vs >1/10): <2.2 x 10-16 S (all data): <2.2 x 10-16 Intron density still significant for Arabidopsis and pombe. Pombe stats: W (<1/10 vs >1/10): <2.2 x 10-16; S (all data): < 2.2 x 10-16

Transcription and splicing kinetics Transcription proceeds at 1200-1500 bp/minute (Izban & Luse, 1992), Pombe: mean gene length ~1,500 bp – time to transcribe ~1 min, mean intron number ~1 per gene. Arabidopsis: mean gene length ~1,900 bp – time ~1.5 mins, mean intron number ~4.4 per gene. Mouse: mean gene length ~33,000 bp – time ~20 mins, mean intron number ~9 per gene. Half-lives for splicing reactions are considerably longer, under a minute for the first intron, but of the order of 2-8 mins for the second and third introns (Audibert et al., 2002). Intron splicing may be the rate limiting factor, since new spliceosomal ‘speckles’ form ~15-20 mins after gene activation in mammalian cells, and speckle morphology changes on the order of 5-7 mins (Misteli et al., 1997). With time scales on this order, it appears that the assembly splicing and release of the spliceosome may be limiting for rapid changes in gene expression.

Acknowledgements Jürg Bähler – Supervisor Valerie Wood – Suggested adding GO into YOGY Daniel Jeffares – Intron data Gavin Burns – Laboratory help for pombe and arrays Luis López – Stationary phase mutant data Matloob Qureshi – 96 to 384 well program Juan Mata – Normalisation script Zoë Birtles, James Morris – Summer students

Chris Penkett Wellcome Trust Sanger Institute

Chris Penkett Wellcome Trust Sanger Institute

Presentation Transcript

Biomedical Ethics at the Wellcome Trust

KEMRI-Wellcome Trust Research Programme

John Ashburner Wellcome Trust Centre for Neuroimaging , UCL Institute of Neurology, London, UK.

Wellcome Trust - Funding the best science

The chordoma genome Peter Campbell, Wellcome Trust Sanger Institute

The Wellcome Trust

DTC/Wellcome Trust Postgraduate Course 2007

The Wellcome Trust

Wellcome Trust Medical Photographic Library

WELLCOME

Joint EBI-Wellcome Trust

Wellcome Trust : how do we fund ?

Joint EBI-Wellcome Trust

Funding and sustainability: the Wellcome Trust perspective

Zemin Ning The Wellcome Trust Sanger Institute

The Wellcome Trust

Wellcome Trust - Funding the best science

Biomedical Ethics at the Wellcome Trust

Dr Mark Walport Director The Wellcome Trust