1 / 33

Finding Transcription Modules from large gene-expression data sets

Finding Transcription Modules from large gene-expression data sets. Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America . Outline. Introduction – transcription, regulation, gene chips, and transcription modules.

malissa
Download Presentation

Finding Transcription Modules from large gene-expression data sets

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Finding Transcription Modulesfrom large gene-expression data sets Ned Wingreen – Molecular Biology Morten Kloster, Chao Tang – NEC Laboratories America

  2. Outline • Introduction – transcription, regulation, gene chips, and transcription modules. • Iterative Signature Algorithm (ISA). • Advantages of Progressive Iterative Signature Algorithm (PISA). • PISA applied to yeast data.

  3. Transcription regulation http://doegenomestolife.org

  4. Gene chips DNA microarray

  5. Gene-expression profile Egc g=1,2,...,Ng c=1,2,...,Nc But data very noisy…

  6. Transcription factors TF1 TF2 TF3 TF4 Transcription module Conditions C1 C2 C3 Genes G1 G2 G3 G4 G5 G6 G7 A Transcription Module: a set of conditions and a set of genes connected by a transcription factor.

  7. Signature of a transcription module Conditions c1 c2 c3 … … cm … … cn ... ... cNc Genes g1 g2 g3 . . gi . . gj . . gNg A gene can be in multiple transcription modules.

  8. Iterative Signature Algorithm (ISA) Barkai group (2002,2003) Conditions c1 c2 c3 … … cm … … cn ... ... cNC Thresholding: Transcription Module (TM) Gene vector and condition vector: Genes g1 g2 g3 . . gi . . gj . . gNG Thresholding on both genes and conditions reduces noise.

  9. Limitations of ISA • Lots of spurious modules (millions…). • Weak modules may be absorbed by strong ones. • ISA does not make use of identified modules to find new ones. c1 c2 c3 … … cm … … cn ... ... cNc g1 g2 g3 . . gi . . gj . . gNg

  10. Progressive Iterative Signature Algorithm (PISA) c1 c2 c3 … … cm … … cn ... ... cNc g1 g2 g3 . . gi . . gj . . gNg

  11. Advantages of PISA over ISA • Removing found modules reveals “hidden” modules, and reduces noise for unrelated modules. • No positive feedback. • Improved thresholding for genes. • Combines coregulated and counter-regulated genes.

  12. Example of PISA vs. ISA A B TF1 TF2 G1 G2

  13. The gene-score threshold Gene scores along the condition vector for some module • Goal: less than one gene included in the module by mistake. • Require: threshold that is insensitive to (unknown) module size.

  14. Eliminating false modules For scrambled data, preliminary modules either have few genes or few contributing conditions. True positives

  15. PISA applied to yeast data • Applied PISA to a dataset containing almost all available microarray data for S. cerevisiae: >6000 genes, ~1000 conditions. • Found ~140 different modules, including all “good” modules found by ISA. • Found some unknown modules. • Found many “good” small modules that ISA could not find / separate from the spurious modules. • ~2600 genes in at least one module, ~900 genes in more than module.

  16. Some modules found by PISA

  17. Example: Zinc module ZRT1 ZRT2 ZRT3 ZAP1-regulated genes during zinc starvation. ZAP1 INO1 ADH4 YNL254C YOL154W Zinc module found by PISA Lyonsetal.,PNAS97,7957-7962(2000)

  18. Comparison with other databases “Gold standard”: Gene Ontology (GenomeRes.11,1425-1433(2001)) Database A: Immunoprecipitation (Leeetal.,Science298,799-804(2002)) Database B: Comparative genomics (Kellisetal.,Nature423,241-254(2003))

  19. rRNA processing (117) Ribosomal proteins (126) Histone (19) Fatty acid syn ++ (22) Cell cycle G2/M (31) Cell cycle M/G1 (35) Cell cycle G1/S (66) Correlations Mating genes for type a (15) Mating type a signaling genes (6) Mating (110) Mating factors/receptors: a/a difference (26) Oxidative stress response(69) De novo purine biosyn (32) Lysine biosyn (11) Biotin syn & transport (6) Arg biosyn (6) aa biosyn (96) Oxidative stress response (69) aryl alcohol dehydrogenase (6) proteolysis (27) trehalose & hexose metabolism/conversion (21) COS genes (11) heat shock (52) repair of disulfide bonds (26) correlated anticorrelated

  20. Summary • Data from gene chips can be used to identify transcription modules (TMs). • Iterative approach (ISA) is promising. • PISA improves on ISA by taking out found TMs. • PISA also improves gene thresholding, avoids positive feedback, and improves signal to noise by grouping coregulated and counter-regulated genes. • PISA very effective for finding “secondary modules”. http://cn.arxiv.org/abs/q-bio/0311017

  21. Future Directions • Input to experiment: • new modules and new genes in old modules. • what kinds of experiments give the most informative data? • Improve PISA: • better pre/post-processing of data. • Apply PISA to other organisms. • Combine PISA with other data (experimental, bioinformatic) to systematically identify TMs, and reconstruct the transcription network.

  22. De novo purine biosynthesis Number of genes: 32 Average number of contributing conditions: 14.6 Consistency: 0.59 Best ISA overlap: 0.59 at tG=5.0; frequency 16

  23. Galactose induced genes Number of genes: 23 Average number of contributing conditions: 18.1 Consistency: 0.55 Best ISA overlap: 0.74 at tG=3.2; frequency 686

  24. Hexose transporters Number of genes: 10 Average number of contributing conditions: 33.7 Consistency: 0.59 Best ISA overlap: 0.6 at tG=3.8; frequency 41

  25. Peroxide shock Number of genes: 69 Average number of contributing conditions: 23.9 Consistency: 0.50 Best ISA overlap: 0.34 at tG=3.4; frequency (1)

  26. Implementation of PISA • Normalization of gene-expression data • Iterative algorithm to find preliminary modules (modified ISA) • avoiding positive feedback • gene-score threshold • Orthogonalization • Finding consistent modules

  27. Normalization of expression data Gene-score matrix EG: normalizes total RNA levels removes reference-condition bias makes gene scores comparable Condition-score matrix EC: makes condition scores comparable 

  28. Iterative algorithm: modified ISA (mISA) Start with a random set of genes GI. Produce condition-score vector sC. Produce gene-score vector sG, using “leave-one-out” scoring to avoid positive feedback. From sG, calculate gene vector mG for next iteration.

  29. s’ s2C s1C Orthogonalization After finding each converged preliminary module (sG, sC), remove component along sC from all genes:

  30. Why does scrambled data yield large modules? Long tails of expression data lead to single-condition modules.

  31. Finding consistent modules • Repeat PISA runs many times (~30). • Tabulate preliminary modules. • A preliminary module contributes to a module if: • the preliminary module contains > 50% of the genes in the module, • these genes constitute > 20% of the preliminary module. • A gene is included in a module if it appears in >50% of the contributing modules, always with the same gene-score sign.

  32. Comparison with other databases Gene Ontology (Genome Res. 11, 1425-1433 (2001)) Ng — number of genes in organism m — number of genes in module c — number of genes in GO category n — number of genes in both module and GO category p value: Database A: Immunoprecipitation (Lee et al., Science 298, 799-804 (2002)) Database B: Comparative genomics (Kellis et al., Nature 423, 241-254 (2003))

  33. Correlation of modules Conditions c1 c2 c3 … … cm … … cn ... ... cNc Genes g1 g2 g3 . . gi . . gj . . gNg

More Related