Etceteromics

receptorome complexome phenome alleome degradome regulome behaviourome genome ORFeome physiome Etceteromics interactome biome transcriptome allergenome bibliome secretome functome cardiogenomics epitome pathogenome Jeremy Glasner, Ph.D. October 23, 2007 lectinomics hygienomics metabolome envirome chemoproteomics glycome RNome epigenome pseudogenome cellome chromatinomics chaperome proteome embryogenomics http://www.genomicglossaries.com/content/omes.asp

http://en.wikipedia.org/wiki/-omics Origin The suffix “-om-” originated as a back-formation from “genome”, a word formed in analogy with “chromosome”.[1] The word “chromosome” comes from the Greekstems “χρωμ(ατ)-” (colour) and “σωμ(ατ)-” (body).[1] (Thus, had this word been well-formed, it would instead be “chromatosome”.[2]) Because “genome” refers to the complete genetic makeup of an organism, some people have made the inference that there exists some root, *“-ome-”, of Greek origin referring to wholeness or to completion, but such root is unknown to most or all scholars.[3]. Because of the success of large-scale quantitative biology projects such as genome sequencing, the suffix "-om-" has migrated to a host of other contexts. Bioinformaticians and molecular biologists figured amongst the first scientists to start to apply the "-ome" suffix widely.[citation needed]

Some -omes • biome: (1916) an ecological community of organisms and environments. • degradome: The entire protease complement of human cells and tissues. • degradomics: The application of genomic and proteomic approaches to identify the protease and protease- substrate repertoires, or 'degradomes', on an organism-wide scale Gerstein Lab http://bioinfo.mbb.yale.edu/what-is-it/omes/omes.html Morphome, Interactome, Glycome, Secretome, Translatome, Ribonome, Orfeome, Regulome, Cellome, Operome, Transportome, Functome, Foldome, Unknome

Relative popularity of different -omes 9452 3002 337

Unifying themes in -omics • Technology driven • High-throughput • Data rich • Databases • Statistical analysis • Ontology development • Data integration/unification

Enabling Technologies • Genomics-sequencing • Transcriptome-microarrays sequencing • Proteomics- mass spec, microarrays • Metabolomics-mass spec, NMR • Genomotyping- microarrays, sequencing, mass spec • Interactome- yeast 2-hybrid, mass spec

Technological Gaps • Phenomics- some tech available (e.g. Biolog) but not generalizable • Genetics- not always doable. Require screens (see phenomics above) • Sample preparation may be rate limiting for many types of experiments • Cost- doable things are not affordable (see sequencing, micrarrays, phenotyping)

Thinking like an -omicist Given the funds, what –ome would you want to characterize? Is it possible with current technology?

High density tiled microarrays to detect “islands” Genome Strain Experimental Strain Extract genomic DNA, Fragment DNA, label fluorescently, Hybridize to oligonucleotide array Infer which regions on the chip are variable

~100 variable regions per strain DEC5A = 611 Kb, 952 genes DEC5D = 624 Kb, 1005 genes ECOR37 = 628 Kb, 943 genes O157:H7 EDL933 O55:H7 DEC5A O55:H7 DEC5D ECOR37

Does it make sense to do CGH? • New advances in sequencing bring the costs and efforts in line with hybridization-based approaches. • A single run on a 454 sequencer generates about 400,000 reads of about 200 bp each –about 80Gb of sequence per run • Hybridization can only tell you about the presence or absence of sequences you already know about. Sequencing can reveal novel elements. • Does it make sense to continue doing CGH?

Comparison of transcription factor binding sites across genomes E. coli K-12 MG1655 432 38 39 26 P. atrosepticum Dickeya dadantii 514 382 53

FNR • apt adenine phosphoribosyltransferase • atpE F0 sector of membrane-bound ATP synthase, subunit • cysC adenosine 5'-phosphosulfate kinase • narK nitrate/nitrite transporter • narX sensory histidine kinase in two-component regulatory system with NarL • ndh respiratory NADH dehydrogenase 2/cupric reductase • nrdD anaerobic ribonucleoside-triphosphate reductase • purM phosphoribosylaminoimidazole synthetase • yfiD pyruvate formate lyase subunit • ArcA • lpd lipoamide dehydrogenase, E3 component is part of three enzyme complexes • mdh malate dehydrogenase, NAD(P)-binding • sdhC succinate dehydrogenase, membrane subunit, binds cytochrome b556 • sodA superoxide dismutase, Mn • tpx lipid hydroperoxide peroxidase FNR E. coli K-12 MG1655 59 9 Dickeya dadantii P. atrosepticum 66 59 ArcA E. coli K-12 MG1655 78 9 Dickeya dadantii P. atrosepticum 47 49 1974 Orthologs

Data integration • Data storage and dissemination • Data mining • Supervised learning • Biological ontologies

Data integration Genome Sequencing Functional Genomics Genome Alignment -ome Databases Evolutionary Analyses Population Level Comparisons

Microarray data availability http://genome-www5.stanford.edu/MicroArray/SMD/ http://www.ncbi.nlm.nih.gov/geo/ http://www.ebi.ac.uk/arrayexpress/ https://asap.ahabs.wisc.edu/annotation/php/logon.php

Pattern Discovery Clustering Data Mining Unsupervised learning From: Eisen MB, Spellman PT, Brown PO and Botstein D. (1998). Cluster Analysis and Display of Genome-Wide Expression Patterns. Proc Natl Acad Sci U S A 95, 14863-8.

K-means Clustering K-means clustering proceeds by repeated application of a two-step process where: 1) the mean vector for all items in each cluster is computed 2) items are reassigned to the cluster whose center is closest to the item The parameters controlling k-means clustering are: 1) the number of clusters (K) 2) the maximum number of cycles

Clustering From Eisen et al., PNAS 95:14863

Machine Learning Machine Learning is the study of computer algorithms that improve automatically through experience. A form of artificial intelligence that is used to classify objects into known groups. For example: Given a set of patients with a disease and a collection of gene expression profiles we could try to train a model on the known cases and try to predict the disease in samples where it is unknown using our model. Given a set of proteins with shared properties, e.g. virulence factors, can we learn to identify new proteins with similar properties? Training examples are essential for these methods.

Why you should care about structured text for annotations • High-throughput experiments require computational analyses • Computers do best with systematic, highly structured data • Ontologies are increasingly used in biology • Open Biomedical Ontologies (OBO) • http://obofoundry.org/

obo

Debating how to construct structured text • “Structured digital abstract makes text mining easy” • Nature Vol 447, 10 May 2007 • Mark Gerstein, Michael Seringhaus, Stanley Fields • -biologists should be required to provide abstracts in structured text to make life easier for computational biology • “Text mining: powering the database revolution” • Nature Vol 448, 12 July 2007 • Udo Hahn, Joachim Wermter, Rainer Blasczyk, Peter A. Horn • -terminologies are complex • -terms only cover a subset of biological phenomena • -quality and reliability of contributed data is suspect • -automated text extraction is a possible solution

3 GO ontologies

GO:0006355 regulation of transcription, DNA-dependent Definition:Any process that modulates the frequency, rate or extent of DNA-dependent transcription.

PAMGO terms for interactions between organisms • ---interaction between host and another organism • ----pathogenesis • ----recognition of host • ----adhesion to host • ----growth on or near host surface • ----growth within host • ----entry into host • ----avoidance of host defenses • -----suppression of host defenses • -----evasion of host defenses • ----induction of host defense response • ----translocation of molecules into host • ----movement within host • ----acquisition of nutrients from host • ----modification of host morphology or physiology • -----disruption of host cells • ------killing of host cells (and its children terms) • ----dissemination or transmission of an organism from a host • -----dissemination or transmission of an organism from a host by a vector

GO evidence codes

Biologists need to be told how to report their data -Minimum information about a microarray experiment (MIAME)—toward standards for microarray data -The minimum information about a proteomics experiment (MIAPE) -Promoting coherent minimum reporting requirements for biological and biomedical investigations: the MIBBI project. -The minimum information required for reporting a molecular interaction experiment (MIMIx)

The minimum information

What do the computational biologists want? • Stable, unique, unambigous identifiers -a database and an accession number (genes) taxon IDs (organisms) • Clear descriptions of all methods including computational parameters • Standardized measurements • Data deposition in publicly accessible databases

Why do they care what I do with my data? • They want to retrieve, combine, and compare information obtained from different groups using various methods • They don’t want to have to guess or look through methods sections to obtain important information about the data • They want to compute, and computers like structured data

Etceteromics

Etceteromics

Presentation Transcript