Bioinformatics Applying the Concept of Information in Biology

Bioinformatics Applying the Concept of Information in Biology The theory of evolution is the conceptual framework of biology and medicine and bioinformatics is the tool used to analyze and quantify evolutionary relationships at every level of investigation – molecular, physiological, or ecological.

Diagrammatic view of metabolic pathways showing major functional interaction among synthesis and degradation of nutrients from different food groups (from KEGG; http://www.genome.ad.jp:80/kegg/metabolism.html)

Macromolecular crowding in bacterial cytoplasm Hopper & Mayer, 1999, Prokaryotes. Am.Sci. 87:518 Ellis, E.J., Macromolecular crowding, 2001, TIBS 26:597

What makes a scientific discipline? A look at the history of biochemistry. To know where we come from helps us understand where we are going. Novel ways of curing diseases and fighting off infections will include individualized prescription drug regiments, gene therapy, and the development of new generations of antibiotics. These changes are no less sweeping and broad than those brought to biology by chemists and physicists in the 1920s, 30s, and 40s attracted to a most obvious problem in biology at that time, the staggering lack of an atomistic understanding of genetics.

1926, 1930 First accounts that proteins convey enzymatic activity (urease, pepsin) in cellular metabolism; an important step to demonstrate that proteins catalyze chemical reactions and are not only structural components of cells 1934 First successful X-ray study of the globular protein pepsin by Bernal and Crowfoot; it does not show high resolution details, but demonstrates water covered protein surface

1937 Citric Acid Cycle described by Hans Krebs; this is the central energy yielding pathway in all organisms; complete biochemical pathway reactions could be elucidated in the absence of any protein structure information (kinetic data represents macroscopic behavior of enzymes)

1941 'One gene, one enzyme' hypothesis by Beadle and Tatum 1944 DNA is carrier of genetic information in bacteria (Oswald Avery) 1945 First complete amino acid content of a protein is published (not its sequence, however)

1951 First complete amino acid sequence published of the protein hormone insulin by Fred Sanger Proposed model for alpha helix and beta sheet and importance of so called hydrogen bonds in protein structures (Pauling and Corey) 1953 DNA structure at atomic resolution by Crick, Watson, and Wilkins; they propose a model for DNA replication based on the structural information; the concept of structure-function relationship has been successfully used to solve a major problem in biology

1962 High resolution structure of myoglobin at 2 Angstrom confirms for the first time the existence of alpha helix structures in proteins (Perutz and Kendrew) The structure of the enzyme Lysozyme with a bound inhibitor molecule solved at 2 Angstrom resolution giving the first structural insight into enzyme-substrate interaction and Koshland's induced fit theory 1963 Genetic code solved; links DNA sequence to amino acid sequence in proteins (Holley, Khorana, Nirenberg)

Data base structures • Sequences • Structures • Pathways • Analysis tools • Prediction tools • Functional categories & interactivity • PubMed

Integrated database retrieval system, GenomeNet, Japan http://www.genome.ad.jp:80/dbget/dbget.links.html

KEGG: Kyoto Encyclopedia of Genes and Genomes http://www.genome.ad.jp:80/kegg/kegg2.html

Analysis, Prediction, Data Mining • Similarity searches • Structure prediction • Gene prediction • Pathway reconstruction • Visualization and Modeling • Pattern recognition • Clustering • Annotation

Analysis of sequence information

Prediction of relationship among sequences Cluster of Orthologous Groups at NCBI Principal component analysis of variability found in whole genome databases

Clusters of orthologous groups (sequences of individual proteins or protein families represented in at least 3 species (currently microorganisms only) thus corresponding to an ancient conserved domain) Translation Transcription Signaling

Phylogenetic analyses indicate that R. prowazekii is more closely related to mitochondria than is any other microbe studied so far. E.coli Rickettsia prowazekii Obligate intracellular parasite, the causative agent of epidemic typhus. The functional profiles of these genes show similarities to those of mitochondrial genes: no genes required for anaerobic glycolysis are found in either R. prowazekii or mitochondrial genomes, but a complete set of genes encoding components of the tricarboxylic acid cycle and the respiratory-chain complex is found in R. prowazekii. In effect, ATP production in Rickettsia is the same as that in mitochondria. Many genes involved in the biosynthesis and regulation of biosynthesis of amino acids and nucleosides in free-living bacteria are absent from R. prowazekii and mitochondria. Such genes seem to have been replaced by homologues in the nuclear (host) genome. (Nature 1998 Nov 12;396(6707):133-40)

KEGG – pathway maps

Glycolysis pathway map from KEGG Escherichia coli K-12 MG1655 Rickettsia prowazekii

Rickettsia prowazekii

Escherichia coli

Homo sapiens

What kind of information can be obtained using the COG database? 1. Annotation of proteins. Known functions (and two- or three-dimensional structures) of one COG member can often be directly attributed to the other members of the COG. Caution must be used here, however, since some COGs contain paralogs whose function may not precisely correspond to that of the known protein. 2. Phylogenetic patterns. These show the presence or absence of proteins from a given organism in a specific COG. Used systematically, such patterns can be used to identify whether a particular metabolic pathway exists in an organism. 3. Multiple alignments. Each COG page includes a link to a multiple alignment of COG members, which can be used to identify conserved sequence residues and analyze evolutionary relationships between member proteins.

Hierarchical cluster analysis of DNA microarrays Eisen et al. (1998) PNAS 95:14863.; http://rana.lbl.gov/EisenSoftware.htm

Hierarchical clustering and factor analysis of DNA microarrays Factor analysis (and principal component analysis) demonstrates three independent factors (Eigenvectors) accounting for 99.5% of the variability of the array data (6 arrays; three conditions; each condition repeated once). Factor one (F1) accounts for the variability in hybridization strength. Factor two accounts for gene specific differences of hybridization strength that are more distinguish Va2 from both Vb5 and control (see diagram F2-F1). Factor three shows that there are general condition specific differences that distinguish control from Vb5 and from Va2 but are highly reproducible when repeated by labeling cDNA from same RNA samples. The dendrogram obtained from hierarchical clustering of the six arrays shows the same relationship as determined by the second variable (F2) from factor analysis.

One can ask any biologically interesting question concerning relationship between database entries, e.g.: How many genes in the human genome? Minimal gene set theory! Evolutionary psychology: Explaining behavioral traits.

Minimal gene set theory! The definition of a minimal gene set would be that any knock-out that does not kill the organism, proves that there are more genes than the organism needs for survival. Therefore, a minimal gene set would be one where each single gene knock-out would result in a non-viable clone. The smallest gene set (besides large viral genomes with >200 genes) found is 467 in Mycoplasma genitalium. The latter can hardly be considered a free living organism. Autonomous (neither symbiotic nor parasitic) species to not tend to have minimal gene sets Chemotrophs, for which there are only archaea known, have genomes with usually more than 2,000 open reading frames, up to four times the minimal gene set found in eubacteria. Phototrophs produce, besides bacterial species, some of the largest life forms (trees) containing some of the largest genomes.

Genome “size” (number of proteins) of some microorganisms Name Proteins in COGs Methanococcus jannaschii 1786 1330 Methanobac. thermoautotrophicum 1873 1388 Saccharomyces cerevisiae 5955 2290 Escherichia coli K12 4275 3414 Escherichia coli O157 5315 3662 Helicobacter pylori 1576 1096 Rickettsia prowazekii 835 697 Mycoplasma pneumoniae 689 425 Mycoplasma genitalium 484 381

How many genes in the human genome? Bets: 165 Mean: 61,710 Lowest: 27,462 Highest: 153,478 Assessment of the gene number will occur on the 2003 Cold Spring Harbor Laboratory Genome meeting Source: Sanger Institute http://www.ensembl.org/Genesweep/

Bioinformatics Applying the Concept of Information in Biology