Loading in 2 Seconds...
Loading in 2 Seconds...
Bioinformatics Applying the Concept of Information in Biology
Applying the Concept of Information in Biology
The theory of evolution is the conceptual framework of biology and medicine and bioinformatics is the tool used to analyze and quantify evolutionary relationships at every level of investigation – molecular, physiological, or ecological.
Diagrammatic view of metabolic pathways showing major functional interaction among synthesis and degradation of nutrients from different food groups
(from KEGG; http://www.genome.ad.jp:80/kegg/metabolism.html)
Hopper & Mayer, 1999, Prokaryotes. Am.Sci. 87:518
Ellis, E.J., Macromolecular crowding, 2001, TIBS 26:597
To know where we come from helps us understand where we are going. Novel ways of curing diseases and fighting off infections will include individualized prescription drug regiments, gene therapy, and the development of new generations of antibiotics. These changes are no less sweeping and broad than those brought to biology by chemists and physicists in the 1920s, 30s, and 40s attracted to a most obvious problem in biology at that time, the staggering lack of an atomistic understanding of genetics.
First accounts that proteins convey enzymatic activity (urease, pepsin) in cellular metabolism; an important step to demonstrate that proteins catalyze chemical reactions and are not only structural components of cells
First successful X-ray study of the globular protein pepsin by Bernal and Crowfoot; it does not show high resolution details, but demonstrates water covered protein surface
Citric Acid Cycle described by Hans Krebs; this is the central energy yielding pathway in all organisms; complete biochemical pathway reactions could be elucidated in the absence of any protein structure information (kinetic data represents macroscopic behavior of enzymes)
'One gene, one enzyme' hypothesis by Beadle and Tatum
DNA is carrier of genetic information in bacteria (Oswald Avery)
First complete amino acid content of a protein is published (not its sequence, however)
First complete amino acid sequence published of the protein hormone insulin by Fred Sanger
Proposed model for alpha helix and beta sheet and importance of so called hydrogen bonds in protein structures (Pauling and Corey)
DNA structure at atomic resolution by Crick, Watson, and Wilkins; they propose a model for DNA replication based on the structural information; the concept of structure-function relationship has been successfully used to solve a major problem in biology
High resolution structure of myoglobin at 2 Angstrom confirms for the first time the existence of alpha helix structures in proteins (Perutz and Kendrew)
The structure of the enzyme Lysozyme with a bound inhibitor molecule solved at 2 Angstrom resolution giving the first structural insight into enzyme-substrate interaction and Koshland's induced fit theory
Genetic code solved; links DNA sequence to amino acid sequence in proteins (Holley, Khorana, Nirenberg)
Cluster of Orthologous Groups at NCBI
Principal component analysis of variability found in whole genome databases
(sequences of individual proteins or protein families represented in at least 3 species (currently microorganisms only) thus corresponding to an ancient conserved domain)
Phylogenetic analyses indicate that R. prowazekii is more closely related to mitochondria than is any other microbe studied so far.
Obligate intracellular parasite, the causative agent of epidemic typhus. The functional profiles of these genes show similarities to those of mitochondrial genes: no genes required for anaerobic glycolysis are found in either R. prowazekii or mitochondrial genomes, but a complete set of genes encoding components of the tricarboxylic acid cycle and the respiratory-chain complex is found in R. prowazekii. In effect, ATP production in Rickettsia is the same as that in mitochondria. Many genes involved in the biosynthesis and regulation of biosynthesis of amino acids and nucleosides in free-living bacteria are absent from R. prowazekii and mitochondria. Such genes seem to have been replaced by homologues in the nuclear (host) genome. (Nature 1998 Nov 12;396(6707):133-40)
1. Annotation of proteins. Known functions (and two- or three-dimensional structures) of one COG member can often be directly attributed to the other members of the COG. Caution must be used here, however, since some COGs contain paralogs whose function may not precisely correspond to that of the known protein.
2. Phylogenetic patterns. These show the presence or absence of proteins from a given organism in a specific COG. Used systematically, such patterns can be used to identify whether a particular metabolic pathway exists in an organism.
3. Multiple alignments. Each COG page includes a link to a multiple alignment of COG members, which can be used to identify conserved sequence residues and analyze evolutionary relationships between member proteins.
Eisen et al. (1998) PNAS 95:14863.; http://rana.lbl.gov/EisenSoftware.htm
Factor analysis (and principal component analysis) demonstrates three independent factors (Eigenvectors) accounting for 99.5% of the variability of the array data (6 arrays; three conditions; each condition repeated once). Factor one (F1) accounts for the variability in hybridization strength. Factor two accounts for gene specific differences of hybridization strength that are more distinguish Va2 from both Vb5 and control (see diagram F2-F1). Factor three shows that there are general condition specific differences that distinguish control from Vb5 and from Va2 but are highly reproducible when repeated by labeling cDNA from same RNA samples. The dendrogram obtained from hierarchical clustering of the six arrays shows the same relationship as determined by the second variable (F2) from factor analysis.
One can ask any biologically interesting question concerning relationship between database entries, e.g.:
How many genes in the human genome?
Minimal gene set theory!
Evolutionary psychology: Explaining behavioral traits.
The definition of a minimal gene set would be that any knock-out that does not kill the organism, proves that there are more genes than the organism needs for survival. Therefore, a minimal gene set would be one where each single gene knock-out would result in a non-viable clone.
The smallest gene set (besides large viral genomes with >200 genes) found is 467 in Mycoplasma genitalium. The latter can hardly be considered a free living organism.
Autonomous (neither symbiotic nor parasitic) species to not tend to have minimal gene sets Chemotrophs, for which there are only archaea known, have genomes with usually more than 2,000 open reading frames, up to four times the minimal gene set found in eubacteria.
Phototrophs produce, besides bacterial species, some of the largest life forms (trees) containing some of the largest genomes.
Genome “size” (number of proteins) of some microorganisms
Name Proteins in COGs
Methanococcus jannaschii 1786 1330
Methanobac. thermoautotrophicum 1873 1388
Saccharomyces cerevisiae 5955 2290
Escherichia coli K12 4275 3414
Escherichia coli O157 5315 3662
Helicobacter pylori 1576 1096
Rickettsia prowazekii 835 697
Mycoplasma pneumoniae 689 425
Mycoplasma genitalium 484 381
Assessment of the gene number will occur on the 2003 Cold Spring Harbor Laboratory Genome meeting
Source: Sanger Institute