What are Math and Computer Science doing in Biology ?. Dan Gusfield UC Davis March 29, 2012 Denison University. One limited perspective. Short Answer:. Bioinformatics Computational Biology Statistical Biology Mathematical Biology …. Short Answer:. Bioinformatics
What are Math and Computer Science doing in Biology? Dan Gusfield UC Davis March 29, 2012 Denison University
One limited perspective
Short Answer: • Bioinformatics • Computational Biology • Statistical Biology • Mathematical Biology • …..
Short Answer: • Bioinformatics • Computational Biology • Statistical Biology • Mathematical Biology • ….. My focus
Biology Computer Science Math & Statistics Computational biology, Bioinformatics UC Davis 6 computational biology • “An interdisciplinary field that applies the techniques of computer science, applied mathematics and statistics to address biological problems” (Wikipedia)
How can non-biologists, non-chemists understand or contribute to biology? Where does our license come from?
My Fear 30 years ago was that I would first need to master material like:
Bond representation of triplex DNA. This view is down the long axis. The “third” strand is colored.
MYOGLOBIN - An oxygen carrier in muscle Here is another way of visualising tertiary the structure Tertiary Stucture Spot the Tertiary folding. Quaternary Structure Spot the Haem group
LYSOZYME Including the Side chains. Can you see any active site now?
By some wonderful fact or fluke of nature, a huge simplification is possible and very productive.
Molecular information is (partially) Digital. And, nature takes notes (leaves historical footnotes).
PRIMARY STRUCTURE Primary structure is described by the sequence of Amino Acids in the chain This diagram shows the primary structure of PIG INSULIN, a protein hormone as discovered by Frederick Sanger. He was given a Nobel prize in 1958.
Hemoglobin – Primary Structure NH2-Val-His-Leu-Thr-Pro-Glu-Glu- Lys-Ser-Ala-Val-Thr-Ala-Leu-Trp- Gly-Lys-Val-Asn-Val-Asp-Glu-Val- Gly-Gly-Glu-….. beta subunit amino acid sequence
It has been amazingly productive to treat protein and DNA molecules just as text:collecting, comparing, creating molecularsequences.
No hard-core chemistry or biology - just text comparison and analysis. Fluke of nature? An imposition of the human mind? Lucky break for us?
The first major success story:
Simian Sarcoma Virus onc Gene, v-sis is derived from the Gene (or Genes) of a Platelet-Derived Growth Factor. R.F. Doolittle et al, Science 1983
“The transforming protein of a primate sarcoma virus and a platelet-derived growth factor are derived from the same or closely related cellular genes. This conclusion is based on the demonstration of extensive sequence similarity.” From the abstract
Sequence similarity suggested that genes involved in cancer were functionally related to genes involved in blood platelet growth, two biological phenomena that had previously seemed unrelated. This was a very surprising result, and a novel kind of reasoning. But,
Biology via Sequence Analysis is now completely accepted, main-stream. Some biologists have even replaced their wet-labs with computer labs, doing biology only by sequence analysis.
“The ultimate rational behind all purposeful structures and behavior of living things is embodied in the sequence of residues of nascent polypeptide chains …” J. Monod “The rosetta stone of modern biology appears to be sequence comparitive analysis.” T. Smith
Success stories from sequence analysis are now routine. Why? Mostly shared history and duplication with modification, but also shared physical, chemical constraints.
“We didn't know it at the time, but we found out everything in life is so similar, that the same genes that work in flies are the ones that work in humans.” Eric Wieschaus, co-winner of the 1995 Nobel prize in medicine
Take-home message Ancestor Species A Species B paralogs UC Davis 9/22/2014 29 High sequence similarity implies significant functional and/or structural similarity orthologs
Can we reverse the statement? UC Davis 9/22/2014 31 Two sequences with high functional similarity should have similar sequences.
The success of sequence comparison and analysis, and the development of efficient DNA sequencing, has lead to huge projects to capture, accumulate, store, curate, and annotate bio-molecular sequences. Genbank, Blast, Human Genome Project, specialized databases.
Examples of large-scale sequencing projects 1,000 Genomes Project. http://www.1000genomes.org/. BGI, 10,000 whole human genomes. BGI, 1,000 individuals with IQ>145 versus 1,000 random individuals. BGI, Autism Genetic Resource Exchange, 10,000 individuals. BGI, CHOP, many childhood diseases. Genome Institute, Washington U. St. Louis, 600 childhood cancer patients; $65 million over three years. 150 tumor & normal cancer genome pairs. Epitwin: TwinsUK & BGI $30 million for epigenetic differences in 5,000 twins. Netherlands Genome Project: BGI 750 genomes (250 trios) in Dutch biobanks. Epi4K: Duke et al. $25M to sequence 4,000 genomes for epilepsy research. U. Michigan Cancer Center: Clinical next-gen sequencing of cancer patients. R. Michelmore
In near future: DNA sequence = an inexpensive commodity generated on a variety of platforms $1,000 ($100?) human genome coming => $1,000 genome for many animals and plants $100 genome for fungi $10 genome for bacteria en masse Metagenomics: sequencing of communities biomes (humans = 100x more bacteria) novel & unculturable organisms characterization of diversity & unique genes Not just genomic DNA sequence: DNA modifications epigenomics & copy number variation (CNV) expression analysis (RNAseq not arrays) Enormous amounts of sequence data Need for major data handling capabilities Vital role for bioinformatics just to manage the data R. Michelmore
More recently: Metagenomics, metabolomics, proteomics, microbiomics, epigenomics, transcriptomics, methylomics…. High-throughput biology generating massive amounts of data; sometimes too large even to store.
NYT November 30, 2011: “The Bejing Genome Center has enough sequencing capacity to sequence 2,000 human genomes per day.” “World capacity is now 13 quadrillion DNA bases a year, an amount that would fill a stack of DVDs two miles high.”
OK, so sequences and sequence analysis are important, but where’s the promised computer science and math?
Simple sequence comparison, comparing new sequences against sequences in databases, has been extremely productive. But how do we extract the most biological value from sequences? The Larger Challenge and Opportunity: How to utilize the deluge of sequence data?
Making sense of the code UC Davis 9/22/2014 43
Lettuce genesare located in a “sea” of repeated sequences Damien Peltier
How do we analyze so much data? How do we know that patterns we see are meaningful? How do we know that similarities we see are based in biology and not just random happenstance? Humans are good at seeing patterns, even in random events and data.
What we need: • Clear, biologically meaningful definitions of similarity, patterns. Biological models of mutation and evolution - how sequences evolve. • Metrics - how similar, how good the fit. • Efficient methods to compute similarities, and find patterns, and compute the metrics. • Efficient methods to assess the “significance” of the finds.
For those tasks, we need • Biology - to define and model meaningful types of similarities and patterns to look for. • Mathematics - to propose and understand the models and metrics. • Computer Science - for efficient sequence analysis and search algorithms. • Statistics - to measure the ``significance” (deviation from random happenstance) of the finds.
Biology Computer Science Math & Statistics Computational biology, Bioinformatics UC Davis 51 computational biology • “An interdisciplinary field that applies the techniques of computer science, applied mathematics and statistics to address biological problems” (Wikipedia)
“It costs more to analyze a genome than to sequence a genome.” D. Haussler