Spring 2007, CS584: Computational and Life Science

Spring 2007, CS584:Computational and Life Science • Kim Gernert BimCore, School of Medicine • James Lu Mathematics and Computer Science

Overview • Computer Science broadly applicable • Computational X: The construction of mathematical models and numerical solution techniques to analyze and solve problems in X. • X Informatics: The management and processing of data, information, and knowledge in X. • Essential Bioinformatics, Jin Xiong (Cambridge University Press, 2006) • Bioinformatics: limited to sequence, structural, and functional analysis of genes and genomes • Computational Biology: all biological areas that involve computation (e.g., modeling of ecosystems, population dynamics, etc.)

DILS 2005 keynotes Shankar Subramaniam, Professor of Bioengineering and Chemistry at UCSD : • the standard paradigm in biology: ‘hypothesis to experimentation (low throughput data) to models’ is being replaced by ‘data to hypothesis to models’ and ‘experimentation to more data and models’. • need for robust data repositories that allow interoperable navigation, query and analysis across diverse data, a plug-and-play environment that will facilitate seamless interplay of tools and data and versatile biologist-friendly user interfaces.

Bioinformatics Subfields Software development Database construction and curation

Sample CS Research Problems • In CS terms: a sequence is a strings over an alphabet • Nucleotide sequence: ∑ = {G,A,T,C} • Protein sequence: ∑ = {G,A,L,V,I,P,S,T,C,D,M,E,F,Y,W,K,R,H,D,E} • Data problems: rapid growth of biological (e.g., sequence) data. These biological data have the following (commonly occurring) characteristics: 1. Complexity; 2. Incompleteness; 3. Error prone; 4. Expensive to obtain and maintain; 5. In high demand by a community with important but difficult questions. • Algorithm Problems: sequence comparison lies at the heart of bioinformatics analysis; it provides the basis for functional and structural analysis of newly determined sequences

Biological Databases • Use flat files, relational and object databases • Primary databases • Contain original biological data -- raw sequence or structural data submitted by scientists • Three major public sequence databases: GenBank, the European Molecular Biology Laboratory database (EMBL), the DNA Data Bank of Japan (DDBI); data are freely shared; sequence submission a precondition to journal publication • Secondary databases • Contain computationally processed or manually curated information from primary databases • SWISS-PROT: detailed sequence annotation that includes structure, function, protein family assignment

NCBI Derivative Sequence Data (Maureen J. Donlin, St. Louis University) C C Curators GA ATT GA GA C ATT GA C RefSeq TATAGCCG ACGTGC TATAGCCG AGCTCCGATA CCGATGACAA ATTGACTA CGTGA TTGACA Labs TTGACA TTGACA ACGTGC Genome Assembly TATAGCCG ACGTGC TATAGCCG ATTGACTA CGTGA CGTGA ATTGACTA TATAGCCG CGTGA ATTGACTA ATTGACTA TATAGCCG TTGACA ATTGACTA TATAGCCG TATAGCCG TATAGCCG TATAGCCG ATT C GenBank UniGene GA AT GA C C Algorithms ATT C C GA ATT GA GA ATT GA GA ATT C GA C ATT GA

Database Curation Peter Buneman, Professor of CS at Upenn (DILS 2005 keynote) • biological data is often created and maintained at a high cost involving extensive human curation • new models of data and query languages are needed to support curation and provenance (knowing where your data comes from) Opportunity: Scientific Database Curator/Biologist Part time or full time--EBI, Hinxton, UK (February 15, 2006) DESCRIPTION: The high-quality curation of PDB entries is essential in establishing the PDB and MSD databases as world-leading sources of protein information. To maintain and extend this position, the MSD group is looking for expert biologists for a demanding role in database curation. The work involves annotating preliminary PDB submissions and extracting biological information relevant to a given entry. In addition, curators draw on their own area of expertise in contributing to the development of methods and procedures. QUALIFICATIONS AND EXPERIENCE: The ideal candidate should be computer-literate and possess a university degree in life sciences well as having a broad knowledge in molecular biology, especially biochemistry. In-depth knowledge of biochemistry and/or protein structure analysis would be an advantage. Knowledge of Unix/Linux Emacs, good communicationskills and excellent attention to detail are required.

Database Issues Overview • Data Integration -- • linking across different type of data, across same data of different format; (see Genomatix) • remove redundancy: RefSeq curated and reviewed by NCBI staff • Data Quality and Integrity -- data entry and annotation errors • Data Discovery -- which database contains relevant information; interesting side effect: metadatabase (e.g., MetaDB: A Metadatabase for the Biological Sciences is a sorted, searchable collection of over 1200 biological databases). • Text and literature data search • Data extraction and mining

Database Integration Research (extracted from DILS 2005) • User Interfaces and Analysis Tools: • Tools to facilitate biologists in choosing data source, querying based on preferences and semantic concepts (using metadata), and displaying results • Tools that mediate among different annotation data sources • Knowledge discovery that archives access patterns of scientists in PubMed • Automatic annotation of metabolomic data with manual verification • Wrapper generation for legacy (flat file) data and tools for learning layout for wrapper generation • Ontologies and taxonomy to facilitate data integration

Algorithmic Issues • Database querying: submission of a query sequence and finding similar sequences in the database • Technique: sequence alignment to determine similarity • Global alignment: two sequences to be aligned are assumed to be similar over their entire length; alignment performed from beginning to end of both sequences to find the best possible alignment across the entire length • Local alignment: only finds local regions with the highest level of similarity and align those regions without regard for the alignment of the rest of the region • Similarity measure: [(LS x 2) / (La+Lb)] x 100 LS the number of aligned residues, La Lb the total lengths of the inputs

Alignment Algorithms • Dot Matrix Method: a graphical way of comparing sequences S1 and S2 in a |S1| x |S2| matrix M[I,J] = (S1[I] == S2[J]) • Dynamic Programming • Word Method

Dot Matrix (Steven M. Thompson, Florida State University)

Alignment Algorithms • Dot Matrix Method • Dynamic Programming: recursively score each cell of an |S1|+1 x |S2|+1 matrix (see example) • M[0,J] = M[I,0] = 0, 0≤I≤|S1|+1, 0≤J≤|S2|+1 • M[I,J] = max{M[I-1,J-1]+(S1[I] == S2[J]),M[I-1,J],M[I,J-1]} • Trace back: best match from the lower right hand corner towards origin -- the maximum score • Word Method

Alignment Algorithms • Dot Matrix Method • Dynamic Programming • Word Method: heuristically based -- create a list of words from the query, identify word matches (determined by substitution matrix), find longer alignment by extending similarity regions from words; used in BLAST (see example)

Course Objectives • To learn about research studies driving the field and computing techniques that have been developed. • To learn about computational and informatics projects related to biology, medicine, and other “life science” disciplines at Emory. • To learn about opportunities for summer research and dissertation topics. • To stimulate ideas for further collaboration between Math/CS and X. But impossible to give a complete treatment of field.

Meta-Objectives • How does a CS knowledgeable person become an X-informatics or computational-X researcher? • How useful is it to work with just symbolic abstractions? • How much X does one need to learn for the research to be meaningful? • How can it be more mutual collaboration? • Most of the time, it is just CS servicing X. • X researchers really don’t care how the CS is done. Just Do It!

A (Personally) Useful Understanding The twelve levels of biological organization (pyramid of life) Organism life Cell Molecules (protein, DNA) Genes are sections of DNA

Useful Description …(Lawrence Hunter, University of Colorado: Molecular Biology for Computer Scientists) I have likened evolution to a search through a vary large space of possible organism characteristics. That space can be defined quite precisely. All of an organism’s inherited characteristics are contained in a single messenger molecule: DNA. The characteristics are represented in a simple, linear, four-element code. The translation of this code into all the inherited characteristics of an organism (e.g., its body plan, or the wiring of its nervous system) is complex. The particular genetic encoding for an organism is called its genotype. The resulting physical characteristics of an organism is called its phenotype. In the search space metaphor, every point in the space is a genotype. Evolutionary variation (such as mutation, sexual recombination and genetic rearrangements) identifies the legal moves in this space. Selection is an evaluation function that determines how many other points a point can generate, and how long each point persists. The difference between genotype and phenotype is important because allowable (I.e., small) steps in genotype space can have large consequences in phenotype space. It is also worth noting that search happens in genotype space, but selection occurs in phenotypes. Although it is hard to characterize the size of phenotype space, an organism with a large amount of genetic material has about 1011 elements taken from a four letter alphabet, meaning that there are roughly 1070,000,000,000 possible genotypes of that size or less. A vast space indeed! Moves occur asynchronously, both with each other and with the selection process. There are many non-deterministic elements; for example, in which of many possible moves is taken, or in the application of the selection function. Imagine this search process running for billions of iterations, examining trillions of points in this space in parallel at each iteration. Perhaps it is not such a surprise that evolution is responsible for the wondrous abilities of living things, and for their tremendous diversity.

Approach: case studies • Biological or Math/CS preparation (1 or 2 lectures) • Main speaker -- the PI of the project (keep 3-4 PM open) 1. Dr. Fengzhu Sun (Biology/Math, USC) 2. Dr. Timonthy Hickey (CS, Brandeis) Dr. Astrid Prinz (Biology, Emory) 3. Dr. Christine Martens (Surgery, Emory) Dr. Vicki Hertzberg (Biostatistics, Emory) 4. Dr. Christopher Flowers (WCI, Emory) • Summary discussion

Student Responsibilities • Grading S/U • Class attendance and participation • Prepare summaries of case studies and answers to questions (goal: to produce useful materials for future offerings) • Take turn lead summary discussions • Project?

Spring 2007, CS584: Computational and Life Science