Computational Genomics: Theory and Practice in Bioinformatics

Computational GenomicsFall 2004/5 www.cs.tau.ac.il/~bchor/CG05/comp-genom.html Lecturer: Benny Chor (benny@cs.tau.ac.il) TA: Amos Tanai (amos@post.tau.ac.il) Lectures: Wednesday 10:00-13:00, location unknown Tutorials: Sunday 15:00-16:00, unknown location .

Course Information Requirements & Grades: • 20-25% homework, in five-to-six assignments, containing both “dry” and “wet” problems. Submission - two weeks from posting. • Homework submission is obligatory. • You are strongly encouraged to solve the assignments independently (or at least give it a serious try). • 75-80% exam. Must pass beyond 55 for the homework’s grade to count

Bibliography • Biological Sequence Analysis, R.Durbin et al. , Cambridge University Press, 1998 • Introduction to Molecular Biology, J. Setubal, J. Meidanis, PWS publishing Company, 1997 • Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, D. Gusfield, Cambridge University Press, 1997. • More refs on course page.

Course Prerequisites Computer Science and Probability Background • Computational Models • Algorithms (“efficiency of computation”) • Probability (any course) Some Biology Background • Formally: None, to allow CS students to take this course. • Recommended: Some molecular biology course, and/or a serious desire to complement your knowledge in Biology by reading the appropriate material. Studying the algorithms in this course while acquiring enough biology background is far more rewarding than ignoring the biological context.

Computational Biology Computational biology is the application of computational tools and techniques to molecular biology (primarily). It enables new ways of study in life sciences, allowing analytic and predictive methodologies that support and enhance laboratory work. It is a multidisciplinary area of study that combines Biology, Computer Science, and Statistics. Computational biology is also called Bioinformatics, although many practitioners define Bioinformatics somewhat narrower by restricting to the application of specialized software for deducing meaningful biological information. This class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor. Additional slides from Zohar Yakhini and Metsada Pasmanik.

Areas of Interest (partial list) • Building evolutionary trees from molecular (and other) data • Efficiently reconstructing the genome sequence from sub-parts (mapping, assembly, etc.) • Understanding the structure of genomes (Genes, SNP, SSR) • Understanding function of genes in the cell cycle and disease • Deciphering structure and function of proteins • Diagnosing cancer based on DNA microarrays (“chips”) _____________________ SNP: Single Nucleotide Polymorphism SSR: Simple Sequence Repeat

Exponential growth of biological information: growth of sequences, structures, and literature.

Four Aspects Biological • What is the task? Algorithmic • How to perform the task at hand efficiently? Learning • How to adapt/estimate/learn parameters and models describing the task from examples Statistics • How to differentiate true phenomena from artifacts

Example: Sequence Comparison Biological • Evolution preserves sequences, thus similar genes might have similar function Algorithmic • Consider all ways to “align” one sequence against another Learning • How do we define “similar” sequences? Use examples to define similarity Statistical • When we compare to ~106 sequences, what is a random match and what is true one

Course Goals • Learning about computational tools for (primarily) molecular biology. • Cover computational tasks that are posed by modern molecular biology • Discuss the biological motivation and setup for these tasks • Understand the kinds of solutions that are known, and what principles justify them

Topics I Dealing with DNA/Protein sequences: • Finding similar sequences • Models of sequences: Hidden Markov Models • Genome projects and how sequences are found • Transcription regulation • Protein Families • Gene finding

Topics II Models of genetic change: • Long term: evolutionary changes among species • Reconstructing evolutionary trees from sequences • Short term: genetic variations in a population • Finding genes by linkage and association

Topics III High throughput biotechnologies – potentials and computational challenges • DNA microarrays • applications to diagnostics • applications to understanding gene networks

Topics IV (Structural BioInfo Course) Protein World: • How proteins fold - secondary & tertiary structure • How to predict protein folds from sequences data • How to predict protein function from its structure • How to analyze proteins changes from raw experimental measurements (MassSpec)

Algorithmics Will introduce algorithmic techniques that are useful in computational genomics (and elsewhere): • Dynamic programing, dynamic programing, dynamic.. • Suffix trees and arrays • Probabilistic models: PSSM (Position Specific Scoring Matrices), HMM (Hidden Markov Models) • Learning and classification, SVM (Support Vector Machines) • Heuristics for solving hard optimization problems (Many problems in comp. genomics are NP-hard)

Human Genome Most human cells contain 46 chromosomes: • 2 sex chromosomes (X,Y): XY – in males. XX – in females. • 22 pairs of chromosomes named autosomes.

… On Feb. 28, 1953, Francis Crick walked into the Eagle pub in Cambridge, England, and, as James Watson later recalled, announced that "we had found the secret of life." "The structure was too pretty not to be true." -- JAMES D. WATSON, "The Double Helix" Watson and Crick

DNA Organization Source: Alberts et al

The Double Helix Source: Alberts et al

DNA Components Four nucleotide types: • Adenine • Guanine • Cytosine • Thymine Hydrogen bonds (electrostatic connection): • A-T • C-G

Watson-Crick Complementarity Conclusion: DNA strands are complementary (1953). % of each base Base ratios DNA source Human Sheep Turtle Sea urchin Wheat E. coli Purines/ Pyrimidines Pyrimidines Purines

Genome Sizes • E.Coli (bacteria) 4.6 x 106 bases • Yeast (simple fungi) 15 x 106 bases • Smallest human chromosome 50 x 106 bases • Entire human genome 3 x 109 bases

Genetic Information • Genome – the collection of genetic information. • Chromosomes – storage units of genes. • Gene – basic unit of genetic information. They determine the inherited characters.

Genes The DNA strings include: • Coding regions (“genes”) • E. coli has ~4,000 genes • Yeast has ~6,000 genes • C. Elegans has ~13,000 genes • Humans have ~32,000 genes • Control regions • These typically are adjacent to the genes • They determine when a gene should be “expressed” • “Junk” DNA (unknown function - ~90% of the DNA in human’s chromosomes)

Gene Finding Existing programs for locating genes within genomic sequences utilize a number of statistical signals and employ statistical models such as hidden Markov models (HMMs). The problem is not solved yet !

The Cell All cells of an organism contain the same DNA content (and the same genes) yet there is a variety of cell types.

Example: Tissues in Stomach How is this variety encoded and expressed ?

Transcription Translation mRNA Protein Gene Central Dogma שעתוק תרגום cells express different subset of the genes In different tissues and under different conditions

Transcription • Coding sequences can be transcribed to RNA • RNA nucleotides: • Similar to DNA, slightly different backbone • Uracil (U) instead of Thymine (T) Source: Mathews & van Holde

Transcription: RNA Editing • Transcribe to RNA • Eliminate introns • Splice (connect) exons • * Alternative splicing exists Exons hold information, they are more stable during evolution. This process takes place in the nucleus. The mRNA molecules diffuse through the nucleus membrane to the outer cell plasma.

RNA roles • Messenger RNA (mRNA) • Encodes protein sequences. Each three nucleotide acids translate to an amino acid (the protein building block). • Transfer RNA (tRNA) • Decodes the mRNA molecules to amino-acids. It connects to the mRNA with one side and holds the appropriate amino acid on its other side. • Ribosomal RNA (rRNA) • Part of the ribosome, a machine for translating mRNA to proteins. It catalyzes (like enzymes) the reaction that attaches the hanging amino acid from the tRNA to the amino acid chain being created. • ...

Translation in Eukaryotes http://www1.imim.es/courses/Lisboa01/slide1.6_translation.html Animation:http://cbms.st-and.ac.uk/academics/ryan/Teaching/medsci/Medsci6.htm

Translation • Translation is mediated by the ribosome • Ribosome is a complex of protein & rRNA molecules • The ribosome attaches to the mRNA at a translation initiation site • Then ribosome moves along the mRNA sequence and in the process constructs a sequence of amino acids (polypeptide) which is released and folds into a protein.

Genetic Code There are 20 amino acids from which proteins are build.

Protein Structure • Proteins are poly-peptides of 70-3000 amino-acids • This structure is (mostly) determined by the sequence of amino-acids that make up the protein

Protein Structure

Evolution • Related organisms have similar DNA • Similarity in sequences of proteins • Similarity in organization of genes along the chromosomes • Evolution plays a major role in biology • Many mechanisms are shared across a wide range of organisms • During the course of evolution existing components are adapted for new functions

Evolution Evolution of new organisms is driven by • Diversity • Different individuals carry different variants of the same basic blue print • Mutations • The DNA sequence can be changed due to single base changes, deletion/insertion of DNA segments, etc. • Selection bias

The Tree of Life Source: Alberts et al

Phylogeny Reconstruction Goal:Given a set of species, reconstruct thetree which best explains theirevolutionary history.

Trees are Based on What ? Darwin (Origin of Species, 1859)and his contemporaries based their work onmorphological andphysiologicalproperties (e.g. cold/warm blood, existance of scales, number of teeth, existance of wings, etc., etc.). Paleontological data is still in use when constructing trees for certain extinct species(e.g. dinosaures, mammoths, moas, unicorns, etc…) Today most phylogenetic trees are based on molecular sequence data (DNA or proteins).

Evolution www.tomchalk.com/evolution.gif

One Answer (the parsimony principle): Pick a tree that has a minimum total number of substitutions of symbols between species and their originator in the evolutionary tree (Also called phylogenetictree). AAA AAA AAA 2 1 1 GGA AGA AAG AAA Total #substitutions = 4 Example for Phylogenetic Analysis Input: four nucleotide sequences: AAG, AAA, GGA, AGA taken from four species. Question: Which evolutionary tree best explains these sequences ?

AATGCTTAGTC TTACGAATCAG AATGCGTAGTC TTACGAATCAG Perfect match One base mismatch WC Complimentarity,again A binds to T C binds to G

Array Based Hybridization Assays (DNA Chips) Unknown sequence (target)Many copies. Array of probes

Array Based Hyb Assays • Target hybs to WC complimentary probes only • Therefore – the fluorescence pattern is indicative of the target sequence.

Microarrays (“DNA Chips”) Leading edge, future technologies (since 1988): In asingleexperiment, measure expression level ofthousandsof genes. • Find informative genes that may have predictive power for medical diagnosis. • Potential forpersonalized medicine, e.g. kits for identifying cancer types and prescribe “personal” treatment.

DNA Chips - Structure • Each chip hasn“pixels” on it. • Every pixel contains copies of • a probe from asingle gene. • Do mexperiments: • Cells in each experiment • are taken from different conditions: • (different phase of cell cycle, different • patient, different type of tissue etc.). • Purpose: • Measure mRNA expression • levels (Colorcoded) of all • n genes in one experiment.

Gene Expression Matrix • Rows correspond togenes. • (Typically n between 500 and 15,000). • Columnscorrespond toexperiments. • (Typically m between 10 and 200). • Entryi, j = expression level • of gene i, in experiment j.

Computational Genomics: Theory and Practice in Bioinformatics

Computational Genomics: Theory and Practice in Bioinformatics

Presentation Transcript

Computational Genomics 5a Distance Based Trees Reconstruction (cont.) Sections 7.1, 7.2, in Durbin et al. Chapter 17 i

Welcome to ISMB/ECCB Genomics Session July 31 – August 4, 2004

CS506/606: Computational Linguistics Fall 2009 Unit 1

Computational Genomics Course Lecture 12 fall 2002/03 School of Computer Science Tel-Aviv University Instructor: Benny

Lecture #14 Example circuits, Zener diodes, dependent sources, basic amplifiers

Computational Genomics

Implications of eScience for Science and Society: A View from Genomics

CMBI - Centre for Molecular and Biomolecular Informatics

Computational functional genomics

FALL 2004 RFP

Fall 2004 Student Demographics

Computational Genomics Lecture #3a

HL7 Clinical-Genomics SIG: A Shared Genotype Model

Comparative genomics for biological discovery

UltraScienceNet Research Testbed Enabling Computational Genomics Project Overview

Computational Genomics Lecture #2b

2008 BMES Fall Annual Meeting, St. Louis, 4 th October Track: Computational Biology

Biophysics 101 Genomics and Computational Biology