1 / 63

Computational Genomics Fall 2005/6  cs.tau.ac.il/~bchor/CG05/comp-genom.html

Computational Genomics Fall 2005/6  www.cs.tau.ac.il/~bchor/CG05/comp-genom.html. Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT post.tau.ac.il ) TA: Tomer Shlomi (shlomito AT post.tau.ac.il). Lectures: Wednesday 11:00-14:00, Kaploon 324

theo
Download Presentation

Computational Genomics Fall 2005/6  cs.tau.ac.il/~bchor/CG05/comp-genom.html

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Computational GenomicsFall 2005/6 www.cs.tau.ac.il/~bchor/CG05/comp-genom.html Lecturers: Benny Chor (benny AT cs.tau.ac.il) Eytan Ruppin (ruppin AT post.tau.ac.il ) TA: Tomer Shlomi (shlomito AT post.tau.ac.il) Lectures: Wednesday 11:00-14:00, Kaploon 324 Tutorials: Sunday 15:00-16:00, Kaploon 118 .

  2. Course Information Requirements & Grades: • 20-25% homework, in five-to-six assignments, containing both “dry” and “wet” problems. Submission - two weeks from posting. • Homework submission is obligatory. • You are strongly encouraged to solve the assignments independently (or at least give it a serious try). • 75-80% exam. Must pass beyond 55 for the homework’s grade to count

  3. Bibliography • Biological Sequence Analysis, R.Durbin et al. , Cambridge University Press, 1998 • Introduction to Molecular Biology, J. Setubal and J. Meidanis, PWS publishing Company, 1997  • Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology, D. Gusfield, Cambridge University Press, 1997. • Post-genome Informatics, M. Kanehisa, Oxford University Press, 2000. • More refs on course page.

  4. Course Prerequisites Computer Science and Probability Background • Computational Models • Algorithms (“efficiency of computation”) • Probability (any course) Some Biology Background • Formally: None, to allow CS students to take this course. • Recommended: Some molecular biology course, and/or a serious desire to complement your knowledge in Biology by reading the appropriate material. Studying the algorithms in this course while acquiring enough biology background is far more rewarding than ignoring the biological context.

  5. What is Computational Biology? Computational biology is the application of computational tools and techniques to molecular biology (primarily).  It enables new ways of study in life sciences, allowing analytic and predictive methodologies that support and enhance laboratory work. It is a multidisciplinary area of study that combines Biology, Computer Science, and Statistics. Computational biology is also called Bioinformatics, although many practitioners define Bioinformatics somewhat narrower by restricting to the application of specialized software for deducing meaningful biological information.

  6. Why Bio-informatics? • An explosivegrowthin the amount of biologicalinformation • necessitates theuse of computersfor cataloging, retrieval and analyzing mega-data (> 3 billion bps,> 30,000 genes). • The human genome project. • Improved technologies, e.g. • automated sequencing. • GenBank is now approximately • doubling every year !!!

  7. New Biotechnologies & Data • Micro arrays - gene expression. • 2D gels – protein expression. • Multi-level maps - genetic, • physical: sequence, annotation. • Networks of protein-protein • interactions. • Cross-species relationships - • Homologous genes. • Chromosome organization. http://www.the-scientist.com/yr2002/apr/research‭020415.html

  8. BioInformatics Tools are Crucial! • New biotechnology tools generate • explosive growthin the amount of • biologicaldata. • Impossible to analyze the data manually. • Novel mathematical, statistical, • algorithmic and computational tools • arenecessary !

  9. Areas of Interest (very partial list) • Building evolutionary trees from molecular (and other) data • Efficiently reconstructing the genome sequence from sub-parts (mapping, assembly, etc.) • Understanding the structure of genomes (Genes, SNP, SSR) • Understanding function of genes in the cell cycle and disease • Deciphering structure and function of proteins • Diagnosing cancer based on DNA microarrays (“chips”) _____________________ SNP: Single Nucleotide Polymorphism SSR: Simple Sequence Repeat Much of this class has been edited from Nir Friedman’s lecture which is available at www.cs.huji.ac.il/~nir. Changes made by Dan Geiger, then Shlomo Moran, and finally Benny Chor. Additional slides from Zohar Yakhini and Metsada Pasmanik.

  10. Growth of DNA Sequence Data: GenBank bp sequences Sequences (millions) Base pairs of DNA (millions) Dec 19, 2001 Most Sequenced Organisms: Human (51%),Mouse (15%),Fruit Fly (4%),Rat (4%),Rice (2%),Wheat (2%),Worm (1%),Chimp (1%), Others (18%).

  11. The Protein Data Bank PDB Content Growth 02 (Experimentally determined) http://www.rcsb.org/pdb/

  12. Four Aspects Biological • What is the task? Algorithmic • How to perform the task at hand efficiently? Learning • How to adapt/estimate/learn parameters and models describing the task from examples Statistics • How to differentiate true phenomena from artifacts

  13. Example: Sequence Comparison Biological • Evolution preserves sequences, thus similar genes might have similar function Algorithmic • Consider all ways to “align” one sequence against another Learning • How do we define “similar” sequences? Use examples to define similarity Statistical • When we compare to ~106 sequences, what is a random match and what is true one

  14. Topics I Dealing with DNA/Protein sequences: • Finding similar sequences • Models of sequences: Hidden Markov Models • Genome projects and how sequences are found • Transcription regulation • Protein Families • Gene finding

  15. Topics II High throughput biotechnologies – potentials and computational challenges • DNA microarrays • applications to diagnostics • applications to understanding gene networks

  16. Topics III (Structural BioInfo Course) Protein World: • How proteins fold - secondary & tertiary structure • How to predict protein folds from sequences data • How to predict protein function from its structure • How to analyze proteins changes from raw experimental measurements (MassSpec)

  17. Algorithmics Will introduce algorithmic techniques that are useful in computational genomics (and elsewhere): • Dynamic programing, dynamic programing, dynamic.. • Suffix trees and arrays • Probabilistic models: PSSM (Position Specific Scoring Matrices), HMM (Hidden Markov Models) • Learning and classification, SVM (Support Vector Machines) • Heuristics for solving hard optimization problems (Many problems in comp. genomics are NP-hard)

  18. Human Genome Most human cells contain 46 chromosomes: • 2 sex chromosomes (X,Y): XY – in males. XX – in females. • 22 pairs of chromosomes named autosomes.

  19. … On Feb. 28, 1953, Francis Crick walked into the Eagle pub in Cambridge, England, and, as James Watson later recalled, announced that "we had found the secret of life." "The structure was too pretty not to be true." -- JAMES D. WATSON, "The Double Helix" Watson and Crick

  20. DNA - the Code for Life (1953) 1920-1958 Died from ovarian cancer http://www.nobel.se/medicine/laureates/1962/index.html

  21. The Double Helix Source: Alberts et al

  22. The Central Dogma of Molecular Biology Replication protein mRNA DNA A C U A A G C Transcription A Translation G A C U G U A C Phenotype

  23. Watson-Crick Complementarity Conclusion: DNA strands are complementary (1953). % of each base Base ratios DNA source Human Sheep Turtle Sea urchin Wheat E. coli Purines/ Pyrimidines Pyrimidines Purines

  24. Genome Sizes • E.Coli (bacteria) 4.6 x 106 bases • Yeast (simple fungi) 15 x 106 bases • Smallest human chromosome 50 x 106 bases • Entire human genome 3 x 109 bases

  25. Genetic Information • Genome – the collection of genetic information. • Chromosomes – storage units of genes. • Gene – basic unit of genetic information. They determine the inherited characters.

  26. What is a Gene ? Transcribed region Un-coded region Un-coded region exon exon exon promotor intron intron Start codon Terminal codon • DNA contains various recognition sites: • Promoter signals. • Transcription start signals. • Start codon. • Exon, intron boundaries. • Transcription termination signal.

  27. Control of the Human b-Globin Gene

  28. Alternative Splicing 33

  29. Genes: How Many? The DNA strings include: • Coding regions (“genes”) • E. coli has ~4,000 genes • Yeast has ~6,000 genes • C. Elegans has ~13,000 genes • Humans have ~32,000 genes • Control regions • These typically are adjacent to the genes • They determine when a gene should be “expressed” • So called “Junk” DNA (unknown function - ~90% of the DNA in human’s chromosomes)

  30. Gene Finding • Only 4% of the human genome encodes for functional genes. • Genes are found along large non-coding DNA regions. • Repeats, pseudo-genes, introns, contamination of vectors, • are confusing.

  31. Gene Finding Existing programs for locating genes within genomic sequences utilize a number of statistical signals and employ statistical models such as hidden Markov models (HMMs). The problem is not solved yet, esp. for the newly discovered “RNA genes”.

  32. Diversity of Tissues in Stomach How is this variety encoded and expressed ?

  33. Transcription Translation mRNA Protein Gene Central Dogma שעתוק תרגום cells express different subset of the genes In different tissues and under different conditions

  34. Transcription • Coding sequences can be transcribed to RNA • RNA nucleotides: • Similar to DNA, slightly different backbone • Uracil (U) instead of Thymine (T) Source: Mathews & van Holde

  35. Transcription: RNA Editing • Transcribe to RNA • Eliminate introns • Splice (connect) exons • * Alternative splicing exists Exons hold information, they are more stable during evolution. This process takes place in the nucleus. The mRNA molecules diffuse through the nucleus membrane to the outer cell plasma.

  36. RNA roles • Messenger RNA (mRNA) • Encodes protein sequences. Each three nucleotide acids translate to an amino acid (the protein building block). • Transfer RNA (tRNA) • Decodes the mRNA molecules to amino-acids. It connects to the mRNA with one side and holds the appropriate amino acid on its other side. • Ribosomal RNA (rRNA) • Part of the ribosome, a machine for translating mRNA to proteins. It catalyzes (like enzymes) the reaction that attaches the hanging amino acid from the tRNA to the amino acid chain being created. • ...

  37. New Roles of RNA Cellular Regulation COVER: Researchers are discovering that small RNA molecules play a surprising variety of key roles in cells. They can inhibit translation of messenger RNA into protein, cause degradation of other messenger RNAs, and even initiate complete silencing of gene expression from the genome. http://www.sciencemag.org/content/vol298/issue5602/cover.shtml http://www.nature.com/nature/journal/v408/n6808/fig_tab/408037a0_F1.html

  38. Translation in Eukaryotes http://www1.imim.es/courses/Lisboa01/slide1.6_translation.html Animation:http://cbms.st-and.ac.uk/academics/ryan/Teaching/medsci/Medsci6.htm

  39. Translation • Translation is mediated by the ribosome • Ribosome is a complex of protein & rRNA molecules • The ribosome attaches to the mRNA at a translation initiation site • Then ribosome moves along the mRNA sequence and in the process constructs a sequence of amino acids (polypeptide) which is released and folds into a protein.

  40. Genetic Code There are 20 amino acids from which proteins are build.

  41. Protein Structure • Proteins are poly-peptides of 70-3000 amino-acids • This structure is (mostly) determined by the sequence of amino-acids that make up the protein

  42. Protein Structure

  43. The Central Paradigm of Bio-informatics Molecular structure Biochemical function Genetic information Symptoms

  44. Similarity Search in Databanks Find similar sequences to a working draft. As databanks grow, homologies get harder, and quality is reduced. Alignment Tools: BLAST & FASTA (time saving heuristics- approximations). >gb|BE588357.1|BE588357 194087 BARC 5BOV Bos taurus cDNA 5'. Length = 369 Score = 272 bits (137), Expect = 4e-71 Identities = 258/297 (86%), Gaps = 1/297 (0%) Strand = Plus / Plus Query: 17 aggatccaacgtcgctccagctgctcttgacgactccacagataccccgaagccatggca 76 |||||||||||||||| | ||| | ||| || ||| | |||| ||||| ||||||||| Sbjct: 1 aggatccaacgtcgctgcggctacccttaaccact-cgcagaccccccgcagccatggcc 59 Query: 77 agcaagggcttgcaggacctgaagcaacaggtggaggggaccgcccaggaagccgtgtca 136 |||||||||||||||||||||||| | || ||||||||| | ||||||||||| ||| || Sbjct: 60 agcaagggcttgcaggacctgaagaagcaagtggagggggcggcccaggaagcggtgaca 119 Query: 137 gcggccggagcggcagctcagcaagtggtggaccaggccacagaggcggggcagaaagcc 196 |||||||| | || | ||||||||||||||| ||||||||||| || |||||||||||| Sbjct: 120 tcggccggaacagcggttcagcaagtggtggatcaggccacagaagcagggcagaaagcc 179 Query: 197 atggaccagctggccaagaccacccaggaaaccatcgacaagactgctaaccaggcctct 256 ||||||||| | |||||||| |||||||||||||||||| |||||||||||||||||||| Sbjct: 180 atggaccaggttgccaagactacccaggaaaccatcgaccagactgctaaccaggcctct 239 Query: 257 gacaccttctctgggattgggaaaaaattcggcctcctgaaatgacagcagggagac 313 || || ||||| || ||||||||||| | |||||||||||||||||| |||||||| Sbjct: 240 gagactttctcgggttttgggaaaaaacttggcctcctgaaatgacagaagggagac 296 Pairwise alignment:

  45. Multiple Sequence Alignment Multiple alignment: Basis for phylogenetic tree construction. Useful to find protein families and functional domains.

  46. Evolution Evolution - a process in which small changes occur within species over time. These changes are mainly monitored today using molecular sequences (DNA/proteins). The Tree of Life: A classical, basic science problem, since Darwin’s 1859 “Origin of Species”.

  47. Evolution • Related organisms have similar DNA • Similarity in sequences of proteins • Similarity in organization of genes along the chromosomes • Evolution plays a major role in biology • Many mechanisms are shared across a wide range of organisms • During the course of evolution existing components are adapted for new functions

More Related