1 / 108

Introduction to Bioinformatics

Introduction to Bioinformatics. Juris Viksna, IMCS UL 2019 Alvis Brazma , European Bioinformatics Institute. Planned course schedule. Regular lecture times: Thursdays 16:30-18:00 and 18:15-19:45

uriel
Download Presentation

Introduction to Bioinformatics

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Introduction to Bioinformatics Juris Viksna, IMCS UL 2019 AlvisBrazma, European Bioinformatics Institute

  2. Planned course schedule Regular lecture times: Thursdays 16:30-18:00and 18:15-19:45 On the second week of September and after each two weeks thereafter (i.e. on the dates 12.09, 26.09, 10.10, 24.10, 07.11, 21.11, 05.12, 19.12) 413. aud. It is likely that few lectures will be rescheduled (hopefully, not too many). The dates and times of these (an of replacement lectures) will be announced when known. Will try my best to invite guest lecturers (quite likely this might involve rescheduled lecture times), but this is subject to options that might (or might not) become available.

  3. Course requierements To obtain a credit for this course you must: - submit a programming project (worth 50% of grade) or - submit a ‘data analysis’ project (worth 50% of grade) - take a (written) exam (open book, open internet :) (worth 50% of grade). Coursewebpage: http://susurs.mii.lu.lv/juris/courses/bi2019.html

  4. Topics from the original (A.Brazma 2008) bioinformatics course • The subjects covered during the course will be roughly distributed as follows: • Biology as information science (4 hours) • Genome sequencing and architecture (4 hours) • Discrete vs. continuous problems in bioinformatics (2 hours) • Gene expression data analysis (2 hours) • Comparison of protein sequences - algorithms and heuristics (4 hours) • Phylogenetic trees (4 hours) • Modelling and comparison of protein structures (2 hours) • Comparative genomics (2 hours) • Supervised learning approaches to data analysis (2 hours) • Gene networks and methods for their analysis (4 hours) • Biomedical informatics(2 hours)

  5. Bioinformatics • Databases and tools to store and access biomolecular data • Sequence algorithms – assembly from short fragments, alignment of similar sequences, analysis of properties • Evolution and phylogenetics • 3D structure analysis of biomolecules • Machine learning and data mining application to genome and related information • Biomolecular interaction analysis (e.g., protein interactions) • Dynamic systems, modelling of biological networks and systems • Analysis of noisy measurement data, statistical analysis • Data management, databases, interfaces, web services • Links with health records, biomedical informatics

  6. Why Bioinformatics might be important for you? • This is a growing science involving increasing number of computer professionals (e.g., 1000-human genome project just started) • Links with medical and health informatics information systems – a growing and important market for software • Latvian genome project and participation in European genotyping projects – software experts who understand the underlying problems are needed

  7. Topics covered in this course: • Introduction into biology as information science • Overview of some bioinformatics problems • Bio sequence and structure analysis, molecular evolution and phylogenyetc • Genomics – DNA assembly, haplotypesetc • Gene regulation network modelling (graph theory, Boolean networks, dynamic systems) • Analysis of gene expression data, cluster analysis, data mining and analysis • Some new recently evolving topics (time and material availability permitting...)

  8. FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • Bioloģiskā informācija - tās daudzveidība un apjoms • Bioloģija, statistika, informācijas tehnoloģijas un programmēšana kā bioinformātikas pamatelementi • Genomu organizācija un evolūcija • Salīdzinošā genomika • Bioloģiskās informācijas datubāzes. Informācijas meklēšanas un iegūšanas sistēmas

  9. FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • Nukleīnskābju un proteīnu sekvenču līdzības pamatprincipi. Dažādas salīdzināšanas metodes, to priekšrocības un pielietošanas nosacījumi • Filoģenētika. Klāsteru un kladistiskās metodes filoģenētisko koku rekonstruēšanā • Genoma ekspresijas analīze • DNS čipi genomu polimorfisma analīzē. Gēnu ekspresijas ģenētika

  10. FYI – Bioinformatics course in UL Faculty of Biology (by Nils Rostoks) • DNS topoloģija, proteīnu struktūra, tās paredzēšanas metodes un pielietojums farmakoloģijā • Proteomika un sistēmu bioloģija. Tīklveida struktūras kā bioloģisko sistēmu dabiska sastāvdaļa • Bioinformātikas perspektīvas. Bioinformātika kā priekšnosacījums modernās bioloģijas apgūšanai

  11. NIH WORKING DEFINITION OF BIOINFORMATICS ANDCOMPUTATIONAL BIOLOGYJuly 17, 2000 • Bioinformatics: Research, development, or application of computational tools and approaches for expanding the use of biological, medical, behavioural or health data, including those to acquire, store, organize, archive, analyse, or visualize such data. • Computational Biology: The development and application of data-analytical and theoretical methods, mathematical modelling and computational simulation techniques to the study of biological, behavioural, and social systems.

  12. Human Genome Project • Began in 1990 in the US • The primary goal – to sequence 3 billion long human DNA • A working draft of the genome was released in 2000 • Finished in 2003, with further analysis still being published

  13. The results of HGP • 3 billion long sequence consisting of four letters: A, T, G and C containing all the human inhered information • Genomes of many other organisms • Development of biotechnology, not only allowing to sequence the DNA, but also study function of different biomolecules and producing many TB of data • Databases storing this information (GenBank and EMBL data library) • Data analysis and management needs leading to the emergence and development of bioinformatics The things however have recently changed again – with NGS technologies sequencing of a specific individual has become affordable – with direct implications on amount of data that needs to be stored and/or analyzed.

  14. All you need to know about Molecular Biology 

  15. One of the first textbooks in bioinformatics MIT press 2000

  16. Few other textbooks for «Computer Scientists» MIT press 2004 Chapman and Hall/CRC 1995

  17. Few other textbooks for «Computer Scientists» Cambridge University 2015 CRC2017

  18. Few other textbooks Cambridge University 2009 Oxford University 2002

  19. Some bioinformatics problems from the perspective of Computer Science Genome sequencing and assembly

  20. Genome sequencing and assembly E.Green (2001) Strategies for the systematic sequencing of complex genomes. Nat Rev Genetics, Vol 2:8, 573-583.

  21. Ensembl genome browser

  22. Genome sequence assembly

  23. Genome sequence assembly

  24. Sequence assembly problem Ok, let us assume that we have these hybridizations. How can we reconstruct theinitial DNA sequence from them? Affymetrix GeneChip W.Bains, C.Smith (1988) A novel method for nucleic acid sequence determination.Journal of theoretical biology .Vol. 135:3, 303-307.

  25. Sequence assembly problem Ok, let us assume that we have these hybridizations. How can we reconstruct theinitial DNA sequence from them?

  26. SBH – Hamiltonian path approach

  27. SBH – Hamiltonian path approach

  28. Hamiltonian path (cycle problem) Hamiltonian path (cycle) problem For a given graph find a path (cycle) that visits every vertex exactly once (or show that such path does not exist). Unfortunately the problem is known to be NP-hard. That means that there are no algorithm that works in realistic time already for comparatively small graphs.

  29. SBH – Eulerian path approach

  30. Eulerian path (cycle) problem Eulerian path (cycle) problem For a given graph find a path (cycle) that visits every edge exactly once (or show that such path does not exist).

  31. Eulerian path (cycle) problem Eulerian cycle exists if and only if each of graph vertices has even degree. Moreover, there is a simple linear time algorithm for finding Eulerian cycle. Eulerian path (cycle) problem For a given graph find a path (cycle) that visits every edge exactly once (or show that such path does not exist).

  32. Next Generation Sequencing (Illumina) In case of de-novo sequencing we have essentially the same fragment assembly problem as for SBH, only the number of DNA sequence fragments are much higher and their size larger (~50-150 bp).

  33. Sequence mappers

  34. Sequence assembly – deBruijn graphs

  35. Sequence assembly – deBruijn graphs D.Zebino, E.Birney (2008) Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Research, Vol. 18:5, 821-829.

  36. All you need to know about Molecular Biology 

  37. Central dogma of molecular biology transcription translation DNA RNA Protein

  38. DNA Four different nucleotides : adenosine, guanine, cytosine and thymine. They are usually referred to as bases and denoted by their initial letters, A,C ,G and T 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5'

  39. DNA

  40. DNA - Biology as and information science 5' C-G-A-T-T-G-C-A-A-C-G-A-T-G-C 3' | | | | | | | | | | | | | | | 3' G-C-T-A-A-C-G-T-T-G-C-T-A-C-G 5' Thus, for many information related purposes, the molecule can be represented as CGATTCAACGATGC The maximal amount of information that can be encoded in such a molecule is therefore 2 bits times the length of the sequence. Noting that the distance between nucleotide pairs in a DNA is about 0.34 nm, we can calculate that the linear information storage density in DNA is about 6x10 8 bits/cm, which is approximately 75 GB or 12.5 CD-Roms per cm.

  41. DNA replication – copying the information

  42. Polymerase chain reaction – PCR – Xeroxing the DNA

  43. Genome sequencing • Reading the nucleotides in the DNA molecule and storing the readout in a computer • Basic technology ideas • A version of PCR • Separation of molecules by chemical properties such as weight or length of the DNA • Molecule labelling and fluorescent labelling in particular • DNA fragmentation in random length bits

  44. Anatomy of a chromosome • Centromeres are the largest constriction of the chromosome • Site of attachment of spindle fibers • 100,000s of 171 base pair repeat, called alpha satellite sequences • Centromere associated proteins are bound [Adapted from R.Yasbin]

  45. Genomes, chromosomes Genome is a set of DNA molecules. Each chromosome contains (long) DNA molecule per chromosome The 23 human chromosomes

  46. Genome sizes Information in the human genome – up to 0.75 TB

  47. www.ensembl.org

  48. Genomes and genes Termination (stop) TATA box control statement control statement start gene Transcription (RNA polymerase) Ribosome binding 3’ utr 5’ utr mRNA Translation (Ribosome) Protein

  49. Chromosomes - Eukaryotes

  50. Chromosomes - Prokaryotes Two subgroups: Archea Bacteria

More Related