Bioinformatics For MNW 2 nd Year - PowerPoint PPT Presentation

bioinformatics for mnw 2 nd year n.
Download
Skip this Video
Loading SlideShow in 5 Seconds..
Bioinformatics For MNW 2 nd Year PowerPoint Presentation
Download Presentation
Bioinformatics For MNW 2 nd Year

play fullscreen
1 / 179
Bioinformatics For MNW 2 nd Year
171 Views
Download Presentation
deon
Download Presentation

Bioinformatics For MNW 2 nd Year

- - - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

  1. Bioinformatics For MNW 2nd Year Jaap Heringa FEW/FALW Integrative Bioinformatics Institute VU (IBIVU) heringa@cs.vu.nl

  2. Current Bioinformatics Unit • Jens Kleinjung (1/11/02) • Victor Simosis – PhD (1/12/02) • Radek Szklarczyk - PhD (1/01/03) • John Romein (1/12/02, Henri Bal)

  3. Bioinformatics course 2nd year MNW spring 2003 • Pattern recognition • Supervised/unsupervised learning • Types of data, data normalisation, lacking data • Search image • Similarity tables • Clustering • Principal component analysis • Discriminant analysis

  4. Bioinformatics course 2nd year MNW spring 2003 • Protein • Folding • Structure and function • Protein structure prediction • Secondary structure • Tertiary structure • Function • Post-translational modification • Prot.-Prot. Interaction -- Docking algorithm • Molecular dynamics/Monte Carlo

  5. Bioinformatics course 2nd year MNW spring 2003 • Sequence analysis • Pairwise alignment • Dynamic programming (NW, SW, shortcuts) • Multiple alignment • Combining information • Database/homology searching (Fasta, Blast, Statistical issues-E/P values)

  6. Bioinformatics course 2nd year MNW spring 2003 • Gene structure and gene finding algorithm • Omics • DNA makes RNA makes protein • Expression data, Nucleus to ribosome, translation, etc. • Metabolomics • Physiomics • Databases • DNA, EST • Protein sequence • Protein structure

  7. Bioinformatics course 2nd year MNW spring 2003 • Microarray data • Protein structure (PDB) • Proteomics • Mass spectrometry/NMR/X-ray?

  8. Bioinformatics course 2nd year MNW spring 2003 • Bioinformatics method development • IPR issues • Programming and scripting languages • Web solutions • Computational issues • NP-complete problems • CPU, memory, storage problems • Parallel computing • Bioinformatics method usage/application • Molecular viewers (RasMol, MolMol, etc.)

  9. Gathering knowledge Rembrandt, 1632 • Anatomy, architecture • Dynamics, mechanics • Informatics (Cybernetics – Wiener, 1948) (Cybernetics has been defined as the science of control in machines and animals, and hence it applies to technological, animal and environmental systems) • Genomics, bioinformatics Newton, 1726

  10. Bioinformatics Chemistry Biology Molecular biology Mathematics Statistics Bioinformatics Computer Science Informatics Medicine Physics

  11. Bioinformatics “Studying informational processes in biological systems” (Hogeweg, early 1970s) • No computers necessary • Back of envelope OK “Information technology applied to the management and analysis of biological data” (Attwood and Parry-Smith) Applying algorithms with mathematical formalisms in biology (genomics) -- USA

  12. Bioinformatics in the olden days • Close to Molecular Biology: • (Statistical) analysis of protein and nucleotide structure • Protein folding problem • Protein-protein and protein-nucleotide interaction • Many essential methods were created early on (BG era) • Protein sequence analysis (pairwise and multiple alignment) • Protein structure prediction (secondary, tertiary structure)

  13. Bioinformatics in the olden days (Cont.) • Evolution was studied and methods created • Phylogenetic reconstruction (clustering – NJ method

  14. The Human Genome -- 26 June 2000

  15. The Human Genome -- 26 June 2000 Dr. Craig Venter Celera Genomics -- Shotgun method Sir John Sulston Human Genome Project

  16. Human DNA • There are about 3bn (3  109) nucleotides in the nucleus of almost all of the trillions (3.5  1012 ) of cells of a human body (an exception is, for example, red blood cells which have no nucleus and therefore no DNA) – a total of ~1022 nucleotides! • Many DNA regions code for proteins, and are called genes (1 gene codes for 1 protein in principle) • Human DNA contains ~30,000 expressed genes • Deoxyribonucleic acid (DNA) comprises 4 different types of nucleotides: adenine (A), thiamine (T), cytosine (C) and guanine (G). These nucleotides are sometimes also called bases

  17. Human DNA (Cont.) • All people are different, but the DNA of different people only varies for 0.2% or less. So, only 2 letters in 1000 are expected to be different. Over the whole genome, this means that about 3 million letters would differ between individuals. • The structure of DNA is the so-called double helix, discovered by Watson and Crick in 1953, where the two helices are cross-linked by A-T and C-G base-pairs (nucleotide pairs – so-called Watson-Crick base pairing).

  18. Tot hier 3/2 – 10.45-12.30

  19. DNA compositional biases • Base composition of genomes: • E. coli: 25% A, 25% C, 25% G, 25% T • P. falciparum (Malaria parasite): 82%A+T • Translation initiation: • ATG is the near universal motif indicating the start of translation in DNA coding sequence.

  20. Some facts about human genes • Comprise about 3% of the genome • Average gene length: ~ 8,000 bp • Average of 5-6 exons/gene • Average exon length: ~200 bp • Average intron length: ~2,000 bp • ~8% genes have a single exon • Some exons can be as small as 1 or 3 bp. • HUMFMR1S is not atypical: 17 exons 40-60 bp long, comprising 3% of a 67,000 bp gene

  21. Genetic diseases • Many diseases run in families and are a result of genes which predispose such family members to these illnesses • Examples are Alzheimer’s disease, cystic fibrosis (CF), breast or colon cancer, or heart diseases. • Some of these diseases can be caused by a problem within a single gene, such as with CF.

  22. Genetic diseases (Cont.) • For other illnesses, like heart disease, at least 20-30 genes are thought to play a part, and it is still unknown which combination of problems within which genes are responsible. • With a “problem” within a gene is meant that a single nucleotide or a combination of those within the gene are causing the disease (or make that the body is not sufficiently fighting the disease). • Persons with different combinations of these nucleotides could then be unaffected by these diseases.

  23. Genetic diseases (Cont.)Cystic Fibrosis • Known since very early on (“Celtic gene”) • Inherited autosomal recessive condition (Chr. 7) • Symptoms: • Clogging and infection of lungs (early death) • Intestinal obstruction • Reduced fertility and (male) anatomical anomalies • CF gene CFTR has 3-bp deletion leading to Del508 (Phe) in 1480 aa protein (epithelial Cl- channel) – protein degraded in ER instead of inserted into cell membrane

  24. Genomic Data Sources • DNA/protein sequence • Expression (microarray) • Proteome (xray, NMR, • mass spectrometry) • Metabolome • Physiome (spatial, • temporal) Integrative bioinformatics

  25. Genomic Data Sources Vertical Genomics genome transcriptome proteome metabolome physiome Dinner discussion: Integrative Bioinformatics & Genomics VU

  26. DNA transcription mRNA translation Protein A gene codes for a protein CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE

  27. Humans have spliced genes…

  28. DNA makes RNA makes Protein

  29. Remark • The problem of identifying (annotating) human genes is considerably harder than the early success story for ß-globin might suggest. • The human factor VIII gene (whose mutations cause hemophilia A) is spread over ~186,000 bp. It consists of 26 exons ranging in size from 69 to 3,106 bp, and its 25 introns range in size from 207 to 32,400 bp. The complete gene is thus ~9 kb of exon and ~177 kb of intron. • The biggest human gene yet is for dystrophin. It has > 30 exons and is spread over 2.4 million bp.

  30. DNA makes RNA makes Protein:Expression data • More copies of mRNA for a gene leads to more protein • mRNA can now be measured for all the genes in a cell at ones through microarray technology • Can have 60,000 spots (genes) on a single gene chip • Colour change gives intensity of gene expression (over- or under-expression)

  31. Metabolic networksGlycolysis and Gluconeogenesis Kegg database (Japan)

  32. High-throughput Biological Data • Enormous amounts of biological data are being generated by high-throughput capabilities; even more are coming • genomic sequences • gene expression data • mass spec. data • protein-protein interaction • protein structures • ......

  33. Protein structural data explosion Protein Data Bank (PDB): 14500 Structures (6 March 2001) 10900 x-ray crystallography, 1810 NMR, 278 theoretical models, others...

  34. Dickerson’s formula: equivalent to Moore’s law n = e0.19(y-1960) with y the year. On 27 March 2001 there were 12,123 3D protein structures in the PDB: Dickerson’s formula predicts 12,066 (within 0.5%)!

  35. Sequence versus structural data • Despite structural genomics efforts, growth of PDB slowed down in 2001-2002 (i.e did not keep up with Dickerson’s formula) • More than 100 completely sequenced genomes Increasing gap between structural and sequence data

  36. Bioinformatics Bioinformatics Large - external (integrative) ScienceHuman Planetary Science Cultural Anthropology Population BiologySociology SociobiologyPsychology Systems Biology BiologyMedicine Molecular Biology Chemistry Physics Small – internal (individual)

  37. Bioinformatics • Offers an ever more essential input to • Molecular Biology • Pharmacology (drug design) • Agriculture • Biotechnology • Clinical medicine • Anthropology • Forensic science • Chemical industries (detergent industries, etc.)

  38. High-throughput Biological DataThe data deluge • Hidden in these data is information that reflects • existence, organization, activity, functionality …… of biological machineries at different levels in living organisms Most effectively utilising this information will prove to be essential for Integrative Bioinformatics

  39. Data Issues …… • Data collection: getting the data • Data representation: data standards, data normalisation ….. • Data organisation and storage: database issues ….. • Data analysis and data mining: discovering “knowledge”, patterns/signals, from data, establishing associations among data patterns • Data utilisation and application: from data patterns/signals to models for bio-machineries • Data visualization: viewing complex data …… • Data transmission: data collection, retrieval, ….. • ……

  40. Tot hier 5/2

  41. Bioinformatics • “Nothing in Biology makes sense except in the light of evolution” (Theodosius Dobzhansky (1900-1975)) • “Nothing in bioinformatics makes sense except in the light of Biology”

  42. Pair-wise alignment T D W V T A L K T D W L - - I K Combinatorial explosion - 1 gap in 1 sequence: n+1 possibilities - 2 gaps in 1 sequence: (n+1)n - 3 gaps in 1 sequence: (n+1)n(n-1), etc. 2n (2n)! 22n = ~ n (n!)2 n 2 sequences of 300 a.a.: ~1088 alignments 2 sequences of 1000 a.a.: ~10600 alignments!

  43. Dynamic programmingScoring alignments Sa,b= + gp(k) = pi + kpeaffine gap penalties pi and pe are the penalties for gap initialisation and extension, respectively

  44. Dynamic programmingScoring alignments T D W V T A L K T D W L - - I K 2020 10 1 Gap penalties (open, extension) Amino Acid Exchange Matrix Score: s(T,T)+s(D,D)+s(W,W)+s(V,L)+Po+2Px + +s(L,I)+s(K,K)

  45. Pairwise sequence alignment Global dynamic programming MDAGSTVILCFVG Evolution M D A A S T I L C G S Amino Acid Exchange Matrix Search matrix Gap penalties (open,extension) MDAGSTVILCFVG- MDAAST-ILC--GS

  46. Global dynamic programming j-1 i-1 Max{S0<x<i-1, j-1- Pi - (i-x-1)Px} Si-1,j-1 Max{Si-1, 0<y<j-1 - Pi - (j-y-1)Px} Si,j = si,j + Max

  47. Global dynamic programming

  48. Global dynamic programming

  49. Tot hier 17/02/03