1 / 72

Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

Sequence Alignment Algorithms – Application to Bioinformatics Tool Development. Dr. S. Parthasarathy Reader and Head Department of Bioinformatics Bharathidasan University Tiruchirappalli – 620 024 (E-mail: partha@cnld.bdu.ac.in ). Plan. Introduction to Bioinformatics

chin
Download Presentation

Sequence Alignment Algorithms – Application to Bioinformatics Tool Development

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Sequence Alignment Algorithms – Application to Bioinformatics Tool Development Dr. S. Parthasarathy Reader and Head Department of Bioinformatics Bharathidasan University Tiruchirappalli – 620 024 (E-mail: partha@cnld.bdu.ac.in) Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  2. Plan • Introduction to Bioinformatics • Sequence alignment algorithms • Global alignment : Needleman - Wunsch algorithm • Local alignment : Smith – Waterman algorithm • – Predict Fold to a protein sequence • Methodology • Algorithm, Coding & Tool Development • Benchmarking • Conclusions PredictFold Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  3. Introduction • Why do we need Bioinformatics? • What is Bioinformatics? • Where is Bioinformatics used? Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  4. Why? • Biological Data Explosion • How did Biological Data Explosion happen? • Sequence Databases are HUGE than the Structure Databases • Why so? Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  5. Introduction Biological Data : Genome Projects Latest Revolution • On 26 June, 2000 - Announcement of completion of the draft of the ‘Human Genome’ ‘Genetic Code of Human Life is Cracked by Scientists’ • Human Genome contains 3.2 x 109 bps • Unit of (Genome) sequence length • bps (base pairs) • Mbps (Mega base pairs) = 106 bps • Gbps (Giga base pairs) = 109 bps • huge (human genome equivalent) = 3.2 Gbps • Unit of Genetic distance • centiMorgan (cM) - arbitrary unit ; Named for Thomas Hunt Morgan (e.g. 1 cM = 0.01 recombinant frequency) Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  6. Introduction Biological Data : Genome Projects 16 February 2001 15 February 2001 Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  7. Biological Data : Recombinant DNA Technology Old Revolution • 1940 – Role of DNA as the genetic material was confirmed • 1953 – Discovery of DNA structure by James Watson & Francis Crick • 1966 – Establishment of the Genetic Code • 1967 – DNA ligase was isolated – (join two strands of DNA together) – Molecular Glue • 1970 – Isolation of Restriction enzyme – Molecular Scissors • 1972 – Recombinant DNA molecules were generated at Stanford University, USA • 1973 – Joining DNA fragments to the plasmid pSC101 isolated from E.Coli. They could replicate when introduced into E.Coli. The discoveries of 1972 & 1973 triggered off the biggest scientific revolution – Genetic Engineering Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  8. Biological Data explosion • GenBank, NCBI, USA • 44 Gbps of DNA & 40 Million Sequences (upto 2004) • GenBank, National Center for Biotechnology Information, USA • Protein Data Bank (PDB), RCSB, USA • 29,000 structures (2004) • PDB, Research Collaboratory for Structural Bioinformatics, USA • QUALITY of Data - HIGH • Experimental error in modern genomic sequencing is extremely low • QUANTITY of Data - HUGE • With Recombinant DNA technology & genomic sequencing, size of sequence data bases is increasingvery rapidly • SEQUENCE Versus STRUCTURE Databases • Sequence Databases are HUGE than Structure Databases Leads to Bioinformatics Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  9. What? • What is Bioinformatics? • Define Bioinformatics Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  10. Bioinformatics - Definition F(i,j) = max { F(i-1, j-1)+s(xi,yj), F(i-1, j) – d, F(i, j-1) – d.} Bioinformat ics atcggcatgcatcagtcatgcaactg PEPTIDESE QSEDITPEP Bioinformatics is an integration of mathematical, statistical and computer methods to analyze biological data. We use computer programs to make inference from the biological data, to make connections among them and to derive useful and interesting predictions. The marriage of biology and computer science has created a new field called ‘Bioinformatics’. - Arthur M. Lesk Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  11. Biology Basic Definitions • Cell - It is the building block of living organisms • Eukaryotic Cells or organisms have the nucleus separated from the cytoplasm by a nuclear membrane and the genetic material borne on a number of chromosomes consisting of DNA and Protein • Chromosome • The physical basis of heredity. Deeply staining rod-like structures present with the nuclei of eukaryotes • Contains DNA and protein arranged in compact manner • Replicate identically during cell division • Same number of chromosomes present in cells of a particular species (e.g. Human : 22, X and Y) Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  12. GenomeBasic Definitions • Genome • A complete set of chromosomes inherited from one parent • Gene • One of the units of inherited material carried on by chromosomes. They are arranged in a linear fashion on DNAs. Each represents one character, which is recognized by its effect on the individual bearing the gene in its cells. There are many thousand genes in each nucleus. • DNA (Deoxyribo Nucleic Acid) • DNA is made up of FOUR bases a t g c – adenine, thymine, guanine, cytosine • Protein • Protein is made up of TWENTY different amino acids A T G C ... – Alanine, Threonine, Glycine, Cysteine, … Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  13. DNA transcription mRNA translation Protein Central Dogma CCTGAGCCAACTATTGATGAA CCUGAGCCAACUAUUGAUGAA PEPTIDE Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  14. Genome DataHuman & Model Organisms • Most mapping and sequencing technologies were developed from studies of simpler non-human organisms • Non-Human/Model organisms • Bacterium Escherichia Coli - 4.6 Mbp • Yeast Saccharomyces Cerevisiae - 12.1 Mbp • Fruit Fly Drosophila melanogaster - 180.0 Mbp • Roundworm C. elegans - 95.5 Mbp • Laboratory Mouse Mus musculus - 3.0 Gbp • Human – more complex genome • Human Homo sapiens - 3.2 Gbp Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  15. Genome DataHuman (Homo Sapiens) Genome 1 Chromosomes 23 Genes / DNAs ~ 30,000 Nucleotides 3.2 x 109 bps Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  16. Bioinformatics in Genome Research • Data Collection and Interpretation • Collecting and Storing Data • Sequence generated by genome research will be used as primary information source for human biology and medicine • The vast amount of data produced will first need to be collected, stored and distributed • Interpretation of Data • Recognizing where genes begin and end • Searching a database for a particular DNA sequence may uncover these homologous sequences in a known gene from a model organism, revealing insights into the function of the corresponding human gene Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  17. Understanding Gene Function • Correct protein functiondepends on the 3-D or folded structure the protein assumes in biological environments • Understandingprotein structure will be essential in determining gene function Gene Protein Function Structure Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  18. Where? • Where is Bioinformatics used? • What are the uses of Bioinformatics? • Applications of Bioinformatics Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  19. Bioinformatics Tasks • Sequence Analysis (Protein sequences) • Similarity & Homology • pairwise local/global alignment • GCG – Seqlab & Seqweb • Scoring Matrices - PAM, BLOSUM • Database Search • BLAST, FASTA • Multiple alignment • ClustalW, PRINTS, BLOCKS • Secondary Structure Prediction (from Sequence) • Proteins – -Helix, β-Sheet, Turn or coil • Protein Folding Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  20. Bioinformatics Tasks • Structure analysis – Experimental Determination • X-ray crystallography – 3 dimensional coordinates – Structure • Nuclear Magnetic Resonance (NMR) • PDB – Protein Data Bank • RasMol – Molecular Viewing Software • High-throughput crystallographic structure determination • High flux synchrotron radiation sources (data collection) • Multiple anomalous diffraction method (data interpretation) • Bioinformatics - Structure Prediction • Homology Modelling – InsightII, SwissPDBViewer, Biosuite • ‘ab initio’ method - Monte Carlo Simulation • Protein Structure Classification • SCOP - Structural Classification Of Proteins • CATH - Class, Architecture, Topology, Homologous superfamily • FSSP - Fold Classification based on Structure- Structure alignment of Proteins – obtained by DALI (Distance-matrix ALIgnment) Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  21. Bioinformatics Tasks • Protein Engineering • Mutations • Alter particular amino acid/base for desired effect • Site directed mutagenesis • Identify the potential sites where we can do alterations • Applications • Agricultural – Genetically Modified Plants, Vegetables, GM Food • Pharmaceutical – Molecular Modelling base Drug Design • Medical – Gene Therapy • DNA Bending • Application to Genomes (Ref: M.G.Munteanu, K.Vlahovicek, S.Parthasarathy, I.Simon and S.Pongor, Rod Models of DNA: Sequence-dependent anisotropic elastic modelling of local phenomena, Trends in Biochemical Sciences, 23 (1998) 341-347) Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  22. Bioinformatics TasksGenomics & Proteomics • Genomicsis the study of the structure, content, evolution and functions of genes in genomes • Aims of Genomics • To establish an integrated web based database and research interface • To assemble Physical,Genetic and Cytological maps of the Genome • To identify and annotate the complete set of genes encoded within a genome • To provide the resources for comparison with other genomes Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  23. Proteomics – Proteome • Proteome is the complete collection of proteins in a cell/tissue/organism at a particular time. Unlike genomes, which are stable over the life time of the organism, proteomes change rapidly as each cell response to its changing environment and produces new proteins and at different amounts. • Genome is a more stable entity. An organism has only one genome but many proteomes. • For an organism, there may be • one body wide proteome, • about 200 tissue proteomes • about a trillion (~1012) individual cell proteomes. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  24. Proteomics – Definition • The study of proteomes that includes determining the 3D shapes of proteins, their roles inside cells, the molecules with which they interact, and defining which proteins are present and how much of each is present at a given time. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  25. Proteomics – Applications • To correlate proteins on the basis of their expression profiles. • To observe patterns in protein synthesis and this observed pattern changes can be used as an indicator of the state of cell and its gene expression. • To characterize bacterial pathogens and to develop novel antimicrobials. • To identify regions of the bacterial genome that encode pathogenic determinants. • To develop drugs and in toxicology – Structural Proteomics • Proteomics as a tool for plant genetics and breeding Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  26. Systems Biology • Systems Biology is a new perspective and emerging field for research in the post-genomic era. • It aims at system level understanding of biological systems. • It studies whole cells/tissues/organisms not by a traditional reductionist’s approach but by holistic means in a reiterative attempt to model the complete cell/tissue/organism. • It is an integrated and interacting network of genes, proteins and biochemical reactions which give rise to life. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  27. Systems Biology Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  28. Sequence Alignment Algorithms • Similarity and Homology • Sequence Comparison - Issues • Types of alignments • Algorithms Used Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  29. Sequence similarity and homology • Nature is atinkerer andnot an inventor. New sequences are adapted from pre-existing sequences rather than invented de novo . There exists significant similarity between a new sequence and already known sequences. – Fortunate for computational sequence analysis • Similarity – Measurement of resemblance and differences, independent of the source of resemblance. Homology – The sequences and the organisms in which they occur are descended from a common ancestor. • If two related sequences are homologous, then we can transfer information about structure and/or function, by homology. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  30. 3-D Structure and Homology • 3-D structure patterns (motifs) of proteins are much more evolutionarily conservedthan amino acid sequences - This type of Homology search could prove more fruitful • Particular motifs may serve similar functions in several different proteins, information that would be valuable in genome analysis • Only a few protein motifs can be recognised at the sequence level • Development of more analytic capabilities to facilitate grouping protein sequences into motif families will make homology searches more useful Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  31. Sequence ComparisonIssues • Types of alignment • Global – end to end matching (Needleman-Wunsch) • Local – portions or subsequences matching (Smith-Waterman) • Scoring system used to rank alignments • PAM & BLOSUM matrices • Algorithms used to find optimal (or good) scoring alignments • Heuristic • Dynamic Programming • Hidden Markov Model (HMM) • Statistical methods used to evaluate the significance of an alignment score • Z- score, P- value and E- value Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  32. PAM BLOSUM Substitution Matrices • PAM (Point Accepted Mutation) • BLOSUM (BLOcks SUbstitution Matrix) 40 Close 90 Default 250 62 Distant 500 30 Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  33. Types of Algorithms • Heuristic A heuristic is an algorithm that will yield reasonable results, even if it is not provably optimal or lacks even a performance guarantee. In most cases, heuristic methods can be very fast, but they make additional assumptions and will miss the best match for some sequence pairs. • Dynamic Programming The algorithm for finding optimal alignments given an additive alignment score dynamically (We are going to discuss about it soon.) These type of algorithms are guaranteed to find the optimal scoring alignment or set of alignments. • HMM - Based on Probability Theory – very versatile. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  34. Global AlignmentNeedleman-Wunsch Algorithm • Formula { F(i-1,j-1) + s(xi,yj) D F(i, j) = max { F(i-1 , j) - d H { F(i , j-1) - d V Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  35. Global AlignmentNeedleman-Wunsch Algorithm • Gap penalties • Linear score f(g) = - gd • Affine score f(g) = - d – (g-1) e • d = gap open penalty e = gap extend penalty • g = gap length • Trace back • Take the value in the bottom right corner and trace back till the end. (i.e. align end – end always). • Algorithm complexity • It takes O(nm) time and O(nm) memory, where n and m are the lengths of the sequences. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  36. Local AlignmentSmith-Waterman Algorithm Same as Global alignment algorithm with TWO differences. • F(i,j) to take 0 (zero), if all other options have value less than 0. • Alignment can end anywhere in the matrix. Take the highest value of F(i,j) over the whole matrix and start trace back from there. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  37. Local AlignmentSmith-Waterman Algorithm • Formula { F(i-1,j-1) + S(xi,yj) D F(i, j) = max F(i-1 , j) - d H F(i , j-1) - d V 0 (if all other value is < 0) } Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  38. Web based server development • Design the web page to get the data • Use cgi-bin or Perl script to parse the submitted data • Invoke the corresponding program to get the appropriate results • Send the results either by e-mail or to the web page directly Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  39. Application to Bioinformatics Tool Development To predict a fold to protein sequence PredictFold Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  40. To predict a fold to protein sequence PredictFold • To predict possible folds for a given protein sequence, whose structure is not known • To develop a fold recognition technique / tool that is sensitive in detecting folds of given protein sequences in the twilight zone (sequences sharing less than 25% identity) • Application of the fold recognition strategy to genomic annotation Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  41. ‘Twilight Zone’ sequencesExampleCytochrome Sequences • 256b >256B:A CYTOCHROME $B562 (OXIDIZED) - CHAIN A ADLEDNMETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPPKLEDKSPDSPEMKD FRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRNAYHQKYR >256B:B CYTOCHROME $B562 (OXIDIZED) - CHAIN B ADLEDNMETLNDNLKVIEKADNAAQVKDALTKMRAAALDAQKATPPKLEDKSPDSPEMKD FRHGFDILVGQIDDALKLANEGKVKEAQAAAEQLKTTRNAYHQKYR • 2ccy >2CCY:A CYTOCHROME $C(PRIME) - CHAIN A QQSKPEDLLKLRQGLMQTLKSQWVPIAGFAAGKADLPADAAQRAENMAMVAKLAPIGWAK GTEALPNGETKPEAFGSKSAEFLEGWKALATESTKLAAAAKAGPDALKAQAAATGKVCKA CHEEFKQD >2CCY:B CYTOCHROME $C(PRIME) - CHAIN B QQSKPEDLLKLRQGLMQTLKSQWVPIAGFAAGKADLPADAAQRAENMAMVAKLAPIGWAK GTEALPNGETKPEAFGSKSAEFLEGWKALATESTKLAAAAKAGPDALKAQAAATGKVCKA CHEEFKQD Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  42. ExampleSequences similarity lalign output for 256b & 2ccy follows … Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  43. ExampleCytochrome Structures 256b CYTOCHROME STRUCTURES (seq. similarity 24%) 2ccy Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  44. Goals • Exploration of suitable fold recognition techniques that are sensitive in detecting similar folds despite low sequence similarity • Identification of functional motifs in proteins at sequence (1D) and structure (3D) level • Development of a protocol that aid in the rapid classification and annotation of genomic data based on functional motifs Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  45. Methodology • Reduction of 3D-structure to 1D-environment string. Environment at each residue position is a function of local secondary structure and extent of exposure to the solvent (based on 3D-1D profile method developed by Eisenberg et al., 1991). • Extract residue environment profiles of the available protein structures. • A scoring matrix is generated from a library of profiles. Each matrix element is the information value of a residue in the given environment. • A library ofenvironment strings is created for the available protein fold structures. • The probe sequence is queried against this library to look for best matches. Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  46. Workflow Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  47. Residue Environments _Helix Partially buried _Exposed _Coil Strand_ Buried_ Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  48. Residue Environments • The residue environments are described by • the area (A) of the residue buried in the protein • the fraction (f) of side-chain area that is covered by polar atoms (O and N) • the local secondary structure Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  49. Residue Environments CLASS Area (A) Å2 FRACTION (f) BURIED 1 (B1) A > 114 f < 0.45 BURIED 2 (B2) 0.45 < f < 0.58 BURIED 3 (B3) f > 0.58 PARTIAL 1 (P1) 40 < A < 114f < 0.67 PARTIAL 2 (P2) f > 0.67 EXPOSED (E0) A < 40 f > 0.67 Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

  50. Residue Environment classes • We have 6 classes based on the extend of exposure to solvent • We have 3 classes based on secondary structure – Alpha Helix(A), Beta Sheet (B) & Coil(C) • Total : 6 x 3 = 18 environments • B1A,B1B,B1C, B2A,B2B,B2C, B3A,B3B,B3C P1A,P1B,P1C, P2A,P2B,P2C, E0A,E0B,E0C. • For example B1A - Buried 1Alpha Helix P2B - Partially Buried 2Beta Sheet E0C - Exposed 0Coil Dr.S.Parthasarathy, Bharathidasan Univ., Trichy

More Related