1 / 51

Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences

Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification 2. Phylogeny prediction (tree construction). Sources:

rae-mccarty
Download Presentation

Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Tutorial 2: Some problems in bioinformatics 1. Alignment pairs of sequences Database searching for sequences Multiple sequence alignment Protein classification 2. Phylogeny prediction (tree construction) Sources: 1) "Bioinformatics: Sequence and Genome Analysis" by David W. Mount. 2001. Cold Spring Harbor Press 2) NCBI tutorial http://www.ncbi.nlm.nih.gov/Education/ and http://www.ncbi.nih.gov/BLAST/tutorial/Altschul-1.html 3) Brian Fristensky. Univ. of Manitoba http://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics

  2. Alignment: pairs of sequences TCGCA || || TC-CA DNA: A, G, C, T protein: A, C, D, E, F, G, H, I, K, L, M, N, P, Q, R, S, T, V, W, Y KQTGKG | ||| KSAGKG

  3. DNA to RNA to protein to phenotype

  4. DNA to RNA to protein to phenotype

  5. Alignment: pairs of sequences Concepts: Similarity Identity Homology Orthology Paralog KQTGKGV | |||: KSAGKGL 4/7 identical 5/7 similar

  6. Homology is based on evolutionary history

  7. Figure 45 Lineage-specific expansions of domains and architectures of transcription factors. Top, specific families of transcription factors that have been expanded in each of the proteomes. Approximate numbers of domains identified in each of the (nearly) complete proteomes representing the lineages are shown next to the domains, and some of the most common architectures are shown. Some are shared by different animal lineages; others are lineage-specific.

  8. A partial alignment of globin sequences. Proteins with very little identity (10% or less) can be recognized as sharing a common domain if they match a pattern.

  9. Homology, orthology and paralogy orthologs diverged at a speciation event paralogs diverged at a gene duplication event • - Fitch, W.M. 2001. • Homology: A personal view of some of the problems. Trends Genet. 16: 227-231.

  10. Alignment: pairs of sequences Scoring schemes Score = matches - mismatches - gaps What is the best way to evaluate the contribution of each? GKG-RRWDAKR ||| || GKGAKRWESAP

  11. A partial alignment of globin sequences from Pfam. Proteins with very little identity (10% or less) can be recognized as sharing a common domain if they match a pattern.

  12. Alignment: pairs of sequences Global vs. local alignment. (end gaps are ignored in local alignment)

  13. Dynamic programming TCGCA || || TC-CA Brian Fristensky. Univ. of Manitoba http://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html

  14. Dynamic programming Brian Fristensky. Univ. of Manitoba http://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html

  15. Dynamic programming Brian Fristensky. Univ. of Manitoba http://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html

  16. Dynamic programming Brian Fristensky. Univ. of Manitoba http://www.umanitoba.ca/faculties/afs/plant_science/COURSES/bioinformatics/lec04/lec04.2.html

  17. Alignment: pairs of sequences Scoring schemes Score = matches - mismatches - gaps "The dynamic programming algorithm was improved in performance by Gotoh (1982) by using the linear relationship for a gap weight wx = g + rx, where the weight for a gap of length x is the sum of a gap opening penalty (g) and a gap extension penalty (r) times the gap length (x), and by simplifying the dynamic programming algorithm." D. W. Mount GKG-RRWDAKR ||| || GKGAKRWESAP KQTGKG-RRWDAKR | ||| ||| KSAGKG-----AKR VS.

  18. Alignment: amino acid substitution matrices Scoring schemes "Any [scoring] matrix has an implicit amino acid pair frequency distribution that characterizes the alignments it is optimized for finding. More precisely, let pi be the frequency with which amino acid i occurs in protein sequences and let qij be the freqeuncy with which amino acids i and j are aligned within the class of alignments sought. Then, the scores that best distinguish these alignments from chance are given by the formula: Sij = log (qij / pipj) The base of the logarithm is arbitrary, affecting only the scale of the scores. Any set of scores useful for local alignment can be written in this form, so a choice of substitution matrices can be viewed as an implicit choice of 'target frequencies'" - Altschul et al. 1994 (Nature Genetics 6:119) Those frequencies are characteristic of the sequences being aligned, and are primarily a function of their degree of divergence.

  19. Alignment: amino acid substitution matrices Substitution matrices -- BLOSUM 62 Henikoff and Henikoff. 1992. Amino acid substitution matrices from protein blocks. PNAS 89: 10915-10919.

  20. Alignment: amino acid substitution matrices Substitution matrices -- BLOSUM 62

  21. Alignment: implementations Fasta Introduces the concept of k-tuple perfects alignment to seed longer global alignments. BLAST -- Basic Local Alignment Search Tool Initiates an alignment locally and then extends that alignment. GKG-RRW ||| || GKGAKRW GKG ||| GKG

  22. Alignment: Searching databases for sequences

  23. There are many modifications of BLAST for specific purposes.

  24. The NCBI BLAST interface

  25. The NCBI BLAST interface

  26. Extreme value distribution the expected distribution of the maximum of many independent random variables, generally Y = exp [-x -e-x ] K and lambda are statistical parameters dependent upon the scoring system and the background amino acid frequencies of the sequences being compared. While FASTA estimates these parameters from the scores generated by actual database searches, BLAST estimates them beforehand for specific scoring schemes by comparing many random sequences generated using a standard protein amino acid composition [12].

  27. Fasta can be run at EMBL. The software is also available for download.

  28. Alignment: Multiple sequence alignment

  29. Alignment: Protein classification

  30. Phylogeny prediction (tree construction)

  31. root

  32. Phylogeny prediction (tree construction) Character-based Methods Parsimony Maximum Likelihood tree that maximizes the likelihood of seeing the data Bayesian Analysis trees with greatest likelihoods given the data Distance Methods Unweighted Gap-pair method with Arithmetic Means Neighbor joining

  33. Within- and between-species variation along a single chromosome. a,The interspecies relationships of five chromosome regions to corresponding DNA sequences in a chimpanzee and a gorilla. Most regions show humans to be most closely related to chimpanzees (red) whereas a few regions show other relationships (green and blue). b, The among-human relationships of the same regions are illustrated schematically for five individual chromosomes.

  34. Tutorial III: Open problems in bioinformatics Tentatively: Detection of subtle signals promoter elements exon splicing enhancers noncoding RNAs weak protein similarities Microarrays Protein folding and homology modeling Thursday, June 10, 2:00 - 3:45

  35. Microarray expression data Statistical analysis -- what has changed Clustering -- which genes change together Clustering -- promoter recognition Clustering -- database integration Phenotype determination (e.g. cancer prognosis)

More Related