1 / 39

How do you gauge significance in sequence comparisons?

How do you gauge significance in sequence comparisons?. G G A GA C TG TAG A CA GCTAATGC TAT A G A A CG C CC TAG C CA CGAGCCCT TAT C Match at 11/26 positions…would expect 6.5/26 if random, equal probabilities

orea
Download Presentation

How do you gauge significance in sequence comparisons?

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. How do you gauge significance in sequence comparisons? • GGAGACTGTAGACAGCTAATGCTATA • GAACGCCCTAGCCACGAGCCCTTATC • Match at 11/26 positions…would expect 6.5/26 if random, equal probabilities • Probability of 11/26 matches given random, equal probabilities at each position is ~0.04

  2. Understanding sequence variation • Evolution results in complex patterns of similarity among biological sequences. • Sequences change due to mutation, natural selection, and genetic drift. • Understanding sequence alignment strategies (ie. BLAST) requires a thorough understanding of these events

  3. Mutations • Change in DNA sequence • TTA  TTG Silent or synonymous (L) • TTA  TTT Missense (L  F) • Conservative or nonconservative substitution based on structure/function • TTA  TAA Nonsense • Frameshift • Consequences of mutations?

  4. Natural Selection • There must be variation within a population • The variation must be heritable • There must be differential reproduction based on variation • Selective advantages (ie. longer necks) lead to increased fitness, more offspring, etc.

  5. The why and what of natural selection • Variation exists at the DNA level: alleles • This variation is inexhaustible (something important to remember when looking at new genome sequences) • These differences are subjected to selection: • Changes in protein structure are typically unfavorable and as a result, selected against • However, some changes in structure/function are selected for: sickle cell anemia/malaria

  6. Genetic drift • Random genetic drift is a stochastic process (by definition). • One aspect of genetic drift is the random nature of transmitting alleles from one generation to the next given that only a fraction of all possible zygotes become mature adults. • Begin with equal frequency of C or T at given position, next generation observe 60/40 in favor of C…greater chance of C making it into the next generation

  7. Neutral Theory of Evolution - Kimura • Third position of a codon or a nucleotide in a non-coding, non-regulatory region are expected to be invisible to natural selection • Compare Fugu with humans..most conserved sequences are the genes • http://www.sciencemag.org/cgi/content/full/297/5585/1301 • Debate over importance of natural selection and neutral evolution in development of species

  8. Sequence Alignment • Sequence alignment is simply the “optimal” assignment of substitution and indel events to a pair of sequences. • Global alignment: align entire sequences • Local alignment: find best matching regions of sequences

  9. Measuring Alignment Quality • Good alignments should have … • “many” exact matches • “few” mismatches • “many” of the mismatches should be similar residues • “few” gaps

  10. Measuring Alignment Quality Begin with... QTRPQNVLNPP ||| STRQNVINPWAAQ Longest Exact Match S = 3a S=alignment score a=match score

  11. Measuring Alignment Quality … allow some mismatches QTRPQNVLNPP ||| || STRQNVINPWAAQ S=alignment score a=match score b=mismatch penalty S = 5a - 1b

  12. Measuring Alignment Quality …and finally, introduce some gaps QTRPQNVLNPP || ||| || STR-QNVINPWAAQ S=alignment score a=match score b=mismatch penalty c=gap penalty S = 7a - 1b -1c

  13. Global programming algorithm

  14. Global Dynamic Programming • Full sequence must be aligned • Gaps at ends are penalized as much as internal ones • F(n,m) is the best score for alignment • Traceback can give >1 correct alignment • Used to examine closely related sequences • http://www.sbc.su.se/~per/molbioinfo2001/dynprog/dynamic.html

  15. Local Alignment – Smith-Waterman

  16. Local alignments - How • Notice the top row and left column are now filled with 0 (if the best alignment has a negative score, it’s better to start a new one) • The alignment can end anywhere in the matrix • Instead of starting at F (n, m), start traceback at highest value of F (i, j); the traceback ends when you hit a 0

  17. Local Alignment – Why? • This alignment strategy identifies islands of similarity amongst a sea of sequences

  18. Overlap matches – genome assembly

  19. Overlap recursion is similar to global alignment • The top row is and left column are set to 0 at each position, choose the maximum point along the bottom row or right column, continue traceback until hit a 0

  20. Scoring Issues • Relative costs of matches, mismatches, and gaps should depend on their probabilities (rare events receive higher penalties) • In practice, the appropriate costs are rarely known. • A variety of scoring matrices are available.

  21. Notation for algorithmic complexity • It is useful to know how an algorithm’s performance in CPU time and required memory storage will scale with the size of the problem • In previous dynamic programming problems, we are storing (n +1) x (m +1) numbers and each number costs us three sums and a max • These algorithms work in O(nm) time, with O(nm) memory, since n and m are often comparable, will often times be described as O(n2) – feasible but slow

  22. Heuristic alignment algorithms • Shortcuts are important • Searching a sequence length of 1000 against a database with 108 residues requires approximately 1011 matrix cells. At ten million matrix cells a second, it would take about 3 hours. • BLAST – the heuristic is based on that true match alignments are very likely to contain somewhere within them a sort stretch of identities. Look for short stretches to serve as seeds to extend.

  23. Five BLAST programs • BLASTN (nucleotide to nucleotide) • Use: mapping oligos, CDNA or PCR products to genome, screening repeats, finding vector sequences • BLASTP (protein to protein) • Identifying related proteins and domains among proteins • BLASTX (nucleotide to protein) • Finding protein coding genes in genome DNA or cDNA • TBLASTN (protein to nucleotide) • Identifying transcripts similar to a given protein, mapping a protein to genomic DNA • TBLASTX (nucleotide to nucleotide via protein intermediate) • Cross species gene prediction at genome or transcript level, searching for missing genes

  24. Seeding • BLAST takes your query and breaks it down into words of fixed length (3 for protein, 11 for nucleotide) • It scans through a database looking for a word from the query set with some minimum score T, when it finds it, it begins a “hit” extension to extend the possible match in both directions, stopping at the maximum scoring extension.

  25. Extension • The seeds are extended to locally optimal pairs, whose scores cannot be improved by extension or trimming. • These locally optimal alignments are called high scoring segment pairs or HSP’s • Sometimes you return only a portion of a sequence – this is the reason you need to look carefully at your BLAST alignments

  26. Alignment example • The quick brown fox jumps over the lazy dog. • The quiet brown cat purrs when she sees him. • Matches = +1; Mismatches = -1; ignore spaces and do not allow gaps. • Assume the seed is the capital T, extend the alignment • You’ll hit a mismatch c/e should you continue and how far? • Generate a variable X to measure how far the score drops off. • Set X = 5 and try the alignment… • Set X = 2 and try again … • A large X value will increase the speed, however, speed is often modulated by word size and other parameters…

  27. Gapped BLAST – a time saver • Extension is costly, now have a two hit (gapped) BLAST where you require two hits within a distance (A) • A gapped extension takes much longer to execute than ungapped, but overall run fewer extensions – time saver • Gapped BLAST requires two non-overlapping hits of at least score (T) within distance A of one another before ungapped extension of second hit • T is adjustable, higher the T then the smaller the search space

  28. Evaluation • Once seeds are extended to generate alignments, these alignments are tested for statistical significance. • Can establish thresholds for reporting

  29. BLASTN variations • BLASTN seeds are always identical words; T is never used • To make BLASTN faster, increase word size, to make it more sensitive decrease word size • MegaBLAST increases word size to 28 • The minimum word size is 7 • http://monod.uwaterloo.ca/papers/02ph.pdf

  30. BLASTP implementation • To make searches faster, set word size to 3 and T to a large value (999), which removes all potential neighborhood words (two-hit distance is 40 amino acids by default) • Affine gaps • Decreased penalty for gap extension relative to gap introduction

  31. Also, FASTA • Similar to Gapped BLAST – except bigger neighborhood • Generates a lookup table to locate all identically matching words of length ktup protein 1-2, DNA 4-6 • Once identified, looks for diagonals with many mutually supporting word matches • Extensions similar to BLAST

  32. Scoring Matrices • Scoring matrix specifies a score, sij, for aligning sequence Iwith sequence II. • Choice of matrix depends on the divergence level of desired/expected hits. • Examples: PAM, BLOSUM • Both can be modified for different divergence levels (eg, BLOSUM40, BLOSUM62) • Advice: try several matrices when possible.

  33. Dayhoff Family of Matrices • Dayhoff model measures sequence evolution in units of “PAMs” • One PAM unit represents the evolutionary distance in which 1% of the amino acids have changed. • Mutability of an aa is its relative rate of change (amino acids with high mutabilities are more likely to change) • Mutability of alanine was defined to be 100.

  34. Dayhoff Family of Matrices • Problems with the original Dayhoff scheme • It does not consider the genetic code. • Not all amino acid substitutions can occur by a single nucleotide substitution event. • Parameters were estimated from a small sample of closely related proteins. • Evolution at the “average site” of the “average protein” is being modeled.

  35. BLOSUM Scoring Matrices Blocks Substitution Matrix. A substitution matrix in which scores for each position are derived from observations of the frequencies ofsubstitutions in blocks of local alignments in related proteins. Each matrix is tailored to a particular evolutionary distance. In the BLOSUM62 matrix, for example, the alignment from which scores were derived was created using sequences sharing no more than 62% identity. Sequences more identical than 62% are represented by a single sequence in the alignment so as to avoid over-weighting closely related family members. (Henikoff and Henikoff)

  36. PSI-BLASTPosition Specific Iterated BLAST 1. BLAST with query 2. Keep hits w/ E < E* (adjustable constant) 3. Multiple alignment of HSPs from step (2) 4. Build profile 5. BLAST with profile 6. Iterate (1)-(5) until no new hits are found

  37. PSI-BLASTPosition Specific Iterated BLAST Use with great caution!!! Once an unrelated sequence is mistakenly incorporated into the profile, subsequent iterations will incorporate homologues of the unrelated sequence (catastrophic transitivity). Human intervention is essential.

More Related