why align sequences
Download
Skip this Video
Download Presentation
Why Align Sequences?

Loading in 2 Seconds...

play fullscreen
1 / 35

Why Align Sequences? - PowerPoint PPT Presentation


  • 116 Views
  • Uploaded on

Why Align Sequences?. DNA sequences (4 letters in alphabet) GTAAACTGGTACT… Amino acid (protein) sequences (20 letters) SSHLDKLMNEFF… Align them so we can search databases To help predict structure/function of new genes In particular, look for homologues (evolutionary relatives).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Why Align Sequences?' - domani


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
why align sequences
Why Align Sequences?
  • DNA sequences (4 letters in alphabet)
    • GTAAACTGGTACT…
  • Amino acid (protein) sequences (20 letters)
    • SSHLDKLMNEFF…
  • Align them so we can search databases
    • To help predict structure/function of new genes
      • In particular, look for homologues (evolutionary relatives)
example matches
Example matches

1. gattcagacctagct (no indels)

gtcagatcct

2. gattcaga-cctagct (with indels)

g-t-cagatcct

3. gattcagacctagc-t

g-t-----cagatcct

  • Need to come up with algorithms producing:
    • Ways of scoring alignments
    • Ways to search for high scoring alignments
  • Concentrate first on alignments without indels
hamming distances
Hamming Distances
  • Suppose we have
    • Query sequence Q and database sequence D
  • Hamming distance:
    • Number of places where Q and D are different (distance)
  • Example (stars mark differences)
    • SSHLDKLMNEFF
    • * ** *
    • HSHLKLLMKEFFHDMN
    • Scores 4 for Hamming distance (sometimes worry about ends)
  • Simple alignment algorithm: slide Q along D
    • Remember where the Hamming distance was minimised
scoring schemes amino acids
Scoring Schemes (Amino Acids)
  • Hamming distance doesn’t take into account
    • Likelihood of one amino acid changing to another
    • Some amino acid substitutions are disastrous
      • So they don’t survive evolution
    • Some substitutions barely change anything
      • Because the two amino acids are chemically quite similar
  • Scoring schemes address this problem
    • Give scores to the chances of each substitution
  • 2 possibilities:
    • Use empirical evidence
      • Of actual substitutions in known homologues (families)
    • Use theory from chemistry (hydrophobicity, etc.)
the scoring scheme
The Scoring Scheme

Give two sequences we need a number to associate with each possible alignment (i.e. the alignment score = goodness of alignment).

  • The scoring scheme is a set of rules which assigns the alignment score to any given alignment of two sequences.
  • The scoring scheme is residue based: it consists of residue substitution scores (i.e. score for each possible residue alignment), plus penalties for gaps.
  • The alignment score is the sum of substitution scores and gap penalties.
blosum62 scheme
BLOSUM62 Scheme
  • Blocks Amino Acid Substitution Matrices
  • Empirical method
    • Based on roughly 2000 amino acid patterns (blocks)
    • Found in more than 500 families of related proteins
  • Calculate the Log-odds scores for each pair (R1, R2)
    • Let O = observed frequency R1 <=> R2
    • Let E = expected frequency R1 <=> R2
    • I.e., Score = round(2 * log2(O/E))
  • To calculate the score for an alignment of two sequences
    • Add up the pairwise scores for residues
      • We’ve calculated log odds
blosum62 substitution matrix
BLOSUM62 Substitution Matrix
  • Zero: by chance
    • + more than chance
    • - less than chance
  • Arranged by
    • Sidegroups
    • So, high scoring

in the end boxes

  • Example
    • M,I,L,V
    • Interchangeable
example calculation
Example Calculation
  • Query = S S H L D K L M R
  • Dbase = H S H L K L L M G
  • Score = -1 4 8 4 -1 -2 4 5 0
  • Total score = -1+4+8+4+-1+-2+4+5+-2

= 21

  • Write Blosum(Query,Dbase) = 21
    • Not standard to do this
blast algorithm basic local alignment search tool
BLAST AlgorithmBasic Local Alignment Search Tool
  • Fast alignment technique(s)
    • Similar to FASTA algorithms (not used much now)
    • There are more accurate ones, but they’re slower
    • BLAST makes a big use of lookup tables
  • Idea: statistically significant alignments (hits)
    • Will have regions of at least 3 letters same
      • Or at least high scoring with respect to BLOSUM matrix
  • Based on small local alignments

more likely than

CCNDHRKMTCSPNDNNRK

TTNDHRMTACSPDNNNKH

CCNDHRKMTCSPNDNNRK

YTNHHMMTTYSLDNNNKK

blast overview
BLAST Overview
  • Given a query sequence Q
  • Seven main stages
    • Remove (filter) low complexity regions from Q
    • Harvest k-tuples (triples) from Q
    • Expand each triple into ~50 high scoring words
    • Seed a set of possible alignments
    • Generate high scoring pairs (HSPs) from the seeds
    • Test significance of matches from HSPs
    • Report the alignments found from the HSPs
blast algorithm part 1 removing low complexity segments
BLAST Algorithm Part 1 Removing Low-complexity Segments
  • Imagine matching
    • HHHHHHHHKMAY and HHHHHHHHURHD
    • The KMAY and URHD are the interesting parts
    • But this pair score highly using BLOSUM
  • It’s a good idea to remove the HHHHHHHs
    • From the query sequence (low complexity)
  • SEG program does this kind of thing
    • Comes with most BLAST implementations
    • Often doesn’t do much, and it can be turned off
removing low complexity segments
Removing Low-complexity Segments
  • Given a segment of length L
    • With each amino acid occurring n1 n2 … n20 times
  • Use the following measure for “compositional complexity”:
  • To use this measure
    • Slide a “window” of ~12 residues along Query Sequence Q
    • Use a threshold to determine low complexity windows
    • Use a minimise routine to replace the segment
      • With an optimal minimised segment (or just an X)
  • Will do an example calculation in tutorial
blast algorithm part 2 harvesting k tuples
BLAST Algorithm Part 2Harvesting k-tuples
  • Collect all the k-tuples of elements in Q
    • k set to 3 for residues and 11 for DNA (can vary)
    • Triples are called ‘words’. Call this set W

S T S L S T S D K L M R

STS

TSL

SLS

LST

blast algorithm part 3 finding high scoring triples
BLAST Algorithm Part 3Finding High Scoring Triples
  • Given a word w from W
    • Find all other words w’ of same length (3), which:
      • Appear in some database sequence
      • Blosum(w,w’) > a threshold T
  • Choose T to limit number to around 50
    • Call these the high scoring triples (words) for w
  • Example: letting w=PQG, set T to be 13
    • Suppose that PQG, PEG, PSG, PQA are found in database
    • Blosum(PQG,PQG) = 18, Blosum(PQG,PEG) = 15
    • Blosum(PQG,PSG) = 13, Blosum(PQG,PQA) = 12
    • Hence, PQG and PEG only are kept
finding high scoring triples
Finding High Scoring Triples
  • For each w in W, find all the high scoring words
    • Organise these sets of words
      • Remembering all the places where w was found in Q
  • Each high scoring triple is going to be a seed
    • In order to generate possible alignment(s)
      • One seed can generate more than one alignment
  • End of the first half of the algorithm
    • Going to find alignments now
blast algorithm part 4 seeding possible alignments
BLAST Algorithm Part 4Seeding Possible Alignments
  • Look at first triple V in query sequence Q
    • Actually from Q (not from W - which has omissions)
    • Retrieve the set of ~50 high scoring words
      • Call this set HV
    • Retrieve the list of places in Q where V occurs
      • Call this set PV
  • For every pair (word, pos)
    • Where word is from HV and pos is from PV
      • Find all the database sequences D
        • Which have an exact match with word at position pos’
      • Store an alignment between Q and D
        • With Vmatched at pos in Q and pos’ in D
  • Repeat this for the second triple in Q, and so on
seeding possible alignments example
Seeding Possible AlignmentsExample
  • Suppose Q = QQGPHUIQEGQQG
  • Suppose V = QQG, HV = {QQG, QEG}
    • Then PV = {1, 11}
  • Suppose we are looking in the database at:
    • D = PKLMMQQGKQEG
  • Then the alignments seeded are:

QQGPHUIQEGQQG word=QQG QQGPHUIQEGQQG word=QQG

PKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11

QQGPHUIQEGQQG word=QEG QQGPHUIQEGQQG word=QEG

PKLMMQQGKQEG pos=1 PKLMMQQGKQEG pos=11

blast algorithm part 5 generating high scoring pairs hsps
BLAST Algorithm Part 5Generating High Scoring Pairs (HSPs)
  • For each alignment A
    • Where sequences Q and D are matched
    • Original region matching was M
  • Extend M to the left
    • Until the Blosum score begins to decrease
  • Extend M to the right
    • Until the Blosum score begins to decrease
  • Larger stretch of sequence now matches
    • May have higher score than the original triple
    • Call these high scoring pairs
  • Throw away any alignments for which the score S of the extended region M is lower than some cutoff score
extending alignment regions example
Extending Alignment RegionsExample

QQGPHUIQEGQQGKEEDPP Blosum(QQG,QQG) = 16

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGK,QQGK) = 21

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGKE,QQGKQ) = 23

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGKEE,QQGKQE) = 28

PKLMMQQGKQEGM

QQGPHUIQEGQQGKEEDPP Blosum(QQGKEED,QQGKQEG) = 27

PKLMMQQGKQEGM

So, the extension to the right stops here

HSP (before left extension) is QQGKEE, scoring 28

blast algorithm part 6 checking statistical significance
BLAST Algorithm Part 6Checking Statistical Significance
  • Reason we extended alignment regions
    • Give a more accurate picture of the probability of that BLOSUM score occurring by chance
  • Question: is a HSP significant?
  • Suppose we have a HSP such that
    • It scores S for a region of length L in sequences Q & D
  • Then the probability of two random sequences Q’ and D’ scoring S in a region of length L is calculated
    • Where Q’ is same length as Q and D’ is same length as D
  • This probability needs to be low for significance
blast algorithm part 7 reporting the alignments
BLAST Algorithm Part 7Reporting the Alignments
  • For each statistically significant HSP
    • The alignment is reported
  • If a sequence D has two HSPs with Query Q
    • Two different alignments are reported
  • Later versions of BLAST
    • Try and unify the two alignments
ncbi blast server protein protein
NCBI BLAST Server (protein-protein)
  • http://www.ncbi.nlm.nih.gov/BLAST/
real example
Real Example
  • MRPQAPGSLVDPNEDELRMAPWYWGRISREEAKSILHGKPDGSFLVRDAL SMKGEYTLTLMKDGCEKLIKICHMDRKYGFIETDLFNSVVEMINYYKENS LSMYNKTLDITLSNPIVRAREDEESQPHGDLCLLSNEFIRTCQLLQNLEQ NLENKRNSFNAIREELQEKKLHQSVFGNTEKIFRNQIKLNESFMKAPADA PSTEAGGAGDGANAAASAAANANARRSLQEHKQTLLNLLDALQAKGQVLN HYMENKKKEELLLERQINALKPELQILQLRKDKYIERLKGFNLKDDDLKM ILQMGFDKWQQLYETVSNQPHSNEALWLLKDAKRRNAEEMLKGAPSGTFL IRARDAGHYALSIACKNIVQHCLIYETSTGFGFAAPYNIYATLKSLVEHY ANNSLEEHNDTLTTTLRWPVLYWKNNPLQVQMIQLQEEMDLEYEQAATLR PPPMMGSSAPIPTSRSREHDVVDGTGSLEAEAAPASISPSNFSTSQ
  • A gene taken from a fruit fly (Drosophila Melanogaster)
    • We’ll alter this a little
    • And see if the NCBI BLAST server can find it for us
database searching overview
Database Searching Overview

Query sequence Q

List of

similar protein

sequences

Comparison

algorithm

Database of

sequences

Infer

homologues

and similar

structures

true false positives and negatives
True/False Positives and Negatives
  • True Positive
    • A hit returned from the database search
      • Which does match in reality with the query sequence
  • False Positive
    • A hit returned from database search
      • Which doesn’t match in reality with the query sequence
  • True Negative
    • A sequence not returned from database search
      • Which doesn’t match in reality with the query sequence
  • False Negative
    • A sequence not returned from database search
      • Which does match in reality with the query sequence
accuracy of database searching an ideal search result
Accuracy of database searching - an ideal search result

Score Output Program Correct answer

High (good) A YES YES

B YES YES

C YES YES

D YES YES

E NO NO

F NO NO

G NO NO

Low (poor) H NO NO

Cut off score

A,B,C,D All correctly assigned and true positives

E,F,G,H All correctly assigned and true negatives

accuracy of database searching a typical search result
Accuracy of database searching - a typical search result

Score Output Program Correct answer

High (good) A YES YES

B YES YES

C YES YES

D YES NO

E NO NO

F NO YES

G NO NO

Low (poor) H NO NO

Cut off score

A,B,C Correctly assigned and true positives

E,G,H Correctly assigned and true negatives

D Incorrectly assigned and false positive

F Incorrectly assigned and false negative

accuracy of database searching a typical search result1
Accuracy of database searching - a typical search result

Score Output

High (good) A

B

C

D

E

F

G

Low (poor) H

How much confidence do

we have that this match

at a particular score (S) is

not due to chance ?

S

sensitivity and selectivity
Sensitivity and Selectivity
  • Given that you know:
    • The false positives and false negatives
  • Ntp = number of true positives
  • Nfp = number of false positives
  • Ntn = number of true negatives
  • Nfn = number of false negatives
  • Sensitivity = Ntp / (Ntp + Nfn)
    • Proportion of the true answers the search found
  • Selectivity = Ntp / (Ntp + Nfp)
    • Proportion of the answers the search found which were correct
sensitivity and selectivity1
Sensitivity and Selectivity
  • In David W. Mount’s book:

“Sensitivity refers to the ability of the method to find most of the members of the protein family represented by the query sequence.”

“Selectivity refers to the ability of the method not to find known members of other families as false positives.”

reliability of a match at score s
Reliability of a Match at Score S
  • P(x  S)
    • is the probability of a score x greater than or equal to the observed score S occurring by chance
  • E(x  S)
    • is the expected number of chance occurrences

of scores greater than or equal to S

  • E-value
    • is the expected number of matches that are errors if you

searched and took all matches scoring up to and including S

    • Estimated number of false positives found using S as the cut off
from the ncbi blast faq pages
From the NCBI BLAST FAQ Pages
  • The Expect value (E) is a parameter that describes the number of hits one can "expect" to see just by chance when searching a database of a particular size. It decreases exponentially with the Score (S) that is assigned to a match between two sequences. Essentially, the E value describes the random background noise that exists for matches between sequences. For example, an E value of 1 assigned to a hit can be interpreted as meaning that in a database of the current size one might expect to see 1 match with a similar score simply by chance. This means that the lower the E-value, or the closer it is to "0" the more "significant" the match is. However, keep in mind that searches with short sequences, can be virtually identical and have relatively high E-Value. This is because the calculation of the E-value also takes into account the length of the Query sequence. This is because shorter sequences have a high probability of occurring in the database purely by chance.
using p and e values
Using P and E Values
  • Most search programs return one or both values
  • For matches < 20 residues
    • We must still be very cautious in suggesting true homology
    • Also, we CANNOT infer short matches will have similar structures
  • We can be confident if P or E < 10-3
    • However, as they are estimated values, these are often wrong
      • You will need experience of the current version of the program
  • Note that P is a probability, so 0 <P < 1, but E can be > 1
    • For low values (<10-3) P and E are virtually the same
calculating p and e values in general
Calculating P and E Values in General
  • Each algorithm/server seems to have its own method
    • Theory for gapped alignments is still very much under debate
    • Theory for non-gapped alignments is solved, but flexible
  • Values consider both
    • the size of the database searched
    • and the score of the match
  • Should also consider the length of the match
    • as short matches are easier to find
  • Calculations often involve “random sequences”
    • Generate randomly with letters in proportion
    • Mix up substrings of existing protein sequences
calculating p and e values in blast
Calculating P and E values in BLAST
  • Remember that each alignment
    • Has a HSP at its heart
  • Suppose we have an alignment of Q and D
    • Q is of length m and D is of length n
    • And they have a HSP scoring S with BLOSUM62
  • Question we’re interested in:
    • Given two random sequences, also of length m and n
      • How many HSPs of score S or greater can we expect to find
    • i.e., is our HSP special, or would we expect one?
ad