340 likes | 572 Views
Pairwise Sequence Alignment Part 2. Outline. Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments. Global Alignment -Cont. Needleman-Wunsch Alignment. Global alignment between sequences Compare entire sequence against another
E N D
Pairwise Sequence Alignment Part 2
Outline • Global alignments-continuation • Local versus Global • BLAST algorithms • Evaluating significance of alignments
Needleman-Wunsch Alignment • Global alignment between sequences • Compare entire sequence against another • Create scoring table • Sequence A across top, B down left • Cell at column i and row j contains the score of best alignment between the first i elements of A and the first j elements of B • Global alignment score is bottom right cell
A -
ACGCTG ------
----- CATGT
A C
AC -C
ACG -C-
ACGC ---C ACGC -C--
ACG -CA
ACGCTG- -C-ATGT
ACGCTG- -CA-TGT
-ACGCTG CATG-T-
Global Alignment versus Local Alignment Global Alignment ATTGCAGTG-TCGAGCGTCAGGCT ATTGCGTCGATCGCAC-GCACGCT Local Alignment CATATTGCAGTGGTCCCGCGTCAGGCT TAAATTGCGT-GGTCGCACTGCACGCT
Global vs. Local alignment DOROTHY DOROTHY HODGKIN HODGKIN Global alignment: DOROTHY--------HODGKIN DOROTHYCROWFOOTHODGKIN Local alignment:
Local Alignment • Best score for aligning part of sequences • Often beats global alignment score • Similar algorithm: Smith-Waterman • Table cells never score below zero
TAA TAA TACTA TAATA
Problems with DP for sequence alignments -The complexity is very high - Given a score, how to evaluate the significance of the alignment?
Complexity • Complexity is determined by size of table • Aligning a sequence of lengthmagainst one of lengthnrequires calculating(mn)cells • Time of calculation Lets say we calculate 108 cells per second on a one processor PC • Aligning two mRNA sequences of8,000 bprequires64,000,000 cells 0.64 seconds • Aligning an mRNA and a107 bpchromosome requires~1011 cells 1,000 secs =15 minutes
Complexity for large databases • Let’s say a database contains3 1010base pairs • Searching an mRNA against the database will require ~2.5 1014 cells 2.5 106 secs =1 month! • We need an efficient algorithm to cut down on alignment
BLAST • Basic Local Alignment Search Technique • A set of tools developed at NCBI (BlastN, BlastP,..) • BLAST benefits • Search speed • Ease of use • Statistical rigor
BLAST • A good alignment contains subsequences of absolute identity: • First, identify very short (almost) exact matches. • Next, the best short hits from the 1st step are extended to longer regions of similarity. • Finally, the best hits are optimized using the Smith-Waterman algorithm.
BLAST Algorithm (1) Query sequence Words of length W W default = 11 • Compare the word list to the database • and identify exact matches
For each word match, extend alignment in both • directions (4) Score the alignments using Dynamic Programing (5) Evaluate the statistics significance
Random Related Database Searches • Using the pairwise comparison, each database search normally yields 2 groups of scores: genuinely related and unrelated sequences, with some overlap between them. • A good search method should completely separate between the 2 score groups.
E-value • The number of hits (with the same similarity score) one can "expect" to see just by chance when searching the given string in a database of a particular size. • higher e-value lower similarity • “sequences with E-value of less than 0.01 are almost always found to be homologous” • The lower bound is normally 0 (we want to find the best)
Expectation Values Increases linearly with length of query sequence Decreases exponentially with score of alignment Increases linearly with length of database