Pairwise Sequence Alignment Part 2

1 / 33

# Pairwise Sequence Alignment Part 2 - PowerPoint PPT Presentation

Pairwise Sequence Alignment Part 2. Outline. Global alignments-continuation Local versus Global BLAST algorithms Evaluating significance of alignments. Global Alignment -Cont. Needleman-Wunsch Alignment. Global alignment between sequences Compare entire sequence against another

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Pairwise Sequence Alignment Part 2' - darius

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Outline
• Global alignments-continuation
• Local versus Global
• BLAST algorithms
• Evaluating significance of alignments
Needleman-Wunsch Alignment
• Global alignment between sequences
• Compare entire sequence against another
• Create scoring table
• Sequence A across top, B down left
• Cell at column i and row j contains the score of best alignment between the first i elements of A and the first j elements of B
• Global alignment score is bottom right cell

A

-

ACGCTG

------

-----

CATGT

A

C

AC

-C

ACG

-C-

ACGC

---C

ACGC

-C--

ACG

-CA

ACGCTG-

-C-ATGT

ACGCTG-

-CA-TGT

-ACGCTG

CATG-T-

Global Alignment versus Local Alignment

Global Alignment

ATTGCAGTG-TCGAGCGTCAGGCT

ATTGCGTCGATCGCAC-GCACGCT

Local Alignment

CATATTGCAGTGGTCCCGCGTCAGGCT

TAAATTGCGT-GGTCGCACTGCACGCT

Global vs. Local alignment

DOROTHY

DOROTHY

HODGKIN

HODGKIN

Global alignment:

DOROTHY--------HODGKIN

DOROTHYCROWFOOTHODGKIN

Local alignment:

Local Alignment
• Best score for aligning part of sequences
• Often beats global alignment score
• Similar algorithm: Smith-Waterman
• Table cells never score below zero

TAA

TAA

TACTA

TAATA

Problems with DP for sequence alignments

-The complexity is very high

- Given a score, how to evaluate the significance of the alignment?

Complexity
• Complexity is determined by size of table
• Aligning a sequence of lengthmagainst one of lengthnrequires calculating(mn)cells
• Time of calculation

Lets say we calculate 108 cells per second on a one processor PC

• Aligning two mRNA sequences of8,000 bprequires64,000,000 cells 0.64 seconds
• Aligning an mRNA and a107 bpchromosome requires~1011 cells 1,000 secs =15 minutes
Complexity for large databases
• Let’s say a database contains3  1010base pairs
• Searching an mRNA against the database will require ~2.5  1014 cells 2.5  106 secs =1 month!
• We need an efficient algorithm to cut down on alignment
BLAST
• Basic Local Alignment Search Technique
• A set of tools developed at NCBI (BlastN, BlastP,..)
• BLAST benefits
• Search speed
• Ease of use
• Statistical rigor
BLAST
• A good alignment contains subsequences of absolute identity:
• First, identify very short (almost) exact matches.
• Next, the best short hits from the 1st step are extended to longer regions of similarity.
• Finally, the best hits are optimized using the Smith-Waterman algorithm.
BLAST Algorithm

(1)

Query sequence

Words of length W

W default = 11

• Compare the word list to the database
• and identify exact matches

For each word match, extend alignment in both

• directions

(4) Score the alignments using Dynamic Programing

(5) Evaluate the statistics significance

Random

Related

Database Searches
• Using the pairwise comparison, each database search normally yields 2 groups of scores: genuinely related and unrelated sequences, with some overlap between them.
• A good search method should completely separate between the 2 score groups.
E-value
• The number of hits (with the same similarity score) one can "expect" to see just by chance when searching the given string in a database of a particular size.
• higher e-value lower similarity
• “sequences with E-value of less than 0.01 are almost always found to be homologous”
• The lower bound is normally 0 (we want to find the best)
Expectation Values

Increases linearly with length of query sequence

Decreases exponentially with score of alignment

Increases linearly with length of database