Sequence similarity
Download
1 / 55

Sequence similarity Analysis - PowerPoint PPT Presentation


  • 120 Views
  • Uploaded on

Sequence similarity Analysis. Benny Shomer, April 2004. Identity The extent to which two (nucleotide or amino acid) sequences are invariant. Similarity

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Sequence similarity Analysis' - darryl


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Sequence similarity

Analysis

Benny Shomer,

April 2004


Identity

The extent to which two (nucleotide or amino acid) sequences are invariant.

Similarity

The extent to which nucleotide or protein sequences are related. The extent of similarity between two sequences can be based on percent sequence identity and/or conservation.

Homology

Similarity attributed to descent from a common ancestor.


Query= uniprot|Q9UP52|TFR2_HUMAN Transferrin receptor protein 2 (TfR2).

>gi|20140567|sp|Q07891|TFR1_CRIGR Transferrin receptor protein 1 (TfR1) (TR) (TfR) (Trfr)

Length = 757

Score = 540 bits (1392), Expect = e-152

Identities = 305/727 (41%), Positives = 412/727 (56%), Gaps = 52/727 (7%)

Query: 87 LTALLIFTGAFLLGYVAF--RGSCQAC--------GDSVLVVSEDVNYEPDLDFHQGRLY 136

+ ++ F F++GY+ + R + C G+S ++ E++ RLY

Sbjct: 71 IAVVIFFLIGFMIGYLGYCKRTEQKDCVRLAETETGNSEIIQEENIP-------QSSRLY 123

Query: 137 WSDLQAMFLQFLGEGRLEDTIRQTSLRERVAGSAGMAALTQDIRAALSRQKLDHVWTDTH 196

W+DL+ + + L DTI+Q S R AGS L I KL VW D H

Sbjct: 124 WADLKKLLSEKLDAIEFTDTIKQLSQTSREAGSQKDENLAYYIENQFRDFKLSKVWRDEH 183

Query: 197 YVGLQFPDPAHPNTLHWVDEAGKVGEQLPLEDPDVYCPYSAIGNVTGELVYAHYGRPEDL 256

YV +Q A N + ++ G + +E+P Y YS V+G+L++A++G +D

Sbjct: 184 YVKIQVKGSAAQNAVTIINVNG---DSDLVENPGGYVAYSKATTVSGKLIHANFGTKKDF 240

Query: 257 QDLRAXXXXXXXXXXXXXXXXISFAQKVTNAQDFGAQGVLIYPEPADFSQDPPKPSLSSQ 316

+DL+ I+FA+KV NAQ F A GVLIY + F P + ++

Sbjct: 241 EDLK---YPVNGSLVIVRAGKITFAEKVANAQSFNAIGVLIYMDQTKF------PVVEAE 291

Query: 317 QAVYGHVHLGTGDPYTPGFPSFNQTQFPPVASSGLPSIPAQPISADIASRLLRKLKGPVA 376

+++GH HLGTGDPYTPGFPSFN TQFPP SSGLPSIP Q IS A +L + ++

Sbjct: 292 LSLFGHAHLGTGDPYTPGFPSFNHTQFPPSQSSGLPSIPVQTISRKAAEKLFQNMETNCP 351

IdentitySimilarityHomology



Finding distant relatives
Finding Distant Relatives the query sequence.

For Proteins, finding distant relatives is a difficult task.

Distant protein family members, may share <20% amino acid identity(!).


Query: the query sequence.

>gi|3582021|emb|CAA70575.1| cytochrome P450 [Nepeta racemosa]

Length = 509

Score = 405 bits (1043), Expect = e-111

Identities = 94/479 (19%), Positives = 192/479 (40%), Gaps = 35/479 (7%)

Query: 61 NLYHFWRETGTHKVHLHHVQNFQKYGPIYREKLGNVESVYVIDPEDVALLFKSEGPNPER 120

NL+ G + H + ++YGP+ + G+V + PE + K++

Sbjct: 45 NLHQL----GLY-PHRYLQSLSRRYGPLMQLHFGSVPVLVASSPEAAREIMKNQDIVFSN 99

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Query: 297 -----DYRGMLYRLLGDSK----MSFEDIKANVTEMLAGGVDTTSMTLQWHLYEMARNLK 347

D+ +L + ++K + + +KA + +M G DTT+ L+W + E+ +N +

Sbjct: 271 GDGALDFVDILLQFQRENKNRSPVEDDTVKALILDMFVAGTDTTATALEWAVAELIKNPR 330

Query: 348 VQDMLRAEVLAARHQAQGDMATMLQLVPLLKASIKETLRLH-PISVTLQRYLVNDLVLRD 406

L+ EV L+ +P LKASIKE+LRLH P+ + + R D +

Sbjct: 331 AMKRLQNEVREVAGSKAEIEEEDLEKMPYLKASIKESLRLHVPVVLLVPRESTRDTNVLG 390

Query: 407 YMIPAKTLVQVAIYALGREPTFFFDPENFDPTRWLSK--DKNITYFRNLGFGWGVRQCLG 464

Y I + T V + +A+ R+P+ + +PE F P R+L D +F L FG G R C G

Sbjct: 391 YDIASGTRVLINAWAIARDPSVWENPEEFLPERFLDSSIDYKGLHFELLPFGAGRRGCPG 450

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~


Main search algorithms
Main Search Algorithms the query sequence.

Smith-Waterman (SSEARCH / MPsrch)

Dynamic programming based optimal local alignment algorithm.

Most sensitive in detecting distantly related proteins.

Usually runs on MASPAR (1024-16384 processors on a typical MP2 machine)

SW


Main search algorithms1
Main Search Algorithms the query sequence.

Two program families which are heuristics. Both reduce computation time, by scarifying some sensitivity.

BLAST

FASTA

Reduce the size

of the problem

  • pre-select sequences thought to share significant similarity with the query

  • locating similarity regions inside those sequences.

Definition: 1. A rule of thumb, simplification, or educated guess that reduces or limits the search for solutions in domains that are difficult and poorly understood. Unlike algorithms, heuristics do not guarantee optimal, or even feasible, solutions and are often used with no theoretical guarantee.


Main search algorithms compared

Speed the query sequence.

Speed

Speed

BLAST

BLAST

FASTA

FASTA

SW

Sensitivity

Sensitivity

Sensitivity

BLAST

FASTA

SW

Main Search Algorithms Compared

Compare Full

Length Protein

Compare

Nucleic acids

Compare Short

Protein Segments


Dna vs protein
DNA Vs. Protein the query sequence.

Protein similarity search is generally more sensitive than DNA.

Proteins ignore silent mutations.

Protein substitution matrices are better.

Where appropriate – Prefer translating DNA into protein and compare Vs. a protein database.


FASTA the query sequence.

Written By Bill Pearson

in 1990


How do they do it

FASTA the query sequence.

How do they do it?

Step 1:

The Goal:

Quickly locate ungapped similarity regions between the query sequence and the database sequences.


How do they do it1

FASTA the query sequence.

How do they do it?

Step 1:

  • Determine the length of a word called “k-tuples”. (for proteins usually 1-3 and for DNA 4-6)

  • Pre-compute all possible k-tuples and build a lookup hash.(for instance, assume k =3: there are 8000=203 possible k-tuples)

{

ARR:[ ], ARN:[ ], ARD:[ ]...

ANR:[ ], ANN:[ ], AND:[ ]...

RNN:[ ], RND:[ ], RNC:[ ]

}


How do they do it2

FASTA the query sequence.

How do they do it?

Step 1:

  • Now, slide the k-tuple on the query sequence and record each k-tuple and its position in the hash structure.

01234567890123456789012345

NTLGTEIAIEDQICQGLKLTFDTTFS

{

NTL:[], TLG:[], LGT:[]...

GTE:[], TEI:[ ], EIA:[ ]...

IAI:[ ], AIE:[ ], IED:[ ]

}

2

0

1

3


How do they do it3

FASTA the query sequence.

How do they do it?

Step 1:

  • NEXT, slide the k-tuple on the next subject sequence and record its position and its offset from the query in the hash .

01234567890123456789012345

QICQGLKLTNTLGTEIAIEDFDTTFS

{

NTL:[], TLG:[], LGT:[]...

GTE:[], TEI:[ ], EIA:[ ]...

IAI:[ ], AIE:[ ], IED:[ ]

}

0

,10,9

2

,11,9

,9,9

1

3

,12,9


How do they do it4

FASTA the query sequence.

How do they do it?

Step 1:

Query

Subject


How do they do it5

Note: the query sequence. Since #2 is done following #1, it may happen that the 10 regions are not the most similar, where there are many conservative substitutions and few identities.

FASTA

How do they do it?

Step 2:

1. Select the 10 best regions.

2. Evaluate those regions using either the PAM or BLOSUM substitution matrix.

3. The score of the best region (*) is calledinit1.

Query

Subject

*


How do they do it6

FASTA the query sequence.

How do they do it?

Step 3:

Query

1. Consider only regions with a score above a certain threshold.

2. Attempt to join the selected regions.

3. Score, summing scores of the regions and subtracting the “join” areas (similar to gap penalty).

This score is called Initn

Subject

*


How do they do it7

Note: the query sequence. Since the process is confined to a selected band out of the entire matrix, the alignment may be sub-optimal.

FASTA

How do they do it?

Apply “Banded Smith-Waterman” algorithm.

Step 4:

Use dynamic programming to calculate local alignment, but restrict the region of the matrix to a band, centered around the diagonal with the best init1 score (*).

This score of the alignment is called the opt score. It is the score used to rank the alignments.

Query

Subject

*


Result evaluation

Z score the query sequence.:

Simply put – FASTA calculates an average score for each length range of sequences in the database and plots them onto a score*length regression line. Z score is the number of standard deviations of a given real score from the theoretical regression line.

FASTA

Result Evaluation


Result evaluation1

Z the query sequence.

Score

Regression

Length

STD

FASTA

Result Evaluation

*

I

I

I

I

I

I

I

I

I

I


Result evaluation2

Z score the query sequence.:

Simply put – FASTA calculates an average score for each length range of sequences in the database and plots them onto a score*length regression line. Z score is the number of standard deviations of a given real score from the theoretical regression line.

E value:

The probability that a given match (query/subject) of a random sequence of the same length range, would be greater than z.

FASTA

Result Evaluation


opt E() the query sequence.

< 20 1412 0:=

22 95 0:= one = represents 2289 library sequences

24 254 1:*

26 719 30:*

28 2260 323:*

30 6209 1960:*==

32 13248 7578:===*==

34 25551 20550:========*===

36 40898 42205:==================*

38 64075 69749:============================ *

40 91829 97294:========================================= *

42 115365 118931:===================================================*

44 131747 131192:=========================================================*

46 137309 133622:==========================================================*=

48 134659 127927:=======================================================*===

50 127192 116734:==================================================*=====

52 105873 102629:============================================*==

54 85048 87663:======================================*

56 69881 73226:===============================*

58 57673 60117:==========================*

60 46861 48698:=====================*

62 36549 39041:================ *

64 28272 31049:=============*

66 22568 24541:==========*

68 17633 19303:========*

70 13099 15127:======*

72 10981 11820:=====*

74 8757 9216:====*

76 6154 7173:===*

78 4772 5575:==*

80 3467 4329:=*

82 2873 3312:=*

84 2217 2623:=*

86 1679 2030:*

88 1268 1571:* inset = represents 16 library sequences

90 883 1215:*

92 752 940:* :=======================================*

94 507 728:* :================================ *

96 498 563:* :================================ *

98 282 436:* :================== *

100 284 337:* :================== *

102 205 261:* :============= *

104 121 202:* :======== *

106 127 156:* :======== *

108 96 121:* :====== *

110 65 93:* :=====*

112 51 72:* :====*

114 83 56:* :===*==

116 59 43:* :==*=

118 35 33:* :==*

>120 195 26:* :=*===========

Score range

Number of optimized scores in the range

Number of random scores expected to be in the range

Actual score distribution

Expected score distribution

Watch the 80-110 range


Result evaluation3

FASTA the query sequence.

Result Evaluation

Kolmogorov-Smirnov statistic

116 59 43:* :==*=

118 35 33:* :==*

>120 195 26:* :=*===========

454171735 residues in 1422690 sequences

statistics extrapolated from 60000 to 1422511 sequences

Expectation_n fit: rho(ln(x))= 4.1847+/-0.000201; mu= 24.1751+/- 0.012

mean_var=64.3387+/-13.596, 0's: 152 Z-trim: 160 B-trim: 3848 in 1/64

Lambda= 0.159896

Kolmogorov-Smirnov statistic: 0.0187 (N=29) at 52

FASTA (3.45 Mar 2002) function [optimized, MD_40 matrix (18:-23)] ktup: 2

join: 37, opt: 25, open/ext: -10/-2, width: 16

Scan time: 206.400

An evaluation of the fit of the data to the expected curve.

< 0.1 == Excellent agreement.

> 0.2 Repeat the analysis with higher gap penalties.


Result evaluation4

>>UNIPROT:Q7XB42 the query sequence.Prunus dulcis cyp74C5 gene for cytochrome P450

initn: 2015 init1: 1741 opt: 2734 Z-score: 3396.1 bits: 637.8 E(): 3.3e-181

Smith-Waterman score: 2755; 61.245% identity (64.346% ungapped)

in 498 aa overlap (1-491:1-481)

10 20 30 40 50

Sequen MSSVSSKYPAIASSS-DNESCKPLLQVREIPGDYGFPFFGAIKDRYDYYYSLGADEFFRT

::: :: .::: .: :: :::::: :::: :::::::.:. : .::.:

UNIPRO MSSSSS-----SSSSPNNLPLKP------IPGDYGWPFFGHIKDRYDYFYNQGRYDFFKT

10 20 30 40

60 70 80 90 100 110

Sequen KSLKYNSTIFRTNMPPGPFIAKDPKVIVLLDAISFPILFDCSKVEKKNVLDGTYMPSTDF

. :: ::.:::::::: :: .:::: :::: :::: :: .:: ...:::::::::: .

UNIPRO RIEKYQSTVFRTNMPPGILIASNPKVIALLDAKSFPIIFDNTKVLRRDVLDGTYMPSTAY

50 60 70 80 90 100

120 130 140 150 160 170

Sequen FGGYRPCAFLDPSEPSHATHKGFYLSIISKLHTQFIPIFENSVSLLFQNLEIQISKDGKA

:::: ::.::::::.::: : .. . . :: ::: :..: : .: ::: : ::::::

UNIPRO TGGYRVCAYLDPSEPNHATLKSYFAALLASQHTKFIPLFQSSTSDMFLNLEAQLSKDGKA

110 120 130 140 150 160

FASTA

Result Evaluation

E(): 3.3e-181


Versions of the program

[ the query sequence.T]FASTA[X/Y/S/F]

FASTA: DNA  DNA or Protein  Protein

T : Translate a DNA database in all 6 reading frames

for comparison with a Protein query.

X /Y : For situations where DNA sequences are likely to

contain errors (aka EST).

X: Allow frameshifts only betweens codons.

Y: Allow frameshifts also within codons.

F : Analyze a set of fragments resulting from

electrophoresis band cleavage and sequencing.

S : Analyze data from Mass Spectrometry analysis of

Proteins.

FASTA

Versions of the program


In practice

Performs better with local alignments. the query sequence.

Speed

Developed by Janet Thornton’s Group

On the basis of the PAM methodology

Interactivity

Program Type

Select Database

Adapt Gap Penalty

Both Strands?

Select Matrix

View Histogram?

Select k-tuple size

How Many To View?

PAM

MDM

k-tuple

BLOSSUM

Sensitivity

Evolutionary Distance

In Practice.

FASTA

http://www.ebi.ac.uk/fasta33/


In practice1

Distant Relatives Vs. Garbage the query sequence.

How far do we start with?

Where in the sequence?

What size ranges to look into?

In Practice.

FASTA


Altschul, S.F., Gish, W., Miller, W., Myers, E.W. & Lipman, D.J.

(1990)

"Basic local alignment search tool."

J. Mol. Biol. 215:403-410.


Word of length w D.J.

(3 for proteins, 11 for DNA)

Basic Local Alignment Search Tools

BLAST

For each position of the query sequence:

SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA


# Matrix made by matblas from blosum62.iij D.J.

# * column uses minimum score

# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units

# Blocks Database = /data/blocks_5.0/blocks.dat

# Cluster Percentage: >= 62

# Entropy = 0.6979, Expected = -0.5209

A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4

R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4

N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4

D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4

C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4

Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4

E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4

G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4

H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4

B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4

Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4

* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

6+6

6

6+6+5=17

BLAST

SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA

NDQ

NDQ


# Matrix made by matblas from blosum62.iij D.J.

# * column uses minimum score

# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units

# Blocks Database = /data/blocks_5.0/blocks.dat

# Cluster Percentage: >= 62

# Entropy = 0.6979, Expected = -0.5209

A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4

R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4

N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4

D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4

C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4

Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4

E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4

G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4

H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4

B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4

Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4

* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

6

6+4

6+4+3=13

BLAST

SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA

[NDQ:17]

NDQ

NBZ


# Matrix made by matblas from blosum62.iij D.J.

# * column uses minimum score

# BLOSUM Clustered Scoring Matrix in 1/2 Bit Units

# Blocks Database = /data/blocks_5.0/blocks.dat

# Cluster Percentage: >= 62

# Entropy = 0.6979, Expected = -0.5209

A R N D C Q E G H I L K M F P S T W Y V B Z X *

A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 -2 -1 0 -4

R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 -1 -3 -2 -1 -1 -3 -2 -3 -1 0 -1 -4

N -2 0 6 1 -3 0 0 0 1 -3 -3 0 -2 -3 -2 1 0 -4 -2 -3 3 0 -1 -4

D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 -3 -3 -1 0 -1 -4 -3 -3 4 1 -1 -4

C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 -1 -2 -3 -1 -1 -2 -2 -1 -3 -3 -2 -4

Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 0 -3 -1 0 -1 -2 -1 -2 0 3 -1 -4

E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 -2 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4

G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 -3 -3 -2 0 -2 -2 -3 -3 -1 -2 -1 -4

H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 -2 -1 -2 -1 -2 -2 2 -3 0 0 -1 -4

I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 1 0 -3 -2 -1 -3 -1 3 -3 -3 -1 -4

L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 2 0 -3 -2 -1 -2 -1 1 -4 -3 -1 -4

K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 -1 -3 -1 0 -1 -3 -2 -2 0 1 -1 -4

M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 5 0 -2 -1 -1 -1 -1 1 -3 -1 -1 -4

F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 0 6 -4 -2 -2 1 3 -1 -3 -3 -1 -4

P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 -2 -4 7 -1 -1 -4 -3 -2 -2 -1 -2 -4

S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 -1 -2 -1 4 1 -3 -2 -2 0 0 0 -4

T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 -1 -2 -1 1 5 -2 -2 0 -1 -1 0 -4

W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 -4 -3 -2 -4

Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 -1 3 -3 -2 -2 2 7 -1 -3 -2 -1 -4

V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 1 -1 -2 -2 0 -3 -1 4 -3 -2 -1 -4

B -2 -1 3 4 -3 0 1 -1 0 -3 -4 0 -3 -3 -2 0 -1 -4 -3 -3 4 1 -1 -4

Z -1 0 0 1 -3 3 4 -2 0 -3 -3 1 -1 -3 -1 0 -1 -3 -2 -2 1 4 -1 -4

X 0 -1 -1 -1 -2 -1 -1 -1 -1 -1 -1 -1 -1 -1 -2 0 0 -2 -1 -1 -1 -1 -1 -4

* -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 -4 1

BLAST

SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA

[NDQ:17,

NBZ:13,

NCA:2 ,

NEE:10,

BDZ:13,

BBZ:11,

...:..

]

Generate a list of all possible combinations for this word (e.g. for protein 3 a.a. words, it is a list of 8000 possible combinations)


BLAST D.J.

SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA

[NDQ:17,

NBZ:13,

NCA:2 ,

NEE:10,

BDZ:13,

BBZ:11,

...:..

]

Weed out all word combinations having a score below a pre-determined cutoff score (T)

(currently 11)

This resulting list of scores above the T score cutoff, is called a “neighbors list”.


BLAST D.J.

SCKPNDQVREIPGDYGFPFFGAIKDRYDYYYSLGA

For each sequence in the database…

[NDQ:17,

NBZ:13,

BDZ:13,

BBZ:11,

...:..

]

SCPFFGAIKDRYDYYYSLKPNDQVREIPGDGAYFG

GDYGFPFFGAIKDRYPBDZVREIDLGAYYYSCKPS

KPNBZVREIPGDYGFPFFSCGAIKDSLGAYDRYYY

PNDQVREIPGDYSCPBBZAKGYDYYFIKDRYSLGA

LGAFGAISCKGEIPGFPFYSKDRYDYPRDQVRYYD

Each word match is called a “hit”

VREIPGDYGBBZPNDDRYDQFPFFGAYSLGAIKYY

QVREISCKPDYGFNRZNDPGGAIKDRYYSLGYDYA


A D.J.

*

BLAST

For each sequence that resulted with hits.

We now have many neighbors hits. We need a methodology to screen, which hits can serve as seeds for a gapped local alignment.

1. Plot all neighbors hits on the sequence with their respective distances. Identify all diagonals.

2. Find candidates for extension of alignment.

Requirement: Two hits (or more) within a pre-determined distance A, can be used as a seed for extension.

Many unrelated or overlapping hits are filtered out this way.

An expensive time for dynamic programming is saved.


Generating hsp

BLAST D.J.

Generating HSP

Extend hit without gaps in both directions.

Stop extending before the total score drops by X from the maximal score obtained so far.

Only segments with a score >= S are counted in.

Such segments are called HSP – High Scoring Segment Pair.


Score limited gapped extension

Start off from the middle point of the highest scoring D.J. neighbor hit within the HSP.

Restrict search for the optimal path, such that the score does not drop off by X more than the maximal score already obtained.

BLAST

Score-limited Gapped extension

Apply a modified Smith-Waterman algorithm:

Score Limited.

Explore the Dynamic Programming Matrix in both directions.

But…….


E value

BLAST D.J.

E value

E value (of an alignment having a score S):

The number of times one expects to find alignments with a score >= S of a random sequence Vs. a random database.

(having the same lengths and compositions)



Filtering low complexity1
Filtering Low Complexity D.J.

The Problem:

Regions of Low Complexity or sequence repeats tend to generate high scores, that do not reflect real sequence similarity.


Filtering low complexity2
Filtering Low Complexity D.J.

The Solution:

SEG

For Proteins

DUST

For DNA


Filtering low complexity3

Note: D.J.

Masking is practiced on the query sequence only, not on the database sequences.

Filtering Low Complexity

SEG


Versions of the program1

[ D.J. t]BLAST[x/n/p]

t : Translate a DNA database in all 6 reading frames

for comparison with a Protein query.

x : Translate a nucleotide query in all 6 reading frames

for comparison with a Protein database.

p : Comparison is against a Protein database.

n : Comparison is against a Nucleotide database.

BLAST

Versions of the program


Special versions

specifically designed to efficiently find long alignments between very similar sequences.

MEGABLAST uses longer words in the comparison process.

MEGABLAST:

BLAST

Special Versions

Discontiguous

MEGABLAST:

Better at finding nucleotide sequences similar, but not identical to your nucleotide query.


Special versions1

"Search for short nearly exact matches" between very similar sequences.:

Simply a regular BLAST, but with the parameters pre-set for optimally finding significant matches to short segments such as PCR primers.

It uses a shorter word (7), turns off filtering and allows a higher expect threshold.

BLAST

Special Versions

Will be discussed later on:

PSI-BLAST: Position-Specific Iterated BLAST

RPS-BLAST: Reverse Position Specific BLAST

CDART: Conserved Domain Architecture Retrieval

Tool.


BLAST between very similar sequences.

http://www.ncbi.nlm.nih.gov/


http://www.ncbi.nlm.nih.gov/BLAST/ between very similar sequences.

BLAST


BLAST between very similar sequences.


BLAST between very similar sequences.

>gi|3130157|dbj|BAA26124.1| pheromone receptor [Takifugu rubripe

MADItgtlglfftlitlFVSSSTSFNAPTCKLWRKFQLNEMHEPGDVLLGGLFQVHYSSVFPEWTFTSEP

HQPVCTRFDILGFRHAMTMAFAVQEINKNPDLLPNLTLGYRLYDNCGALVVGFSGALALASGQEEAFALQ

GGCAGSPPVLGIVGDSLSTFTIASASVLGLYKIPMVSYFATCSCLTNRQRFPSFFRTIPSDDFQVRAMIQ

ILKHFGWTWVGLLVSDDDYGLHVARSFQSDLVQSGQGCLAYLEVLPWDNYLSENRRIVHVIKESTARVLM

VFAHQSHMIHLMEEVVRQKVTGLQWLASEAWTGTTFLQTPDFMPYLNGTLGIAIRRGEITGLRDFLLRIR

PGQSSNNTSYDMVQQFWEYSFQCKFGASGSAEACTGDENIQQVDAEFLDVSNLRPEYNIYKAVYALAYAL

DDMLQCEPGRGPFSGGSCADIHKLEPWQFVHYLQHVNFTTTFGDQVSFDENGDVLPIYDILNWQWLPDGR

TQVQNVGEVKRSPSRGEELQIHEDKIFWNFESNKPPHSVCSESCPPGTRMSRKKGQPVCCFDCLLCSEGK

ISNTTDSMECTSCPEDFWSSPQRDHCVPKKTEFLSYHEPLGICLTAASLLGTVISVVVLGIFIHHRSTPV

VRANNSELSFLLLVSLKLCFLCSLLFIGRPRLWTCQLRHAAFGISFVLCVSCILVKTMVVLAVFRASKPG

GGATLKWFGAVQQRGTVLGLTSIQAAICFAWLLSSSPKPHKNIQYHKDKIVFECVVGSTVGFAVLLSYIG

LLAILSFLLAFLARNLPDNFNEAKLITFSMLIFCAVWVAFVPAYINSPGKYADAVEVFAILTSSFGLLVA

LFGPKCYIILFRPERNTKRAIMAR

I can limit my search to a selected organism

…or even construct my own searchable database by an Entrez query

Mask According to the case within the query sequence.

Filter Low Complexity regions by SEG or DUST

Mask for lookup table hit search stage, but NOT for the hit extension stage.


BLAST between very similar sequences.

Lineage Report

root

. Bilateria [animals]

. . Coelomata [animals]

. . . Euteleostomi [vertebrates]

. . . . Tetrapoda [vertebrates]

. . . . . Eutheria [mammals]

. . . . . . Homo sapiens (man) ------------ 571 18 hits [mammals] retinoic acid induced 3; retinoic acid responsive gene [Hom

. . . . . . Mus musculus (mouse) .......... 432 15 hits [mammals] retinoic acid inducible protein 3 [Mus musculus]

. . . . . . Rattus norvegicus (brown rat) . 411 5 hits [mammals] similar to retinoic acid inducible protein 3 [Rattus norveg

. . . . . Xenopus laevis (clawed frog) ---- 216 1 hit [amphibians] MGC68729 protein [Xenopus laevis]

. . . . Takifugu rubripes (torafugu) ------ 40 4 hits [bony fishes] pheromone receptor [Takifugu rubripes]

. . . Drosophila melanogaster ------------- 48 4 hits [flies] CG8285-PA [Drosophila melanogaster] >gi|2827758|sp|P22815|B

. . . Drosophila virilis .................. 39 1 hit [flies] Bride of sevenless protein precursor >gi|1079166|pir||A4755

. . . Anopheles gambiae str. PEST ......... 38 1 hit [flies] ENSANGP00000013404 [Anopheles gambiae] >gi|21296536|gb|EAA0

. . Caenorhabditis elegans ---------------- 41 2 hits [nematodes] calcium-sensing receptor, similar to human metabotropic glu

. environmental sequence ------------------ 40 2 hits [unclassified] unknown [environmental sequence]

Lineage Report

root

. Bilateria [animals]

. . Coelomata [animals]

. . . Euteleostomi [vertebrates]

. . . . Tetrapoda [vertebrates]

. . . . . Eutheria [mammals]

. . . . . . Homo sapiens (man) ------------ 571 18 hits [mammals] retinoic acid induced 3; retinoic acid responsive gene [Hom

. . . . . . Mus musculus (mouse) .......... 432 15 hits [mammals] retinoic acid inducible protein 3 [Mus musculus]

. . . . . . Rattus norvegicus (brown rat) . 411 5 hits [mammals] similar to retinoic acid inducible protein 3 [Rattus norveg

. . . . . Xenopus laevis (clawed frog) ---- 216 1 hit [amphibians] MGC68729 protein [Xenopus laevis]

. . . . Takifugu rubripes (torafugu) ------ 40 4 hits [bony fishes] pheromone receptor [Takifugu rubripes]

. . . Drosophila melanogaster ------------- 48 4 hits [flies] CG8285-PA [Drosophila melanogaster] >gi|2827758|sp|P22815|B

. . . Drosophila virilis .................. 39 1 hit [flies] Bride of sevenless protein precursor >gi|1079166|pir||A4755

. . . Anopheles gambiae str. PEST ......... 38 1 hit [flies] ENSANGP00000013404 [Anopheles gambiae] >gi|21296536|gb|EAA0

. . Caenorhabditis elegans ---------------- 41 2 hits [nematodes] calcium-sensing receptor, similar to human metabotropic glu

. environmental sequence ------------------ 40 2 hits [unclassified] unknown [environmental sequence]


BLAST between very similar sequences.


BLAST between very similar sequences.


BLAST between very similar sequences.


ad