A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System

1 / 22

# A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System - PowerPoint PPT Presentation

A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System. Philipp Bucher and Kay Hofmann. Proc Int Conf Intell Syst Mol Biol. 1996;4:44-51. Goal. Modify Smith-Waterman (SW) algorithm such that it has a probabilistic interpretation.

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'A Sequence Similarity Search Algorithm Based on a Probabilistic Interpretation of an Alignment Scoring System' - yestin

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

### A Sequence Similarity Search Algorithm Based on a ProbabilisticInterpretation of an Alignment Scoring System

Philipp Bucher and Kay Hofmann

Proc Int Conf Intell Syst Mol Biol. 1996;4:44-51

Goal
• Modify Smith-Waterman (SW) algorithm such that it has a probabilistic interpretation
Introduction 1
• Goal: find a local alignment between a query sequence and a sequence in a database
• Local similarity to find conserved domains
• Conservation implies function
Introduction 2
• Smith-Waterman (SW) Algorithm (dynamic programming) is the most sensitive algorithm to identify local alignment between two sequences
• Heuristic algorithms such as FASTA and BLAST are modifications or special cases of SW algorithm
• O (mxn)
Definition
• a = a1 a2 . . . am

b = b1 b2 . . . bn

a,bS, S containing N elements

u alignment path

u= (x1,y1), (x2,y2), . . . (xl,yl)

x k+1>xk, y k+1>y k, x £m, y£n

m=8

n=7

l=6

EGAWGHE-E

P-AW-HEAE

EAWHEE

PAWHEE

Sequence dependent

Sequence independent

Gap score

Scoring

Substitution matrix s(a, b)

SA(a, b, u) = SM (a, b, u) + SG(u)

• Gap weighting function w(k )
• w(k ) = a + bk for k ³1,
• w(0) = 0 if k=0
Defines a probability distribution over the sequence space by means of a stochastic process involving arandom walkthrough the model

Defines a probability distribution over the space of sequence pairs by means of a stochastic process involving a random walk through an alignment path matrix

?

?

Probabilistic Smith-Waterman (PSW) Algorithm

ASS

HMM

Length distribution (same for ASS and Null model)

Null model

residue probability distribution over the alphabet S

residue a

Null probability

Scoring fxn of local alignment

Length normalizing fxn

Length normalizing fxn

SM(a, b, u)

SG(u)

Scoring fxn of localalignment

SA(a, b, u)= SM(a, b, u) + SG(u)

Z is some logarithmic base that satisfies:

G

RKE

GAWG--HE-

AAW-RKHEE

GAWHE

AAWHE

Length of unmatched pairs

Length of matched pairs

P0(a,b)

vk, wk unmatched residues in a and b, respectively

xk, yk matched residues in a and b, respectively

Performance evaluation of PSW
• BLAST (Blosum 62)
• SSEARCH
• Native SW
• Blosum 45
• default gap weighting fxn
• PSW
• Blosum 45
• Same weighing fxn as SSEARCH
• Search the Swissprot protein database
• Query: from well known protein family and domains
• % True positives affected by
• Divergence of sequence family
• Stringency of significant criterion applied
• Stringency of criterion determined by fixing thenumber of false positives accepted
• Not appropriate if the status of sequences is not known in advance

5%

9%

14%

14%

26%

33%

54%

54%

53%

Comparison

Equivalent performance of SSEARCH and PSW on GPC receptors, SH2-domain, SH3-domain

Comparison II
• Improved or equivalent performance of PSW over native SW
• PSW is specially more sensitive for stringent criterion
Summary
• Pairwise sequence alignments can be improved by interpreting a scoring system as a probabilistic model
• Probabilistic interpretation gives higher sensitivity
• Log-likelihood ratio eliminates scoring bias due to sequence length or choice of the scoring matrix
• Facilitates optimization of gap weighting matrices