sequence similarity
Download
Skip this Video
Download Presentation
Sequence Similarity

Loading in 2 Seconds...

play fullscreen
1 / 38

Sequence Similarity - PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on

Sequence Similarity. x i. ―. x i. y j. MATCH. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins. INSERT X. INSERT Y. ―. y j. x i. y j. MATCH. INSERT X. INSERT Y. x i. ―. ―. y j. A pair-HMM model of pairwise alignment.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Sequence Similarity' - carl


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
probcons probabilistic consistency based multiple alignment of proteins

xi

xi

yj

MATCH

PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins

INSERT

X

INSERT

Y

yj

a pair hmm model of pairwise alignment

xi

yj

MATCH

INSERT

X

INSERT

Y

xi

yj

A pair-HMM model of pairwise alignment
  • Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences
  • Transition probabilities ~ gap penalties
  • Emission probabilities ~ substitution matrix (from BLOSUM)

x

ABRACA-DABRA

AB-ACARDI---

y

computing pairwise alignments
Computing Pairwise Alignments
  • The Viterbi algorithm
    • conditional distribution P(α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y
    • identifies highest probability alignment, αviterbi, in O(L2) time

Caveat: the mostlikely alignment is not the mostaccurate

    • Alternative: find the alignment of maximum expected accuracy

P(α)

αviterbi

P(α | x, y)

the lazy teacher analogy

4. F

4. T

4. F

4. F

4. F

B

A-

A

A-

A

4. F

4. F

4. T

4. F

4. F

B-

B+

B+

B-

C

The Lazy-Teacher Analogy
  • 10 students take a 10-question true-false quiz
  • How do you make the answer key?
    • Approach #1: Use the answer sheet of the best student!
    • Approach #2: Weighted majority vote!
viterbi vs maximum expected accuracy mea
Viterbi

picks single alignment with highest chance of being completely correct

mathematically, finds the alignment α that maximizes

Eα*[1{α = α*}]

Maximum Expected Accuracy

picks alignment with highest expected number of correct predictions

mathematically, finds the alignment α that maximizes

Eα*[accuracy(α, α*)]

4. T

A

Viterbi vs. Maximum Expected Accuracy (MEA)

4. F

4. F

4. T

4. F

4. F

B

A-

A

A-

A

4. F

4. F

4. F

4. F

4. T

C

B-

B+

B+

B-

computing mea alignments
Computing MEA alignments
  • Define

accuracy (α, α*) =

Eα*(accuracy(α, α*) | x, y) ~ Eα*(∑(xi, yj) in α1((xi, yj) in α*) | x,y)

= ∑α’P(α’ | x, y) ∑(xi, yj) in α 1((xi, yj) in α’)

= ∑(xi, yj) in α ∑α’P(α’ | x, y) 1((xi, yj) in α’)

= ∑(xi, yj) in α P(xi, yj in α’ | x, y)

  • Define M[i, j] = posterior probability that xi is aligned to yj

# of correct predicted matches

length of shorter sequence

computing mea alignments1
Computing MEA alignments
  • Define

accuracy (α, α*) =

  • Then, MEA alignment is highest summing path through the matrix

M[i, j] = P(xi is aligned to yj | x, y)

  • M[i, j] = posterior probability that xi is aligned to yj
    • Can compute with forward, backward dynamic programming in O(L2) time

# of correct predicted matches

length of shorter sequence

computing mea alignments2
Computing MEA alignments
  • Define

accuracy (α, α*) =

  • Then, MEA alignment is highest summing path through the matrix

M[i, j] = P(xi is aligned to yj | x, y)

  • M[I, j] = posterior probability that xi is aligned to yj
    • Can compute with forward, backward dynamic programming in O(L2) time

# of correct predicted matches

length of shorter sequence

to estimate p x i y j x y z
To estimate P(xiyj | x, y, z)

Method 1:triplet-HMM

P(xi ~ yj | x, y, z)

= ∑k P(xi~yj~zk | x, y, z)

Parameters trained with unsupervised EM

Running time: O(N3L3)

N: # sequences

L: sequence lengths

probabilistic consistency
Probabilistic consistency
  • Compute P(xi is aligned to yj | x, y)

P(xi is aligned to yj | x, y, z)

  • 2 approaches:
    • 1) Exact – triplet HMM, O(L3) time
    • 2) Approximate – use independence assumptions

∑k P(xi ~ zk and zk ~ yj | x, y, z) =

∑k P(xi ~ zk | x, z) P(zk ~ yj | x, y, z, xi ~ zk)  (assume indep.)

∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

probabilistic consistency1
Probabilistic consistency
  • Compute P(xi is aligned to yj | x, y, z)

To compute P(xi ~ yj | x, y, z) ~ ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

Notice that for any given i, most entries k and j will be close to 0

-- sparse matrices

Pxy|z PxzPzy

Finally, let

Pxy|S 1/|S|∑z in S PxzPzy

multiple sequence alignment

ABRACA-DABRA

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABRACADABRA

ABRA--DABI-

AB-ACARDI---

ABRA---DABI-

Multiple sequence alignment
  • A straightforward generalization
    • sum-of-pairs
    • tree-based progressive alignment
    • iterative refinement
multiple sequence alignment1

ABRACA-DABRA

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABRA---DABI-

ABRACADABRA

ABRA--DABI-

ABRACA-DABRA

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABRACADABRA

ABRA--DABI-

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARD--I-

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABACARDI

ABRACADABRA

ABACARDI

ABRADABI

Multiple sequence alignment
  • A straightforward generalization
    • sum-of-pairs
    • tree-based progressive alignment
    • iterative refinement
summary of p rob c ons algorithm
Summary of PROBCONS Algorithm
  • Given K sequences to be aligned,
  • Compute M[i, j] for all pairs of sequences, x and y
  • (2) Use probabilistic consistency to reestimate M[i, j]
  • Build a tree of the sequences by connecting closest first
    • “Closest” defined according to expected accuracy
    • EA(x, y) = E(accuracy) of MEA alignment of x and y
  • Perform progressive alignment along the tree
    • Score of a column: sum-of-pairs M[i, j]
  • Apply iterative refinement
training testing methodology
Training/testing methodology
  • 3 reference benchmark sets
  • PROBCONS parameters trained via unsupervised EM on unaligned sequences from BAliBASE.
  • Quality score:

Q(α, α*) =

BAliBASE

PREFAB

SABmark

# of correct predicted matches

total # of true matches

evaluation of algorithm components
Evaluation of Algorithm Components

all-pairs

pairwise

multiple

resources for alignment
Resources for alignment

Protein Multiple Aligners

http://www.ebi.ac.uk/clustalw/

CLUSTALW – most widely used (1994)

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

MUSCLE – most scalable (2004)

http://probcons.stanford.edu/

PROBCONS – most accurate (2004)

Some more protein multiple aligners:

MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT, DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX, PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP, etc…….

ProbCons: Chuong (Tom) Do

slide23
PFAM

Protein FAMilies database of alignments

  • Profile HMMs describe each family
  • For each family in Pfam you can:
    • Look at multiple alignments
    • View protein domain architectures
    • Examine species distribution
    • Follow links to other databases
    • View known protein structures
slide24
PFAM

Pfam-A – curated multiple alignments

    • Grows slowly; quality controlled by experts

Pfam-B – automatic clustering (ProDom derived)

    • New sequences instantly incorporated; unchecked
  • Search by: Sequence, keyword, domain, taxonomy
  • Browsing by family or genome
  • Evolutionary tree
  • Source of seed alignments:
    • Pfam-B families
    • Published articles
    • ‘Domain hunting\' studies
profile hmms
Profile HMMs
  • Each M state has a position-specific pre-computed substitution table
  • Each I state has position-specific gap penalties (and in principle can have its own emission distributions)
  • Each D state also has position-specific gap penalties
    • In principle, D-D transitions can also be customized per position

Dm-1

Dm

D1

D2

BEGIN

END

I0

I1

Im-1

Im

M1

M2

Mm

Protein Family F

profile hmms1
Profile HMMs
  • transition between match states – αM(i)M(i+1)
  • transitions between match and insert states – αM(i)I(i), αI(i)M(i+1)
  • transition within insert state – αI(i)I(i)
  • transition between match and delete states – αM(i)D(i+1), αD(i)M(i+1)
  • transition within delete state – αD(i)D(i+1)
  • emission of amino acid b at a state S – εS(b)

Dm-1

Dm

D1

D2

BEGIN

END

I0

I1

Im-1

Im

M1

M2

Mm

Protein Family F

profile hmms2
Profile HMMs
  • transition probabilities ~ frequency of a transition in alignment
  • emission probabilities ~ frequency of an emission in alignment
  • pseudocounts are usually introduced

Dm-1

Dm

D1

D2

BEGIN

END

I0

I1

Im-1

Im

M1

M2

Mm

Protein Family F

alignment of a protein to a profile hmm
Alignment of a protein to a profile HMM

To align sequence x1…xn to a profile HMM:

We will find the most likely alignment with the Viterbi DP algorithm

  • Define
    • VjM(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Mj
    • VjI(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Ij
    • VjD(i): score of best alignment of x1…xi to the HMM ending in Dj (xi is the last character emitted before Dj)
  • Denote by qa the frequency of amino acid a in a ‘random’ protein
alignment of a protein to a profile hmm1
Alignment of a protein to a profile HMM

Vj-1M(i – 1) + log αM(j-1)M(j)

  • VjM(i) = log (εM(j)(xi) / qxi) + max Vj-1I(i – 1) + log αI(j-1)M(j)

Vj-1D(i – 1) + log αD(j-1)M(j)

VjM(i – 1) + log αM(j)I(j)

  • VjI(i) = log (εI(j)(xi) / qxi) + max VjI(i – 1) + log αI(j)I(j)

VjD(i – 1) + log αD(j)I(j)

Vj-1M(i) + log αM(j-1)D(j)

  • VjD(i) = max Vj-1I(i) + log αI(j-1)D(j)

Vj-1D(i) + log αD(j-1)D(j)

weight of each sequence
Weight of each sequence

i

h

g

f

  • One simple weighting scheme is to find how much edge length each leaf contributes
    • Example: edge 1 belongs to a
    • Example: edge 3 belongs both to a, and to b: e3e1/(e1+e2) goes to a

Δwi = ecurrent wi / (leaves k below ecurrentwk)

e

d

c

2

b

3

1

a

resources on the web
Resources on the web
  • HMMer – a free profile HMM software
    • http://hmmer.wustl.edu/
  • SAM – another free profile HMM software
    • http://www.cse.ucsc.edu/research/compbio/sam.html
  • PFAM – database of alignments and HMMs for protein families and domains
    • http://www.sanger.ac.uk/Software/Pfam/
  • SCOP – a structural classification of proteins
    • http://scop.berkeley.edu/data/scop.b.html
ad