Sequence similarity
Download
1 / 38

Sequence Similarity - PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on

Sequence Similarity. x i. ―. x i. y j. MATCH. PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins. INSERT X. INSERT Y. ―. y j. x i. y j. MATCH. INSERT X. INSERT Y. x i. ―. ―. y j. A pair-HMM model of pairwise alignment.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about ' Sequence Similarity' - carl


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript

Probcons probabilistic consistency based multiple alignment of proteins

xi

xi

yj

MATCH

PROBCONS: Probabilistic Consistency-based Multiple Alignment of Proteins

INSERT

X

INSERT

Y

yj


A pair hmm model of pairwise alignment

xi

yj

MATCH

INSERT

X

INSERT

Y

xi

yj

A pair-HMM model of pairwise alignment

  • Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences

  • Transition probabilities ~ gap penalties

  • Emission probabilities ~ substitution matrix (from BLOSUM)

x

ABRACA-DABRA

AB-ACARDI---

y


Computing pairwise alignments
Computing Pairwise Alignments

  • The Viterbi algorithm

    • conditional distribution P(α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y

    • identifies highest probability alignment, αviterbi, in O(L2) time

      Caveat: the mostlikely alignment is not the mostaccurate

    • Alternative: find the alignment of maximum expected accuracy

P(α)

αviterbi

P(α | x, y)


The lazy teacher analogy

4. F

4. T

4. F

4. F

4. F

B

A-

A

A-

A

4. F

4. F

4. T

4. F

4. F

B-

B+

B+

B-

C

The Lazy-Teacher Analogy

  • 10 students take a 10-question true-false quiz

  • How do you make the answer key?

    • Approach #1: Use the answer sheet of the best student!

    • Approach #2: Weighted majority vote!


Viterbi vs maximum expected accuracy mea

Viterbi

picks single alignment with highest chance of being completely correct

mathematically, finds the alignment α that maximizes

Eα*[1{α = α*}]

Maximum Expected Accuracy

picks alignment with highest expected number of correct predictions

mathematically, finds the alignment α that maximizes

Eα*[accuracy(α, α*)]

4. T

A

Viterbi vs. Maximum Expected Accuracy (MEA)

4. F

4. F

4. T

4. F

4. F

B

A-

A

A-

A

4. F

4. F

4. F

4. F

4. T

C

B-

B+

B+

B-


Computing mea alignments
Computing MEA alignments

  • Define

    accuracy (α, α*) =

    Eα*(accuracy(α, α*) | x, y) ~ Eα*(∑(xi, yj) in α1((xi, yj) in α*) | x,y)

    = ∑α’P(α’ | x, y) ∑(xi, yj) in α 1((xi, yj) in α’)

    = ∑(xi, yj) in α ∑α’P(α’ | x, y) 1((xi, yj) in α’)

    = ∑(xi, yj) in α P(xi, yj in α’ | x, y)

  • Define M[i, j] = posterior probability that xi is aligned to yj

# of correct predicted matches

length of shorter sequence


Computing mea alignments1
Computing MEA alignments

  • Define

    accuracy (α, α*) =

  • Then, MEA alignment is highest summing path through the matrix

    M[i, j] = P(xi is aligned to yj | x, y)

  • M[i, j] = posterior probability that xi is aligned to yj

    • Can compute with forward, backward dynamic programming in O(L2) time

# of correct predicted matches

length of shorter sequence


Computing mea alignments2
Computing MEA alignments

  • Define

    accuracy (α, α*) =

  • Then, MEA alignment is highest summing path through the matrix

    M[i, j] = P(xi is aligned to yj | x, y)

  • M[I, j] = posterior probability that xi is aligned to yj

    • Can compute with forward, backward dynamic programming in O(L2) time

# of correct predicted matches

length of shorter sequence


The consistency signal
The consistency signal

zk

z

xi

x

y

yj

yj’


To estimate p x i y j x y z
To estimate P(xiyj | x, y, z)

Method 1:triplet-HMM

P(xi ~ yj | x, y, z)

= ∑k P(xi~yj~zk | x, y, z)

Parameters trained with unsupervised EM

Running time: O(N3L3)

N: # sequences

L: sequence lengths


Probabilistic consistency
Probabilistic consistency

  • Compute P(xi is aligned to yj | x, y)

    P(xi is aligned to yj | x, y, z)

  • 2 approaches:

    • 1) Exact – triplet HMM, O(L3) time

    • 2) Approximate – use independence assumptions

      ∑k P(xi ~ zk and zk ~ yj | x, y, z) =

      ∑k P(xi ~ zk | x, z) P(zk ~ yj | x, y, z, xi ~ zk)  (assume indep.)

      ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)


Probabilistic consistency1
Probabilistic consistency

  • Compute P(xi is aligned to yj | x, y, z)

    To compute P(xi ~ yj | x, y, z) ~ ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

    Notice that for any given i, most entries k and j will be close to 0

    -- sparse matrices

    Pxy|z PxzPzy

    Finally, let

    Pxy|S 1/|S|∑z in S PxzPzy


Multiple sequence alignment

ABRACA-DABRA

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABRACADABRA

ABRA--DABI-

AB-ACARDI---

ABRA---DABI-

Multiple sequence alignment

  • A straightforward generalization

    • sum-of-pairs

    • tree-based progressive alignment

    • iterative refinement


Multiple sequence alignment1

ABRACA-DABRA

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABRA---DABI-

ABRACADABRA

ABRA--DABI-

ABRACA-DABRA

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABRACADABRA

ABRA--DABI-

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARD--I-

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABACARDI

ABRACADABRA

ABACARDI

ABRADABI

Multiple sequence alignment

  • A straightforward generalization

    • sum-of-pairs

    • tree-based progressive alignment

    • iterative refinement


Summary of p rob c ons algorithm
Summary of PROBCONS Algorithm

  • Given K sequences to be aligned,

  • Compute M[i, j] for all pairs of sequences, x and y

  • (2) Use probabilistic consistency to reestimate M[i, j]

  • Build a tree of the sequences by connecting closest first

    • “Closest” defined according to expected accuracy

    • EA(x, y) = E(accuracy) of MEA alignment of x and y

  • Perform progressive alignment along the tree

    • Score of a column: sum-of-pairs M[i, j]

  • Apply iterative refinement


Training testing methodology
Training/testing methodology

  • 3 reference benchmark sets

  • PROBCONS parameters trained via unsupervised EM on unaligned sequences from BAliBASE.

  • Quality score:

    Q(α, α*) =

BAliBASE

PREFAB

SABmark

# of correct predicted matches

total # of true matches


Evaluation of algorithm components
Evaluation of Algorithm Components

all-pairs

pairwise

multiple



Resources for alignment
Resources for alignment

Protein Multiple Aligners

http://www.ebi.ac.uk/clustalw/

CLUSTALW – most widely used (1994)

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

MUSCLE – most scalable (2004)

http://probcons.stanford.edu/

PROBCONS – most accurate (2004)

Some more protein multiple aligners:

MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT, DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX, PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP, etc…….

ProbCons: Chuong (Tom) Do



PFAM

Protein FAMilies database of alignments

  • Profile HMMs describe each family

  • For each family in Pfam you can:

    • Look at multiple alignments

    • View protein domain architectures

    • Examine species distribution

    • Follow links to other databases

    • View known protein structures


PFAM

Pfam-A – curated multiple alignments

  • Grows slowly; quality controlled by experts

    Pfam-B – automatic clustering (ProDom derived)

  • New sequences instantly incorporated; unchecked

  • Search by: Sequence, keyword, domain, taxonomy

  • Browsing by family or genome

  • Evolutionary tree

  • Source of seed alignments:

    • Pfam-B families

    • Published articles

    • ‘Domain hunting' studies


  • Profile hmms
    Profile HMMs

    • Each M state has a position-specific pre-computed substitution table

    • Each I state has position-specific gap penalties (and in principle can have its own emission distributions)

    • Each D state also has position-specific gap penalties

      • In principle, D-D transitions can also be customized per position

    Dm-1

    Dm

    D1

    D2

    BEGIN

    END

    I0

    I1

    Im-1

    Im

    M1

    M2

    Mm

    Protein Family F


    Profile hmms1
    Profile HMMs

    • transition between match states – αM(i)M(i+1)

    • transitions between match and insert states – αM(i)I(i), αI(i)M(i+1)

    • transition within insert state – αI(i)I(i)

    • transition between match and delete states – αM(i)D(i+1), αD(i)M(i+1)

    • transition within delete state – αD(i)D(i+1)

    • emission of amino acid b at a state S – εS(b)

    Dm-1

    Dm

    D1

    D2

    BEGIN

    END

    I0

    I1

    Im-1

    Im

    M1

    M2

    Mm

    Protein Family F


    Profile hmms2
    Profile HMMs

    • transition probabilities ~ frequency of a transition in alignment

    • emission probabilities ~ frequency of an emission in alignment

    • pseudocounts are usually introduced

    Dm-1

    Dm

    D1

    D2

    BEGIN

    END

    I0

    I1

    Im-1

    Im

    M1

    M2

    Mm

    Protein Family F


    Alignment of a protein to a profile hmm
    Alignment of a protein to a profile HMM

    To align sequence x1…xn to a profile HMM:

    We will find the most likely alignment with the Viterbi DP algorithm

    • Define

      • VjM(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Mj

      • VjI(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Ij

      • VjD(i): score of best alignment of x1…xi to the HMM ending in Dj (xi is the last character emitted before Dj)

    • Denote by qa the frequency of amino acid a in a ‘random’ protein


    Alignment of a protein to a profile hmm1
    Alignment of a protein to a profile HMM

    Vj-1M(i – 1) + log αM(j-1)M(j)

    • VjM(i) = log (εM(j)(xi) / qxi) + max Vj-1I(i – 1) + log αI(j-1)M(j)

      Vj-1D(i – 1) + log αD(j-1)M(j)

      VjM(i – 1) + log αM(j)I(j)

    • VjI(i) = log (εI(j)(xi) / qxi) + max VjI(i – 1) + log αI(j)I(j)

      VjD(i – 1) + log αD(j)I(j)

      Vj-1M(i) + log αM(j-1)D(j)

    • VjD(i) = max Vj-1I(i) + log αI(j-1)D(j)

      Vj-1D(i) + log αD(j-1)D(j)


    Weight of each sequence
    Weight of each sequence

    i

    h

    g

    f

    • One simple weighting scheme is to find how much edge length each leaf contributes

      • Example: edge 1 belongs to a

      • Example: edge 3 belongs both to a, and to b: e3e1/(e1+e2) goes to a

        Δwi = ecurrent wi / (leaves k below ecurrentwk)

    e

    d

    c

    2

    b

    3

    1

    a



    Resources on the web
    Resources on the web

    • HMMer – a free profile HMM software

      • http://hmmer.wustl.edu/

    • SAM – another free profile HMM software

      • http://www.cse.ucsc.edu/research/compbio/sam.html

    • PFAM – database of alignments and HMMs for protein families and domains

      • http://www.sanger.ac.uk/Software/Pfam/

    • SCOP – a structural classification of proteins

      • http://scop.berkeley.edu/data/scop.b.html


    ad