- By
**carl** - Follow User

- 82 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about ' Sequence Similarity' - carl

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

yj

MATCH

INSERT

X

INSERT

Y

xi

―

―

yj

A pair-HMM model of pairwise alignment- Parameterizes a probability distribution, P(A), over all possible alignments of all possible pairs of sequences
- Transition probabilities ~ gap penalties
- Emission probabilities ~ substitution matrix (from BLOSUM)

x

ABRACA-DABRA

AB-ACARDI---

y

Computing Pairwise Alignments

- The Viterbi algorithm
- conditional distribution P(α | x, y) reflects model’s uncertainty over the “correct” alignment of x and y
- identifies highest probability alignment, αviterbi, in O(L2) time

Caveat: the mostlikely alignment is not the mostaccurate

- Alternative: find the alignment of maximum expected accuracy

P(α)

αviterbi

P(α | x, y)

4. T

4. F

4. F

4. F

B

A-

A

A-

A

4. F

4. F

4. T

4. F

4. F

B-

B+

B+

B-

C

The Lazy-Teacher Analogy- 10 students take a 10-question true-false quiz
- How do you make the answer key?
- Approach #1: Use the answer sheet of the best student!
- Approach #2: Weighted majority vote!

Viterbi

picks single alignment with highest chance of being completely correct

mathematically, finds the alignment α that maximizes

Eα*[1{α = α*}]

Maximum Expected Accuracy

picks alignment with highest expected number of correct predictions

mathematically, finds the alignment α that maximizes

Eα*[accuracy(α, α*)]

4. T

A

Viterbi vs. Maximum Expected Accuracy (MEA)4. F

4. F

4. T

4. F

4. F

B

A-

A

A-

A

4. F

4. F

4. F

4. F

4. T

C

B-

B+

B+

B-

Computing MEA alignments

- Define

accuracy (α, α*) =

Eα*(accuracy(α, α*) | x, y) ~ Eα*(∑(xi, yj) in α1((xi, yj) in α*) | x,y)

= ∑α’P(α’ | x, y) ∑(xi, yj) in α 1((xi, yj) in α’)

= ∑(xi, yj) in α ∑α’P(α’ | x, y) 1((xi, yj) in α’)

= ∑(xi, yj) in α P(xi, yj in α’ | x, y)

- Define M[i, j] = posterior probability that xi is aligned to yj

# of correct predicted matches

length of shorter sequence

Computing MEA alignments

- Define

accuracy (α, α*) =

- Then, MEA alignment is highest summing path through the matrix

M[i, j] = P(xi is aligned to yj | x, y)

- M[i, j] = posterior probability that xi is aligned to yj
- Can compute with forward, backward dynamic programming in O(L2) time

# of correct predicted matches

length of shorter sequence

Computing MEA alignments

- Define

accuracy (α, α*) =

- Then, MEA alignment is highest summing path through the matrix

M[i, j] = P(xi is aligned to yj | x, y)

- M[I, j] = posterior probability that xi is aligned to yj
- Can compute with forward, backward dynamic programming in O(L2) time

# of correct predicted matches

length of shorter sequence

To estimate P(xiyj | x, y, z)

Method 1:triplet-HMM

P(xi ~ yj | x, y, z)

= ∑k P(xi~yj~zk | x, y, z)

Parameters trained with unsupervised EM

Running time: O(N3L3)

N: # sequences

L: sequence lengths

Probabilistic consistency

- Compute P(xi is aligned to yj | x, y)

P(xi is aligned to yj | x, y, z)

- 2 approaches:
- 1) Exact – triplet HMM, O(L3) time
- 2) Approximate – use independence assumptions

∑k P(xi ~ zk and zk ~ yj | x, y, z) =

∑k P(xi ~ zk | x, z) P(zk ~ yj | x, y, z, xi ~ zk) (assume indep.)

∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

Probabilistic consistency

- Compute P(xi is aligned to yj | x, y, z)

To compute P(xi ~ yj | x, y, z) ~ ∑k P(xi ~ zk | x, z) P(zk ~ yj | z, y)

Notice that for any given i, most entries k and j will be close to 0

-- sparse matrices

Pxy|z PxzPzy

Finally, let

Pxy|S 1/|S|∑z in S PxzPzy

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABRACADABRA

ABRA--DABI-

AB-ACARDI---

ABRA---DABI-

Multiple sequence alignment- A straightforward generalization
- sum-of-pairs
- tree-based progressive alignment
- iterative refinement

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABRA---DABI-

ABRACADABRA

ABRA--DABI-

ABRACA-DABRA

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABRACADABRA

ABRA--DABI-

AB-ACARDI---

ABRA---DABI-

ABRACA-DABRA

AB-ACARD--I-

ABRA---DABI-

ABRACA-DABRA

AB-ACARDI---

ABACARDI

ABRACADABRA

ABACARDI

ABRADABI

Multiple sequence alignment- A straightforward generalization
- sum-of-pairs
- tree-based progressive alignment
- iterative refinement

Summary of PROBCONS Algorithm

- Given K sequences to be aligned,
- Compute M[i, j] for all pairs of sequences, x and y
- (2) Use probabilistic consistency to reestimate M[i, j]
- Build a tree of the sequences by connecting closest first
- “Closest” defined according to expected accuracy
- EA(x, y) = E(accuracy) of MEA alignment of x and y
- Perform progressive alignment along the tree
- Score of a column: sum-of-pairs M[i, j]
- Apply iterative refinement

Training/testing methodology

- 3 reference benchmark sets
- PROBCONS parameters trained via unsupervised EM on unaligned sequences from BAliBASE.
- Quality score:

Q(α, α*) =

BAliBASE

PREFAB

SABmark

# of correct predicted matches

total # of true matches

Resources for alignment

Protein Multiple Aligners

http://www.ebi.ac.uk/clustalw/

CLUSTALW – most widely used (1994)

http://phylogenomics.berkeley.edu/cgi-bin/muscle/input_muscle.py

MUSCLE – most scalable (2004)

http://probcons.stanford.edu/

PROBCONS – most accurate (2004)

Some more protein multiple aligners:

MULTALIGN, MSA, DIALIGN, DCA, MACAW, TCOFFEE, MAFFT, DSC, MUSEQUAL, TOPLIGN, SACHMO, MATCHBOX, PRRN, SAM, MAXHOM, STRAP, ALIGN, AMAS, PILEUP, etc…….

ProbCons: Chuong (Tom) Do

PFAM

Protein FAMilies database of alignments

- Profile HMMs describe each family
- For each family in Pfam you can:
- Look at multiple alignments
- View protein domain architectures
- Examine species distribution
- Follow links to other databases
- View known protein structures

PFAM

Pfam-A – curated multiple alignments

- Grows slowly; quality controlled by experts

Pfam-B – automatic clustering (ProDom derived)

- New sequences instantly incorporated; unchecked
- Search by: Sequence, keyword, domain, taxonomy
- Browsing by family or genome
- Evolutionary tree
- Source of seed alignments:
- Pfam-B families
- Published articles
- ‘Domain hunting\' studies

Profile HMMs

- Each M state has a position-specific pre-computed substitution table
- Each I state has position-specific gap penalties (and in principle can have its own emission distributions)
- Each D state also has position-specific gap penalties
- In principle, D-D transitions can also be customized per position

Dm-1

Dm

D1

D2

BEGIN

END

I0

I1

Im-1

Im

M1

M2

Mm

Protein Family F

Profile HMMs

- transition between match states – αM(i)M(i+1)
- transitions between match and insert states – αM(i)I(i), αI(i)M(i+1)
- transition within insert state – αI(i)I(i)
- transition between match and delete states – αM(i)D(i+1), αD(i)M(i+1)
- transition within delete state – αD(i)D(i+1)
- emission of amino acid b at a state S – εS(b)

Dm-1

Dm

D1

D2

BEGIN

END

I0

I1

Im-1

Im

M1

M2

Mm

Protein Family F

Profile HMMs

- transition probabilities ~ frequency of a transition in alignment
- emission probabilities ~ frequency of an emission in alignment
- pseudocounts are usually introduced

Dm-1

Dm

D1

D2

BEGIN

END

I0

I1

Im-1

Im

M1

M2

Mm

Protein Family F

Alignment of a protein to a profile HMM

To align sequence x1…xn to a profile HMM:

We will find the most likely alignment with the Viterbi DP algorithm

- Define
- VjM(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Mj
- VjI(i): score of best alignment of x1…xi to the HMM ending in xi being emitted from Ij
- VjD(i): score of best alignment of x1…xi to the HMM ending in Dj (xi is the last character emitted before Dj)
- Denote by qa the frequency of amino acid a in a ‘random’ protein

Alignment of a protein to a profile HMM

Vj-1M(i – 1) + log αM(j-1)M(j)

- VjM(i) = log (εM(j)(xi) / qxi) + max Vj-1I(i – 1) + log αI(j-1)M(j)

Vj-1D(i – 1) + log αD(j-1)M(j)

VjM(i – 1) + log αM(j)I(j)

- VjI(i) = log (εI(j)(xi) / qxi) + max VjI(i – 1) + log αI(j)I(j)

VjD(i – 1) + log αD(j)I(j)

Vj-1M(i) + log αM(j-1)D(j)

- VjD(i) = max Vj-1I(i) + log αI(j-1)D(j)

Vj-1D(i) + log αD(j-1)D(j)

Weight of each sequence

i

h

g

f

- One simple weighting scheme is to find how much edge length each leaf contributes
- Example: edge 1 belongs to a
- Example: edge 3 belongs both to a, and to b: e3e1/(e1+e2) goes to a

Δwi = ecurrent wi / (leaves k below ecurrentwk)

e

d

c

2

b

3

1

a

Resources on the web

- HMMer – a free profile HMM software
- http://hmmer.wustl.edu/
- SAM – another free profile HMM software
- http://www.cse.ucsc.edu/research/compbio/sam.html
- PFAM – database of alignments and HMMs for protein families and domains
- http://www.sanger.ac.uk/Software/Pfam/
- SCOP – a structural classification of proteins
- http://scop.berkeley.edu/data/scop.b.html

Download Presentation

Connecting to Server..