Inverse alignment
This presentation is the property of its rightful owner.
Sponsored Links
1 / 46

Inverse Alignment PowerPoint PPT Presentation


  • 67 Views
  • Uploaded on
  • Presentation posted in: General

Inverse Alignment. CS 374 Bahman Bahmani Fall 2006. The Papers To Be Presented. Sequence Comparison - Alignment. Alignments can be thought of as two sequences differing due to mutations happened during the evolution. AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC.

Download Presentation

Inverse Alignment

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Inverse alignment

Inverse Alignment

CS 374

Bahman Bahmani

Fall 2006


The papers to be presented

The Papers To Be Presented


Sequence comparison alignment

Sequence Comparison - Alignment

  • Alignments can be thought of as two sequences differing due to mutations happened during the evolution

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

| | | | | | | | | | | | | x | | | | | | | | | | |

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC


Scoring alignments

Scoring Alignments

  • Alignments are based on three basic operations:

  • Substitutions

  • Insertions

  • Deletions

  • A score is assigned to each single operation (resulting in a scoring matrix and also in gap penalties). Alignments are then scored by adding the scores of their operations.

  • Standard formulations of string alignment optimize the above score of the alignment.


An example of scoring an alignment using a scoring matrix

AKRANR

KAAANK

-1 + (-1) + (-2) + 5 + 7 + 3 = 11

An Example Of Scoring an Alignment Using a Scoring Matrix


Scoring matrices in practice

Scoring Matrices in Practice

  • Some choices for substitution scores are now common, largely due to convention

  • Most commonly used Amino-Acid substitution matrices:

  • PAM (Percent Accepted Mutation)

  • BLOSUM (Blocks Amino Acid Substitution Matrix)

BLOSUM50 Scoring Matrix


Gap penalties

Gap Penalties

  • Inclusion of gaps and gap penalties is necessary to obtain the best alignment

  • If gap penalty is too high, gaps will never appear in the alignment

    AATGCTGC

    ATGCTGCA

  • If gap penalty is too low, gaps will appear everywhere in the alignment

    AATGCTGC----

    A----TGCTGCA


Gap penalties cont d

Gap Penalties (Cont’d)

Separate penalties for gap opening and gap extension

Opening: The cost to introduce a gap

Extension: The cost to elongate a gap

Opening a gap is costly, while extending a gap is cheap

Despite scoring matrices, no gap penalties are commonly agreed upon

LETVGY

W----L

-5 -1 -1 -1


Parametric sequence alignment

Parametric Sequence Alignment

  • For a given pair of strings, the alignment problem is solved for effectively all possible choices of the scoring parameters and penalties (exhaustive search).

  • A correct alignmentis then used to find the best parameter values.

  • However, this method is very inefficient if the number of parameters is large.


Inverse parametric alignment

Inverse Parametric Alignment

  • INPUT: an alignment of a pair of strings.

  • OUTPUT: a choice of parameters that makes the input alignment be an optimal-scoring alignment of its strings.

  • From Machine Learning point of view, this learns the parameters for optimal alignment from training examples of correct alignments.


Inverse optimal alignment

Inverse Optimal Alignment

Definition (Inverse Optimal Alignment):

INPUT: alignments A1, A2, …, Ak of strings,

an alignment scoring function fw with parameters w = (w1, w2, …, wp).

OUTPUT: values x = (x1, x2, …, xp) for w

GOAL: each input alignment be an optimal alignment of its strings under fx .

ATTENTION: This problem may have no solution!


Inverse near optimal alignment

Inverse Near-Optimal Alignment

  • When minimizing the scoring function f, we say an alignment A of a set of strings S is –optimal, for some if:

    where is the optimal alignment of S under f.


Inverse near optimal alignment cont d

Inverse Near-Optimal Alignment (Cont’d)

  • Definition (Inverse Near-Optimal Alignment):

    INPUT: alignments Ai

    scoring function f

    real number

    OUTPUT: find parameter values x

    GOAL: each alignment Ai be -optimal under fx .

    The smallest possible can be found within accuracy using calls to the algorithm.


Inverse unique optimal alignment

Inverse Unique-Optimal Alignment

  • When minimizing the scoring function f, we say an alignment A of a set of strings S is -uniquefor some if:

    for every alignment B of S other than A.


Inverse unique optimal alignment cont d

Inverse Unique-Optimal Alignment (Cont’d)

  • Definition (Inverse Unique-Optimal Alignment):

    INPUT: alignments Ai

    scoring function f

    real number

    OUTPUT: parameter values x

    GOAL: each alignment Ai be -unique under fx

    The largest possible can be found within accuracy using calls to the algorithm.


Let there be linear functions

Let There Be Linear Functions …

  • For most standard forms of alignment, the alignment scoring function f is a linear function of its parameters:

    where each fi measures one of the features of A.


Let there be linear functions example i

Let There Be Linear Functions … (Example I)

  • With fixed substitution scores, and two parameters gap open ( ) and gap extension ( ) penalties, p=2 and:

    where:

    g(A) = number of gaps

    l(A) = total length of gaps

    s(A) = total score of all substitutions


Let there be linear functions example ii

Let There Be Linear Functions … (Example II)

  • With no parameters fixed, the substitution scores are also in our parameters and:

    where:

    a and b range over all letters in the alphabet

    hab(A) = # of substitutions in A replacing a by b


Linear programming problem

Linear Programming Problem

  • INPUT: variables x = (x1, x2, …, xn)

    a system of linear inequalities in x

    a linear objective function in x

    OUTPUT: assignment of real values to x

    GOAL: satisfy all the inequalities and minimize the objective

    In general, the program can be infeasible, bounded, or unbounded.


Reducing the inverse alignment problems to linear programming

Reducing The Inverse Alignment Problems To Linear Programming

  • Inverse Optimal Alignment: For each Ai and every alignment B of the set Si, we have an inequality:

    or equivalently:

    The number of alignments of a pair of strings of length n is hence a total of inequalities in p variables. Also, no specific objective function.


Separation theorem

Separation Theorem

  • Some definitions:

  • Polyhedron: intersection of half-spaces

  • Rational polyhedron: described by inequalities with only rational coefficients

  • Bounded polyhedron: no infinite rays


Separation theorem cont d

Separation Theorem (Cont’d)

  • Optimization Problem for a rational polyhedron P in :

    INPUT: rational coefficients c specifying the objective

    OUTPUT: a point x in P minimizing cx, or determining that P is empty.

  • Separation Problem for P is:

    INPUT: a point y in

    OUTPU: rational coefficients w and b such that for all points x in P, but (a violated inequality) or determining that y is in P.


Separation theorem cont d1

Separation Theorem (Cont’d)

  • Theorem (Equivalence of Separation and Optimization): The optimization problem on a bounded rational polyhedron can be solved in polynomial time if and only if the separation problem can be solved in polynomial time.

    That is, for bounded rational polyhedrons:

    OptimizationSeparation


Cutting plane algorithm

Cutting-Plane Algorithm

  • Start with a small subset S of the set L of all inequalities

  • Compute an optimal solution x under constraints in S

  • Call the separation algorithm for L on x

  • If x is determined to satisfy L output it and halt; otherwise,

    add the violated inequality to S and loop back to step (2).


Complexity of inverse alignment

Complexity of Inverse Alignment

  • Theorem:Inverse Optimal and Near-Optimal Alignment can be solved in polynomial time for any form of alignment in which:

    1. the alignment scoring function is linear

    2. the parameters values can be bounded

    3. for any fixed parameter choice, an optimal alignment can be found in polynomial time.

    Inverse Unique-Optimal Alignment can be solved in polynomial time if in addition:

    3’. for any fixed parameter choice, a next-best alignment can be found in polynomial time.


Application to global alignment

Application to Global Alignment

  • Initializing the Cutting-Plane Algorithm: We consider the problem in two cases:

  • All scores and penalties varying: Then the parameter space can be made bounded.

  • Substitution costs are fixed: Then either (1) a bounding inequality, or (2) two inequalities one of which is a downward half-space, the other one is an upward half-space, and the slope of the former is less than the slope of the latter can be found in O(1) time, if they exist.


Application to global alignment cont d

Application to Global Alignment (Cont’d)

  • Choosing an Objective Function: Again we consider two different cases:

  • Fixed substitution scores: in this case we choose the following objective:

  • Varying substitution scores: In this case we choose the following objective:

    where s is the minimum of all non-identity substitution scores and i is the maximum of all identity scores.


Application to global alignment cont d1

Application to Global Alignment (Cont’d)

  • For every objective, two extreme solutions exist: xlarge and xsmall. Then for every we have a corresponding solution:

    x1/2 is expected to better generalize to alignments outside the training set.


Computational results

Computational Results


Computational results cont d

Computational Results (Cont’d)


Computational results cont d1

Computational Results (Cont’d)


Contralign

CONTRAlign

  • What: extensible and fully automatic parameter learning framework for protein pair-wise sequence alignment

  • How: pair conditional random fields (pair CRF s)

  • Who:


Pair hmms for sequence alignment

Pair-HMMs for Sequence Alignment


Pair hmms cont d

Pair-HMMs … (Cont’d)

  • If then:

    where:


Training pair hmms

Training Pair-HMMs

  • INPUT: a set of training examples

  • OUTPUT: the feature vector w

  • METHOD: maximizing the joint log-likelihood of the data and alignments under constraints on w:


Generating alignments using pair hmms

Generating Alignments Using Pair-HMMs

  • Viterbi Algorithm on a Pair-HMM:

    INPUT: two sequences x and y

    OUTPUT: the alignment a of x and y that maximizes P(a|x,y;w)

    RUNNING TIME: O(|x|.|y|)


Pair crfs

Pair-CRFs

  • Directly model the conditional probabilities:

    where w is a real-valued parameter vector not necessarily corresponding to log-probabilities


Training pair crfs

Training Pair-CRFs

  • INPUT: a set of training examples

  • OUTPUT: real-valued feature vector w

  • METHOD: maximizing theconditional log-likelihood of the data (discriminative/conditional learning)

    where is a Gaussian prior on w, to prevent over-fitting.


Properties of pair crfs

Properties of Pair-CRFs

  • Far weaker independence assumptions than Pair-HMMs

  • Capable of utilizing complex non-independent feature sets

  • Directly optimizing the predictive ability, ignoring P(x,y); the model to generate the input sequences


Choice of model topology in contralign

Choice of Model Topology in CONTRAlign

  • Some possible model topologies:

    CONTRAlignDouble-Affine :

    CONTRAlignLocal :


Choice of feature sets in contralign

Choice of Feature Sets in CONTRAlign

  • Some possible feature sets to utilize:

    1. Hydropathy-based gap context features (CONTRAlignHYDROPATHY)

    2. External Information:

    2.1. Secondary structure (CONTRAlignDSSP)

    2.2. Solvent accessibility (CONTRAlignACCESSIBILITY)


Results comparison of model topologies and feature sets

Results: Comparison of Model Topologies and Feature Sets


Results comparison to modern sequence alignment tools

Results: Comparison to Modern Sequence Alignment Tools


Results alignment accuracy in the twilight zone

Results: Alignment Accuracy in the “Twilight Zone”

For each conservation range, the uncolored bars give accuracies for MAFFT(L-INS-i), T-Coffee, CLUSTALW, MUSCLE, and PROBCONS (Bali) in that order, and the colored bar indicated the accuracy for CONTRAlign.


Inverse alignment

Questions?


Inverse alignment

Thank You!


  • Login