Inverse alignment
This presentation is the property of its rightful owner.
Sponsored Links
1 / 46

Inverse Alignment PowerPoint PPT Presentation


  • 71 Views
  • Uploaded on
  • Presentation posted in: General

Inverse Alignment. CS 374 Bahman Bahmani Fall 2006. The Papers To Be Presented. Sequence Comparison - Alignment. Alignments can be thought of as two sequences differing due to mutations happened during the evolution. AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC.

Download Presentation

Inverse Alignment

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Inverse Alignment

CS 374

Bahman Bahmani

Fall 2006


The Papers To Be Presented


Sequence Comparison - Alignment

  • Alignments can be thought of as two sequences differing due to mutations happened during the evolution

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

| | | | | | | | | | | | | x | | | | | | | | | | |

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC


Scoring Alignments

  • Alignments are based on three basic operations:

  • Substitutions

  • Insertions

  • Deletions

  • A score is assigned to each single operation (resulting in a scoring matrix and also in gap penalties). Alignments are then scored by adding the scores of their operations.

  • Standard formulations of string alignment optimize the above score of the alignment.


AKRANR

KAAANK

-1 + (-1) + (-2) + 5 + 7 + 3 = 11

An Example Of Scoring an Alignment Using a Scoring Matrix


Scoring Matrices in Practice

  • Some choices for substitution scores are now common, largely due to convention

  • Most commonly used Amino-Acid substitution matrices:

  • PAM (Percent Accepted Mutation)

  • BLOSUM (Blocks Amino Acid Substitution Matrix)

BLOSUM50 Scoring Matrix


Gap Penalties

  • Inclusion of gaps and gap penalties is necessary to obtain the best alignment

  • If gap penalty is too high, gaps will never appear in the alignment

    AATGCTGC

    ATGCTGCA

  • If gap penalty is too low, gaps will appear everywhere in the alignment

    AATGCTGC----

    A----TGCTGCA


Gap Penalties (Cont’d)

Separate penalties for gap opening and gap extension

Opening: The cost to introduce a gap

Extension: The cost to elongate a gap

Opening a gap is costly, while extending a gap is cheap

Despite scoring matrices, no gap penalties are commonly agreed upon

LETVGY

W----L

-5 -1 -1 -1


Parametric Sequence Alignment

  • For a given pair of strings, the alignment problem is solved for effectively all possible choices of the scoring parameters and penalties (exhaustive search).

  • A correct alignmentis then used to find the best parameter values.

  • However, this method is very inefficient if the number of parameters is large.


Inverse Parametric Alignment

  • INPUT: an alignment of a pair of strings.

  • OUTPUT: a choice of parameters that makes the input alignment be an optimal-scoring alignment of its strings.

  • From Machine Learning point of view, this learns the parameters for optimal alignment from training examples of correct alignments.


Inverse Optimal Alignment

Definition (Inverse Optimal Alignment):

INPUT: alignments A1, A2, …, Ak of strings,

an alignment scoring function fw with parameters w = (w1, w2, …, wp).

OUTPUT: values x = (x1, x2, …, xp) for w

GOAL: each input alignment be an optimal alignment of its strings under fx .

ATTENTION: This problem may have no solution!


Inverse Near-Optimal Alignment

  • When minimizing the scoring function f, we say an alignment A of a set of strings S is –optimal, for some if:

    where is the optimal alignment of S under f.


Inverse Near-Optimal Alignment (Cont’d)

  • Definition (Inverse Near-Optimal Alignment):

    INPUT: alignments Ai

    scoring function f

    real number

    OUTPUT: find parameter values x

    GOAL: each alignment Ai be -optimal under fx .

    The smallest possible can be found within accuracy using calls to the algorithm.


Inverse Unique-Optimal Alignment

  • When minimizing the scoring function f, we say an alignment A of a set of strings S is -uniquefor some if:

    for every alignment B of S other than A.


Inverse Unique-Optimal Alignment (Cont’d)

  • Definition (Inverse Unique-Optimal Alignment):

    INPUT: alignments Ai

    scoring function f

    real number

    OUTPUT: parameter values x

    GOAL: each alignment Ai be -unique under fx

    The largest possible can be found within accuracy using calls to the algorithm.


Let There Be Linear Functions …

  • For most standard forms of alignment, the alignment scoring function f is a linear function of its parameters:

    where each fi measures one of the features of A.


Let There Be Linear Functions … (Example I)

  • With fixed substitution scores, and two parameters gap open ( ) and gap extension ( ) penalties, p=2 and:

    where:

    g(A) = number of gaps

    l(A) = total length of gaps

    s(A) = total score of all substitutions


Let There Be Linear Functions … (Example II)

  • With no parameters fixed, the substitution scores are also in our parameters and:

    where:

    a and b range over all letters in the alphabet

    hab(A) = # of substitutions in A replacing a by b


Linear Programming Problem

  • INPUT: variables x = (x1, x2, …, xn)

    a system of linear inequalities in x

    a linear objective function in x

    OUTPUT: assignment of real values to x

    GOAL: satisfy all the inequalities and minimize the objective

    In general, the program can be infeasible, bounded, or unbounded.


Reducing The Inverse Alignment Problems To Linear Programming

  • Inverse Optimal Alignment: For each Ai and every alignment B of the set Si, we have an inequality:

    or equivalently:

    The number of alignments of a pair of strings of length n is hence a total of inequalities in p variables. Also, no specific objective function.


Separation Theorem

  • Some definitions:

  • Polyhedron: intersection of half-spaces

  • Rational polyhedron: described by inequalities with only rational coefficients

  • Bounded polyhedron: no infinite rays


Separation Theorem (Cont’d)

  • Optimization Problem for a rational polyhedron P in :

    INPUT: rational coefficients c specifying the objective

    OUTPUT: a point x in P minimizing cx, or determining that P is empty.

  • Separation Problem for P is:

    INPUT: a point y in

    OUTPU: rational coefficients w and b such that for all points x in P, but (a violated inequality) or determining that y is in P.


Separation Theorem (Cont’d)

  • Theorem (Equivalence of Separation and Optimization): The optimization problem on a bounded rational polyhedron can be solved in polynomial time if and only if the separation problem can be solved in polynomial time.

    That is, for bounded rational polyhedrons:

    OptimizationSeparation


Cutting-Plane Algorithm

  • Start with a small subset S of the set L of all inequalities

  • Compute an optimal solution x under constraints in S

  • Call the separation algorithm for L on x

  • If x is determined to satisfy L output it and halt; otherwise,

    add the violated inequality to S and loop back to step (2).


Complexity of Inverse Alignment

  • Theorem:Inverse Optimal and Near-Optimal Alignment can be solved in polynomial time for any form of alignment in which:

    1. the alignment scoring function is linear

    2. the parameters values can be bounded

    3. for any fixed parameter choice, an optimal alignment can be found in polynomial time.

    Inverse Unique-Optimal Alignment can be solved in polynomial time if in addition:

    3’. for any fixed parameter choice, a next-best alignment can be found in polynomial time.


Application to Global Alignment

  • Initializing the Cutting-Plane Algorithm: We consider the problem in two cases:

  • All scores and penalties varying: Then the parameter space can be made bounded.

  • Substitution costs are fixed: Then either (1) a bounding inequality, or (2) two inequalities one of which is a downward half-space, the other one is an upward half-space, and the slope of the former is less than the slope of the latter can be found in O(1) time, if they exist.


Application to Global Alignment (Cont’d)

  • Choosing an Objective Function: Again we consider two different cases:

  • Fixed substitution scores: in this case we choose the following objective:

  • Varying substitution scores: In this case we choose the following objective:

    where s is the minimum of all non-identity substitution scores and i is the maximum of all identity scores.


Application to Global Alignment (Cont’d)

  • For every objective, two extreme solutions exist: xlarge and xsmall. Then for every we have a corresponding solution:

    x1/2 is expected to better generalize to alignments outside the training set.


Computational Results


Computational Results (Cont’d)


Computational Results (Cont’d)


CONTRAlign

  • What: extensible and fully automatic parameter learning framework for protein pair-wise sequence alignment

  • How: pair conditional random fields (pair CRF s)

  • Who:


Pair-HMMs for Sequence Alignment


Pair-HMMs … (Cont’d)

  • If then:

    where:


Training Pair-HMMs

  • INPUT: a set of training examples

  • OUTPUT: the feature vector w

  • METHOD: maximizing the joint log-likelihood of the data and alignments under constraints on w:


Generating Alignments Using Pair-HMMs

  • Viterbi Algorithm on a Pair-HMM:

    INPUT: two sequences x and y

    OUTPUT: the alignment a of x and y that maximizes P(a|x,y;w)

    RUNNING TIME: O(|x|.|y|)


Pair-CRFs

  • Directly model the conditional probabilities:

    where w is a real-valued parameter vector not necessarily corresponding to log-probabilities


Training Pair-CRFs

  • INPUT: a set of training examples

  • OUTPUT: real-valued feature vector w

  • METHOD: maximizing theconditional log-likelihood of the data (discriminative/conditional learning)

    where is a Gaussian prior on w, to prevent over-fitting.


Properties of Pair-CRFs

  • Far weaker independence assumptions than Pair-HMMs

  • Capable of utilizing complex non-independent feature sets

  • Directly optimizing the predictive ability, ignoring P(x,y); the model to generate the input sequences


Choice of Model Topology in CONTRAlign

  • Some possible model topologies:

    CONTRAlignDouble-Affine :

    CONTRAlignLocal :


Choice of Feature Sets in CONTRAlign

  • Some possible feature sets to utilize:

    1. Hydropathy-based gap context features (CONTRAlignHYDROPATHY)

    2. External Information:

    2.1. Secondary structure (CONTRAlignDSSP)

    2.2. Solvent accessibility (CONTRAlignACCESSIBILITY)


Results: Comparison of Model Topologies and Feature Sets


Results: Comparison to Modern Sequence Alignment Tools


Results: Alignment Accuracy in the “Twilight Zone”

For each conservation range, the uncolored bars give accuracies for MAFFT(L-INS-i), T-Coffee, CLUSTALW, MUSCLE, and PROBCONS (Bali) in that order, and the colored bar indicated the accuracy for CONTRAlign.


Questions?


Thank You!


  • Login