Inverse Alignment

Inverse Alignment CS 374 Bahman Bahmani Fall 2006

The Papers To Be Presented

Sequence Comparison - Alignment • Alignments can be thought of as two sequences differing due to mutations happened during the evolution AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC -AGGCTATCACCTGACCTCCAGGCCGA--TGCCC--- | | | | | | | | | | | | | x | | | | | | | | | | | TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC

Scoring Alignments • Alignments are based on three basic operations: • Substitutions • Insertions • Deletions • A score is assigned to each single operation (resulting in a scoring matrix and also in gap penalties). Alignments are then scored by adding the scores of their operations. • Standard formulations of string alignment optimize the above score of the alignment.

AKRANR KAAANK -1 + (-1) + (-2) + 5 + 7 + 3 = 11 An Example Of Scoring an Alignment Using a Scoring Matrix

Scoring Matrices in Practice • Some choices for substitution scores are now common, largely due to convention • Most commonly used Amino-Acid substitution matrices: • PAM (Percent Accepted Mutation) • BLOSUM (Blocks Amino Acid Substitution Matrix) BLOSUM50 Scoring Matrix

Gap Penalties • Inclusion of gaps and gap penalties is necessary to obtain the best alignment • If gap penalty is too high, gaps will never appear in the alignment AATGCTGC ATGCTGCA • If gap penalty is too low, gaps will appear everywhere in the alignment AATGCTGC---- A----TGCTGCA

Gap Penalties (Cont’d) Separate penalties for gap opening and gap extension Opening: The cost to introduce a gap Extension: The cost to elongate a gap Opening a gap is costly, while extending a gap is cheap Despite scoring matrices, no gap penalties are commonly agreed upon LETVGY W----L -5 -1 -1 -1

Parametric Sequence Alignment • For a given pair of strings, the alignment problem is solved for effectively all possible choices of the scoring parameters and penalties (exhaustive search). • A correct alignmentis then used to find the best parameter values. • However, this method is very inefficient if the number of parameters is large.

Inverse Parametric Alignment • INPUT: an alignment of a pair of strings. • OUTPUT: a choice of parameters that makes the input alignment be an optimal-scoring alignment of its strings. • From Machine Learning point of view, this learns the parameters for optimal alignment from training examples of correct alignments.

Inverse Optimal Alignment Definition (Inverse Optimal Alignment): INPUT: alignments A1, A2, …, Ak of strings, an alignment scoring function fw with parameters w = (w1, w2, …, wp). OUTPUT: values x = (x1, x2, …, xp) for w GOAL: each input alignment be an optimal alignment of its strings under fx . ATTENTION: This problem may have no solution!

Inverse Near-Optimal Alignment • When minimizing the scoring function f, we say an alignment A of a set of strings S is –optimal, for some if: where is the optimal alignment of S under f.

Inverse Near-Optimal Alignment (Cont’d) • Definition (Inverse Near-Optimal Alignment): INPUT: alignments Ai scoring function f real number OUTPUT: find parameter values x GOAL: each alignment Ai be -optimal under fx . The smallest possible can be found within accuracy using calls to the algorithm.

Inverse Unique-Optimal Alignment • When minimizing the scoring function f, we say an alignment A of a set of strings S is -uniquefor some if: for every alignment B of S other than A.

Inverse Unique-Optimal Alignment (Cont’d) • Definition (Inverse Unique-Optimal Alignment): INPUT: alignments Ai scoring function f real number OUTPUT: parameter values x GOAL: each alignment Ai be -unique under fx The largest possible can be found within accuracy using calls to the algorithm.

Let There Be Linear Functions … • For most standard forms of alignment, the alignment scoring function f is a linear function of its parameters: where each fi measures one of the features of A.

Let There Be Linear Functions … (Example I) • With fixed substitution scores, and two parameters gap open ( ) and gap extension ( ) penalties, p=2 and: where: g(A) = number of gaps l(A) = total length of gaps s(A) = total score of all substitutions

Let There Be Linear Functions … (Example II) • With no parameters fixed, the substitution scores are also in our parameters and: where: a and b range over all letters in the alphabet hab(A) = # of substitutions in A replacing a by b

Linear Programming Problem • INPUT: variables x = (x1, x2, …, xn) a system of linear inequalities in x a linear objective function in x OUTPUT: assignment of real values to x GOAL: satisfy all the inequalities and minimize the objective In general, the program can be infeasible, bounded, or unbounded.

Reducing The Inverse Alignment Problems To Linear Programming • Inverse Optimal Alignment: For each Ai and every alignment B of the set Si, we have an inequality: or equivalently: The number of alignments of a pair of strings of length n is hence a total of inequalities in p variables. Also, no specific objective function.

Separation Theorem • Some definitions: • Polyhedron: intersection of half-spaces • Rational polyhedron: described by inequalities with only rational coefficients • Bounded polyhedron: no infinite rays

Separation Theorem (Cont’d) • Optimization Problem for a rational polyhedron P in : INPUT: rational coefficients c specifying the objective OUTPUT: a point x in P minimizing cx, or determining that P is empty. • Separation Problem for P is: INPUT: a point y in OUTPU: rational coefficients w and b such that for all points x in P, but (a violated inequality) or determining that y is in P.

Separation Theorem (Cont’d) • Theorem (Equivalence of Separation and Optimization): The optimization problem on a bounded rational polyhedron can be solved in polynomial time if and only if the separation problem can be solved in polynomial time. That is, for bounded rational polyhedrons: OptimizationSeparation

Cutting-Plane Algorithm • Start with a small subset S of the set L of all inequalities • Compute an optimal solution x under constraints in S • Call the separation algorithm for L on x • If x is determined to satisfy L output it and halt; otherwise, add the violated inequality to S and loop back to step (2).

Complexity of Inverse Alignment • Theorem:Inverse Optimal and Near-Optimal Alignment can be solved in polynomial time for any form of alignment in which: 1. the alignment scoring function is linear 2. the parameters values can be bounded 3. for any fixed parameter choice, an optimal alignment can be found in polynomial time. Inverse Unique-Optimal Alignment can be solved in polynomial time if in addition: 3’. for any fixed parameter choice, a next-best alignment can be found in polynomial time.

Application to Global Alignment • Initializing the Cutting-Plane Algorithm: We consider the problem in two cases: • All scores and penalties varying: Then the parameter space can be made bounded. • Substitution costs are fixed: Then either (1) a bounding inequality, or (2) two inequalities one of which is a downward half-space, the other one is an upward half-space, and the slope of the former is less than the slope of the latter can be found in O(1) time, if they exist.

Application to Global Alignment (Cont’d) • Choosing an Objective Function: Again we consider two different cases: • Fixed substitution scores: in this case we choose the following objective: • Varying substitution scores: In this case we choose the following objective: where s is the minimum of all non-identity substitution scores and i is the maximum of all identity scores.

Application to Global Alignment (Cont’d) • For every objective, two extreme solutions exist: xlarge and xsmall. Then for every we have a corresponding solution: x1/2 is expected to better generalize to alignments outside the training set.

Computational Results

Computational Results (Cont’d)

CONTRAlign • What: extensible and fully automatic parameter learning framework for protein pair-wise sequence alignment • How: pair conditional random fields (pair CRF s) • Who:

Pair-HMMs for Sequence Alignment

Pair-HMMs … (Cont’d) • If then: where:

Training Pair-HMMs • INPUT: a set of training examples • OUTPUT: the feature vector w • METHOD: maximizing the joint log-likelihood of the data and alignments under constraints on w:

Generating Alignments Using Pair-HMMs • Viterbi Algorithm on a Pair-HMM: INPUT: two sequences x and y OUTPUT: the alignment a of x and y that maximizes P(a|x,y;w) RUNNING TIME: O(|x|.|y|)

Pair-CRFs • Directly model the conditional probabilities: where w is a real-valued parameter vector not necessarily corresponding to log-probabilities

Training Pair-CRFs • INPUT: a set of training examples • OUTPUT: real-valued feature vector w • METHOD: maximizing theconditional log-likelihood of the data (discriminative/conditional learning) where is a Gaussian prior on w, to prevent over-fitting.

Properties of Pair-CRFs • Far weaker independence assumptions than Pair-HMMs • Capable of utilizing complex non-independent feature sets • Directly optimizing the predictive ability, ignoring P(x,y); the model to generate the input sequences

Choice of Model Topology in CONTRAlign • Some possible model topologies: CONTRAlignDouble-Affine : CONTRAlignLocal :

Choice of Feature Sets in CONTRAlign • Some possible feature sets to utilize: 1. Hydropathy-based gap context features (CONTRAlignHYDROPATHY) 2. External Information: 2.1. Secondary structure (CONTRAlignDSSP) 2.2. Solvent accessibility (CONTRAlignACCESSIBILITY)

Results: Comparison of Model Topologies and Feature Sets

Results: Comparison to Modern Sequence Alignment Tools

Results: Alignment Accuracy in the “Twilight Zone” For each conservation range, the uncolored bars give accuracies for MAFFT(L-INS-i), T-Coffee, CLUSTALW, MUSCLE, and PROBCONS (Bali) in that order, and the colored bar indicated the accuracy for CONTRAlign.

Questions?

Thank You!

Inverse Alignment