Inverse alignment
Download
1 / 46

Inverse Alignment - PowerPoint PPT Presentation


  • 84 Views
  • Uploaded on

Inverse Alignment. CS 374 Bahman Bahmani Fall 2006. The Papers To Be Presented. Sequence Comparison - Alignment. Alignments can be thought of as two sequences differing due to mutations happened during the evolution. AGGCTATCACCTGACCTCCAGGCCGATGCCC TAGCTATCACGACCGCGGTCGATTTGCCCGAC.

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Inverse Alignment' - fola


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Inverse alignment

Inverse Alignment

CS 374

Bahman Bahmani

Fall 2006



Sequence comparison alignment
Sequence Comparison - Alignment

  • Alignments can be thought of as two sequences differing due to mutations happened during the evolution

AGGCTATCACCTGACCTCCAGGCCGATGCCC

TAGCTATCACGACCGCGGTCGATTTGCCCGAC

-AGGCTATCACCTGACCTCCAGGCCGA--TGCCC---

| | | | | | | | | | | | | x | | | | | | | | | | |

TAG-CTATCAC--GACCGC--GGTCGATTTGCCCGAC


Scoring alignments
Scoring Alignments

  • Alignments are based on three basic operations:

  • Substitutions

  • Insertions

  • Deletions

  • A score is assigned to each single operation (resulting in a scoring matrix and also in gap penalties). Alignments are then scored by adding the scores of their operations.

  • Standard formulations of string alignment optimize the above score of the alignment.


An example of scoring an alignment using a scoring matrix

AKRANR

KAAANK

-1 + (-1) + (-2) + 5 + 7 + 3 = 11

An Example Of Scoring an Alignment Using a Scoring Matrix


Scoring matrices in practice
Scoring Matrices in Practice

  • Some choices for substitution scores are now common, largely due to convention

  • Most commonly used Amino-Acid substitution matrices:

  • PAM (Percent Accepted Mutation)

  • BLOSUM (Blocks Amino Acid Substitution Matrix)

BLOSUM50 Scoring Matrix


Gap penalties
Gap Penalties

  • Inclusion of gaps and gap penalties is necessary to obtain the best alignment

  • If gap penalty is too high, gaps will never appear in the alignment

    AATGCTGC

    ATGCTGCA

  • If gap penalty is too low, gaps will appear everywhere in the alignment

    AATGCTGC----

    A----TGCTGCA


Gap penalties cont d
Gap Penalties (Cont’d)

Separate penalties for gap opening and gap extension

Opening: The cost to introduce a gap

Extension: The cost to elongate a gap

Opening a gap is costly, while extending a gap is cheap

Despite scoring matrices, no gap penalties are commonly agreed upon

LETVGY

W----L

-5 -1 -1 -1


Parametric sequence alignment
Parametric Sequence Alignment

  • For a given pair of strings, the alignment problem is solved for effectively all possible choices of the scoring parameters and penalties (exhaustive search).

  • A correct alignmentis then used to find the best parameter values.

  • However, this method is very inefficient if the number of parameters is large.


Inverse parametric alignment
Inverse Parametric Alignment

  • INPUT: an alignment of a pair of strings.

  • OUTPUT: a choice of parameters that makes the input alignment be an optimal-scoring alignment of its strings.

  • From Machine Learning point of view, this learns the parameters for optimal alignment from training examples of correct alignments.


Inverse optimal alignment
Inverse Optimal Alignment

Definition (Inverse Optimal Alignment):

INPUT: alignments A1, A2, …, Ak of strings,

an alignment scoring function fw with parameters w = (w1, w2, …, wp).

OUTPUT: values x = (x1, x2, …, xp) for w

GOAL: each input alignment be an optimal alignment of its strings under fx .

ATTENTION: This problem may have no solution!


Inverse near optimal alignment
Inverse Near-Optimal Alignment

  • When minimizing the scoring function f, we say an alignment A of a set of strings S is –optimal, for some if:

    where is the optimal alignment of S under f.


Inverse near optimal alignment cont d
Inverse Near-Optimal Alignment (Cont’d)

  • Definition (Inverse Near-Optimal Alignment):

    INPUT: alignments Ai

    scoring function f

    real number

    OUTPUT: find parameter values x

    GOAL: each alignment Ai be -optimal under fx .

    The smallest possible can be found within accuracy using calls to the algorithm.


Inverse unique optimal alignment
Inverse Unique-Optimal Alignment

  • When minimizing the scoring function f, we say an alignment A of a set of strings S is -uniquefor some if:

    for every alignment B of S other than A.


Inverse unique optimal alignment cont d
Inverse Unique-Optimal Alignment (Cont’d)

  • Definition (Inverse Unique-Optimal Alignment):

    INPUT: alignments Ai

    scoring function f

    real number

    OUTPUT: parameter values x

    GOAL: each alignment Ai be -unique under fx

    The largest possible can be found within accuracy using calls to the algorithm.


Let there be linear functions
Let There Be Linear Functions …

  • For most standard forms of alignment, the alignment scoring function f is a linear function of its parameters:

    where each fi measures one of the features of A.


Let there be linear functions example i
Let There Be Linear Functions … (Example I)

  • With fixed substitution scores, and two parameters gap open ( ) and gap extension ( ) penalties, p=2 and:

    where:

    g(A) = number of gaps

    l(A) = total length of gaps

    s(A) = total score of all substitutions


Let there be linear functions example ii
Let There Be Linear Functions … (Example II)

  • With no parameters fixed, the substitution scores are also in our parameters and:

    where:

    a and b range over all letters in the alphabet

    hab(A) = # of substitutions in A replacing a by b


Linear programming problem
Linear Programming Problem

  • INPUT: variables x = (x1, x2, …, xn)

    a system of linear inequalities in x

    a linear objective function in x

    OUTPUT: assignment of real values to x

    GOAL: satisfy all the inequalities and minimize the objective

    In general, the program can be infeasible, bounded, or unbounded.


Reducing the inverse alignment problems to linear programming
Reducing The Inverse Alignment Problems To Linear Programming

  • Inverse Optimal Alignment: For each Ai and every alignment B of the set Si, we have an inequality:

    or equivalently:

    The number of alignments of a pair of strings of length n is hence a total of inequalities in p variables. Also, no specific objective function.


Separation theorem
Separation Theorem Programming

  • Some definitions:

  • Polyhedron: intersection of half-spaces

  • Rational polyhedron: described by inequalities with only rational coefficients

  • Bounded polyhedron: no infinite rays


Separation theorem cont d
Separation Theorem (Cont’d) Programming

  • Optimization Problem for a rational polyhedron P in :

    INPUT: rational coefficients c specifying the objective

    OUTPUT: a point x in P minimizing cx, or determining that P is empty.

  • Separation Problem for P is:

    INPUT: a point y in

    OUTPU: rational coefficients w and b such that for all points x in P, but (a violated inequality) or determining that y is in P.


Separation theorem cont d1
Separation Theorem (Cont’d) Programming

  • Theorem (Equivalence of Separation and Optimization): The optimization problem on a bounded rational polyhedron can be solved in polynomial time if and only if the separation problem can be solved in polynomial time.

    That is, for bounded rational polyhedrons:

    OptimizationSeparation


Cutting plane algorithm
Cutting-Plane Algorithm Programming

  • Start with a small subset S of the set L of all inequalities

  • Compute an optimal solution x under constraints in S

  • Call the separation algorithm for L on x

  • If x is determined to satisfy L output it and halt; otherwise,

    add the violated inequality to S and loop back to step (2).


Complexity of inverse alignment
Complexity of Inverse Alignment Programming

  • Theorem:Inverse Optimal and Near-Optimal Alignment can be solved in polynomial time for any form of alignment in which:

    1. the alignment scoring function is linear

    2. the parameters values can be bounded

    3. for any fixed parameter choice, an optimal alignment can be found in polynomial time.

    Inverse Unique-Optimal Alignment can be solved in polynomial time if in addition:

    3’. for any fixed parameter choice, a next-best alignment can be found in polynomial time.


Application to global alignment
Application to Global Alignment Programming

  • Initializing the Cutting-Plane Algorithm: We consider the problem in two cases:

  • All scores and penalties varying: Then the parameter space can be made bounded.

  • Substitution costs are fixed: Then either (1) a bounding inequality, or (2) two inequalities one of which is a downward half-space, the other one is an upward half-space, and the slope of the former is less than the slope of the latter can be found in O(1) time, if they exist.


Application to global alignment cont d
Application to Global Alignment (Cont’d) Programming

  • Choosing an Objective Function: Again we consider two different cases:

  • Fixed substitution scores: in this case we choose the following objective:

  • Varying substitution scores: In this case we choose the following objective:

    where s is the minimum of all non-identity substitution scores and i is the maximum of all identity scores.


Application to global alignment cont d1
Application to Global Alignment (Cont’d) Programming

  • For every objective, two extreme solutions exist: xlarge and xsmall. Then for every we have a corresponding solution:

    x1/2 is expected to better generalize to alignments outside the training set.





Contralign
CONTRAlign Programming

  • What: extensible and fully automatic parameter learning framework for protein pair-wise sequence alignment

  • How: pair conditional random fields (pair CRF s)

  • Who:



Pair hmms cont d
Pair-HMMs … (Cont’d) Programming

  • If then:

    where:


Training pair hmms
Training Pair-HMMs Programming

  • INPUT: a set of training examples

  • OUTPUT: the feature vector w

  • METHOD: maximizing the joint log-likelihood of the data and alignments under constraints on w:


Generating alignments using pair hmms
Generating Alignments Using Pair-HMMs Programming

  • Viterbi Algorithm on a Pair-HMM:

    INPUT: two sequences x and y

    OUTPUT: the alignment a of x and y that maximizes P(a|x,y;w)

    RUNNING TIME: O(|x|.|y|)


Pair crfs
Pair-CRFs Programming

  • Directly model the conditional probabilities:

    where w is a real-valued parameter vector not necessarily corresponding to log-probabilities


Training pair crfs
Training Pair-CRFs Programming

  • INPUT: a set of training examples

  • OUTPUT: real-valued feature vector w

  • METHOD: maximizing theconditional log-likelihood of the data (discriminative/conditional learning)

    where is a Gaussian prior on w, to prevent over-fitting.


Properties of pair crfs
Properties of Pair-CRFs Programming

  • Far weaker independence assumptions than Pair-HMMs

  • Capable of utilizing complex non-independent feature sets

  • Directly optimizing the predictive ability, ignoring P(x,y); the model to generate the input sequences


Choice of model topology in contralign
Choice of Model Topology in CONTRAlign Programming

  • Some possible model topologies:

    CONTRAlignDouble-Affine :

    CONTRAlignLocal :


Choice of feature sets in contralign
Choice of Feature Sets in CONTRAlign Programming

  • Some possible feature sets to utilize:

    1. Hydropathy-based gap context features (CONTRAlignHYDROPATHY)

    2. External Information:

    2.1. Secondary structure (CONTRAlignDSSP)

    2.2. Solvent accessibility (CONTRAlignACCESSIBILITY)




Results alignment accuracy in the twilight zone
Results: Alignment Accuracy in the “Twilight Zone” Programming

For each conservation range, the uncolored bars give accuracies for MAFFT(L-INS-i), T-Coffee, CLUSTALW, MUSCLE, and PROBCONS (Bali) in that order, and the colored bar indicated the accuracy for CONTRAlign.


Questions? Programming


Thank You! Programming


ad