composition alignment l.
Skip this Video
Loading SlideShow in 5 Seconds..
Composition Alignment PowerPoint Presentation
Download Presentation
Composition Alignment

Loading in 2 Seconds...

play fullscreen
1 / 33

Composition Alignment - PowerPoint PPT Presentation

  • Uploaded on

Composition Alignment. Gary Benson Departments of Computer Science and Biology Boston University. Composition Alignment. Gary Bens z on Departments of Computer Science and Biology Boston University. Outline of Talk. Sequence composition and composition match

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
Download Presentation

PowerPoint Slideshow about 'Composition Alignment' - MartaAdara

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
composition alignment
Composition Alignment

Gary Benson

Departments of Computer Science and Biology

Boston University

composition alignment2
Composition Alignment

Gary Benszon

Departments of Computer Science and Biology

Boston University

outline of talk
Outline of Talk
  • Sequence composition and composition match
  • Composition alignment algorithm
  • Composition match scoring functions
  • Growth of local composition alignment scores
  • Limiting the length of a composition match
  • Biological examples

Identify features in DNA sequences that are not accurately described by position specific patterns.

A position specific pattern, P, has the form:

P = p1 p2 p3

where pi is either a single specific character or a choice (weighted or unweighted) of characters.

In DNA there are features that are characterized by composition rather than by position specific patterns.

sequence composition
Sequence Composition

Composition is a vector quantity describing the frequency of occurrence of each alphabet letter in a particular string. Let S be a string over Σ. Then,

C(S)=(fσ1 , fσ2 , fσ3 , … , fσ|Σ|)

is the composition of S, where fσi is the fraction of the characters in S that are σi.

composition example
Composition Example


C(S) = ( 0.17, 0.28, 0.22, 0.33 )


Note that the order of letters is irrelevant as it has no effect on the composition.

composition and sequence features
Composition and Sequence Features
  • Isochores – Multi-megabase, specifically GC-rich or GC-poor. GC-rich isochores have greater gene density.
  • CpG Islands – Several hundred nucleotides, rich in the dinucleotide CG which is underrepresented in eukaryotic genomes. Methylation of the cystine (C) in these dinucleotides affects gene expression.
  • Protein binding regions – Tens of nucleotides, dinucleotide composition contributes to DNA flexibility, allowing the helix to change shape during protein binding.
composition match
Composition Match

We hope to identify common features in sequences using a new alignment algorithm. The main new idea is the use of composition matching.

Two strings, S and T, have a composition match if their lengths are equal and C(S) = C(T).

For example, S and T below have a composition match:



composition alignment problem
Composition Alignment Problem

Given:Two sequences, S and T of lengths m and n, over an alphabet Σ, and a scoring function cm(s, t) for the score of a composition match between substrings s and t.

Find: The best scoring alignment (global or local) of S with T such that the allowed scoring options include composition match between substrings of S and T as well as the standard options of 1) single character match, 2) single character mismatch, 3) insertion and deletion.

example of composition alignment
Example of composition alignment





| |<-> | <--->


related work
Related Work
  • Alignment allowing adjacent letter swap.

O(nm), Lowrance and Wagner (1975)

  • All swapped matchings of a pattern in a text.

O(nm1/3log m log|Σ|), Amir, Aumann, Landau, Lewenstein, Lewenstein (2000)

O(n log m log |Σ|), Amir, Cole, Hariharan, Lewenstein, Porat (2001)

  • Composition naming

O(n log m log |Σ|), Amir, Apostolico, Landau, Satta (2003)

composition alignment using dynamic programming
Composition Alignment using Dynamic Programming

Given two sequences, S and T, the best alignment of the prefix strings

S[1, i] = s1 …si

T[1, j] = t1 …tj

ends in one of four ways:

  • mismatch,
  • insertion,
  • deletion, or
  • composition match
ways an alignment can end
Ways an Alignment Can End


S: C G T

T: C G A

S: C A T

T: C A -

S: C A –

T: C A A

composition match

X: C G T A C

Y: C G C T A

insertion or deletion

ways an alignment can end14
Ways an Alignment Can End


S: C G T

T: C G A

S: C A T

T: C A -

S: C A –

T: C A A

composition match

X: C G T A C

Y: C G C T A

insertion or deletion

Note that the suffixes will have

a length l where

1 ≤ l ≤ min(i, j, limit)

time complexity
Time Complexity

Computing the optimal composition alignment with dynamic programming is similar to standard alignment, except for the composition match scoring option. The overall time complexity is


where Z is the time required per (i, j) pair to find the best length l for the composition match.

computing length of the shortest composition match
Computing length of the shortest composition match

Our goal here is to start with two strings, S and T, of equal length, and for each prefix pair S[1, k], T[1, k], find the length of the shortest suffixes that have a composition match.


For example, let



the table states that for k = 6, the shortest suffixes which have a composition match have length = 3:



composition difference
Composition difference

We find the matching suffix lengths using composition difference, a vector quantity for two strings x and y:

CD(x, y) = (cσ1, … , cσ|Σ|)

where cσiis the difference between the number of times σi occurs in x and in y.

using composition difference
Using composition difference

Key observation: two identical composition differences at prefix lengths k and g indicate a composition match of length k – g.

sorting to find shortest composition matches
Sorting to find shortest composition matches

Sort on composition difference using stable sort. Adjacent tuples with the same composition difference identify shortest composition matches.

time complexity for composition matches
Time complexity for composition matches

O(nmΣ) to find all index pairs shortest composition match lengths for two strings of length n and m.

In our work, Σ, is a small constant (4 for DNA, 16 for dinucleotides). For larger alphabets, the method of Amir, Apostolico, Landau and Satta (2003) can be used.

composition match scoring functions
Composition match scoring functions

We have explored:

Functions based on match length, k:

  • Function 1: cm(k) = ck
  • Function 2: cm(k) = c√ k

where c is a constant.

Functions based on substring composition:

  • Function 4: cm(C, B, k) = ck · H(C,B)

where H is the relative entropy function, C is the composition of the matching substrings and B is a background composition.

additive and subadditive scoring functions
Additive and subadditive scoring functions

The functions based on length are additive or subadditive:

cm(i + j) ≤ cm(i) + cm(j)

Lemma: For additive or subadditive composition match scoring functions, any best scoring alignment is equivalent in score to an alignment which contains only shortest composition matches.

Theorem: Composition alignment with additive or subadditive match scoring functions and finite alphabet has time complexity O(nm).

the limit parameter
The limit parameter

Intuitively, allowing scrambled letters to match should increase the amount of matching between sequences. If too much matching occurs, alignments will not be meaningful.

The limit parameter is an upper bound on the length l of the longest single composition match, used to prevent excessive matching.

Sequence length = 100, randomly generated

limit values for dna
Limit values for DNA
  • Function 1: cm(k) = ck: Limit ≤ 3.
  • Function 2: cm(k) = c√k: Limit ≤ 10.
  • Function 4: cm(C, B, k) = ck ·H(C, B):

Limit ≤ 50.

biological examples
Biological examples

Composition alignment was tested on a set of 1796 promoter sequences from the Eukaryotic Promoter Database. Each sequence is 600 nucleotides long, 500 bases upstream and 100 downstream of the transcription initiation site.

Two local alignment scores were produced using function 1, W using composition alignment and S using standard alignment. The examples shown have statistically significant W with W ≥ 3 · S to exclude good standard alignments.

example 1
Example 1

Composition alignment and standard alignment of the same two promoters. Standard alignment is not statistically significant. Sequences are characteristic of CpG islands.

Composition Alignment:


<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||


Standard Alignment:



example 2
Example 2

Composition alignment of two promoter sequences.

Composition changes at vertical line.


Left: (0.01, 0.61, 0.30, 0.08)

Right: (0.19, 0.16, 0.56, 0.09)


<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|




  • define a new alignment problem based on composition matching and test several scoring functions
  • show how to find all-pairs shortest composition match lengths in linear time per pair for a fixed alphabet
  • show that alignment using scoring functions based on sequence length only require finding shortest composition matches
  • give biological examples where composition alignment finds statistically (and functionally) significant sequence similarity in the absence of significant standard alignments