Composition Alignment

1 / 33

# Composition Alignment - PowerPoint PPT Presentation

Composition Alignment. Gary Benson Departments of Computer Science and Biology Boston University. Composition Alignment. Gary Bens z on Departments of Computer Science and Biology Boston University. Outline of Talk. Sequence composition and composition match

I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.

## PowerPoint Slideshow about 'Composition Alignment' - ady

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
Composition Alignment

Gary Benson

Departments of Computer Science and Biology

Boston University

Composition Alignment

Gary Benszon

Departments of Computer Science and Biology

Boston University

Outline of Talk
• Sequence composition and composition match
• Composition alignment algorithm
• Composition match scoring functions
• Growth of local composition alignment scores
• Limiting the length of a composition match
• Biological examples
Goal

Identify features in DNA sequences that are not accurately described by position specific patterns.

A position specific pattern, P, has the form:

P = p1 p2 p3 ...pk

where pi is either a single specific character or a choice (weighted or unweighted) of characters.

In DNA there are features that are characterized by composition rather than by position specific patterns.

Sequence Composition

Composition is a vector quantity describing the frequency of occurrence of each alphabet letter in a particular string. Let S be a string over Σ. Then,

C(S)=(fσ1 , fσ2 , fσ3 , … , fσ|Σ|)

is the composition of S, where fσi is the fraction of the characters in S that are σi.

Composition Example

S = ACTGTACCTGGCGCTATT

C(S) = ( 0.17, 0.28, 0.22, 0.33 )

A C G T

Note that the order of letters is irrelevant as it has no effect on the composition.

Composition and Sequence Features
• Isochores – Multi-megabase, specifically GC-rich or GC-poor. GC-rich isochores have greater gene density.
• CpG Islands – Several hundred nucleotides, rich in the dinucleotide CG which is underrepresented in eukaryotic genomes. Methylation of the cystine (C) in these dinucleotides affects gene expression.
• Protein binding regions – Tens of nucleotides, dinucleotide composition contributes to DNA flexibility, allowing the helix to change shape during protein binding.
Composition Match

We hope to identify common features in sequences using a new alignment algorithm. The main new idea is the use of composition matching.

Two strings, S and T, have a composition match if their lengths are equal and C(S) = C(T).

For example, S and T below have a composition match:

S = ACTGTACCTGGCGCTATT

T = AAACCCCCGGGGTTTTTT

Composition Alignment Problem

Given:Two sequences, S and T of lengths m and n, over an alphabet Σ, and a scoring function cm(s, t) for the score of a composition match between substrings s and t.

Find: The best scoring alignment (global or local) of S with T such that the allowed scoring options include composition match between substrings of S and T as well as the standard options of 1) single character match, 2) single character mismatch, 3) insertion and deletion.

Example of composition alignment

S = AACGTCTTTGAGCTC

T = AGCCTGACTGCCTA

Alignment

AACGTCTTTGAGCTC

| |<-> | <--->

AGCCTGACT-GCCTA

Related Work
• Alignment allowing adjacent letter swap.

O(nm), Lowrance and Wagner (1975)

• All swapped matchings of a pattern in a text.

O(nm1/3log m log|Σ|), Amir, Aumann, Landau, Lewenstein, Lewenstein (2000)

O(n log m log |Σ|), Amir, Cole, Hariharan, Lewenstein, Porat (2001)

• Composition naming

O(n log m log |Σ|), Amir, Apostolico, Landau, Satta (2003)

Composition Alignment using Dynamic Programming

Given two sequences, S and T, the best alignment of the prefix strings

S[1, i] = s1 …si

T[1, j] = t1 …tj

ends in one of four ways:

• mismatch,
• insertion,
• deletion, or
• composition match
Ways an Alignment Can End

mismatch

S: C G T

T: C G A

S: C A T

T: C A -

S: C A –

T: C A A

composition match

X: C G T A C

Y: C G C T A

insertion or deletion

Ways an Alignment Can End

mismatch

S: C G T

T: C G A

S: C A T

T: C A -

S: C A –

T: C A A

composition match

X: C G T A C

Y: C G C T A

insertion or deletion

Note that the suffixes will have

a length l where

1 ≤ l ≤ min(i, j, limit)

Time Complexity

Computing the optimal composition alignment with dynamic programming is similar to standard alignment, except for the composition match scoring option. The overall time complexity is

O(nmZ)

where Z is the time required per (i, j) pair to find the best length l for the composition match.

Computing length of the shortest composition match

Our goal here is to start with two strings, S and T, of equal length, and for each prefix pair S[1, k], T[1, k], find the length of the shortest suffixes that have a composition match.

For example, let

S = AACGTCTTTGAGCT

T = AGCCTGACTGCCTA

the table states that for k = 6, the shortest suffixes which have a composition match have length = 3:

S = AACGTC...

T = AGCCTG...

Composition difference

We find the matching suffix lengths using composition difference, a vector quantity for two strings x and y:

CD(x, y) = (cσ1, … , cσ|Σ|)

where cσiis the difference between the number of times σi occurs in x and in y.

Using composition difference

Key observation: two identical composition differences at prefix lengths k and g indicate a composition match of length k – g.

Sorting to find shortest composition matches

Sort on composition difference using stable sort. Adjacent tuples with the same composition difference identify shortest composition matches.

Time complexity for composition matches

O(nmΣ) to find all index pairs shortest composition match lengths for two strings of length n and m.

In our work, Σ, is a small constant (4 for DNA, 16 for dinucleotides). For larger alphabets, the method of Amir, Apostolico, Landau and Satta (2003) can be used.

Composition match scoring functions

We have explored:

Functions based on match length, k:

• Function 1: cm(k) = ck
• Function 2: cm(k) = c√ k

where c is a constant.

Functions based on substring composition:

• Function 4: cm(C, B, k) = ck · H(C,B)

where H is the relative entropy function, C is the composition of the matching substrings and B is a background composition.

The functions based on length are additive or subadditive:

cm(i + j) ≤ cm(i) + cm(j)

Lemma: For additive or subadditive composition match scoring functions, any best scoring alignment is equivalent in score to an alignment which contains only shortest composition matches.

Theorem: Composition alignment with additive or subadditive match scoring functions and finite alphabet has time complexity O(nm).

The limit parameter

Intuitively, allowing scrambled letters to match should increase the amount of matching between sequences. If too much matching occurs, alignments will not be meaningful.

The limit parameter is an upper bound on the length l of the longest single composition match, used to prevent excessive matching.

Sequence length = 100, randomly generated

Limit values for DNA
• Function 1: cm(k) = ck: Limit ≤ 3.
• Function 2: cm(k) = c√k: Limit ≤ 10.
• Function 4: cm(C, B, k) = ck ·H(C, B):

Limit ≤ 50.

Biological examples

Composition alignment was tested on a set of 1796 promoter sequences from the Eukaryotic Promoter Database. Each sequence is 600 nucleotides long, 500 bases upstream and 100 downstream of the transcription initiation site.

Two local alignment scores were produced using function 1, W using composition alignment and S using standard alignment. The examples shown have statistically significant W with W ≥ 3 · S to exclude good standard alignments.

Example 1

Composition alignment and standard alignment of the same two promoters. Standard alignment is not statistically significant. Sequences are characteristic of CpG islands.

Composition Alignment:

GCCCGCCCGCCGCGCTCCCGCCCGCCGCTCTCCGTGGCCC-CGCCG-CGCTGCCGCCGCCGCCGCTGC

<->||||<>|<>||<>| ||||<>||<> |<-> |||||| <>|<> ||||<><> |<>| ||<->||

CCGCGCCGCCGCCGTCCGCGCCGCCCCG-CCCT-TGGCCCAGCCGCTCGCTCGGCTCCGCTCCCTGGC

Standard Alignment:

CGCCGCCGCCG

CGCCGCCGCCG

Example 2

Composition alignment of two promoter sequences.

Composition changes at vertical line.

A C G T

Left: (0.01, 0.61, 0.30, 0.08)

Right: (0.19, 0.16, 0.56, 0.09)

GCCCCGCGCCCCGCGCCCCGCGCCCCGCGCGCCTC-CGCCCGCCCCT-GCTCCGGC---C-TTGCGCCTGC-GCACAGTGGGATGCGCGGGGAG

<->|<><>|||| <>|||||| ||<->|<>||||| <>|||| |||| || ||<-> | |<><>|<-> | |<>|<>|<>||||<-><->|

CCGCGCGCCCCC-GCCCCCGCCCCGCCCCGGCCTCGGCCCCGGCCCTGGC-CCCGGGGGCAGTCGCGCCTGTG-AACGGTGAGTGCGGGCAGGG

Conclusion

We

• define a new alignment problem based on composition matching and test several scoring functions
• show how to find all-pairs shortest composition match lengths in linear time per pair for a fixed alphabet
• show that alignment using scoring functions based on sequence length only require finding shortest composition matches
• give biological examples where composition alignment finds statistically (and functionally) significant sequence similarity in the absence of significant standard alignments