Loading in 5 sec....

Finding Regulatory Signals in Genomic SequencesPowerPoint Presentation

Finding Regulatory Signals in Genomic Sequences

- By
**avery** - Follow User

- 184 Views
- Updated On :

Finding Regulatory Signals in Genomic Sequences. Weeder ProFind. Giancarlo Mauri. Bioinformatics and Natural Computing Group Dipartimento di Informatica, Sistemistica e Comunicazione Università degli Studi di Milano-Bicocca. Gene Expression Data.

Related searches for Finding Regulatory Signals in Genomic Sequences

Download Presentation
## PowerPoint Slideshow about 'Finding Regulatory Signals in Genomic Sequences' - avery

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

### Finding Regulatory Signals inGenomic Sequences

### Weeder : A tool for pattern discovery inGenomic Sequences

### ProFind : A GA Approach to the Definitionof Regulatory Signals inGenomic Sequences

### SeQuAl described experimentally, as in the case of fruit fly signals

Weeder

ProFind

Giancarlo Mauri

Bioinformatics and Natural Computing Group

Dipartimento di Informatica, Sistemistica e Comunicazione

Università degli Studi di Milano-Bicocca

Gene Expression Data

- When and how much a gene is expressed under some given conditions (tissue, external stimuli, disease...)
- We can group genes according to their expression profile
- We can suspect a “common cause” for their expression

Transcription Factors

- The expression of a gene starts with transcription from DNA to RNA
- Transcription is modulated by dedicated proteins called transcription factors(TFs)
- TFs bind to DNA in the regions surrounding the starting site of the gene (mostly upstream), and direct polymerases to the “right spot” to start transcription
- Different effects: may enhance or block transcription

TFs Binding Sites (TFBSs)

Fundamental in regulatory analysis is the identification of potential TFBSs

- Bound by transcription factors
- Short degenerate sequences, 5-16 nucleotides long, (gaps possible but rare)
- Each TF does not recognize a single fragment but a set of them (similar to each other) called signal or motif
- Can be illustrated by profiles and/or consensi (computational models)

TFs Binding Sites (TFBSs)

Fundamental in regulatory analysis is the identification of potential TFBSs

- Bound by transcription factors
- Short degenerate sequences, 5-16 nucleotides long, (gaps possible but rare)
- Each TF does not recognize a single fragment but a set of them (similar to each other) called signal or motif
- Can be illustrated by profiles and/or consensi (computational models)

Finding TFBSs

- We have a set of related genes:
- similar expression profile
- similar biological function
- anything else..

- We take their upstream regulatory regions
- If they are regulated by the same TF(s), then we should find its (their) binding sites in the sequences
- We should find short patterns conserved in the sequences
- We could use the detected TFBS to predict the behavior of a gene

Finding Novel TFBSs

- Over-representation:
- First, detect groups of similar oligos
- Describe each group with a consensus or a profile (or in some other smart way)
- Find the most over-represented groups

If sequences were built at random and/or

we picked sequences at random,

the group should not appear with

the same size/conservation

Finding Novel TFBSs

- Most of early research has focused on the first point: how to detect the best groups (unfortunately, there are thousands of candidates) given simple score measures
- Recent research has followed the second point: which is the best measure to tell “significant” groups from random similarities?
- Is it expected or not, to find a group that is conserved?
- Can we take advantage from the wealth of sequence data available?

Giancarlo Mauri*

Giulio Pavesi*

Graziano Pesole^

* Università degli Studi di Milano-Bicocca

^ Università degli Studi di Milano

References

- G. Pavesi, G.Mauri, G.Pesole. An Algorithm for Finding Signals of Unknown Length in Unaligned DNA Sequences. Bioinformatics 17, S207-S214, 2001
- G.Pavesi, P.Mereghetti, G.Mauri, G.Pesole. WeederWeb: Discovery of Transcription Factor Binding Sites in a Set of Sequences from Co-Regulated Genes. Nucleic Acids Research Web Server Issue 2004, 32: W199-W203
- http://159.149.109.16:8080/weederWeb/

Weeder (2001)

- Idea: instead of reducing the set of candidate patterns, reduce the set of possible matches for each pattern, trying to save a “significant” number of valid occurrences
- Instead of searching exhaustively for patterns that occur in every sequence, we “short-sightedly” look for patterns that occur in a subset of them
- The algorithm needs as input only a given error ratio e

Suffix Trees

- A suffix tree is a data structure that exposes the internal structure of a string in a very deep and meaningful way
- Suffix tree T for S = s1…sn
- rooted directed tree
- exactly n leaves numbered 1 to n
- internal nodes with at least two sons
- edges labeled by non empty substrings of S
- labels out of the same node begin with different symbols
- the concatenation of the edge labels on the path from the root to any leaf i exactly spells the suffix of S starting at position i, i.e., s1…sn

Suffix Trees

- The same structure can be built also for a set of k sequences
- To distinguish which sequence a suffix belongs to, it appends a different marker symbol, not occurring elsewhere, to each sequence in the set.
- It is also possible to annotate each node of the tree with a k-bit string, where the i-th bit is set if the word spelled by the path ending at the node occurs in the i-th sequence.

#

C

A

C

C

$

A

C

A

A

G

$

A

A

$

G

#

G

$

A

#

G

#

#

Suffix TreesSuffix tree for ACCA (end with $) and CCAAG (end with #)

Suffix Trees

- A generalized suffix tree can be built in O(N) time and takes O(N) space, where N is the overall length of the sequences
- Annotating it with the bit strings takes additional O(kN) time
- Each pattern occurring in the strings is spelled by a path starting from the root of the tree
- The time needed to search for a pattern depends only on the length of the pattern
- The structure allows to implement recursively the exhaustive enumeration of all the candidate patterns of a given length
- The time complexity is thus reduced to be exponential in the maximum number of mutations allowed (Sagot, 1998)

Searching for an Exact Pattern

- Given a set of sequences and the annotated suffix tree, every pattern appearing in at least one sequence of the set is spelled by a unique path starting from the root
- We match the symbols of pattern p along the unique path in the tree until
- p is exhausted
- In this case, the bit string on the next node on the path specifies which sequences p appears in

- no more matches are possible

- p is exhausted

Searching for a Pattern with Mismatches

- We can also search for a pattern p with at most e mismatches in a similar way.
- We match p along different paths on the tree at the same time, keeping track of the number of mismatches encountered on each path.
- Whenever the number of errors on a path is greater than e, we discard that path.
- The sequences p appears in are given by the logical-OR of the bit strings corresponding to the different paths.

Searching for (M, e) Patterns

- The algorithm starts with the empty pattern from the root of the tree, and recursively expands it
- Let us suppose we have found on the tree the endpoints of paths corresponding to the occurrences of a pattern p=p1…pm in the sequences, where all the paths spell words within distance e from p, with m<M
- If p occurs in at least q sequences, we try and expand it by one symbol

Searching for (M, e) Patterns

- Expanding a pattern by one symbol
- For each character b {A, C, G, T}, we match b against the next symbol on each path
- If a path ends just before a node V of the tree, we match b against the first symbol on each edge leaving V
- When we encounter a mismatch, we increase the previous error along the path by one
- If the new error is greater than e, we discard the path

Searching for (M, e) Patterns

- Once all paths have been checked, the surviving ones represent the approximate occurrences of p’=p1…pmb
- If p’ occurs in at least q sequences, and is shorter than M, we expand p’ as well. Otherwise, we continue with p and the next character in

Searching for (M, e) Patterns

- For example
- It matches the first symbol on each edge leaving the root against A.
- If A is valid, i.e., A occurs in at least q sequences, it is expanded to AA.
- If also AA is valid, we move to AAA, and so on.
- If it is not valid, we proceed to look for occurrences of AC.

- In this method, patterns don’t have to occur exactly in the sequences.

Searching for (M, e) Patterns

- The main drawback is that every pattern of length e satisfies the input constraints, since every other pattern of length e found in the tree is a valid occurrence for it
- Thus, the method works well only for small values of e

Searching for (M, e) Patterns

- To apply the algorithm also to longer patterns with higher values of e, instead of reducing the set of patterns that have to be searched, we restrict the number of paths that have to be followed for each pattern.
- That is, we narrow down the set of valid occurrences－the WEEDER algorithm.

Searching for Approximate Occurrences of Patterns

- Problem Definition:
- Given a set of k sequences on the alphabet= { A, C, G, T }, we want to find all(M, e) patterns
- (M, e) patterns: patterns oflength Mthat occur withat most e mismatchesin at least qsequencesof the set

The Outline of WEEDER

- WEEDER fixes an initial error ratio
- Given a pattern p, a path is valid if the distance from p to the path is not greater than |p|
- |p| is the length of the pattern

- When we expand p by one symbol, the error threshold is set to (|p|+1)

- Given a pattern p, a path is valid if the distance from p to the path is not greater than |p|

4

8

12

16

= 0.25

1

2

3

4

Block Decomposition of a Pattern- Each block size is 1/
- Let p = p1…pm. We can see p as composed of m blocks

Valid Occurrences

- For every pattern p = p1…pm, valid occurrences are words si+1…si+m occurring in the sequences for which: j {1,…, m} d(p1…pj, si+1…si+j) j
- d(p1…pj, si+1…si+j) is the number of mismatches between p1…pj and si+1…si+j

- si+1…si+m is a valid occurrence for p if it is a valid occurrence for all its prefixes
{p1, p1p2, …, p1p2…pm-1}

q = 2, =0.25

S2: AGCTCA&

S1: AATCACGC#

S3: ATGCT%

S4: ACTC$

An Example for WEEDERG

&

C

A

GCTCA&

T

C

T

%

%

ATCACGC#

C

#

GC#

#

GC#

T

T

C

CA&

GCT%

$

TC$

A

%

$

C

CACGC#

A

A&

CGC#

GCT%

&

$

CGC#

&

ACTCA: error max =2. S1, S2, S4 contain ACTCA.

ACTC ACTCAG

&

C

A

GCTCA&

T

T

C

%

%

ATCACGC#

C

GC#

#

GC#

#

CA&

T

T

C

TC$

GCT%

$

A

%

C

$

CACGC#

A

A&

CGC#

GCT%

&

$

CGC#

&

ACTC: error max =1. S1, S2, S4 contain ACTC.

S2: AGCTCA&

S1: AATCACGC#

S3: ATGCT%

S4: ACTC$

ACTCA ACTCAA

G

&

C

A

GCTCA&

T

T

C

%

%

ATCACGC#

C

GC#

#

GC#

#

T

T

C

CA&

TC$

$

GCT%

A

%

$

C

CACGC#

A

A&

CGC#

GCT%

&

$

CGC#

&

ACTCAA: error max =2. S1, S2 contain ACTCAA.

ACTCAC, ACTCAG, ACTCAT are also patterns.

S1: AATCACGC#

S2: AGCTCA&

S4: ACTC$

S3: ATGCT%

Weeder (2001)

- Given a pattern P = p1p2....pm, the algorithm can find all the valid occurrences of P (with at most |P| mutations), such that at most i mutations occur in the first i letters of the pattern
- But: some occurrences of a pattern can be missed altogether
- Are DNA signals always so polite to show up in “blocks-decomposed” form?
- The answer is no, but we can use Weeder with a grain of salt

Using Weeder

- Example: (15,4) pattern occurring in 20 sequences
- Valid (block decomposed) possible occurrences: 829
- Total possible occurrences:1365
- Probability if “hitting” a possible occurrence in a sequence: phit=.61
- Probability of finding the pattern in every sequence: like trying to win the national lottery
- If we search for patterns occurring in at least 10 sequences, the probability of “seeing” at least 10 times the pattern is:
Phit(20,10) = .89

Using Weeder

- Thus, we can use Weeder as a sieve, to filter the set of candidate patterns
- All patterns that are found to occur in at least q of the sequences by Weeder can be searched again in the sequences, but this time with no restriction on the position of mismatches
- We expect the number of patterns (random patterns other than the real signal) passed to the second phase to be much smaller than the original number (and no longer exponential)

Using Weeder

- The probability of finding a pattern in a sequence depends on its length and the error ratio
- The probability of finding a pattern in a set of sequences (and thus the choice of the quorum q for the first phase) depends on the number of sequences
- The same approach can be applied also when the signal does not show up in each sequence

Using Weeder

- When the signal to be found is expected to be short, the algorithm can be used in “exact” mode
- For longer signal, the lower is the quorum q, the higher is the probability of finding the signal
- But: also the number of patterns satisfying the input constraints is higher, and the program is slower
- Users can choose a suitable trade-off between time and accuracy

Theoretical Time Complexity

- Naïve approach: O(4men)
- Suffix tree approach (Sagot, 1998): O(4emekn)
- where n is the input size, m is the pattern length, and e is the number of mutations allowed

- Weeder:
O((1/)e4ekn)

where e is the number of mutations occurring in the longest pattern found

Weeder Web

- Weeder Web is a web interface to the Weeder algorithm, where all the parameters concerning the motifs are automatically set for the discovery of transcription factor binding sites
- Although there is no pre-set limit on the length of the input sequences, feasible results can be obtained by submitting sequences of "typical" length for regulatory/promoter regions (i.e. from 500 to 5000 bps)
- A priori, there's no limit on the number of sequences you can input. Also, for the moment we do not consider correlations among different motifs (i.e. cis-regulatory modules)

Weeder Web

- All the statistical measures (background oligo frequencies, expected occurrences and so on) used to score/rank motifs and to post-process the output have been derived from the analysis of promoter/enhancer and 5'UTR regions only (taken from different organisms)
- If you submit something else (i.e. 3'UTRs, coding regions, noncoding RNAs, and so on) the statistical evaluation probably will not be consistent with your data, and thus produce unreliable results
- http://www.pesolelab.it/Tool/ind.php

Post-Processing

- Real motifs have different degrees of variation in different positions
- Some admit “any” nucleotide
- Some are (almost) perfectly conserved
- We should find “redundant” motifs among the highest-scoring ones
- Pieces of a long motif should appear also in shorter results

Post-Processing

- Look for “redundant” (either in length or in conservation) motifs in the reports of each run
- Collect the instances of each one and build a frequency matrix
- Scan the sequences looking for matches
- Report the best matches (with no constraint on the substitutions allowed)

Characterization of CAP and TATA-box through probability matrices with a genetic algorithm

Giancarlo Mauri, Roberto Mosca and Giulio Pavesi

Bioinformatics and Natural Computing Group

Dipartimento di Informatica, Sistemistica e Comunicazione

Università degli studi di Milano-Bicocca

TATA Box and CAP Binding Sites

- A large number of genes present two characteristic signals:
- TATA-box
- 25-35 bp upstream of the TSS
- When discovered it was given a TATA consensus
- Bound by the TATA Binding Protein (TBP) part of a large complex of some 50 different proteins including TFIID and TFIIB

- CAP (also called Initiator or Inr)
- Straddles the TSS
- Experimental evidence that it is bound by TFIID too
- Previous characterization by Bucher [1990] with a CA[Py] consensus

- TATA-box
- Very strong positional preference for the two signals with respect to the TSS

Describing Binding Sites

Frequency Matrix nij

.........0...........

...CGTGCCATTTGTTGT...

...TCCTACAGTGCAGCA...

...TCACATATTATTGTC...

...GAAAGCAACAACTAA...

...TAAATCGTCAGTGTA...

...CCGACCAGAGTGAAA...

...GGGTTTGGTTTGATA...

...GCGTGCAGTTGTGAA...

...GTCGCCATATACACA...

...GTGGCCGTATGCGCT...

.........0...........

...CGTGCCATTTGTTGT...

...TCCTACAGTGCAGCA...

...TCACATATTATTGTC...

...GAAAGCAACAACTAA...

...TAAATCGTCAGTGTA...

...CCGACCAGAGTGAAA...

...GGGTTTGGTTTGATA...

...GCGTGCAGTTGTGAA...

...GTCGCCATATACACA...

...GTGGCCGTATGCGCT...

Frequency Matrix

- Given a signal of length m, a matrix P of dimension 4m is built up, taking Pi,j=P[nucleotide i at position j]
- We suppose the existence of two different models
- The signal
- The background

Signal occurence

Weights stored

in the matrix

The Problem

Dataset

... TTTTGTTTTTTTATTTCCTGTATTTTT ...

... TCCAGCCCGAACAAAATCGATCAAAAT ...

... ATCCCTCTGGCCATTGGCAATCGATCC ...

... AGAAACAAAACGGCTTGTAAAACAAAC ...

... GTGCAGTGAGTCAGTGTGTTGTGTGCC ...

... GAGCGTAAGCAAGAGAGAGAGGTGAAG ...

... AGGTGAAGCCAGGGGCGGAGGCGCAAG ...

... AGAAAAGAGAGAGTGAAAGCATAGTCC ...

... AGTTTTCATATTGTTACCGTTTGAGTT ...

... TTCACCAGCCACTTTCAGTCGGTTTAT ...

... GCATAACGAATCACTCTGATCGCTGTC ...

... GGTCCAGCGACCACTCGCAGTTCTACA ...

... GATCGGCGTGCCATTTGTTGTTGAATC ...

... CCGCTCTCCTCCAGTGCAGCAGCAGCA ...

... TCCAAGTCACCGATTATTGTCTCAGTG ...

... GAACTGGAAACCAACAACTAACGGAGC ...

... TCAGTCTAAATTTACCCTGTAAAATTC ...

... GTAGTTCCGACCAGAGTGAAACTGAAC ...

... CTTTATGGGTTTAGTTTGATAGGAGTC ...

... TCACTGGCGTTGTTAGAGTTGTGAATG ...

... TTTTGTTTTTTTATTTCCTGTATTTTT ...

... TCCAGCCCGAACAAAATCGATCAAAAT ...

... ATCCCTCTGGCCATTGGCAATCGATCC ...

... AGAAACAAAACGGCTTGTAAAACAAAC ...

... GTGCAGTGAGTCAGTGTGTTGTGTGCC ...

... GAGCGTAAGCAAGAGAGAGAGGTGAAG ...

... AGGTGAAGCCAGGGGCGGAGGCGCAAG ...

... AGAAAAGAGAGAGTGAAAGCATAGTCC ...

... AGTTTTCATATTGTTACCGTTTGAGTT ...

... TTCACCAGCCACTTTCAGTCGGTTTAT ...

... GCATAACGAATCACTCTGATCGCTGTC ...

... GGTCCAGCGACCACTCGCAGTTCTACA ...

... GATCGGCGTGCCATTTGTTGTTGAATC ...

... CCGCTCTCCTCCAGTGCAGCAGCAGCA ...

... TCCAAGTCACCGATTATTGTCTCAGTG ...

... GAACTGGAAACCAACAACTAACGGAGC ...

... TCAGTCTAAATTTACCCTGTAAAATTC ...

... GTAGTTCCGACCAGAGTGAAACTGAAC ...

... CTTTATGGGTTTAGTTTGATAGGAGTC ...

... TCACTGGCGTTGTTAGAGTTGTGAATG ...

... TTTTGTTTTTTTATTTCCTGTATTTTT ...

... TCCAGCCCGAACAAAATCGATCAAAAT ...

... ATCCCTCTGGCCATTGGCAATCGATCC ...

... AGAAACAAAACGGCTTGTAAAACAAAC ...

... GTGCAGTGAGTCAGTGTGTTGTGTGCC ...

... GAGCGTAAGCAAGAGAGAGAGGTGAAG ...

... AGGTGAAGCCAGGGGCGGAGGCGCAAG ...

... AGAAAAGAGAGAGTGAAAGCATAGTCC ...

Negative

score

Positive

score

Negative

score

The Score Function- The score function is made up of two terms:
- A positive term
- A negative term

Total score = positive score – negative score

The Genetic Algorithm

... TTTTGTTTTTTTATTTCCTGTATTTTT ...

... TCCAGCCCGAACAAAATCGATCAAAAT ...

... ATCCCTCTGGCCATTGGCAATCGATCC ...

... AGAAACAAAACGGCTTGTAAAACAAAC ...

... GTGCAGTGAGTCAGTGTGTTGTGTGCC ...

... GAGCGTAAGCAAGAGAGAGAGGTGAAG ...

... AGGTGAAGCCAGGGGCGGAGGCGCAAG ...

... AGAAAAGAGAGAGTGAAAGCATAGTCC ...

... AGTTTTCATATTGTTACCGTTTGAGTT ...

... TTCACCAGCCACTTTCAGTCGGTTTAT ...

... GCATAACGAATCACTCTGATCGCTGTC ...

... GGTCCAGCGACCACTCGCAGTTCTACA ...

... GATCGGCGTGCCATTTGTTGTTGAATC ...

... CCGCTCTCCTCCAGTGCAGCAGCAGCA ...

... TCCAAGTCACCGATTATTGTCTCAGTG ...

... GAACTGGAAACCAACAACTAACGGAGC ...

... TCAGTCTAAATTTACCCTGTAAAATTC ...

... GTAGTTCCGACCAGAGTGAAACTGAAC ...

... CTTTATGGGTTTAGTTTGATAGGAGTC ...

... TCACTGGCGTTGTTAGAGTTGTGAATG ...

The Genetic Algorithm

- One-point crossover
- Mutation operator flips one bit with a probability pm
- Given a genome the frequency matrix is calculated from the sequences selected by the genome itself
- The matrix is scored according to the previously defined score function
- A local optimization procedure is applied to the resulting best individual

Results

Three different datasets:

- Eukaryotic Promoter Database(EPD, www.epd.isb-sib.ch, rel. 74)
- Vertebrates: 2199 seqs, from -50 to 50
- Homo Sapiens: 1796 seqs (included in the previous one)

- Drosophila Core Promoters Database(DCPD, www-biology.ucsd.edui/labs/kadonaga/DCPD.htm)
- 205 seqs, from -47 to 44

Bucher, P.: Weight matrix descriptions of four eukaryotic RNA

polymerase II promoter elements derived from 502 unrelated

promoter sequences. J. Mol. Biol. 212 (1990) 563–78

Results - CAPStart position: -2

Length: 6

Population: 500

Generations: 20000

Crossover Probability: 0.9

Mutation Probability: 0.04

Results - CAP RNA

- CA, CG, TA and TG dinucleotides starting at position -1 in for the EPD dataset
- CA, TA dinucleotides starting at position -1 for the DCPD dataset (consistently with [Kadonaga,2000]

Kutach, A., Kadonaga, J.: The downstream promoter element DPE appears to be as widely used as the TATA box in Drosophila core promoters. Mol. Cell Biol. 20 (2000) 4754–64

DCPD RNA

Results - CAPResults – TATA Box RNA

Start position: -30

Length: 8

Population: 500

Generations: 20000

Crossover Probability: 0.9

Mutation Probability: 0.04

Consistent with the finding that the TBP recognizes the minor grove of DNA, where protein-DNA interactions are typically influenced by A/T-content, but not by the specific nucleotide sequence.

Kim, J.L., Nikolov, D.B., Burley, S.I.: Co-crystal structure

of TBP Recognizing the minor groove of a TATA element.

Nature 365 (1993) 520–527

Lo, K., Smale, S.T.: Generality of a functional initiator

consensus sequence. Gene 182 (1996) 13–22

- The signals found have proved to be consistent with those described experimentally, as in the case of fruit fly signals
- Differently from previous approaches to the same problem:
- it does not make any prior assumption about the signals structure
- it takes advantage of the specific localization of the signals considered

- We do not need to know in advance which sequences contain the signal, this is taken care of by the algorithm
- Further improvements include:
- Development of models taking into account correlation between adjacent positions
- Application of the method to new datasets with a better characterization of the TSS to investigate positional correlation between the TATA-box and the TSS and the presence of alternative TSSs

Large Scale Multiple Alignment of Genomic Sequences

Summary described experimentally, as in the case of fruit fly signals

- Multiple Sequence Alignment
- The Methods So Far
- SeQuAl project
- Producing a Threaded Blockset Alignment
- Results & possible improvements

Multiple Sequence Alignment (1) described experimentally, as in the case of fruit fly signals

The multiple sequence alignment problem is the process of taking three or more input sequences and forcing them to have the same length by inserting a universal gap symbol, in order to maximize their similarity as measured by a score function.

Multiple Sequence Alignment (2) described experimentally, as in the case of fruit fly signals

Dynamic Programming Vs Heuristics described experimentally, as in the case of fruit fly signals

- Dynamic Programming Techniques
- O(NxM) for two sequences of length N and M
- Exponential for multiple sequences
- O(nxM2) for n sequences of length approximately M using some heuristics (ClustalW)

Prohibitively time-consuming for sequences

of length exceeding 10 kb

Need for more heuristics

to speed up the alignment

The Methods So Far described experimentally, as in the case of fruit fly signals

- Pairwise sequence alignment tools
- MUMmer, GLASS, WABA, BLASTZ

- Multiple sequence alignment tools
- MAVID, MLAGAN, MGA, Mauve

- Anchoring techniques
- Seed ungapped alignments and extensions
- Filling gaps using dynamic programming

SeQuAl project described experimentally, as in the case of fruit fly signals

- Goals:
- Ability to find conservation even in the presence of mutations on amino acids
- Ability to find conservation in subsets of the original set

- Anchoring technique based on text indexing structures
- Hashing functions on nucleotides and amino acids

S described experimentally, as in the case of fruit fly signals1

S2

S3

S4

SeQuAl project- Four steps alignment:
- Anchors search
- Anchors chaining
- Anchors refinement
- Gaps filling

New!

New!

S described experimentally, as in the case of fruit fly signals1

S2

S3

S4

Anchors Discovery (1)- Discovery of MUMs and MEMs using Suffix Trees
- Longer than a minimum threshold
- Shorter than a maximum length for speeding up the time complexity
- Statistically significant

- Searched on
- the nucleotide sequence
- the amino acid sequence
- on the sequence of classes of aminoacids

- Fragments generation

New!

New!

(R,1) described experimentally, as in the case of fruit fly signals

{S,R}

$

A

T

T

C

A

T

G

G

{S,R}

C

T

{S,R}

A

C

C

C

$

$

G

{S,R}

A

$

T

{S,R}

(R,5)

(S,6)

(R,6)

{S,R}

T

C

A

G

T

(S,4)

A

T

A

T

$

$

A

A

A

$

C

$

(R,4)

$

$

(S,1)

A

(S,3)

$

(R,3)

(S,5)

(R,2)

(S,2)

Anchors Discovery (2)S: ACGTCA$

R: TGTCTA$

S described experimentally, as in the case of fruit fly signals1

S2

S3

S4

D

A

B

C

E

B

D

start

stop

A

C

E

Fragments Chaining- Ordering relation between fragments

<

<

<

<

- Graph induced by the ordering relation

Shortest Path on a DAG

S described experimentally, as in the case of fruit fly signals1

S2

S3

S4

S4

A

B

C

<

A

B

A

C

Partial Anchors (1)- Ordering relation?

<

<

- Order the partial fragments based on their “barycenter”
- Use a shifting barrier to impose a total ordering between partial fragments

Partial Anchors (2) described experimentally, as in the case of fruit fly signals

S described experimentally, as in the case of fruit fly signals1

S2

S3

S4

Producing a Threaded Blockset Alignment- Blockis a rectangular array of symbols such that removing dashes from any row produces a run of one or more consecutive positions in one of the original sequences
- Blockset: a set of such blocks
- A “ref-blockset" consists of a blockset in which every block has a designated row, all of which come from the same original sequence, called the reference for that ref-blockset.
- If a blockset is threaded by each of the original sequences, we call it a threaded blockset.

Local Alignment

with ClustalW

Results (1) described experimentally, as in the case of fruit fly signals

Results (2) described experimentally, as in the case of fruit fly signals

- Approximate anchors in non-coding areas of the genome described experimentally, as in the case of fruit fly signals
- Developing a hashing function for regulatory regions

- Important evolutionary events
- Modeling rearrangements and inversions

- Memory-saving implementation of suffix trees
- Custom visualization tool and classification of aligned regions based on the nature of anchors

Conclusions described experimentally, as in the case of fruit fly signals

- Exhaustive methods:
- Suitable for short patterns and a once-for-all analysis of data (e.g. whole genomes)

- Sequence driven methods:
- Give faster but less accurate answers
- Limited data sizes
- It is possible to choose a trade-off between time and accuracy

3 described experimentally, as in the case of fruit fly signals

4

4

4

Block DecompositionWhere (3-4-4-4) comes from:

0.26*1 = 1

0.26*2 = 1

0.26*3 = 1

0.26*4 = 2

0.26*5 = 2

…..

0.26*15 = 4

3 described experimentally, as in the case of fruit fly signals

4

4

4

0

0

1

1

1

1

0

0

0

0

0

0

0

1

2

0

0

0

1

0

0

1

1

1

2

0

0

1

3

1

0

1

2

2

1

1

0

0

1

0

0

2

2

2

2

3

2

1

3

3

1

1

1

4

2

1

After iteration

1+12+48+288+216+192+72 = 829

829*34 for all

Number of Patterns- 829*34

- Probability phit = 829/1365 = 0.61
- Probability of finding pattern in all 20 sequences is (0.61)20 ≈ 0

It works… described experimentally, as in the case of fruit fly signals

- Use the algorithm as a sift, i.e. we run it to find all patterns that occur in at least half of the sequences.
- Search the patterns reported by the algorithm and use the suffix tree to check which ones actually appear in all the sequences.

The next… described experimentally, as in the case of fruit fly signals

- The probability that a pattern of length 15 occurs with up to four mutations of a give position of a random sequence:
- The probability is seen by WEEDER is
- The probability that a pattern occurs in a position in a form valid for WEEDER is

Finally… described experimentally, as in the case of fruit fly signals

- The pattern occurs at least once in a sequence of length n is:
- The probability that the pattern occurs at least half of the sequence of set and is found by WEEDER is:
- The expected number of patterns passed by WEEDER to the second phase is :

How good it works….. described experimentally, as in the case of fruit fly signals

- A set of 20 sequences of length 600 raises less than 10 patterns.
- A set of 20 sequences of length 1000 raises to about 300 patterns.
- If we look for patterns that occur in at least 9 sequences:
- 50 for sequences of length 600
- 3500 for sequence of length 1000

Performance Evaluation (cont’d) described experimentally, as in the case of fruit fly signals

- We could partition the set of sequences in two subsets of ten sequences each and the probability that the pattern will pop up among the ones found by the algorithm in either subset is:
- where pmiss = 1 - phit

Performance Evaluation (cont’d) described experimentally, as in the case of fruit fly signals

- A pattern is expanded whenever at least one of the two counters is greater than q.
- q is a minimum number of sequences for each subset

Performance Evaluation (cont’d) described experimentally, as in the case of fruit fly signals

- This approach can be pushed even further.
Partition the set of sequences as long as the parameters are such that only a few patterns satisfy them.

- And it works well on long signals, where the phit value is lower and random patterns are unlikely to be picked up by the sifting phase.

Thus… described experimentally, as in the case of fruit fly signals

- The final exact search has to be performed on a limited number of patterns.
- When the signal length and the number of mutations are known in advance, we can determine the best parameter setting and search strategy for WEEDER.

If the length of the pattern is not known in advance… described experimentally, as in the case of fruit fly signals

- There’re some additional problems to be faced…
If we choose :

The probability of hitting a (16, 4) pattern is 0.54.

But the chances of finding a (15, 4) have dropped to 0.45. (since it’s decomposed as (4-4-4-3))

=>The probabilities of finding a pattern depends on the block decomposition induced by .

Moreover… described experimentally, as in the case of fruit fly signals

- Setting a lower threshold value q would increase the number of candidates.
For example:

Run a algorithm with on 20 sequences of length 400 hoping to find a (16, 4) pattern.

=>The constraints are satisfied by hundreds of patterns of length 12.

What is the solution? described experimentally, as in the case of fruit fly signals

- One possible solution could be to investigate only the longest patterns found.
=>But a significant signal could be hidden also among the shorter ones.

The solution we adopt… described experimentally, as in the case of fruit fly signals

- Expand all patterns that appear in at least q sequences, but report only those that occur in q(m), which is set according to the pattern length.
In the previous example:

A pattern of length 12 with 3 mutations can be found with probability of about 90% by setting q=11.

=>Thus, if a pattern of length 12 appears in at least q(m)=11 sequences, we can pass it to the second phase.

- Since q(m) can be set according to the number of sequences given as input, and the parameters q and .

If the nucleotide composition of the sequences is not uniform…

- For example, 1:1:1:2 (T occurs twice as often as the others)
In order to avoid spending too much time in the final phase, we can set the threshold q(m) according to the pattern probability.

The points just discussed uniform…

- At the end, the algorithm might report more than one pattern satisfying the constraints.
=>We need to introduce significance

measures to sort the output.

Sorting the Output uniform…

- The algorithm may output more than one pattern under:
- The length of the signal to be found is unknown
- The sequences contain more than one signal

Example uniform…

- A successful (15,4) pattern can be expanded by one symbol and becomes a (16,5) pattern.
- All its occurrences are also valid occurrences of the longer one.

A Method for Sorting Outputs uniform…

- A pattern P is given
- is the best instance of P in the sequence i
- is the total number of sequences P occurs in
- is 1 if P appears in sequence i, otherwise 0
- means the distance between P and

Relative Entropy uniform…

- When the nucleotide composition of the sequences is biased, we use the background probabilities to define new match premiums and penalties.

Relative Entropy (cont’d) uniform…

- is the frequency with which residue r occurs in position j in the occurrences of P
- is the frequency of r in the sequences

Relative Entropy (cont’d) uniform…

- It is suitable for sorting patterns under following situations:
- Patterns that appear the same number of times, for example once in every sequence of the set.
- Sequences containing multiple signals.

Another Measure of Significance uniform…

- We can define statistical measures of significance, that compare the actual number of occurrences of a pattern with the expected value.

Measure of Significance (cont’d) uniform…

- is the probability that P occurs in a sequence of the set with at most e mutations.
- N is the overall length of the sequences.
- This value can be computed explicitly (Tompa,1999)

Software Implementation uniform…

- Machine: Pentium II class computer with 64MB RAM
- OS: Linux
- Program Codes: written in ANSI C, about 2500 lines long
- Testing Data: challenge problem as described before, varying the length from 100 to 1000 nucleotides

Experiment Results -- Time uniform…

- Construction of the suffix tree: always within one second
- To find the (15,4) signal with 89% success probability, and run the algorithm with = 0.26 and q = 10. for sequence lengths up to 400 nucleotides. it took less than one minute (including the final exact search).
- Length 500: execution time grows significantly
- Length 500 and 600: average time 125 and 200 seconds
- q = 11: execution time drops to 100 and 120

Experiment Results – Time (cont’d) uniform…

- Increasing the number of sequences did not influence the execution time very much.
- Example
- For every sequence length, the algorithm took just a few more seconds when run over 30 or 40 sequences with q set to 16 and 21, respectively.
- Sequence length set to 800, the program took on the average 320 and 450 seconds to complete the execution with q set to 11 and 10 respectively. When length is 1000 long, it took about 15 minutes.

- Thus the WEEDER algorithm is suited to work on a large set of relatively short (up to 600 nucleotide) sequences than a small set of very long sequences.

Download Presentation

Connecting to Server..