Loading in 2 Seconds...

Spectrum-based de novo repeat detection in genomic sequences

Loading in 2 Seconds...

- By
**colin** - Follow User

- 94 Views
- Uploaded on

Download Presentation
## PowerPoint Slideshow about 'Spectrum-based de novo repeat detection in genomic sequences' - colin

**An Image/Link below is provided (as is) to download presentation**

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript

AACAACAAC### Evaluation

Outline

- Introduction
- What is a repeat?
- Why studying repeats?
- Related work
- SAGRI
- Algorithm
- Analysis
- Evaluation

What is a repeat? (Definition)

- [General]: Nucleotide sequences occurring multiply within a genome
- [CompBio]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).

What is a repeat? (Function)

- Motifs
- Very short repeats (10-20bp)
- Transcription factor binding sites
- Long and Short interspersed elements (SINE, LINE)
- Jumping genes
- Genes and Pseudogenes
- Tandem repeats
- Simple short sequence repeats An, CGGn

Why studying repeats? (1)

- Eukaryotic genomes contain a lot of repeats
- E.g. Human genome contains 50% repeats.
- Repeats are believed to play an important role in evolution and disease.
- E.g. Alu elements are particularly prone to recombination. Insertion of Alu repeats inactivate genes in patient with hemophilia and neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999)
- Repeats are important to chromatin structure.
- Most TEs in mammals seem to be silenced by methylation. Alu sequences are major target for histone H3-Lys9 methylation in humans (Kondo and Issa, 2003).
- It is known that heterochromatin have a lot of SINE and LINE repeats.

Why studying repeats? (2)

- Repeats complicated sequence assembly and genome comparison
- Many people remove repeats before they analyze the genome.
- Repeats set hurdles on microarray probe signal analysis
- The probe signal may be inaccurate if the probe sequence overlap with repeat regions.
- Repeats may contribute to human diversity more than genes.
- Repeats can be used as DNA fingerprint

Steps in Repeat finding

- Repeat library (RepeatMasker)
- De-novo repeat discovery (two steps):
- Identification of repeats
- Classification of repeats

Algorithm outline

- Input: a text G
- FindHitphase: finds all candidate of second occurrence of repeat regions
- ACGACGCGATTAACCCTCGACGTGATCCTC
- Validation phase: uses hits from phase 1 to find all pairs of repeats
- ACGACGCGATTAACCCTCGACGTGATCCTC

Spectrum-based repeat finder

- What is a spectrum?
- Given a string G, its spectrum is the set of all k-mers.

E.g. k=3, G= ACGACGCTCACCCT

The spectrum is

- ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA
- CTC is a k-mer occurring at position 7.
- ACG is a k-mer occurring at positions 1, 4.

Observation 1: How to find candidate regions containing repeats?

- Two regions of repeats should share some k-mers.
- E.g. the following repeats share CGA.

ACGACGCGATTAACCCTCGACGTGATCCTC

Feasible extension (bud)

i

S = ACGACGTGATTAACCCTCGACGTGATCCTC

- Given the spectrum S for G[1..i-1]:

i

CGA

A X

C

Feasible extensions!

G X

T

Note: T is called a fooling probe!

Observation 2

- A path of feasible extensions may be a repeat.
- Example:

S = ACGACGCTATCGATGCCCTC

Spectrum S for G[1..10] is

ACG, CGA, CGC, CTA, GAC, GCT, TAT

Starting from position 11, there exists a path of feasible extensions:

CGA-C-G-C

This path corresponds to a length-6 substring in position 2.

Also, this path has one mismatch compare with the length-6 substring for position 11 (CGATGC).

11

Phase 1: FindHit()

Algorithm:

Input: a text G

- Initialize the empty spectrum S
- For i = 1 to n

/* we maintain the variant that S is a spectrum for G[1..i-1] */

- Let x be the k-mer at position i
- If x exists in S, run DetectRepSeq(S,i);
- Insert x into S
- Note: DetectRepSeq(S,i) looks for repeat occurring at position i.

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA

DetectRepSeg(S(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1

DetectRepSeg(S(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

Other details

- Extend backward
- Stop backtracking after h steps

Validation phase

- Decompose hits into set of k-mer and index all the locations of these k-mers.
- Scan for each pair of locations of a k-mer w in the hits, do BLAST extension
- Use some auxiliary data structure to avoid double checking
- Report the pairs whose length exceed our threshold

Analysis

- How to find most repeats?
- Avoid false negative
- How to get better speed?
- Avoid false positive

How do we choose k? (1)

- If k is too big,
- k-mer is too specific and we may miss some repeat
- If k is too small,
- k-mer cannot help us to differentiate repeat from non-repeat
- For repeat of length 50 and similarity>0.9,
- we found that k log4n+2 is good enough.

m

0

How do we choose k? (2)- A random k-mer match with one of n chosen k-mer
- Pr(a k-mer re-occurs by random in a sequence of length n)
- (analog to throwing n balls into 4k bins)
- 1-(1 – 4-k)m 1 – exp(-m/4k).
- We requires 1-exp(-n/4k)1,
- hence, k log4n + log41.
- If we set 1=1/16, k log4n + 2

X

X

x1

x2

Xm+1

L

The occurrence of false negative (missed repeat) (1)- A pair of repeats of length L, with m mismatches
- Probability of a preserved k-mer in repeat is
- M is the number of nonnegative integer solutions
- to Subject to

The occurrence of false negative (missed repeat) (2)

- It is easy to see that M is the coefficient of xL−m in
- Hence

Criterion for path termination (1)

- Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%.
- Then, the pruning strategy is length dependent.
- If the length of strings in is r, we allow (r) mismatches.

Criterion for path termination (2)

- Let q be the mismatch probability and r be the length of the string.
- Prob that a string has s mismatches =
- For a threshold (says, 0.01), we set
- (r) = max {2 s r-2 | Pq(s) > } + 2

Control of false positives (1)

- Two typical cases
- The probability of (case 1)/ (case 2)

is 2*4-

- P(case1 or case2) is small
- For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 * 10-8

Compare with other programs

Programs

- EulerAlign by Zhang and Waterman
- PALS by Edgar and Myers
- REPuter by Kurtz et al.
- SARGRI

Measurement

- Count Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs.
- Shared Repeat Region (SRR): the ratio of the found region to the reference region.

Simulated data

- Conclusion from simulated data
- The result is consistent with the analysis

Genome data

- M.gen (0.6 Mbp)
- Organism with the smallest genome
- Lives in the primate genital and respiratory tracts
- C.tra (1 Mbp)
- Live inside the cells of humans
- A.ful (2.1 Mbp)
- Found in high-temperature oil fields
- E.coli (4 Mbp)
- An import bacteria live inside lower intestines of mammals
- Human chr22 p20M to p21M (1Mbp)

Use CR and SRR ratio to measure

- Cross validation
- G/H=1, H/G<1 G “outperforms” H
- G/H<1, H/G=1 H “outperforms” G
- G/H<1, H/G<1 G, H are complementary
- G/H=1, H/G=1 G, H are similar

H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469-487, June 2008

Download Presentation

Connecting to Server..