Spectrum-based de novo repeat detection in genomic sequences

Download Presentation

Spectrum-based de novo repeat detection in genomic sequences

Loading in 2 Seconds...

- 71 Views
- Uploaded on
- Presentation posted in: General

Spectrum-based de novo repeat detection in genomic sequences

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.

- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Spectrum-based de novo repeat detection in genomic sequences

Do Huy Hoang

- Introduction
- What is a repeat?
- Why studying repeats?

- Related work
- SAGRI
- Algorithm
- Analysis

- Evaluation

Introduction

- [General]: Nucleotide sequences occurring multiply within a genome
- [CompBio]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).

- Motifs
- Very short repeats (10-20bp)
- Transcription factor binding sites

- Long and Short interspersed elements (SINE, LINE)
- Jumping genes

- Genes and Pseudogenes
- Tandem repeats
- Simple short sequence repeats An, CGGn

- Eukaryotic genomes contain a lot of repeats
- E.g. Human genome contains 50% repeats.

- Repeats are believed to play an important role in evolution and disease.
- E.g. Alu elements are particularly prone to recombination. Insertion of Alu repeats inactivate genes in patient with hemophilia and neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999)

- Repeats are important to chromatin structure.
- Most TEs in mammals seem to be silenced by methylation. Alu sequences are major target for histone H3-Lys9 methylation in humans (Kondo and Issa, 2003).
- It is known that heterochromatin have a lot of SINE and LINE repeats.

- Repeats complicated sequence assembly and genome comparison
- Many people remove repeats before they analyze the genome.

- Repeats set hurdles on microarray probe signal analysis
- The probe signal may be inaccurate if the probe sequence overlap with repeat regions.

- Repeats may contribute to human diversity more than genes.
- Repeats can be used as DNA fingerprint

- Repeat library (RepeatMasker)
- De-novo repeat discovery (two steps):
- Identification of repeats
- Classification of repeats

SAGRI algorithm

- Input: a text G
- FindHitphase: finds all candidate of second occurrence of repeat regions
- ACGACGCGATTAACCCTCGACGTGATCCTC

- Validation phase: uses hits from phase 1 to find all pairs of repeats
- ACGACGCGATTAACCCTCGACGTGATCCTC

- What is a spectrum?
- Given a string G, its spectrum is the set of all k-mers.
E.g. k=3, G= ACGACGCTCACCCT

The spectrum is

- Given a string G, its spectrum is the set of all k-mers.
- ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA
- CTC is a k-mer occurring at position 7.
- ACG is a k-mer occurring at positions 1, 4.

- Two regions of repeats should share some k-mers.
- E.g. the following repeats share CGA.
ACGACGCGATTAACCCTCGACGTGATCCTC

- E.g. the following repeats share CGA.

i

S = ACGACGTGATTAACCCTCGACGTGATCCTC

- Given the spectrum S for G[1..i-1]:

i

CGA

A X

C

Feasible extensions!

G X

T

Note: T is called a fooling probe!

- A path of feasible extensions may be a repeat.
- Example:
S = ACGACGCTATCGATGCCCTC

Spectrum S for G[1..10] is

ACG, CGA, CGC, CTA, GAC, GCT, TAT

Starting from position 11, there exists a path of feasible extensions:

CGA-C-G-C

This path corresponds to a length-6 substring in position 2.

Also, this path has one mismatch compare with the length-6 substring for position 11 (CGATGC).

11

Algorithm:

Input: a text G

- Initialize the empty spectrum S
- For i = 1 to n
/* we maintain the variant that S is a spectrum for G[1..i-1] */

- Let x be the k-mer at position i
- If x exists in S, run DetectRepSeq(S,i);
- Insert x into S

- Note: DetectRepSeq(S,i) looks for repeat occurring at position i.

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA

DetectRepSeg(S(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1

DetectRepSeg(S(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

- Extend backward
- Stop backtracking after h steps

- Decompose hits into set of k-mer and index all the locations of these k-mers.
- Scan for each pair of locations of a k-mer w in the hits, do BLAST extension
- Use some auxiliary data structure to avoid double checking

- Report the pairs whose length exceed our threshold

Analysis

- How to find most repeats?
- Avoid false negative

- How to get better speed?
- Avoid false positive

- If k is too big,
- k-mer is too specific and we may miss some repeat

- If k is too small,
- k-mer cannot help us to differentiate repeat from non-repeat

- For repeat of length 50 and similarity>0.9,
- we found that k log4n+2 is good enough.

m

0

- A random k-mer match with one of n chosen k-mer
- Pr(a k-mer re-occurs by random in a sequence of length n)
- (analog to throwing n balls into 4k bins)
- 1-(1 – 4-k)m 1 – exp(-m/4k).

- We requires 1-exp(-n/4k)1,
- hence, k log4n + log41.
- If we set 1=1/16, k log4n + 2

X

X

x1

x2

Xm+1

L

- A pair of repeats of length L, with m mismatches
- Probability of a preserved k-mer in repeat is
- M is the number of nonnegative integer solutions
- to Subject to

- It is easy to see that M is the coefficient of xL−m in
- Hence

- Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%.
- Then, the pruning strategy is length dependent.
- If the length of strings in is r, we allow (r) mismatches.

- Let q be the mismatch probability and r be the length of the string.
- Prob that a string has s mismatches =

- For a threshold (says, 0.01), we set
- (r) = max {2 s r-2 | Pq(s) > } + 2

- Two typical cases
- The probability of (case 1)/ (case 2)
is 2*4-

- P(case1 or case2) is small
- For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 * 10-8

Evaluation

Compare with other programs

- EulerAlign by Zhang and Waterman
- PALS by Edgar and Myers
- REPuter by Kurtz et al.
- SARGRI

- Count Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs.
- Shared Repeat Region (SRR): the ratio of the found region to the reference region.

- Conclusion from simulated data
- The result is consistent with the analysis

- M.gen (0.6 Mbp)
- Organism with the smallest genome
- Lives in the primate genital and respiratory tracts

- C.tra (1 Mbp)
- Live inside the cells of humans

- A.ful (2.1 Mbp)
- Found in high-temperature oil fields

- E.coli (4 Mbp)
- An import bacteria live inside lower intestines of mammals

- Human chr22 p20M to p21M (1Mbp)

- Use CR and SRR ratio to measure
- Cross validation
- G/H=1, H/G<1 G “outperforms” H
- G/H<1, H/G=1 H “outperforms” G
- G/H<1, H/G<1 G, H are complementary
- G/H=1, H/G=1 G, H are similar

- H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469-487, June 2008