spectrum based de novo repeat detection in genomic sequences
Download
Skip this Video
Download Presentation
Spectrum-based de novo repeat detection in genomic sequences

Loading in 2 Seconds...

play fullscreen
1 / 45

Spectrum-based de novo repeat detection in genomic sequences - PowerPoint PPT Presentation


  • 94 Views
  • Uploaded on

Spectrum-based de novo repeat detection in genomic sequences. Do Huy Hoang. Outline. Introduction What is a repeat? Why studying repeats? Related work SAGRI Algorithm Analysis Evaluation. Introduction. What is a repeat? (Definition).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha
Download Presentation

PowerPoint Slideshow about 'Spectrum-based de novo repeat detection in genomic sequences' - colin


An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -
Presentation Transcript
outline
Outline
  • Introduction
    • What is a repeat?
    • Why studying repeats?
  • Related work
  • SAGRI
    • Algorithm
    • Analysis
  • Evaluation
what is a repeat definition
What is a repeat? (Definition)
  • [General]: Nucleotide sequences occurring multiply within a genome
  • [CompBio]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).
what is a repeat function
What is a repeat? (Function)
  • Motifs
    • Very short repeats (10-20bp)
    • Transcription factor binding sites
  • Long and Short interspersed elements (SINE, LINE)
    • Jumping genes
  • Genes and Pseudogenes
  • Tandem repeats
    • Simple short sequence repeats An, CGGn
why studying repeats 1
Why studying repeats? (1)
  • Eukaryotic genomes contain a lot of repeats
    • E.g. Human genome contains 50% repeats.
  • Repeats are believed to play an important role in evolution and disease.
    • E.g. Alu elements are particularly prone to recombination. Insertion of Alu repeats inactivate genes in patient with hemophilia and neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999)
  • Repeats are important to chromatin structure.
    • Most TEs in mammals seem to be silenced by methylation. Alu sequences are major target for histone H3-Lys9 methylation in humans (Kondo and Issa, 2003).
    • It is known that heterochromatin have a lot of SINE and LINE repeats.
why studying repeats 2
Why studying repeats? (2)
  • Repeats complicated sequence assembly and genome comparison
    • Many people remove repeats before they analyze the genome.
  • Repeats set hurdles on microarray probe signal analysis
    • The probe signal may be inaccurate if the probe sequence overlap with repeat regions.
  • Repeats may contribute to human diversity more than genes.
  • Repeats can be used as DNA fingerprint
steps in repeat finding
Steps in Repeat finding
  • Repeat library (RepeatMasker)
  • De-novo repeat discovery (two steps):
    • Identification of repeats
    • Classification of repeats
algorithm outline
Algorithm outline
  • Input: a text G
  • FindHitphase: finds all candidate of second occurrence of repeat regions
    • ACGACGCGATTAACCCTCGACGTGATCCTC
  • Validation phase: uses hits from phase 1 to find all pairs of repeats
    • ACGACGCGATTAACCCTCGACGTGATCCTC
spectrum based repeat finder
Spectrum-based repeat finder
  • What is a spectrum?
    • Given a string G, its spectrum is the set of all k-mers.

E.g. k=3, G= ACGACGCTCACCCT

The spectrum is

  • ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA
    • CTC is a k-mer occurring at position 7.
    • ACG is a k-mer occurring at positions 1, 4.
observation 1 how to find candidate regions containing repeats
Observation 1: How to find candidate regions containing repeats?
  • Two regions of repeats should share some k-mers.
    • E.g. the following repeats share CGA.

ACGACGCGATTAACCCTCGACGTGATCCTC

feasible extension bud
Feasible extension (bud)

i

S = ACGACGTGATTAACCCTCGACGTGATCCTC

  • Given the spectrum S for G[1..i-1]:

i

CGA

A X

C 

Feasible extensions!

G X

T 

Note: T is called a fooling probe!

observation 2
Observation 2
  • A path of feasible extensions may be a repeat.
  • Example:

S = ACGACGCTATCGATGCCCTC

Spectrum S for G[1..10] is

ACG, CGA, CGC, CTA, GAC, GCT, TAT

Starting from position 11, there exists a path of feasible extensions:

CGA-C-G-C

This path corresponds to a length-6 substring in position 2.

Also, this path has one mismatch compare with the length-6 substring for position 11 (CGATGC).

11

phase 1 findhit
Phase 1: FindHit()

Algorithm:

Input: a text G

  • Initialize the empty spectrum S
  • For i = 1 to n

/* we maintain the variant that S is a spectrum for G[1..i-1] */

    • Let x be the k-mer at position i
    • If x exists in S, run DetectRepSeq(S,i);
    • Insert x into S
  • Note: DetectRepSeq(S,i) looks for repeat occurring at position i.
slide16
AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA

DetectRepSeg(S(18), 18)

slide17
AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1

DetectRepSeg(S(18), 18)

slide18
AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

slide19
AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

slide20
AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

slide21
AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

slide22
AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

slide23
AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)

other details
Other details
  • Extend backward
  • Stop backtracking after h steps
validation phase
Validation phase
  • Decompose hits into set of k-mer and index all the locations of these k-mers.
  • Scan for each pair of locations of a k-mer w in the hits, do BLAST extension
    • Use some auxiliary data structure to avoid double checking
  • Report the pairs whose length exceed our threshold
analysis1
Analysis
  • How to find most repeats?
    • Avoid false negative
  • How to get better speed?
    • Avoid false positive
how do we choose k 1
How do we choose k? (1)
  • If k is too big,
    • k-mer is too specific and we may miss some repeat
  • If k is too small,
    • k-mer cannot help us to differentiate repeat from non-repeat
  • For repeat of length 50 and similarity>0.9,
    • we found that k  log4n+2 is good enough.
how do we choose k 2
m

0

How do we choose k? (2)
  • A random k-mer match with one of n chosen k-mer
  • Pr(a k-mer re-occurs by random in a sequence of length n)
    • (analog to throwing n balls into 4k bins)
    •  1-(1 – 4-k)m  1 – exp(-m/4k).
  • We requires 1-exp(-n/4k)1,
    • hence, k  log4n + log41.
    • If we set 1=1/16, k  log4n + 2
the occurrence of false negative missed repeat 1
X

X

x1

x2

Xm+1

L

The occurrence of false negative (missed repeat) (1)
  • A pair of repeats of length L, with m mismatches
  • Probability of a preserved k-mer in repeat is
  • M is the number of nonnegative integer solutions
    • to Subject to
the occurrence of false negative missed repeat 2
The occurrence of false negative (missed repeat) (2)
  • It is easy to see that M is the coefficient of xL−m in
  • Hence
criterion for path termination 1
Criterion for path termination (1)
  • Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%.
  • Then, the pruning strategy is length dependent.
    • If the length of strings in  is r, we allow (r) mismatches.
criterion for path termination 2
Criterion for path termination (2)
  • Let q be the mismatch probability and r be the length of the string.
    • Prob that a string has s mismatches =
  • For a threshold  (says, 0.01), we set
    • (r) = max {2  s  r-2 | Pq(s) > } + 2
control of false positives 1
Control of false positives (1)
  • Two typical cases
  • The probability of (case 1)/ (case 2)

is  2*4-

  • P(case1 or case2) is small
    • For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 * 10-8
evaluation

Evaluation

Compare with other programs

programs
Programs
  • EulerAlign by Zhang and Waterman
  • PALS by Edgar and Myers
  • REPuter by Kurtz et al.
  • SARGRI
measurement
Measurement
  • Count Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs.
  • Shared Repeat Region (SRR): the ratio of the found region to the reference region.
simulated data
Simulated data
  • Conclusion from simulated data
    • The result is consistent with the analysis
genome data
Genome data
  • M.gen (0.6 Mbp)
    • Organism with the smallest genome
    • Lives in the primate genital and respiratory tracts
  • C.tra (1 Mbp)
    • Live inside the cells of humans
  • A.ful (2.1 Mbp)
    • Found in high-temperature oil fields
  • E.coli (4 Mbp)
    • An import bacteria live inside lower intestines of mammals
  • Human chr22 p20M to p21M (1Mbp)
slide40
Use CR and SRR ratio to measure
  • Cross validation
    • G/H=1, H/G<1  G “outperforms” H
    • G/H<1, H/G=1  H “outperforms” G
    • G/H<1, H/G<1  G, H are complementary
    • G/H=1, H/G=1  G, H are similar
slide45
H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469-487, June 2008
ad