Spectrum based de novo repeat detection in genomic sequences
Download
1 / 45

Spectrum-based de novo repeat detection in genomic sequences - PowerPoint PPT Presentation


  • 82 Views
  • Uploaded on
  • Presentation posted in: General

Spectrum-based de novo repeat detection in genomic sequences. Do Huy Hoang. Outline. Introduction What is a repeat? Why studying repeats? Related work SAGRI Algorithm Analysis Evaluation. Introduction. What is a repeat?(Definition).

loader
I am the owner, or an agent authorized to act on behalf of the owner, of the copyrighted work described.
capcha

Download Presentation

Spectrum-based de novo repeat detection in genomic sequences

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Spectrum-based de novo repeat detection in genomic sequences

Do Huy Hoang


Outline

  • Introduction

    • What is a repeat?

    • Why studying repeats?

  • Related work

  • SAGRI

    • Algorithm

    • Analysis

  • Evaluation


Introduction


What is a repeat?(Definition)

  • [General]: Nucleotide sequences occurring multiply within a genome

  • [CompBio]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).


What is a repeat? (Function)

  • Motifs

    • Very short repeats (10-20bp)

    • Transcription factor binding sites

  • Long and Short interspersed elements (SINE, LINE)

    • Jumping genes

  • Genes and Pseudogenes

  • Tandem repeats

    • Simple short sequence repeats An, CGGn


Why studying repeats? (1)

  • Eukaryotic genomes contain a lot of repeats

    • E.g. Human genome contains 50% repeats.

  • Repeats are believed to play an important role in evolution and disease.

    • E.g. Alu elements are particularly prone to recombination. Insertion of Alu repeats inactivate genes in patient with hemophilia and neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999)

  • Repeats are important to chromatin structure.

    • Most TEs in mammals seem to be silenced by methylation. Alu sequences are major target for histone H3-Lys9 methylation in humans (Kondo and Issa, 2003).

    • It is known that heterochromatin have a lot of SINE and LINE repeats.


Why studying repeats? (2)

  • Repeats complicated sequence assembly and genome comparison

    • Many people remove repeats before they analyze the genome.

  • Repeats set hurdles on microarray probe signal analysis

    • The probe signal may be inaccurate if the probe sequence overlap with repeat regions.

  • Repeats may contribute to human diversity more than genes.

  • Repeats can be used as DNA fingerprint


Steps in Repeat finding

  • Repeat library (RepeatMasker)

  • De-novo repeat discovery (two steps):

    • Identification of repeats

    • Classification of repeats


SAGRI algorithm


Algorithm outline

  • Input: a text G

  • FindHitphase: finds all candidate of second occurrence of repeat regions

    • ACGACGCGATTAACCCTCGACGTGATCCTC

  • Validation phase: uses hits from phase 1 to find all pairs of repeats

    • ACGACGCGATTAACCCTCGACGTGATCCTC


Spectrum-based repeat finder

  • What is a spectrum?

    • Given a string G, its spectrum is the set of all k-mers.

      E.g. k=3, G= ACGACGCTCACCCT

      The spectrum is

  • ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA

    • CTC is a k-mer occurring at position 7.

    • ACG is a k-mer occurring at positions 1, 4.


Observation 1: How to find candidate regions containing repeats?

  • Two regions of repeats should share some k-mers.

    • E.g. the following repeats share CGA.

      ACGACGCGATTAACCCTCGACGTGATCCTC


Feasible extension (bud)

i

S = ACGACGTGATTAACCCTCGACGTGATCCTC

  • Given the spectrum S for G[1..i-1]:

i

CGA

A X

C 

Feasible extensions!

G X

T 

Note: T is called a fooling probe!


Observation 2

  • A path of feasible extensions may be a repeat.

  • Example:

    S = ACGACGCTATCGATGCCCTC

    Spectrum S for G[1..10] is

    ACG, CGA, CGC, CTA, GAC, GCT, TAT

    Starting from position 11, there exists a path of feasible extensions:

    CGA-C-G-C

    This path corresponds to a length-6 substring in position 2.

    Also, this path has one mismatch compare with the length-6 substring for position 11 (CGATGC).

11


Phase 1: FindHit()

Algorithm:

Input: a text G

  • Initialize the empty spectrum S

  • For i = 1 to n

    /* we maintain the variant that S is a spectrum for G[1..i-1] */

    • Let x be the k-mer at position i

    • If x exists in S, run DetectRepSeq(S,i);

    • Insert x into S

  • Note: DetectRepSeq(S,i) looks for repeat occurring at position i.


AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA

DetectRepSeg(S(18), 18)


AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1

DetectRepSeg(S(18), 18)


AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


Other details

  • Extend backward

  • Stop backtracking after h steps


Validation phase

  • Decompose hits into set of k-mer and index all the locations of these k-mers.

  • Scan for each pair of locations of a k-mer w in the hits, do BLAST extension

    • Use some auxiliary data structure to avoid double checking

  • Report the pairs whose length exceed our threshold


Analysis


Analysis

  • How to find most repeats?

    • Avoid false negative

  • How to get better speed?

    • Avoid false positive


How do we choose k? (1)

  • If k is too big,

    • k-mer is too specific and we may miss some repeat

  • If k is too small,

    • k-mer cannot help us to differentiate repeat from non-repeat

  • For repeat of length 50 and similarity>0.9,

    • we found that k  log4n+2 is good enough.


m

0

How do we choose k? (2)

  • A random k-mer match with one of n chosen k-mer

  • Pr(a k-mer re-occurs by random in a sequence of length n)

    • (analog to throwing n balls into 4k bins)

    •  1-(1 – 4-k)m  1 – exp(-m/4k).

  • We requires 1-exp(-n/4k)1,

    • hence, k  log4n + log41.

    • If we set 1=1/16, k  log4n + 2


X

X

x1

x2

Xm+1

L

The occurrence of false negative (missed repeat) (1)

  • A pair of repeats of length L, with m mismatches

  • Probability of a preserved k-mer in repeat is

  • M is the number of nonnegative integer solutions

    • to Subject to


The occurrence of false negative (missed repeat) (2)

  • It is easy to see that M is the coefficient of xL−m in

  • Hence


Criterion for path termination (1)

  • Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%.

  • Then, the pruning strategy is length dependent.

    • If the length of strings in  is r, we allow (r) mismatches.


Criterion for path termination (2)

  • Let q be the mismatch probability and r be the length of the string.

    • Prob that a string has s mismatches =

  • For a threshold  (says, 0.01), we set

    • (r) = max {2  s  r-2 | Pq(s) > } + 2


Control of false positives (1)

  • Two typical cases

  • The probability of (case 1)/ (case 2)

    is  2*4-

  • P(case1 or case2) is small

    • For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 * 10-8


Evaluation

Compare with other programs


Programs

  • EulerAlign by Zhang and Waterman

  • PALS by Edgar and Myers

  • REPuter by Kurtz et al.

  • SARGRI


Measurement

  • Count Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs.

  • Shared Repeat Region (SRR): the ratio of the found region to the reference region.


Simulated data

  • Conclusion from simulated data

    • The result is consistent with the analysis


Genome data

  • M.gen (0.6 Mbp)

    • Organism with the smallest genome

    • Lives in the primate genital and respiratory tracts

  • C.tra (1 Mbp)

    • Live inside the cells of humans

  • A.ful (2.1 Mbp)

    • Found in high-temperature oil fields

  • E.coli (4 Mbp)

    • An import bacteria live inside lower intestines of mammals

  • Human chr22 p20M to p21M (1Mbp)


  • Use CR and SRR ratio to measure

  • Cross validation

    • G/H=1, H/G<1  G “outperforms” H

    • G/H<1, H/G=1  H “outperforms” G

    • G/H<1, H/G<1  G, H are complementary

    • G/H=1, H/G=1  G, H are similar


= 


Questions and Answers


  • H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469-487, June 2008


ad
  • Login