Spectrum based de novo repeat detection in genomic sequences
This presentation is the property of its rightful owner.
Sponsored Links
1 / 45

Spectrum-based de novo repeat detection in genomic sequences PowerPoint PPT Presentation


  • 58 Views
  • Uploaded on
  • Presentation posted in: General

Spectrum-based de novo repeat detection in genomic sequences. Do Huy Hoang. Outline. Introduction What is a repeat? Why studying repeats? Related work SAGRI Algorithm Analysis Evaluation. Introduction. What is a repeat?(Definition).

Download Presentation

Spectrum-based de novo repeat detection in genomic sequences

An Image/Link below is provided (as is) to download presentation

Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author.While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server.


- - - - - - - - - - - - - - - - - - - - - - - - - - E N D - - - - - - - - - - - - - - - - - - - - - - - - - -

Presentation Transcript


Spectrum based de novo repeat detection in genomic sequences

Spectrum-based de novo repeat detection in genomic sequences

Do Huy Hoang


Outline

Outline

  • Introduction

    • What is a repeat?

    • Why studying repeats?

  • Related work

  • SAGRI

    • Algorithm

    • Analysis

  • Evaluation


Introduction

Introduction


What is a repeat definition

What is a repeat?(Definition)

  • [General]: Nucleotide sequences occurring multiply within a genome

  • [CompBio]: Given a genome sequence S, find a string P which occurs at least twice in S (allowing some errors).


What is a repeat function

What is a repeat? (Function)

  • Motifs

    • Very short repeats (10-20bp)

    • Transcription factor binding sites

  • Long and Short interspersed elements (SINE, LINE)

    • Jumping genes

  • Genes and Pseudogenes

  • Tandem repeats

    • Simple short sequence repeats An, CGGn


Why studying repeats 1

Why studying repeats? (1)

  • Eukaryotic genomes contain a lot of repeats

    • E.g. Human genome contains 50% repeats.

  • Repeats are believed to play an important role in evolution and disease.

    • E.g. Alu elements are particularly prone to recombination. Insertion of Alu repeats inactivate genes in patient with hemophilia and neurofibromatosis (Kazazian, 1998; Deininger and Batzer, 1999)

  • Repeats are important to chromatin structure.

    • Most TEs in mammals seem to be silenced by methylation. Alu sequences are major target for histone H3-Lys9 methylation in humans (Kondo and Issa, 2003).

    • It is known that heterochromatin have a lot of SINE and LINE repeats.


Why studying repeats 2

Why studying repeats? (2)

  • Repeats complicated sequence assembly and genome comparison

    • Many people remove repeats before they analyze the genome.

  • Repeats set hurdles on microarray probe signal analysis

    • The probe signal may be inaccurate if the probe sequence overlap with repeat regions.

  • Repeats may contribute to human diversity more than genes.

  • Repeats can be used as DNA fingerprint


Steps in repeat finding

Steps in Repeat finding

  • Repeat library (RepeatMasker)

  • De-novo repeat discovery (two steps):

    • Identification of repeats

    • Classification of repeats


Sagri algorithm

SAGRI algorithm


Algorithm outline

Algorithm outline

  • Input: a text G

  • FindHitphase: finds all candidate of second occurrence of repeat regions

    • ACGACGCGATTAACCCTCGACGTGATCCTC

  • Validation phase: uses hits from phase 1 to find all pairs of repeats

    • ACGACGCGATTAACCCTCGACGTGATCCTC


Spectrum based repeat finder

Spectrum-based repeat finder

  • What is a spectrum?

    • Given a string G, its spectrum is the set of all k-mers.

      E.g. k=3, G= ACGACGCTCACCCT

      The spectrum is

  • ACC, ACG, CAC, CCC, CCT, CGA, CGC, CTC, GAC, GCT, TCA

    • CTC is a k-mer occurring at position 7.

    • ACG is a k-mer occurring at positions 1, 4.


Observation 1 how to find candidate regions containing repeats

Observation 1: How to find candidate regions containing repeats?

  • Two regions of repeats should share some k-mers.

    • E.g. the following repeats share CGA.

      ACGACGCGATTAACCCTCGACGTGATCCTC


Feasible extension bud

Feasible extension (bud)

i

S = ACGACGTGATTAACCCTCGACGTGATCCTC

  • Given the spectrum S for G[1..i-1]:

i

CGA

A X

C 

Feasible extensions!

G X

T 

Note: T is called a fooling probe!


Observation 2

Observation 2

  • A path of feasible extensions may be a repeat.

  • Example:

    S = ACGACGCTATCGATGCCCTC

    Spectrum S for G[1..10] is

    ACG, CGA, CGC, CTA, GAC, GCT, TAT

    Starting from position 11, there exists a path of feasible extensions:

    CGA-C-G-C

    This path corresponds to a length-6 substring in position 2.

    Also, this path has one mismatch compare with the length-6 substring for position 11 (CGATGC).

11


Phase 1 findhit

Phase 1: FindHit()

Algorithm:

Input: a text G

  • Initialize the empty spectrum S

  • For i = 1 to n

    /* we maintain the variant that S is a spectrum for G[1..i-1] */

    • Let x be the k-mer at position i

    • If x exists in S, run DetectRepSeq(S,i);

    • Insert x into S

  • Note: DetectRepSeq(S,i) looks for repeat occurring at position i.


Spectrum based de novo repeat detection in genomic sequences

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA

DetectRepSeg(S(18), 18)


Spectrum based de novo repeat detection in genomic sequences

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1

DetectRepSeg(S(18), 18)


Spectrum based de novo repeat detection in genomic sequences

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


Spectrum based de novo repeat detection in genomic sequences

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


Spectrum based de novo repeat detection in genomic sequences

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


Spectrum based de novo repeat detection in genomic sequences

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


Spectrum based de novo repeat detection in genomic sequences

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


Spectrum based de novo repeat detection in genomic sequences

AAC

AAG

ACC

ACG

AGT

ATT

CCC

CCT

CGA

CTC

GAA

GAT

GTG

TAA

TCG

TGA

TTA

ACGAAGTGATTAACCCTCGACGCGATCC

1 2 …

… 18 19 20 21 22 23 24 25 26 27 28

18 19 20

21 22 23 24 25 26 27 28

CGA C G C G A T C T

Ref

Curr

CGA-T1-T2-A3*

A1-G1-T2-G2-A2-T2-T3*

C2-C2-C3*

G3*

DetectRepSeg(S(18), 18)


Other details

Other details

  • Extend backward

  • Stop backtracking after h steps


Validation phase

Validation phase

  • Decompose hits into set of k-mer and index all the locations of these k-mers.

  • Scan for each pair of locations of a k-mer w in the hits, do BLAST extension

    • Use some auxiliary data structure to avoid double checking

  • Report the pairs whose length exceed our threshold


Analysis

Analysis


Analysis1

Analysis

  • How to find most repeats?

    • Avoid false negative

  • How to get better speed?

    • Avoid false positive


How do we choose k 1

How do we choose k? (1)

  • If k is too big,

    • k-mer is too specific and we may miss some repeat

  • If k is too small,

    • k-mer cannot help us to differentiate repeat from non-repeat

  • For repeat of length 50 and similarity>0.9,

    • we found that k  log4n+2 is good enough.


How do we choose k 2

m

0

How do we choose k? (2)

  • A random k-mer match with one of n chosen k-mer

  • Pr(a k-mer re-occurs by random in a sequence of length n)

    • (analog to throwing n balls into 4k bins)

    •  1-(1 – 4-k)m  1 – exp(-m/4k).

  • We requires 1-exp(-n/4k)1,

    • hence, k  log4n + log41.

    • If we set 1=1/16, k  log4n + 2


The occurrence of false negative missed repeat 1

X

X

x1

x2

Xm+1

L

The occurrence of false negative (missed repeat) (1)

  • A pair of repeats of length L, with m mismatches

  • Probability of a preserved k-mer in repeat is

  • M is the number of nonnegative integer solutions

    • to Subject to


The occurrence of false negative missed repeat 2

The occurrence of false negative (missed repeat) (2)

  • It is easy to see that M is the coefficient of xL−m in

  • Hence


Criterion for path termination 1

Criterion for path termination (1)

  • Instead of fixing the number of mismatches, we may want to fixed the percentage of mismatches, says, 10%.

  • Then, the pruning strategy is length dependent.

    • If the length of strings in  is r, we allow (r) mismatches.


Criterion for path termination 2

Criterion for path termination (2)

  • Let q be the mismatch probability and r be the length of the string.

    • Prob that a string has s mismatches =

  • For a threshold  (says, 0.01), we set

    • (r) = max {2  s  r-2 | Pq(s) > } + 2


Control of false positives 1

Control of false positives (1)

  • Two typical cases

  • The probability of (case 1)/ (case 2)

    is  2*4-

  • P(case1 or case2) is small

    • For example: 4 errors, q=0.1, k = 12, P(case 1) = 1.77 * 10-8


Evaluation

Evaluation

Compare with other programs


Programs

Programs

  • EulerAlign by Zhang and Waterman

  • PALS by Edgar and Myers

  • REPuter by Kurtz et al.

  • SARGRI


Measurement

Measurement

  • Count Ratio (CR): the ratio of number of pairs of repeat share more than 50% with a reference pair to the number of reference pairs.

  • Shared Repeat Region (SRR): the ratio of the found region to the reference region.


Simulated data

Simulated data

  • Conclusion from simulated data

    • The result is consistent with the analysis


Genome data

Genome data

  • M.gen (0.6 Mbp)

    • Organism with the smallest genome

    • Lives in the primate genital and respiratory tracts

  • C.tra (1 Mbp)

    • Live inside the cells of humans

  • A.ful (2.1 Mbp)

    • Found in high-temperature oil fields

  • E.coli (4 Mbp)

    • An import bacteria live inside lower intestines of mammals

  • Human chr22 p20M to p21M (1Mbp)


Spectrum based de novo repeat detection in genomic sequences

  • Use CR and SRR ratio to measure

  • Cross validation

    • G/H=1, H/G<1  G “outperforms” H

    • G/H<1, H/G=1  H “outperforms” G

    • G/H<1, H/G<1  G, H are complementary

    • G/H=1, H/G=1  G, H are similar


Spectrum based de novo repeat detection in genomic sequences

= 


Questions and answers

Questions and Answers


Spectrum based de novo repeat detection in genomic sequences

  • H. H. Do, K. P. Choi, F. P. Preparata, W. K. Sung, L. Zhang. Spectrum-based de novo repeat detection in genomic sequences. Journal of Computational Biology, 15(5):469-487, June 2008


  • Login