1 / 19

A clustering method for repeat analysis in DNA sequences

A clustering method for repeat analysis in DNA sequences. Molecular Biology & Phylogeny Laboratory 석사 1 년 김우연. A clustering method for repeat analysis in DNA sequences. Natalia Volfovsky, Brian J Haas and Steven L Salzberg The Institute for Genomic Research, USA Genome Biology 2001.

Download Presentation

A clustering method for repeat analysis in DNA sequences

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. A clustering method for repeat analysis in DNA sequences Molecular Biology & Phylogeny Laboratory 석사 1년 김우연

  2. A clustering method for repeat analysis in DNA sequences • Natalia Volfovsky, Brian J Haas and Steven L Salzberg • The Institute for Genomic Research, USA • Genome Biology 2001 Pusan Bioinformatics & Biocomplexity Research Center

  3. Abstract Pusan Bioinformatics & Biocomplexity Research Center

  4. Suffix Trie • Definition • Tree: 한 개 이상의 node 로 구성된 유한집합 • Suffix: 각 위치에서 시작하는 가장 긴 substring • Suffix tree: 모든 suffix 를 표현하는 trie • 예: T = ababa# a # 123456 b b 6 a # a b b 5 a a # # # # 4 3 2 1 Pusan Bioinformatics & Biocomplexity Research Center

  5. T = ababa# P = aba 123456 a # • Edge : label • Internal node • Sibling edge • Leaf node <=> Suffix ba 6 # ba # 5 # ba# ba# 4 1 3 2 Suffix Tree • Definition • Suffix tree: 모든 suffix 를 표현하는 compacted trie • 예: Pusan Bioinformatics & Biocomplexity Research Center

  6. Example T = ATGATGC# 12345678 8 # ATG C# G TG ATGATGC# 7 ATGC# 6 C# TGATGC# GATGC# C# ATGC# C# ATGC# 1 4 5 3 2 Pusan Bioinformatics & Biocomplexity Research Center

  7. Numerous methods for detecting repeats • RepeatMasker • Using a database of known repeat sequences and implements a string-matching algorithm • MaskerAid • Same approach • More rapid than RepeatMasker • WU-BLAST • Using the BLAST engine • Based on suffix trees • RepeatMatch, REPuter, RepeatFinder • Finding all exact repeats • 10-100 megabases (Mb) Pusan Bioinformatics & Biocomplexity Research Center

  8. Definitions • An exact repeat • A subsequence occurring in DNA seqeunce at least twice • A maximal repeat • Can’t be extended in either direction without incurring a mismatch Pusan Bioinformatics & Biocomplexity Research Center

  9. Exact repeats Pusan Bioinformatics & Biocomplexity Research Center

  10. Definition of repeats Pusan Bioinformatics & Biocomplexity Research Center

  11. Algorithm description • Using either of two suffix tree method • RepeatMatch, REPuter • Based on first identifying all exact repeats • Defining repeat classes by merging and extending • Step1: Selection and pre-processing • Step2: Merging procedure • Step3: Classification • Step4: BLAST searches and repeat class updates Pusan Bioinformatics & Biocomplexity Research Center

  12. STEP1: Selection and pre-processing Interpreting a partition of the original genome sequence By output of RepeatMatch or REPuter F: forward RC: reverse complement l: length Pusan Bioinformatics & Biocomplexity Research Center

  13. STEP2: Merging procedure Merging two exact repeats that either overlap or that occur within A limited distance ( a gap ) of each other Pusan Bioinformatics & Biocomplexity Research Center

  14. STEP3: Classification One step of the classification procedure Pusan Bioinformatics & Biocomplexity Research Center

  15. STEP4: BLAST searches and further merging If a class appears in multiple similarity pairs, all these similar classes are merged with the original class. Pusan Bioinformatics & Biocomplexity Research Center

  16. Repeat analysis of microbial genomes Minimal exact repeat length: 25 bp Gap: 25 bp Pusan Bioinformatics & Biocomplexity Research Center

  17. Prototype repeat sequences • Prototype • The most representative element for each class Pusan Bioinformatics & Biocomplexity Research Center

  18. Pusan Bioinformatics & Biocomplexity Research Center

  19. Finding new HERVs by Suffix Tree Pusan Bioinformatics & Biocomplexity Research Center

More Related