1 / 17

Thahir P. Mohamed, Asia D. Mitchell and Madhavi Ganapathiraju

Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences. Thahir P. Mohamed, Asia D. Mitchell and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine

Download Presentation

Thahir P. Mohamed, Asia D. Mitchell and Madhavi Ganapathiraju

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Open access toolkit for nonparametric explorative pattern mining to detect events relating to disease in large scale genome sequences Thahir P. Mohamed, Asia D. Mitchell and Madhavi Ganapathiraju Department of Biomedical Informatics University of Pittsburgh School of Medicine Pittsburgh PA USA Advancing Practice , Innovation, and Instruction through Informatics October 20, 2008

  2. The Genome Sequence • The human genome contains… 3 billion nucleotides 20 to 25 thousand genes Two-thirds of the genome made of repetitive elements (2 billion nucleotides) ATGGCACTGAGCTCCCAGATCTGGGCCGCTTGCCTCCTGCTCCTCCTCCTCCTCGCCAGCCTGACCAGTGGCTCTGTTTTCCCACAACAGGTGAGAGCCCAGTGGCCTGGGTCCTTAGCAGGGCAGCAGGGATGGGAGAGCCAGGCCTCAGCCTAGGGCACTGGAGACACCCGAGCACTGAGCAGAGCTCAGGACGTCTCAGGAGTACTGGCAGCTGAACAGGAACCAGGACAGGCACGGTGGCTCATGCCTGTAATCCCAGCACTTTGGGAGGTTGAGGCAGGCAGCCCACTTGAGGTCAGTTTGAGACCAGCCTGGCCAACATGGTAAAACCCCGTCTCTACTAAAAATACAAAAGTTAGCCAGGCTTGGTGGCAGGTGCCTGTAATCCCAGCTACTCGGGAGACTGAGGCAGGAGAATTGCTTGAACCCGCAAGGTGGAGGTTGCACAGTGAGCTGAGATTGCACCACTGCACTCCAGCCTGGCAACAGAGCAAGACTCCATCTCCAAAAAAGAACAGAAATCAATGAAGCACCGAGTGACAGGGACTGGAAGGTCCTAATTCCATGGGTATTTACGGAACCCCTACGCCGTGTGGAGTCTTATTCTAGACAGTGGGGACGAGGCCATGAACAAGGTAGATGAGAGAGGAGATTTCTCCATCCTGGTCAGGGAATTTGTTAAAGACTGATGAAAACATGAATAAATAATTGTGTCTAGTACATTCTATTCGTGAATCTCATAACAGACAGTGGTAGAGTGACCGTGACCCATTCGCCACACAGTAGAGTCACTTTTTTGGTTTGTTTTTTAGAGACAGGGTCTTCCTCTGTTGCTGAGGCTGGAGTGCAGTGGTGCAGTCATAGTTCACTGCAGCCTCAACCTCCTGTGCTCAAGCAATCCTCCCACCTCAGCGTCCCAAGTAGCTGGGACAGCAGGCACATGCCACGGGTTGGGGGACCACAGGCATGGTCAAGGGGCTGGCAGTCAAGCAAGTG

  3. Genomic Patterns Short Tandem Repeats (STRs) 1 to 6 nucleotides repeated in tandem Variable Number Tandem Repeats (VNTRs) Same as short tandem repeats Number of repeats variable across individuals CpG Islands A sequence of > 500 nucleotides C+G content of > 55% High frequency of CG dinucleotides …CGCGCCGGACGTTACGCGCGCCGCGAAACGCGCGCCGGACGGCGCCGCAAACGGCCGCGCGTAC…

  4. 300 bp >1,000 bp Genomic Patterns Palindromes A sequence that is like a normal palindrome (mom, racecar, …) One half is a complement of the other in reverse order. LINE-1 Elements Retrotransposon of >1,000 nucleotides High A+T content Poly A tail ALU Elements Retrotransposon of ~300 nucleotides with High G+C content Recognition site for alu endonuclease Segment high in A content A poly A tail

  5. ALU/LINE-1 Expansions VNTRs Palindromes STRs CpG Islands Abnormal Methylation Alternative Structures Cancer Disease High Mutability Genomic Instability Disease Relevance

  6. Challenges in Pattern Mining Computational tools for pattern mining must be… Scalable Genomes are large 3 billion nucleotides Genes are small 3 thousand nucleotides Genomes of different organisms vary greatly in size Flexible Types of patterns differ There are variations within a single type of pattern Flexibility in resolution of analysis Nonparametric New and unknown patterns Explorative analysis Currently, there are no tools that are scalable, flexible, and nonparametric for genomic pattern mining

  7. Pattern Mining Toolkit Applications layer contains programs that utilize features computed by tools layer and also the preprocessed layer to compute specific commonly known patterns such short tandem repeats, DNA palindromes, short and long interspersed nuclear elements, etc.

  8. Foundation Layer Efficient Preprocessing of Genome Sequence • Repetitive patterns appear next to each other • Allows for efficient computation of patterns Data Preprocessing: Suffix array computation Longest common prefix array computation

  9. Tools Layer Find Ngram Counts Compare Ngram Counts Locate Specific Patterns TTAAAAAAAA-TTTTTTAAAA 10 251555 TAAAAAAC-GTTTTTAA 8 276649 CAAAAAAG-CTTTTTAG 8 312629 TCTCTACTAAAAAT-ATTTTTAAAAAAAA 14 364179 TGAAAAACA-TGTTTTAAA 9 449648

  10. Tools Layer Large Repeats Find RegEx 23 17 29441 CAGATTTGAAACACTCTTTTTGT 24 93 4161 ATATCTTCGTATAAAAACAAGACA 25 123 292054 TTTTCAGAAACTGCTTTGTGATGTG 31 255 3983 GAAACGGGATTTCTTTATATTATGCTAGACA Find Perplexity

  11. Explorative pattern analysis in chromosome 19 5 MB

  12. Explorative pattern analysis in chromosome 19 5 MB 250 KB

  13. Explorative pattern analysis in chromosome 19 5 MB 250 KB 10 KB

  14. Explorative pattern analysis in chromosome 19 5 MB 250 KB 10 KB 1 KB

  15. Feature analysis of the centromere of the X chromosome Perplexity drops near the centromere region that is highly repetitive, containing ngrams that are unique to this region.

  16. Pattern landscape of chromosome 19 Duplication events

  17. Ackowledgements Madhavi GanapathirajuThahir Mohamed Kamiya Mopwani Thank you! Visit us at Department of Biomedical Informatics University of Pittsburgh www.dbmi.pitt.edu/madhavi  Cathedral of Learning, University of Pittsburgh

More Related