1 / 45

Genomic Repeat Visualisation Using Suffix Arrays

Genomic Repeat Visualisation Using Suffix Arrays. Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk. Repeat Visualisation Using Suffix Arrays. The Analysis Artificial Sequences Genomic Sequences The Algorithm Larger Sequences Non-genomic sequences.

oceana
Download Presentation

Genomic Repeat Visualisation Using Suffix Arrays

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Genomic Repeat Visualisation Using Suffix Arrays Nava Whiteford Department of Chemistry University of Southampton new@soton.ac.uk

  2. Repeat Visualisation Using Suffix Arrays • The Analysis • Artificial Sequences • Genomic Sequences • The Algorithm • Larger Sequences • Non-genomic sequences

  3. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

  4. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

  5. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

  6. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

  7. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

  8. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

  9. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

  10. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

  11. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

  12. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA

  13. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA 1 2 3

  14. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs 1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)

  15. The repeatscore plot A sliding window is ran over the entire sequence to divide it into all substrings of a given length. (in this case 2). ATGCATATA AT TG GC CA AT TA AT TA AT Occurs 3 time(s) TG Occurs 1 time(s) GC Occurs 1 time(s) CA Occurs 1 time(s) TA Occurs 2 time(s)

  16. The repeat-score plot

  17. The repeat-score plot The resulting matrix is then plotted as an image:

  18. Repeatscore plots of Artificial Sequences Small repeats Reverse strand is also included

  19. Random Sequences

  20. DNA Sequences • “The language of life” • Composed of four different bases A, T, G and C • Sequences range in size from 2000bp to 670 billion bp.

  21. Small Genomic Sequences Lambda Phage

  22. Small Genomic Sequences Random Sequence Lambda Phage

  23. E.Coli

  24. E.Coli

  25. E.Coli Sequences coding for rRNA Known inter-genic repeat elements

  26. E.Coli

  27. Repeats in Genomic Sequences

  28. A Linear time algorithm • The plots shown would take hours to construct using traditional methods. • The algorithms used would not scale linearly • It is not feasible to create these plots on large sequences unless more advanced algorithms are used.

  29. The suffix array • Original string: banana$ • banana$ • anana$ • nana$ • ana$ • na$ • a$ All suffixes

  30. The suffix array • Original string: banana$ • banana$ • anana$ • nana$ • ana$ • na$ • a$ • a$ • ana$ • anana$ • banana$ • na$ • nana$ All suffixes In sorted order

  31. Generating the repeatscore plot a$ ana$ anana$ banana$ na$ nana$

  32. Generating the repeatscore plot a$ ana$ anana$ banana$ na$ nana$

  33. Whole human genome

  34. Whole human genome

  35. Whole human genome

  36. Human Chromosome 18

  37. Arabidopsis thaliana chromosome 1, coding region

  38. Fibonacci derived sequences

  39. Gallus gallus chromosome 20

  40. Application to other sequences • Analysing writing styles • Finding plagiarised text • Any sequence that may contain motif based, language like structure.

  41. Shakespeare

  42. Text document containing the text “The quick brown fox jumped over the lazy dog” 16times.

  43. “On the Economy of Machinery and Manufacturers” by Charles Babbage with artificial repeat inserted 16times.

  44. “On the Economy of Machinery and Manufacturers” by Charles Babbage with artificial repeat inserted 16times.

  45. Conclusion • This new visualisation technique can highlight repeat structure in sequences. • In genomic sequences this maybe useful in generating annotation. • There are applications in other areas worth pursuing. • Our next step is to allow the repeatscore plot to be easily interrogated by a user in order to better understand the repeat structure.

More Related