1 / 26

Comparative Motif Discovery: A High Performance Computing Approach

Comparative Motif Discovery: A High Performance Computing Approach. 1 Dieter De Witte, 2,3 Michiel Van Bel , 2,3 Jan Van de Velde , 1 Pieter Audenaert , 1 Piet Demeester , 1 Bart Dhoedt , 2,3 Klaas Vandepoele , and 1 Jan Fostier email: jan.fostier@intec.ugent.be

fauna
Download Presentation

Comparative Motif Discovery: A High Performance Computing Approach

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Comparative Motif Discovery: A High Performance Computing Approach 1Dieter De Witte, 2,3Michiel Van Bel, 2,3Jan Van de Velde, 1Pieter Audenaert, 1Piet Demeester, 1Bart Dhoedt, 2,3Klaas Vandepoele, and 1Jan Fostier email: jan.fostier@intec.ugent.be 1Department of Information Technology (INTEC), Ghent University - iMinds, Belgium 2Department of Plant Systems Biology, VIB, Ghent, Belgium 3Department of Plant Biotechnology and Bioinformatics, Ghent University, Belgium Department of Information Technology (INTEC), Ghent University, Belgium

  2. Background (1) regulate transcription factor gene CCACGTG promoter protein Department of Information Technology (INTEC), Ghent University, Belgium

  3. Outline Parallel, comparative motif discovery framework 1) motif target prediction 2) motif discovery algorithm 3) parallel framework Department of Information Technology (INTEC), Ghent University, Belgium

  4. Gene families DNA sequences zma sbi bdi orthologous genes (Van Bel et al., “Plaza 2.5”, Plant Phys. 2012) osa 2kbp promoter 2kbp promoter 2kbp promoter … TSS TSS TSS gene family 17724 gene family 2 gene family 1 Department of Information Technology (INTEC), Ghent University, Belgium

  5. Branch Length Score (BLS) Scoring conservation within a gene family motif occurrences MST 2kbp promoter 23,66% zma 8,60% sbi bdi 26,88% 5,38% osa Branch Length Score (BLS) 17.20 % Department of Information Technology (INTEC), Ghent University, Belgium

  6. Branch Length Score (BLS) Scoring conservation within a gene family motif occurrences MST 2kbp promoter 23,66% zma 8,60% sbi bdi 26,88% 5,38% osa Branch Length Score (BLS) 64.52 % Department of Information Technology (INTEC), Ghent University, Belgium

  7. Branch Length Score (BLS) Scoring conservation within a gene family motif occurrences MST 2kbp promoter 23,66% zma 8,60% sbi bdi 26,88% 5,38% osa Branch Length Score (BLS) 100 % Department of Information Technology (INTEC), Ghent University, Belgium

  8. Branch Length Score (BLS) Scoring conservation within a gene family motif occurrence 2kbp promoter 23,66% zma 8,60% sbi bdi 26,88% 5,38% osa Branch Length Score (BLS) 0 % Department of Information Technology (INTEC), Ghent University, Belgium

  9. Genomewide scoring (1) Genome widescoring of conservation gene family 1 gene family 2 gene family 17724 2kbp promoter 2kbp promoter 2kbp promoter … BLS = 64,52 % BLS = 17,20 % BLS = 72,72 % > BLSthres ? > BLSthres ? > BLSthres ? yes no yes Fw= # families with BLS > BLSthres Department of Information Technology (INTEC), Ghent University, Belgium

  10. Genomewide scoring (2) Genome widescoring of conservation (2) Fw= # families with BLS > BLSthresfor word w Fbg= median # families with BLS > BLSthresfor random permutations of w confidence score Retain only motifs with confidence score C > 0.9 Stark et al., Genome Res. 2007 Department of Information Technology (INTEC), Ghent University, Belgium

  11. Example BLS threshold 1000 random permutations of w = 1 - Department of Information Technology (INTEC), Ghent University, Belgium

  12. Outline Parallel, comparative motif discovery framework 1) motif target prediction 2) motif discovery algorithm 3) parallel framework Department of Information Technology (INTEC), Ghent University, Belgium

  13. Motif discovery Main idea: extend previous ideas to all words that occur in the sequences Department of Information Technology (INTEC), Ghent University, Belgium

  14. Enumerating all words Generalized suffix tree (GST) Example sequences: S1 = ATGTAT$1 S2 = TTATGC$2 T $1 G C$2 AT $2 AT TATGC$2 TAT$1 $1 G C$2 $1 G C$2 TAT$1 $1 GC$2 $1 GC$2 Department of Information Technology (INTEC), Ghent University, Belgium

  15. Enumerating all words Generalized suffix tree (GST) Example sequences: S1 = ATGTAT$1 S2 = TTATGC$2 T $1 G C$2 AT $2 AT $1 G C$2 TAT$1 TATGC$2 $1 G C$2 TAT$1 $1 GC$2 $1 GC$2 Department of Information Technology (INTEC), Ghent University, Belgium

  16. Enumerating all words Generalized suffix tree (GST) Example sequences: S1 = ATGTAT$1 S2 = TTATGC$2 T $1 G C$2 AT $2 AT $1 G C$2 TAT$1 TATGC$2 $1 G C$2 TAT$1 $1 GC$2 $1 GC$2 Department of Information Technology (INTEC), Ghent University, Belgium

  17. Enumerating all words Generalized suffix tree (GST) Example sequences: S1 = ATGTAT S2 = TTATGC T G C AT AT G C TAT TATGC G C TAT GC GC Department of Information Technology (INTEC), Ghent University, Belgium

  18. Actual discovery Depth-first walk in the tree Can be extended to IUPAC alphabet (Sagot, 2001) T G C AT AT G C TAT TATGC G C TAT GC GC Department of Information Technology (INTEC), Ghent University, Belgium

  19. Discovery of words Single gene family, words with length [6 … 12], max3 deg. chars Department of Information Technology (INTEC), Ghent University, Belgium

  20. Outline Parallel, comparative motif discovery framework 1) motif target prediction 2) motif discovery algorithm 3) parallel framework Department of Information Technology (INTEC), Ghent University, Belgium

  21. MapReduce approach gene family 1 gene family 2 gene family 17724 … map lots of words + BLS score disc lots of words + BLS score lots of words + BLS score disc disc Parallel sorting disc disc disc reduce statistical evaluation statistical evaluation statistical evaluation Department of Information Technology (INTEC), Ghent University, Belgium

  22. Results • Dataset description • 4 organisms (zma, sbi, osa, bdi) • 17639 gene families • Alphabet: IUPAC \ {BDHV} alphabet with length [6 ... 12] • maximum of 3 degenerate characters • 20 instances on AWS (19 workers type m1.xlarge) • Map time: 20h 50 min • Reduce time: 12h 38 min • Cost price of 421.6$ • Number of <k, v> emitted by mappers: 2444660846443 • On average 138 million <k, v> per gene family • Output materialized: 4019787705377 = 3.65 TByte • Average 1.64 byte per intermediate <k, v> pair • Number of <k, v> emitted by reducers: 620 294 857 • About 0.02537 % of all words Department of Information Technology (INTEC), Ghent University, Belgium

  23. Where do we go from here? • Output data is still big (~ 50 Gbyte of data) • Redundancy present in output • Clustering algorithms • Mapping words back to the promoter sequences • Filtering output is easy • Subsets with higher confidence scores • Subsets with specific BLS scores • Subsets which match specific motif • Worked example: KN1 motif We need further post-processing Department of Information Technology (INTEC), Ghent University, Belgium

  24. Knotted 1 motif Bolduc et al. 2011 Department of Information Technology (INTEC), Ghent University, Belgium

  25. Conclusion • We developed… • A Parallel Framework for Motif Discovery • Based on Phylogenetic Footprinting • MPI implementation (memory issues) • MapReduce implementation (scales well) • Key advantages… • Runs in the cloud • Exhaustively explores search space (IUPAC) • Alignment-free (unique feature) • Flexible framework • Future work… • Development of post-processing tools • Likely also cloud-based Department of Information Technology (INTEC), Ghent University, Belgium

  26. Questions ? Jan Fostier jan.fostier@intec.ugent.be Department of Information Technology (INTEC) Ghent University Department of Information Technology – Internet Based Communication Networks and Services (IBCN)

More Related