1 / 28

Parallel Detection of Regulatory Elements with gMP

Parallel Detection of Regulatory Elements with gMP. Bertil Schmidt, Lin Feng, Amey Laud, Yusdi Santoso. Damayanti Gupta CMSC 838 Presentation. Motivation. Fundamental question How are expression levels of thousands of genes regulated ? Very important Understanding of gene function

cher
Download Presentation

Parallel Detection of Regulatory Elements with gMP

An Image/Link below is provided (as is) to download presentation Download Policy: Content on the Website is provided to you AS IS for your information and personal use and may not be sold / licensed / shared on other websites without getting consent from its author. Content is provided to you AS IS for your information and personal use only. Download presentation by click this link. While downloading, if for some reason you are not able to download a presentation, the publisher may have deleted the file from their server. During download, if you can't get a presentation, the file might be deleted by the publisher.

E N D

Presentation Transcript


  1. Parallel Detection of Regulatory Elements with gMP Bertil Schmidt, Lin Feng, Amey Laud, Yusdi Santoso Damayanti Gupta CMSC 838 Presentation

  2. Motivation • Fundamental question • How are expression levels of thousands of genes regulated ? • Very important • Understanding of gene function • Response to environment • Understand genetic causes of diseases • Evaluate effects of drus • Detect mutations • Remember • Sets of genes -> Pathways -> Genetic Networks • Gene regulation • Control decisions turn genes on/off • Gene Regulation Network CMSC 838T – Presentation

  3. Talk Overview • Overview of talk • Motivation • Technique • Experiment • Related work • Conclusions CMSC 838T – Presentation

  4. Technique • Motifs upstream of genes regulate gene expression • Motifs are sites of regulatory activity • Identify regulatory motifs by combining • Gene expression data • Detect common motifs occuring upstream of genes • Huge datasets • Utilise parallel computing CMSC 838T – Presentation

  5. Technique • gRNA • Java development framework • gMP • Java communication library • REDUCE • Algorithm to identify regulatory motifs • REDUCE parallelised with gMP • Increase computing power • Get motifs ranked in statistical significance CMSC 838T – Presentation

  6. gRNA framework • Consists of APIs CMSC 838T – Presentation

  7. gRNA - APIs • Interact with data sources • Provide functionality from biology • Pipelines tasks into unified process • Repository of resources • Distributed programming CMSC 838T – Presentation

  8. gRNA environment • gRNA Grid • Clustered computing environment • Application written for gRNA • Multiple-tier application • Applications operate from client computer • Communicates with cluster through single computer • Hosts EJB server • Server identifies processing nodes • each of these perform tasks CMSC 838T – Presentation

  9. gRNA Grid CMSC 838T – Presentation

  10. gMP • Java based message passing tool • Built on top of sockets • Manages virtual processors to run on available machines • Scalable • Machines added/removed easily CMSC 838T – Presentation

  11. gMP • Processes are grouped • Communication primitives provided for sending and receiving data • Collective communication to several nodes enabled modularly and efficiently • Enables functions to be implemented on data CMSC 838T – Presentation

  12. REDUCE algorithm • Based on model • Upstream motifs contribute additively to expression level of each gene • Quantify the extent to which these motifs contribute to expression data • Fit log of expression ratio to sum of activating and inhibitory terms • Find stastically most significant motifs • Plots of fitting parameters suggest biological function CMSC 838T – Presentation

  13. REDUCE algorithm • Terms • Occurence vector • Measure of how often a motif is found • Expression vector • Measure of gene expression CMSC 838T – Presentation

  14. REDUCE method Consists of 1) Motif frequency counter • counts occurrences of DNA motifs upstream of each ORF • motifs are about 7~11 nucleotides in length • get occurence vectors CMSC 838T – Presentation

  15. REDUCE algorithm 2) Significant motif finder • Use i) Normalised occurrence vector made for each motif nμ ii) Normalised vector of logs of gene expression ratio vectors- a • Take dot product of these (a . nμ) ,and square. • Can be considered as frequency of occurence X expressive power of regulatory motif • It is squared to get rid of negatives • Correlate gene expression with occurence of motif • Largest dot product is most significant motif CMSC 838T – Presentation

  16. .... • a is modified to remove effect of this motif • residual gene expression vector • Process repeated until motifs are ranked CMSC 838T – Presentation

  17. Table: Finding significant motifs • Uses a - (.5816,.2522,.2886,-.5947, -.1595, -.3683) CMSC 838T – Presentation

  18. REDUCE parallelised with gMP... • Parallel motif frequency counter • Split set of ORFs equally • Distribute across available nodes • Each node calculates in parallel to get occurence vectors • Matrix transposition • Occurence vectors scattered across nodes • Advantageous to store each vector in single node • Transpose motif frequency matrix • For each ORF can only calculate fraction of occurence frequencies for all motifs • But the entire occurence frequency is needed CMSC 838T – Presentation

  19. ... • Parallel significant motif finder • Normalises occurence vector within each node • At each node, most significant motif calculated • Global most significant motif calculated • Process iterated to rank occurence vectors • Interface in gRNA allows ease of implementation CMSC 838T – Presentation

  20. Experiment • Use Compaq Alpha system • Consists of cluster of 8 AlphaServer SC/ES45 • Connected by high-speed Alpha SC 16-Port switch and ELAN PCI adapter cards. • Each server contains 4 Alpha EV68 processors CMSC 838T – Presentation

  21. Results • Use 7090 gene expressions of yeast • ORFs of length 600 • Motifs upto length 7 • Throughput (in MBytes/s) also shown • 20 most significant motifs computed. CMSC 838T – Presentation

  22. Analysis • Runtime scales well with number of processing nodes • Frequency counter scales perfectly • Motif finder also scales • Cannot achieve perfect scaling because of communication overhead. CMSC 838T – Presentation

  23. Related work • DiscoveryLink • Provides configurable wrappers as interfaces to multiple data sources • Kleisli system • Systematically manages and integrates external databases • Uses functional query language to perform correlation across databases • Toolkits designed with functionality for specialised areas • BioJava, BioPerl, PAL • Sequence Analysis • Ensembl initiative, DAS • provide extensible approach to issue of annotating genomic data CMSC 838T – Presentation

  24. Related work Previous approaches using Java for high performance computing • Bindings into native message-passing APIs(e.g.MPI) • Does not allow easy integration into larger Java applications • Pure Java message passing interfaces • JMPI, CCJ • Both implemented on top of Java RMI • Slower than using raw sockets • CCJ tries to overcome • optimised RMI implementation • not portable • Both cannot handle integration CMSC 838T – Presentation

  25. Comparison According to authors ... • gRNA distinguishes itself • Uses whole range of requirements for applications in computational biology • Provides decoupled, yet inter-related subsystems • Ease of 3rd party implementation CMSC 838T – Presentation

  26. Observations • REDUCE surpasses traditional clustering approach • REDUCE algorithm has high runtime • Complexity depends on product of number possible motifs and that of genes. • Grows exponentially with length of sequences • So length of motif is restricted • REDUCE algorithm is greedy • suboptimal • REDUCE is simplistic • lacks parameters for interactions between motifs • does not consider impact of other biological knowledge CMSC 838T – Presentation

  27. ... • Not clear that results of REDUCE are biologically significant • Experiment does not effectively show how higher computation power helps results • Only analysis from 9 to 16 processors, is this sufficient to determine ‘good scaling’? CMSC 838T – Presentation

  28. Conclusions Finally... • gRNA demonstrates efficient mechanism for development of genome-centric applications Further... • Extensions to REDUCE have been proposed • require higher computing power • more specialised programming interfaces required • Identifying communication patterns • Use of data structures e.g. sequences, trees, matrices CMSC 838T – Presentation

More Related