Enhancing Computational Efficiency with RBForest Models in Multivariate Binary Data Analysis

Butte Lab Journal Club 10/25/2010

Boltzmann machines able to solve difficult combinatorial problems • Estimating the density function of multivariate binary data typically done with mixture models or factor models • Problem: Too computationally expensive for many multivariate binary density modeling problems • Solution: Authors describe a generalization of the restricted Boltzmann Machine (RBM), the restricted Boltzmann forest (RBForest) • replaces the binary hidden variables of the RBM with groups of tree-structured binary variables • when the size of the trees is varied, the number of parameters of the model can be increased while keeping the computations of the density function tractable. • basically, “structured” binning of variables • Example application: automated diagnosis using involving large number of feature types

Computational pipelines are essential, yet paucity of “good” tools for designing pipelines • eHive has many design features for robustness and scalability: • Fault tolerance • Agents (“bees”) • Graph-based • Cloud/GRID-friendly • Generic infrastructure: PERL, MySQL

vorinostat • Normalization scheme enables better detection of drug signals • Less susceptible to known confounders Asthma drugs trichostatin A antifungal drugs Calmodulin inhibitors Anti-neoplastic drugs

Emtree = EMBASE’s MeSH equivalent; much more comprehensive in certain areas, e.g., pharmacology Caveat: SCOPUS is not EMBASE  SCOPUS does not support the kinds of complex Emtree queries EMBASE supports, as well as other features e.g., no thesaurus explosion in SCOPUS

CenterWatch Databases

Example reports…

Example Pipeline for Multiplying Large Numbers • Pipeline defined in 4 files: • Start.pm splits a multiplication job into sub-tasks and creates corresponding jobs • PartMultiply.pm performs a partial multiplication and stores the intermediate result in a table • AddTogether.pm waits for partial multiplication results to compute and adds them together into final result • LongMult_conf.pm, the pipeline configuration module that links the previous Runnables into one pipeline

Features Used in Example Pipeline • A pipeline can have multiple analyses (e.g.,'start', 'part_multiply' and 'add_together'). • A job of one analysis can create jobs of other analyses by 'flowing the data' down branches. These branches are then assigned specific analysis names in the pipeline configuration file • one 'start' job flows partial multiplication subtasks down to branch #2, and a task of adding them together down branch #1. • Execution of one analysis can be blocked until all jobs of another analysis have been successfully completed ('add_together' is blocked both by 'part_multiply'). • eHive processes store intermediate and final results in a database (in this pipeline, 'intermediate_result' and 'final_result' tables are used).

Other Worthy Features • eHive performance good for jobs that run for very short time but repeated millions of time • Converse of typical job scheduling systems, which have high latency

Enhancing Computational Efficiency with RBForest Models in Multivariate Binary Data Analysis

Enhancing Computational Efficiency with RBForest Models in Multivariate Binary Data Analysis

Presentation Transcript

Journal Club

JOURNAL CLUB

Journal Club

Journal Club

Journal Club

Journal club

Journal Club

Journal Club

Journal Club

Journal club

Butte Lab Journal Club 10/11/10

Butte Lab Journal Club 4/25/11

Journal Club

Journal Club

Journal Club

Journal Club

Journal Club

Journal Club

Journal Club

Journal Club

Journal Club

Journal Club